Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 21.
Published in final edited form as: IEEE Trans Signal Process. 2024 Sep 10;72:4428–4443. doi: 10.1109/tsp.2024.3457529

Principal Component Analysis in Space Forms

Puoya Tabaghi 1, Michael Khanzadeh 2, Yusu Wang 3, Siavash Mirarab 4
PMCID: PMC12366648  NIHMSID: NIHMS2096070  PMID: 40843461

Abstract

Principal Component Analysis (PCA) is a workhorse of modern data science. While PCA assumes the data conforms to Euclidean geometry, for specific data types, such as hierarchical and cyclic data structures, other spaces are more appropriate. We study PCA in space forms; that is, those with constant curvatures. At a point on a Riemannian manifold, we can define a Riemannian affine subspace based on a set of tangent vectors. Finding the optimal low-dimensional affine subspace for given points in a space form amounts to dimensionality reduction. Our Space Form PCA (SFPCA) seeks the affine subspace that best represents a set of manifold-valued points with the minimum projection cost. We propose proper cost functions that enjoy two properties: (1) their optimal affine subspace is the solution to an eigenequation, and (2) optimal affine subspaces of different dimensions form a nested set. These properties provide advances over existing methods, which are mostly iterative algorithms with slow convergence and weaker theoretical guarantees. We evaluate the proposed SFPCA on real and simulated data in spherical and hyperbolic spaces. We show that it outperforms alternative methods in estimating true subspaces (in simulated data) with respect to convergence speed or accuracy, often both.

Index Terms—: Principal component analysis, Riemannian manifolds, hyperbolic and spherical spaces

I. Introduction

GIVEN a set of multivariate points, principal component analysis (PCA) finds orthogonal basis vectors so that different components of the data, in the new coordinates, become uncorrelated and the leading bases carry the largest projected variance of the points. PCA is related to factor analysis [1], Karhunen-Loéve expansion, and singular value decomposition [2] — with a history going back to the 18th century [3]. The modern formalism of PCA goes back to the work of Hotelling [4]. Owing to its interpretability and flexibility, PCA has been an indispensable tool in data science applications [5]. The PCA formulation has been studied numerously in the literature. Tipping and Bishop [6] established a connection between factor analysis and PCA in a probabilistic framework. Other extensions have been proposed [7], e.g., Gaussian processes [8] and sensible [9], Bayesian [10], sparse [11], [12], [13], [14], and Robust PCA [15].

PCA's main features are its linearity and nested optimality of subspaces with different dimensions. PCA uses a linear transformation to extract features. Thus, applying PCA to non-Euclidean data ignores their geometry, produces points that may not belong to the original space, and breaks downstream applications relying on this geometry [16], [17], [18], [19].

We focus on space forms: complete, simply connected Riemannian manifolds of dimension d2 and constant curvature — spherical, Euclidean, or hyperbolic spaces (positive, zero, and negative curvatures) [20]. Space forms have gained attention in the machine learning community due to their ability to represent many forms of data. Hyperbolic spaces are suitable for hierarchical structures [18], [21], [22], [23], [24], biological data [25], [26], and phylogenetic trees [17]. Spherical spaces find application in text embeddings [27], longitudinal data [28], and cycle-structures in graphs [29].

To address the shortcomings of Euclidean PCA for non-Euclidean data, several authors propose Riemannian PCA methods [16], [30], [31], [32], [33], [34], [35], [36], [37], [38]. Riemannian manifolds generally lack a vector space structure [39], posing challenges for defining principal components. A common approach for dimensionality reduction of manifold-valued data relies on tangent spaces. Fletcher et al. [33] propose a cost to quantify the quality of a Riemannian affine subspace but use a heuristic approach, principal geodesics analysis (PGA), to optimize it: (1) the base point (intrinsic mean) is the solution to a fixed-point problem, (2) Euclidean PCA in tangent space estimates the low-rank tangent vectors. Even more principled approaches do not readily yield tractable solutions necessary for analyzing large-scale data [34], as seen in spherical and hyperbolic PCAs [28], [40], [41].

Despite recent progress, PCA in space forms remains inadequately explored. In general, cost-based Riemannian (e.g., spherical and hyperbolic) PCAs rely on finding the optimal Riemannian affine subspace by minimizing a nonconvex function. The cost function, proxy, or methodology is usually inspired by the 22 cost, with no definite justification for it [16], [28], [33], [35], [40], [41]. These algorithms rely on iterative methods to estimate the Riemannian affine subspaces, e.g., gradient descent, fixed-point iterations, and proximal alternate minimization, and they are slow to converge and require parameter tuning. There is also no guarantee that estimated Riemannian affine subspaces form a total order under inclusion (i.e., optimal higher-dimensional subspaces include lower-dimensional ones) unless they perform cost minimization in a greedy (suboptimal) fashion by building high-dimensional subspaces based on previously estimated low-dimensional subspaces. Notably, Chakraborty et al. propose a greedy PGA for space forms by estimating one principal geodesic at a time [42]. They derive an analytic formula for projecting a point onto a parameterized geodesic. This simplifies the projection step of the PGA. However, we still have to solve a nonconvex optimization problem (with no theoretical guarantees) to estimate the principal geodesic at each iteration.

We address PCA limitations in space forms by proposing a closed-form, theoretically optimal, and computationally efficient method to derive all principal geodesics at once. We begin with a differential geometric view of Euclidean PCA (Section II), followed by a generic description of Riemannian PCA (Section III). In this view, a proper PCA cost function must (1) naturally define a centroid for manifold-valued points and (2) yield theoretically optimal affine subspaces forming a total order under inclusion. We introduce proper costs for spherical (Section IV) and hyperbolic (Section V) PCA problems. Minimizing each cost function leads to an eigenequation, which can be effectively solved. For hyperbolic PCA, the optimal affine subspace solves an eigenequation in Lorentzian space which is equipped with an indefinite inner product. These results give us efficient algorithms to derive hyperbolic principal components. We delegate all proofs to the Appendix.

A. Preliminaries and Notations

Let (,g) be a Riemannian manifold. The tangent space Tp is the collection of all tangent vectors at p. The Riemannian metric gp:Tp×TpR is given by a positive-definite inner product and depends smoothly on p. We use gp to define notions such as subspace, norms, and angles, similar to inner product spaces. For any subspace HTp, we define its orthogonal complement as follows:

H=hTp:gph,h=0,hHTp. (1)

The norm of vTp is v=defgp(v,v). We denote the length of a smooth curve γ:[0,1] as [γ]=01γ(t)dt. A geodesic γp1,p2 is the shortest-length path between p1 and p2, that is, γp1,p2=argminγL[γ]:γ(0)=p1,γ(1)=p2. Interpreting the parameter t as time, if a geodesic γ(t) starts at γ(0)=p with initial velocity γ(0)=vTp, the exponential map expp(v) gives its position at t=1. For p and x, the logarithmic map logp(x) gives the initial velocity to move (with constant speed) along the geodesic from p to x in one time step. A Riemannian manifold is geodesically complete if the exponential and logarithmic maps, at every point p, are well-defined operators [43]. A submanifold of a Riemannian manifold (,g) is geodesic if any geodesic on with its induced metric g is also a geodesic on (,g). For NN, we let [N]=def{1,,N} and [N]0=def[N]{0}. The variable x1 is an element of the vector x=x0,,xD-1RD. It can also be an indexed vector, e.g., x1,,xNRD. This distinction will be clarified in the context. We use EN[] to denote the empirical mean of its inputs with indices in [N].

II. Principal Component Analysis — Revisited

Similar to the notion by Pearson [44], PCA finds the optimal low-dimensional affine space to represent data. Let pRD and let the column span of HRD×K be a subspace. For the affine subspace p+H, PCA assumes the following cost:

costp+H𝒳=ENfdxn,𝒫p+Hxn,

where 𝒳=xnRD:n[N], 𝒫p+Hxn=argminxp+Hdx,xn,d(,) computes the 2 distance, and the distortion function f(x)=x2. This formalism relies on (1) an affine subspace p+H, (2) the projection operator 𝒫p+H, and (3) the distortion function f. To generalize affine subspaces to Riemannian manifolds, consider parametric lines:

γp,xt=1-tp+tx,andγht=p+th, (2)

where hH, the orthogonal complement of the subspace H; see Fig. 1(a) and 1(b). We reformulate affine subspaces as:

p+H=xRD:x-p,h=0,forallhH=xRD:γp,x(0),γh(0)=0,forallhH,

where , is the dot product and γt0=defddtγ(t)t=t0.

Fig. 1.

Fig. 1.

(a,b) One- (a) and two-dimensional (b) affine subspaces in R3. We show subspaces (H) at point p instead of the origin. We may define the same Riemannian affine subspace using other base points, e.g., p. (c) Two-dimensional affine subspace in a hyperbolic space (Poincaré) where hTpI3=R3.

Definition 1: An affine subspace is a set of points, e.g., x, where there exists pRD such that (tangent of) the line γx,p (i.e., γp,x(0)) is normal to a set of tangent vectors at p.

Definition 1 requires dimH parameters to describe p+H. For example, in R3, we need two orthonormal vectors to represent a one-dimensional affine subspace; see Fig. 1(a). We use Definition 1 since it describes affine subspaces in terms of lines and tangent vectors, not a global coordinate system.

III. Riemannian Principal Component Analysis

We next introduce Riemannian affine subspaces and then propose a generic framework for Riemannian PCA.

A. Riemannian Affine Subspaces

The notion of affine subspaces can be extended to Riemannian manifolds using tangent subspaces of the Riemannian manifold [33]. The Riemannian affine subspace is the image of a subspace of Tp under the exponential map, i.e.,

H=expp(H)=defexpp(h):hH, (3)

where H is a subspace of Tp and p. Equivalently, the Riemannian affine subspace is defined as follows.

Definition 2: Let (,g) be a geodesically complete Riemannian manifold, p, and subspace HTp. We let H=x:gplogp(x),h=0,hH where H is the orthogonal complement of H; see equation (1).

The set H is a collection of points on geodesics originating at p such that their initial velocities are normal to vectors in HTp, or form a subspace HTp; cf. Definition 1.

Example 1: When H is a one-dimensional subspace, then H contains geodesics that pass through p with the initial velocity in H, i.e., H={γ(t):geodesicγwhereγ(0)=p,γ(0)H,tR. Thus, with equation (3), geodesics are one-dimensional Riemannian affine subspaces.

Example 2: The Euclidean exponential map is expp(h)=p+h; see Table I. Therefore, equation (3) recovers the affine subspaces, i.e., expp(H)={p+h:hH}=defp+H.

TABLE I.

Summary of Relevant Operators in Euclidean, Spherical, and Hyperbolic Spaces

Tp gp(u,v) logp(x):θ=d(x,p) expp(v) d(x,p)
RD RD u,v x-p p+v x-p2
SD p u,v θsin(θ)(x-cos(θ)p) cos(Cv)p+sin(Cv)vCv 1Carccos(Cx,p)
HD p [u,v] θsinh(θ)(x-cosh(θ)p) cosh(-Cv)p+sinh(-Cv)v-Cv 1-Cacosh(C[x,p])

Recall that, a nonempty set V is a Euclidean affine subspace if and only if there exists pV such that αv1-p+pV and v1-p+v2-p+pV for all αR and v1,v2V. We have a similar definition for Riemannian affine subspaces.

Definition 3: Let (,g) be a geodesically complete Riemannian manifold. Then, nonempty V is an affine set if and only if there exists pV such that exppαlogpv1V and expplogpv1+logpv2V for all αR and v1,v2V.

B. Proper Cost for Riemannian PCA

Similar to Euclidean PCA, Riemannian PCA aims to find a (Riemannian) affine subspace with minimum average distortion between points and their projections.

Definition 4: Let (,g) be a geodesically complete Riemannian manifold that is equipped with distance function d,p, and subspace HTp. A geodesic projection of x onto H is 𝒫H(x)argminyHd(x,y). If argminyHd(x,y), then minyHd(x,y)=log𝒫H(x)(x) for any geodesic projection 𝒫H(x).

Remark 1: Projecting a manifold-valued point onto a Riemannian affine subspace is not a trivial task, often requiring the solution of a nonconvex optimization problem over a submanifold, i.e., solving argminyHd(x,y). Definition 4 states that if a solution exists (which is not always guaranteed), the projection distance must be equal to log𝒫H(x)(x). This also requires computing the logarithmic map, which may not be available for all Riemannian manifolds.

In Euclidean PCA, minimizing the 22 cost is equivalent to maximizing the variance of the projected data. To avoid confusing arguments regarding the notion of variance and centroid, we formalize the cost (parameterized by H) in terms of the projection distance, viz.,

costH𝒳=ENflog𝒫Hxnxn, (4)

where 𝒳=xn:n[N] and f:R+R is a monotonically increasing distortion function. The projection point 𝒫H(x) may not be unique. Its minimizer, if it exists, is the best affine subspace to represent manifold-valued points.

Definition 5: Riemannian PCA aims to minimize the cost in equation (4) — for a specific choice of distortion function f.

Choice of f. The closed-form solution for the optimal Euclidean affine subspace is due to letting f(x)=x2. This is a proper cost function with the following properties:

  1. Consistent Centroid. The optimal 0-dimensional affine subspace (a point) is the centroid of data points, i.e., p*=argminyRDENfdxn,y=ENxn.

  2. Nested Optimality. The optimal affine subspaces form a nested set, i.e., p*p+H1* where p+Hd* is the optimal d-dimensional affine subspace.

Definition 6: For Riemannian PCA, we call costH𝒳 a proper cost function if its minimizers satisfy the consistent centroid and nested optimality conditions.

Deriving the logarithm operator is not a trivial task for general Riemannian manifolds, e.g., the manifold of rankdeficient positive semidefinite matrices [45]. Focusing on constant-curvature Riemannian manifolds, we propose distortion functions that, unlike existing methods, arrive at proper cost functions with closed-form optimal solutions.

IV. Spherical PCA

Consider the spherical manifold (SD,gS) with curvature C>0, where SD=xRD+1:x,x=C-1, a sphere with radius C-12, and gpS(u,v)=u,v computes the dot product of u,vTpSD=xRD+1:x,p=0=defp.

A. Spherical Affine Subspace and the Projection Operator

Let pSD and the subspace HTpSD=p. Following Definition 2 and Table I, the spherical affine subspace is:

SHD=xSD:x,h=0,hH=SD(pH),

where is the direct sum operator, i.e., pH={αp+h:hH,αR}, and H is the orthogonal complement of H; see equation (1). This matches Pennec's notion of the metric completion of exponential barycentric spherical subspace [37].

Claim 1: SHD is a geodesic submanifold.

There are orthogonal tangent vectors h1,,hK that form a complete basis for H, i.e., hi,hj=Cδi,j, where δi,j=0 if ij and δi,j=1 if i=j. Using these basis vectors, we derive a simple expression for the projection distance.

Proposition 1: For any SHD and xSD, we have

minySHDd(x,y)=C-12acos1-kKx,hk2,

where hkkK are the complete orthogonal basis vectors of H. Both SHD and SD have a fixed curvature C>0.

For points at C-12π2 distance from SHD, there is no unique projection onto the affine subspace. Nevertheless, Proposition 1 provides a closed-form expression for the projection distance in terms of basis vectors for H. This distance monotonically increases with the length of the residual of x onto H, i.e., kKx,hk2. Since dim(H)D is common, switching to the basis of H helps us represent the projection distance with fewer parameters.

Proposition 2: For any SHD and xSD, we have

minySHDd(x,y)=C-12acosC2x,p2+k[K]x,hk2,

where hkk[K] are complete orthogonal basis vectors of H.

Next, we derive an isometry between SHD and Sdim(H) — where both have the fixed curvature C>0.

Theorem 2: The isometry 𝒬:SHDSK and its inverse are

𝒬(x)=C-12Cx,px,h1x,hK,𝒬-1(y)=C-12y0Cp+k=1Kykhk,

where hkk[K] are complete orthogonal basis vectors of H.

Corollary 1: The dimension of SHD is dim(H).

Finally, we can provide an alternative view of spherical affine subspaces based on sliced unitary matrices {GR(D+1)×(K+1):GG=IK+1.

Claim 3: For any SHD, there is a sliced-unitary operator G:Sdim(H)SHD and vice versa.

B. Minimum Distortion Spherical Subspaces

To define principal components, we need a specific choice of distortion function f; see equation (4). Before presenting our choice, let us discuss previously studied cost functions.

1). Review of Existing Work:

Dai and Müller consider an intrinsic PCA for smooth functional data 𝒳 on a Riemannian manifold [28] with the distortion function f(x)=x2, i.e.,

costDai.SHD𝒳=ENfminySHDdxn,y. (5)

Their algorithm, Riemannian functional principal component analysis (RFPCA), is based on first solving for the base point — the optimal zero-dimensional affine subspace, i.e., the Fréchet mean p*=argminpSDENdxn,p2. Then, they project each point to Tp* using the logarithmic map. Next, they perform PCA on the resulting tangent vectors to obtain the K-dimensional tangent subspace. Finally, they map back the projected tangent vectors to (spherical space) using the exponential map. Despite its simplicity, this approach suffers from four shortcomings. (1) There is no closed-form expression for the Fréchet mean of spherical data. (2) Theoretical analysis on computation complexity of estimating a Fréchet mean is not yet known; and its computation involves an argmin operation which oftentimes cannot be easily differentiated [46]. (3) Even for accurately estimated Fréchet mean, there is no guarantee that the optimal solution to the problem (5) is the Fréchet mean. Huckemann and Ziezold [47] show that the Fréchet mean may not belong to the optimal one-dimensional affine spherical subspace. (4) Even if the Fréchet mean happens to be the optimal base point, performing PCA in the tangent space is not the solution to problem (5).

Liu et al. propose a spherical matrix factorization problem:

minGR(D+1)×(K+1)ynSKn[N]ENxn-Gyn22:GG=IK+1, (6)

where 𝒳=xnRD+1:n[N] is the measurement set and y1,,yNSK are features in a spherical space with C=1 [40]. They propose a proximal algorithm to solve for the affine subspace and features. This formalism is not a spherical PCA because the measurements do not belong to a spherical space. The objective in equation (6) aims to best project Euclidean points to a low-dimensional spherical affine subspace with respect to the squared Euclidean distance — refer to Claim 2. Nevertheless, if we change their formalism and let the input space be a spherical space, we arrive at:

costLiuSHD𝒳=EN-cosminynSKdxn,Gyn=ENfdxn,𝒫Hxn, (7)

where H is a tangent subspace that corresponds to G (see Claim 2) and 𝒫H is the spherical projection operator. This formalism uses distortion function f(x)=-cos(x).

Nested spheres [48] by Jung et al. is an alternative procedure for fitting principal nested spheres to iteratively reduce the dimensionality of data. It finds the optimal (D-1)-dimensional subsphere 𝒰D-1 by minimizing the following cost,

costJung𝒰D-1𝒳=ENdxn,p-r2,

where 𝒰D-1=xSD:d(x,p)=r — over pSD and rR+. This is a constrained nonlinear optimization problem without closed-form solutions. Once they estimate the optimal 𝒰D-1, they map each point to the lower-dimensional spherical space SD-1 — and repeat this process until they reach the target dimension. The subspheres are not necessarily great spheres, making this decomposition nongeodesic.

2). A Proper Cost Function for Spherical PCA:

In contrast to distortions f(x)=-cos(x) and f(x)=x2 used by Liu et al. [40] and Dai and Müller [28], we choose f(x)=sin2(Cx). Using Proposition 1, we arrive at:

costSHD𝒳=ENkKxn,hk2, (8)

i.e., the average 22 norm of the projected points in the directions of vectors h1,,hKTpSD. The expression (8) leads to a tractable constrained optimization problem.

Claim 4: Let x1,,xNSD. The spherical PCA equation (8) aims to find pSD and orthogonal h1,,hKTpSD that minimize kKhkCxhkCx=ENxnxn.

The solution to the problem in Claim 3 is the minimizer of the cost in equation (8), which is a set of orthogonal vectors h1,,hKp that capture, in quantum physics terminology, the least possible energy of Cx.

Theorem 5: Let x1,,xNSD. Then, an optimal solution for p is the leading eigenvector of Cx, and h1,,hK (basis vectors of H) are the eigenvectors that correspond to the smallest K eigenvalues of the second-moment matrix Cx.

Corollary 2: The optimal subspace H is spanned by K leading eigenvectors of Cx, discarding the first one.

Claim 6: The cost function in equation (8) is proper.

The distortion in equation (8) implies a closed-form definition for the centroid of spherical data points, i.e., a zero-dimensional affine subspace that best represents the data.

Definition 7: A spherical mean μ(𝒳) for point set 𝒳 is any point such that ENfdxn,μ(𝒳)=minpSDEN[fdxn,p. The solution is a scaled leading eigenvector of Cx.

Interpreting the optimal base point as the spherical mean in Definition 7 shows our spherical PCA has consistent centroid and nested optimality: optimal spherical affine subspaces of different dimensions form a chain under inclusion. However, μ(𝒳) is not unique and only identifies the direction of the main component of points; see Fig. 2.

Fig. 2.

Fig. 2.

(a) A set of data points in SD, where D=2. (b) The best estimate for the base point p and the tangent subspace H=h1TpSD — the spherical affine subspace SHD=(pH)SD. (c) The projection of points onto SHDH=h1. (d) The low-dimensional features in SK, where K=dimSHD=1.

Remark 2: PGA [33] and RFPCA [28] involve the intensive task of Fréchet mean estimation. This involves iterative techniques like gradient descent or fixed-point iterations on nonlinear optimization objectives. There has been work on the numerical analysis of Fréchet mean computations [46]. On the other hand, SPCA [40] uses alternating linearized minimization to estimate the optimal subspace. In contrast, our method (SFPCA) requires computing the second-moment matrix Cx with a complexity of OD2N and involves eigendecomposition with a worst-case complexity of OD3.

V. Hyperbolic PCA

Let us first introduce Lorentzian spaces.

Definition 8: The Lorentzian (D+1)-space, denoted by R1,D, is a vector space equipped with the Lorentzian inner product

x,yR1,D:x,y=xJDy,JD=-100ID,

where ID is the D×D identity matrix.

The D-dimensional hyperbolic manifold HD,gH with curvature C<0, where HD=xRD+1:[x,x]=C-1andx0>0 and metric gpH(u,v)=[u,v] for u,vTpHD=xRD+1:[x,p]=0=defp; see Table I.

A. Eigenequation in Lorentzian spaces

Like inner product spaces, we define operators in R1,D.

Definition 9: Let AR(D+1)×(D+1) be a matrix (operator) in R1,D. We let A[] be the JD-adjoint of A if and only if A[]=JDAJD.A[-1] is the JD-inverse of A if and only if (iff) A[-1]JDA=AJDA[-1]=JD. An invertible matrix A is called JD-unitary iff AJDA=JD; see [49] for more detail.

The Lorentzian space R1,D is equipped with an indefinite inner product, i.e., xR1,D:[x,x]<0. Therefore, it requires a form of eigenequation defined by its indefinite inner product. For completeness, we propose the following definition of eigenequation in the complex Lorentzian space C1,D.

Definition 10: For AC(D+1)×(D+1),vC1,D is its JD-eigenvector and λ is the corresponding JD-eigenvalue if

AJDv=sgnv*,vλv,whereλC, (9)

and v* is the complex conjugate of v. The sign of the norm, sgnv*,v, defines positive and negative JD-eigenvectors.

Definition 10 is subtly different from hyperbolic eigenequation [50] — a special case of (A,J) eigenvalue decomposition. We prefer Definition 10 as it carries over familiar results from Euclidean to the Lorentzian space.

Proposition 3: If A=A[], then its JD-eigenvalues are real.

Let v be a JD-eigenvector of A where v*,v=1. Then, v*,AJDv=sgnv*,vλv*,v=λ, the JD-eigenvalue of A; see Definition 10. There is a connection between Euclidean and Lorentzian eigenequations. Namely, AJDC(D+1)×(D+1) has eigenvector vCD+1 and v*,v0. Then, v*,v-12v,sgnv*,vλ is a JD-eigenpair of A.

Claim 7: JD-eigenvectors of A are parallel to eigenvectors of AJD.

Proposition 4 shows that the normalization factor is well-defined for full-rank matrices.

Proposition 4: If A is full-rank, then vRD+1:AJDv=λvandv*,v=0=.

Our algorithm uses the connection between Euclidean and JD-eigenequations. We extend the notion of diagonalizability to derive the optimal affine subspace; see Proposition 7.

Definition 11: AC(D+1)×(D+1) is JD-diagonalizable if and only if there is a JD-invertible VC(D+1)×(D+1) such that AJDV=VJDΛ, where Λ is a diagonal matrix.

B. Hyperbolic Affine Subspace and the Projection Operator

Let pHD and H be a K-dimensional subspace of TpHD=p. Following Definition 2 and Table I, we arrive at the following definition for the hyperbolic affine subspace:

HHD=xHD:x,h=0,hH, (10)

where H is the orthogonal complement of H, i.e., HHD=HD(pH). This also coincides with the metric completion of exponential barycentric hyperbolic subspace [37].

Claim 8: The hyperbolic subspace is a geodesic submanifold.

Lemma 1 shows that there is a complete set of orthogonal tangents h1,,hK where hi,hj=-Cδi,j and span H. In Proposition 5, we provide a closed-form expression for the projection distance onto HHD in terms of the basis of H.

Proposition 5: For any HHD and xHD, we have

minyHHDd(x,y)=|C|-12acosh1+kKx,hk2,

where hkkK are complete orthogonal basis of H.

The projection distance monotonically increases with the norm of its residual of x onto H, i.e., kKx,hk2. Proposition 5 asks for the orthogonal basis of H—commonly, a high-dimensional space. We can use the basis of H to compute the projection distance.

Proposition 6: For any HHD and xHD, we have

minyHHDd(x,y)=|C|-12acoshC2[x,p]2-kKx,hk2,

where hkk[K] are complete orthonormal basis of H.

We represent points in HHD as a linear combination of the base point and tangent vectors. Given these K+1 vectors, we can find a low-dimensional representation for points in HHD—reducing the dimensionality of hyperbolic data points.

Theorem 9: The isometry 𝒬:HHDHK and its inverse are

𝒬(x)=αC[x,p],𝒬-1(y)=α-y0Cp+k=1Kykhk,

where α=|C|-12 and H has complete orthogonal basis vectors hkk[K]. Both HHD and HK have curvature C<0.

Corollary 3: The affine dimension of HHD is dim(H).

Similar to the spherical case, we can characterize hyperbolic affine subspaces in terms of sliced JD-unitary matrices—paving the way for constrained optimization methods over sliced JD-unitary matrices to solve hyperbolic PCA problems.

Claim 10: For any HHD, there is a sliced JD-unitary operator G:Hdim(H)HHD, i.e., GJDG=Jdim(H), and vice versa.

C. Minimum Distortion Hyperbolic Subspaces

1). Review of Existing Work:

Chami et al. propose HoroPCA [41]. They define GHp,q1,,qK as the geodesic hull γ1,,γK, where γk is a geodesic such that γk(0)=pHD and limt+γk(t)=qkHD for all k[K]. The geodesic hull of γ1,,γK contains straight lines between γk(t) and γkt for all t,tR and k,k[K].

Claim 11: GHp,q1,,qK is a hyperbolic affine subspace.

Their goal is to maximize a proxy for the projected variance:

costChamiHHD𝒳=-ENd𝒫^Hxn,𝒫^Hxn2, (11)

where 𝒫^H is the horospherical projection operator — which is not a geodesic (distance-minimizing) projection. They propose a sequential algorithm to minimize the cost in equation (11), using a gradient descent method, as follows: (1) the base point is computed as the Fréchet mean via gradient descent; and (2) a higher-dimensional affine subspace is estimated based on the optimal affine subspace of lower dimension. One may formulate the hyperbolic PCA problem as follows:

minGR(D+1)×(K+1)ynHKn[N]ENxn-Gyn22:GJDG=JK, (12)

where x1,,xNRD+1 are the measurements and y1,,yNHK are low-dimensional hyperbolic features. The formulation in equation (12) leads to the decomposition of a Euclidean matrix in terms of a sliced-JD unitary matrix and a hyperbolic matrix — a topic for future studies.

2). A Proper Cost Function for Hyperbolic PCA:

We choose f(x)=sinh2(|C|x) to arrive at the following cost:

costHHD𝒳=ENkKxn,hk2; (13)

see Proposition 5. We interpret costHHD𝒳 as the aggregate dissipated energy of the points in directions of normal tangent vectors. If xHD has no components in the direction of normal tangents — i.e., x,hk=0 where h1,,hK are orthogonal basis vectors for H — then kKx,hk2=0. Our distortion function in equation (13) leads to the formulation of hyperbolic PCA as a constrained optimization problem:

Problem 12: Let x1,,xNHD and Cx=ENxnxn. The hyperbolic PCA aims to find a point pHD and a set of orthogonal vectors h1,,hKTpHD=p that minimize the function kKhkJDCxJDhk.

Claim 1 aims to minimize the cost in equation (13), i.e.,

costHHD𝒳=kKhkJDCxJDhk,

where pHD, h1,,hKp, hi,hj=-Cδi,j and Cx=ENxnxn, over pHD,Hp,dim(H)=K. Claim 1 asks for JD-orthogonal vectors h1,,hK in an appropriate tangent space TpHD that capture the least possible energy of Cx with respect to the Lorentzian scalar product.

Remark 3: The spectrum of a matrix is its set of eigenvalues. Discarding an eigenvalue from the matrix's eigenvalue decomposition approximates the matrix with an error proportional to the magnitude of the discarded eigenvalue, that is, discarded energy. Similarly, one can define the JD-spectrum of the second-moment matrix Cx as the set of its JD-eigenvalues. As we demonstrate in numerical experiments, we use JD-spectrum to identify the existence of outlier hyperbolic points.

This is akin to the Euclidean PCA: the subspace is spanned by the leading eigenvectors of the covariance matrix. However, to prove the hyperbolic PCA theorem, we need a technical result on JD-diagonalizability of the second-moment matrix.

Proposition 7: If AR(D+1)×(D+1) is a symmetric and JD-diagonalizable matrix, i.e., AJDV=VJDΛ, that has distinct (in absolute values) diagonal elements of Λ, then A=VΛV where V is a JD-unitary matrix.

From Proposition 7, any symmetric, JD-diagonalizable matrix with distinct (absolute) JD-eigenvalues has D positive and one negative JD-eigenvectors — all orthogonal to each other.

Theorem 13: Let x1,,xNHD and Cx=ENxnxn be a JD-diagonalizable matrix. Then, the optimal solution for point p is the scaled negative JD-eigenvector of Cx and the optimal h1,,hK are the scaled positive JD-eigenvectors that correspond to the smallest KJD-eigenvalues of Cx. And H is spanned by K=D-K scaled positive JD-eigenvectors that correspond to the leading JD-eigenvalues of Cx.

The JD-diagonalizability condition requires Cx to be similar to a diagonal matrix. Proposition 7 provides a sufficient condition for its JD-diagonalizability; in fact, we conjecture that symmetry is sufficient even if it has repeated JD-eigenvalues.

Claim 14: The cost function in equation (13) is proper.

The proper cost in equation (13) implies the following closed-form definition for the hyperbolic centroid.

Definition 12: A hyperbolic mean of 𝒳 is μ(𝒳) if EN[fdxn,μ(𝒳)=minpHDENfdxn,p. The solution is the scaled negative JD-eigenvector of Cx.

Remark 4: The formalism of space form (Euclidean, spherical, and hyperbolic) PCAs shows similarities through the use of (in)definite eigenequations. This arises from the introduction of proper cost functions which resulted in quadratic cost functions with respect to the base points and tangent vectors. However, this approach is not necessarily generalizable to other Riemannian manifolds. This limitation is due to the absence of (1) a simple Riemannian metric, (2) a closed-form distance function, and (3) closed-form exponential and logarithmic maps in general Riemannian manifolds, e.g., the manifold of rankdeficient positive semidefinite matrices [45].

VI. Numerical Results

We compare our space form PCA algorithm (SFPCA) to other leading algorithms in terms of accuracy and speed.

A. Synthetic Data and Experimental Setup

We generate random, noise-contaminated points on known (but random) spherical and hyperbolic affine subspaces. We then apply PCA methods to recover the projected points after estimating the affine subspace. We conduct experiments examining the impact of number of points N, the ambient dimension D, the dimension of the affine subspace K, and the noise level σ on the performance of algorithms.

1). Random Affine Subspace:

For fixed ambient and subspace dimensions, D and K, we sample from a normal distribution and normalize it to get the spherical (or hyperbolic) point p. We then generate random vectors from the standard normal distribution and use the Gram-Schmidt process to construct tangent vectors: project the first random vector onto p and normalize it to h1TpSD, where S{S,H}. We then project the second random vector onto (pH), where H=h1, and normalize it to h2TpSD. We repeat this until we form a K-dimensional affine subspace in TpSD.

2). Noise-Contaminated Points:

Let HR(D+1)×K be the subspace of TpSD. For n[N], we generate cn~𝒩0,αSIK+1 and let vn=Hcn. To add noise, we project νn~𝒩0,αSσ2ID+1 onto TpSD, i.e, pvn. We then let xn=exppvn+pνn be the noise-contaminated point. Finally, αS=π4 if S=S and αS=1 if S=H.

3). PCA on Noisy Data:

We use each algorithm to estimate an affine subspace SH^D where H^Tp^SD and p^ are the estimated parameters. We let ni=defENdxn,𝒫Hxn be the empirical mean of the distance between measurements and the true subspace, and no=defENd𝒫H^xn,𝒫H𝒫H^xn be the average distance between denoised points 𝒫H^xnn[N] and the true affine subspace. If SH^D is a good approximation to SHD, then no is small. We evaluate the performance of algorithms using the normalized output error, noni.

Remark 5: The ratio of no over ni quantifies how much the estimated points are farther from the true subspace compared to the original noise-contaminated points. This is a normalized quantity, i.e., it is invariant with respect to the scale of data points, which makes it ideal for comparing results as D,K, σ, and N vary. A reasonable upper bound for this ratio is 1 — as PCA is expected to denoise the point sets by finding the optimal low-dimensional affine subspace for them.

4). Randomized Experiments and Algorithms:

For each random affine subspace and noise-contaminated point set, we report the normalized error and the running time for each algorithm. Then, we repeat each random trial 100 times. We use our implementation of principal geodesic analysis (PGA) [33]. We also implement Riemannian functional principal component analysis (RFPCA) for spherical PCA [28] and spherical principal component analysis (SPCA) [40]. Since SPCA is computationally expensive, we first run our SFPCA to provide it with good initial parameters. For hyperbolic experiments, we use HoroPCA [41] and barycentric subspace analysis (BSA) [37], implemented by Chami et al. [41].

B. Spherical PCA

1). Experiment SK1 :

For a fixed D=102,N=104, increasing the subspace dimension K worsens the normalized output errors for all algorithms; see Fig. 3(a). RFPCA is unreliable, while other methods are similar in their error reduction pattern. When K is close to D, SFPCA has a marginal but consistent advantage over PGA. SFPCA is faster than the rest, and K has a minor impact on running times.

Fig. 3.

Fig. 3.

For each spherical experiment, on the y-axes, we report running time and normalized output error. We show the median (solid line) and the first and third quartiles (shaded transparent area) over all random trials. Figures (a,b,c) show the results for SK1,SD1,SN1, respectively. All axes are in logarithmic scale.

2). Experiments SD1 and SD2 :

In SD1 — fixed K=1,N=104 and varying D — PGA, SFPCA, and SPCA exhibit a similar denoising performance, not impacted by D; see Fig. 3 (b). RFPCA has higher output error levels than other methods. To further compare SFPCA and its close competitor PGA, we design the challenging experiment SD2 with K=10 and N=103. In this setting, SFPCA exhibits a clear advantage over PGA in error reduction; see Fig. 4. In both settings, SFPCA continues to be the fastest in almost all conditions despite using a warm start for PGA.

Fig. 4.

Fig. 4.

Spherical experiment SD2. The y-axes show running time and normalized output error. All axes are in logarithmic scale.

3). Experiment SN1:

For fixed K=1 and D=102, when we change N and σ, our SFPCA has the fastest running time; and it is tied in having the lowest normalized output error with SPCA and PGA; see Fig. 3(c). As expected, increasing N generally makes all methods slower, partially because the computation of Cx has O(N) complexity. Computing a base point p using iterative computations on all N points is time-consuming with N, whereas our SFPCA has worst-case complexity of OD3. SFPCA provides similar error reductions compared to the rest due to providing an excessive number of points to each algorithm. SPCA fails in some cases, as evident from the erratic behavior of the normalized output error. This may be due to the algorithm's failure to converge within the allocated maximum running time. SPCA takes about 15 minutes on 104 points in each trial, while our SFPCA takes less than a second. PGA is the closest competitor in normalized error but is about three times slower.

C. Hyperbolic PCA

1). Experiments HK1 and HK2:

On small datasets in HK1(D=50,N=51), for each trial, HoroPCA and BSA take close to an hour whereas SFPCA and PGA take milliseconds; see Fig. 5(a). Increasing K only increases the running time of BSA and HoroPCA but does not change SFPCA's and PGA's. This is expected as they estimates an affine subspace greedily one dimension at a time. Regarding error reduction, as expected, all methods become less effective as K grows. For small σ, all methods achieve similar normalized output error levels with only a slight advantage for PGA and SFPCA. As σ increases, PGA and HoroPCA become less effective compared to BSA and SFPCA. For large σ, SFPCA exhibits a clear advantage over all other methods. In the larger HK2 experiments (D=102,N=104), we compare the two fastest methods, SFPCA and PGA; see Fig. 6(a). When σ is small, both methods have similar denoising performance for small K; SFPCA performs better only for larger K. As σ increases, SFPCA outperforms PGA irrespective of K.

Fig. 5.

Fig. 5.

For each scaled-down hyperbolic experiment, on the y-axes, we report running time and normalized output error in logarithmic scale. We report the median (solid line) and the first and third quartiles (shaded transparent area) over all random trials. Figures in rows (a), (b), and (c) are HK1,HD1, and HN1.

Fig. 6.

Fig. 6.

For each full-scale hyperbolic experiment, on the y-axes, we report running time and normalized output error in logarithmic scale. We report the median (solid line) and the first and third quartiles (shaded transparent area). Figures in rows (a),(b), and (c) are HK2,HD2, and HN2. All axes are in logarithmic scale.

2). Experiments HD1 and HD2 :

In HD1, we fix K=1,N=101 and in HD2, we let K=1,N=104. Changing D impacts each method differently; see Fig. 5(b). Both SFPCA and PGA take successively more time as D increases, but they remain significantly faster than the other two, with average running times below 0.1 second. The running time of HoroPCA is (almost) agnostic to D since its cost function (projected variance) is free of the parameter D. Neither HoroPCA nor BSA outperform SFPCA in error reduction. All methods improve in their error reduction performance as D increases. For large σ, SFPCA provides the best error reduction performance among all algorithms. Comparing the fastest methods SFPCA and PGA, we observe consistent patterns in HD1 and HD2: (1) SFPCA is faster, regardless of D and the gap between the two methods can be as high as a factor of 10. (2) When σ<0.1, PGA slightly outperforms SFPCA in reducing error; with the lowest noise (σ=0.01), PGA gives 17% better accuracy, in average over all values of D. However, as σ increases, SFPCA becomes more effective; at the highest end (σ=0.5), SFPCA outperforms PGA by 40%, in average over D; see Fig. 6(b).

3). Experiments HN1 and HN2:

In HN1, (K=1,D=10), increasing N impacts the running time of SFPCA and PGA due to computing Cx; see Fig. 5(c). Nevertheless, both are orders of magnitude faster than HoroPCA and BSA. All methods provide improved error reduction as N increases. Comparing the fast methods SFPCA and PGA on larger datasets HN2K=1,D=102 shows that SFPCA is always faster, has a slight disadvantage in output error on small σ, and substantial improvements on large σ; see Fig. 6(c).

D. Real Data: Spherical Spaces

We evaluate the performance of PCA methods using following datasets: (1) Intestinal Microbiome: Lahti et al. [51] analyzed the gut microbiota of N=1,006 adults covering D=130 bacterial groups. The study explored the effects of age, nationality, BMI, and DNA extraction methods on the microbiome. They assessed variations in microbiome compositions across young (18 to 40), middle-aged (41to 60), and older (61 to 77) age groups (a ternary classification problem). (2) Throat Microbiome: Charlson et al. [52] investigated the impact of cigarette smoking on the airway microbiota in 29 smokers and 33 non-smokers (a binary classification problem) using culture-independent 454 pyrosequencing of 16S rRNA. (3) Newsgroups: Using Python's scikit-learn package, the 20 newsgroups dataset was streamlined to a binary classification problem by retaining N=400 samples from two distinct classes. Feature reduction was performed using TF-IDF, narrowing it down to D=3000 features to improve computational efficiency. Each dataset has undergone standard preprocessing, e.g., normalization and square-root transformation, to ensure the data points are spherically distributed. For a fixed subspace dimension, we estimate spherical affine subspaces. Then, we compute the projected spherical data points and denoise the original compositional data.

Distortion Analysis.

For compositional data, we calculate distance matrices using Aitchison (AI), Jensen–Shannon (JS), and total variation (TV) metrics. We also compute the spherical distance matrix (S). For each embedding dimension K, we compute projected point sets. We then compute the normalized errors; an example of normalized error is DTV-D^TVFDTVF where DTV and D^TV are total variation distance matrices for the original and estimated data. For each algorithm, we then divide these normalized errors by their average across all algorithms, providing relative measures, that is, if the resulting relative error is greater than 1, the algorithm performs worse than average. We then report the mean and standard deviation of relative errors across different dimensions; see Table II. On all datasets, SFPCA outperforms the rest. Newsgroups experiments are limited to SFPCA and PGA due to the significant computational complexity of SPCA and RFPCA.

TABLE II.

The Mean and Standard Deviation of Normalized Distance Errors Are Divided by Their Average Across Methods. Classification Accuracies Are Percentage Deviations From 100% — Representing the Average Accuracy Across Methods. Boldface and Red Indicate SFPCA and the Top-Performing Method. Lower Distortions (↓) and Higher Accuracies (↑) Are Better

Metric (Method) Throat Microbiome Intestinal Microbiome Newsgroups
SFPCA RFPCA SPCA PGA SFPCA RFPCA SPCA PGA SFPCA PGA
S() 0.88 ± 0.99 1.13 ± 1.28 0.9 ± 0.99 1.1 ± 1.18 0.75 ± 1.64 0.93 ± 2.09 1.47 ± 2.07 0.85 ± 1.79 0.77 ± 1.03 1.23 ± 1.4
AI() 0.98 ± 0.48 1.03 ± 0.57 0.99 ± 0.46 1.01 ± 0.52 0.8 ± 1.06 0.81 ± 1.03 1.55 ± 1.04 0.84 ± 1.14 0.999 ± 0.4 1.001 ± 0.5
JS() 0.91 ± 0.84 1.08 ± 1.04 0.95 ± 0.84 1.06 ± 0.97 0.75 ± 1.55 0.93 ± 1.96 1.47 ± 1.93 0.85 ± 1.7 0.9 ± 0.81 1.1 ± 0.98
TV() 0.91 ± 1.0 1.08 ± 1.21 0.94 ± 1.0 1.07 ± 1.14 0.76 ± 1.65 0.92 ± 2.09 1.49 ± 2.07 0.83 ± 1.75 0.87 ± 0.88 1.13 ± 1.12
(NN)() −0.06 ± 3.03 0.15 ± 3.15 −0.26 ± 3.7 0.17 ± 3.0 0.04 ± 0.36 −0.05 ± 0.9 −0.02 ± 0.6 0.03 ± 0.3 0.04 ± 1.6 −0.04 ± 1.6
(RF)() 0.59 ± 9.5 1.2 ± 9.6 −1.9 ± 9.3 0.11 ± 9.8 0.4 ± 2.3 0.11 ± 2.7 −0.74 ± 2.8 0.2 ± 2.3 0.3 ± 3.5 −0.3 ± 4.4

Classification Performance.

For each K, using the denoised compositional data, we perform two classifiers: a five-layer neural network (NN) and a random forest model (RF). We normalize and report the mean and standard deviation of classification accuracies by the average accuracy of all methods. From Table II, SFPCA outperforms competing methods on Intestinal Microbiome and Newsgroups, though the accuracy differences are mostly less than one percent. In Section I, we further compare the performance of the two classifiers on Newsgroups as it relates to PCA analysis.

E. Real Data: Hyperbolic Spaces

We use a biological dataset of 103 plant and algal transcriptomes [53]. The authors inferred phylogenetic trees from genome-wide genes. Tree leaves are present-day species, internal nodes are ancestral species, and branch weights represent evolutionary distances. The dataset includes an “unfiltered” version with 852 trees and a “filtered” version with 844 trees after removing error-prone genes and filtering problematic sequences. Errors appear as outliers, with more expected in the unfiltered dataset a priori. Other studies [54], [55] have used this dataset to evaluate outlier detection methods. We preprocess each tree by rescaling branch lengths to a diameter of 10, compute the distance matrix between leaves, and embed it into a D-dimensional (D=20) hyperbolic space using a semidefinite program [18]. We use two metrics to evaluate PCA results, and then apply them for outlier detection.

Distortion Analysis.

For a fixed dimension K, we estimate hyperbolic affine subspaces, compute the projected hyperbolic points, and their hyperbolic distance matrix D^H. The normalized distance error DH-D^HFDHF is calculated, where DH is the original distance matrix. These errors are averaged over K[D] for each algorithm and then divided by the average normalized errors across all algorithms — providing relative errors. If the relative error is greater than 1, the algorithm performs worse than average. For each algorithm, we report the mean and standard deviation of these relative errors across all gene trees. Distortion is not a perfect measure of PCA accuracy as highly noise-contaminated data should experience high distortions during the projection (denoising) step. In all experiments, PGA outperforms others in terms of distortion (Table III). SFPCA provides an average distance-preserving performance, contrary to synthetic experiments. We conjecture this may be due to the trees being relatively small (see scaled-down hyperbolic experiments in Fig. 5), or to high noise levels making distortion an inappropriate accuracy metric, or the discordance between our choice of distortion function f=cosh (which overemphasizes large distances) and the distance distortion metric.

TABLE III.

Normalized Distance Errors (H) Mean and Standard Deviation Are Divided by Their Average Across Methods. Quartet Scores (Q) Are Percentage Deviations From 100% (the Average). Lower Distortions (↓) and Higher Scores (↑) Are Better

Method Filtered Unfiltered
Q() H() Q() H()
SFPCA 1.50 ± 1.91 0.98 ± 0.23 1.10 ± 1.74 1.01 ± 0.25
PGA 1.48 ± 1.47 0.55 ± 0.09 1.80 ± 1.45 0.53 ± 0.08
BSA 1.48 ± 1.61 1.45 ± 0.24 1.67 ± 1.91 1.45 ± 0.31
HoroPCA −4.48 ± 2.79 1.01 ± 0.21 −4.57 ± 2.58 1.00 ± 0.20

Quartet Scores.

To use a biologically motivated accuracy measure, we use the quartet score [56]. For a target dimension K, each algorithm is applied to an embedded hyperbolic point set to compute the projected (denoised) points. For each set of four points, we find the optimal tree topology with minimum distance distortion using the four-point condition [57]. For 105 randomly chosen (but fixed) sets of four projected points, we estimate their topology and compare it with the true topology from the gene trees. For each dimension K[D], we compute the percentage of correctly estimated topologies, then average this over all dimensions. We normalize and report the mean and standard deviation of quartet scores by the average score of all methods, as detailed in Table III. In these experiments, PGA and SFPCA exhibit the best performance compared to the alternatives. This is particularly informative as the quartet score measures tree topology accuracy, not distance.

1). Outlier Detection With Hyperbolic Spectrum:

We show-case the practical utility of the JD-eigenequation (Definition 10) in species tree estimation. As proved in Theorem 4, the principal axes align with the leading JD-eigenvectors of Cx. Thus, the optimal SFPCA cost corresponds to the sum of its neglected JD-eigenvalues. We conjecture that a tree with outliers has more outlier JD-eigenvalues; see Fig. 7(a)–7(b). If a tree has an outlier set of species (likely from incorrect sequences), its second leading JD-eigenvalue (λ2)1 is significantly larger than the rest. We quantify this by plotting its normalized retained energy k=2Kλkd=2Dλd (or cumulative spectrum) versus the normalized embedding dimension (or number of JD-eigenvalues) x=K/D and finding its the knee point. This lets us sort gene trees by their hyperbolic spectrum.

Fig. 7.

Fig. 7.

(a) and (b): Normalized retained energy versus normalized number of eigenvalues for two gene trees. The knee point (intersection with y=1-x line) for trees with outliers approaches (0, 1). (c) The quartet score for species trees constructed using the top N trees (knee values) versus random orders.

After sorting, we use the top N trees with the smallest knee values (least prone to outliers) to construct a species topology using ASTRAL [58]. ASTRAL outputs the quartet score between the estimated species tree and the input gene trees, where a higher score indicates more congruence among input trees. Thus, a higher score after filtering means outlier gene trees, likely inferred from problematic sequences, have been removed. Our results (Fig. 7(c)) show that hyperbolic spectrum-based sorting — offered only by our SFPCA — effectively identifies the worst trees most dissimilar to others, without explicitly comparing tree topologies. In contrast, random sorting keeps the quartet score fixed. Filtered trees have a higher score than unfiltered trees and benefit less from further filtering. It is remarkable that using eigenvalues alone, we can effectively find genes with discordant evolutionary histories.

Acknowledgment

The authors would like to thank the National Institutes of Health (NIH) for their financial support.

This research was supported in part by the NIH under Grant 1R35GM142725. The associate editor coordinating the review of this article and approving it for publication was Dr. Marco Felipe Duarte.

Appendix

2) Claim 1: Let x,ySHDSD and γx,y(t) be the geodesic where γx,y(0)=x and γx,y(1)=y. This geodesic lies at the intersection of span{x,y} and SD. Since x,ySHD=pHSD and pH is a subspace, we have span{x,y}pH. Therefore, we have γx,y(t)span{x,y}SDpHSD=SHD for all t[0,1].

3) Claim 2: For a spherical affine subspace SHD, we define the sliced unitary matrix G=C12p,C-12h1,,C-12hK where p is the base point and h1,,hK are orthogonal basis for H. Then, we have GySHD for all ySK. Conversely, for any sliced unitary matrix G=g0,g1,,gK, we let p=C-12g0 and hk=C12gk for k[K]. Since hk's and p are orthogonal, we can define the spherical affine subspace.

4) Claim 3: We simplify its PCA cost as costSHD𝒳=1ENPHxn22=aENkKxn,hk2=bkKhkCxhk where (a) follows from Proposition 1, h1,,hK are orthogonal basis for HTpSD, and (b) follows from cyclic property of trace.

5) Claim 4: From Corollary 2 and Definition 7, an optimal zero-dimensional affine subspace (a point) is a subset of any other spherical affine subspace. In general, SH1DSH2D if and only if dimSH1DdimSH2D.

6) Claim 5: Let vCD+1 be such that AJDv=sgnv*,vλv where v*,v=1, then v2-1v is an eigenvector of AJD with eigenvalue of sgnv*,vλ.

7) Claim 6: Let x,yHHD and γx,y be the geodesic where γx,y(0)=x,γx,y(1)=y. This geodesic belongs to span{x,y}HD. We have span{x,y}pH since x,yHHD=pHHD and pH is a subspace. Thus, we have γx,y(t)span{x,y}HDHHD for all t[0,1].

8) Claim 7: For a hyperbolic affine subspace HHD, we define G=|C|12p,|C|-12h1,,|C|-12hK where p is the base point and h1,,hK are orthogonal basis for H. We have GJKG=JK and GyHD for all yHHD. Conversely, for any sliced unitary matrix G=g0,g1,,gK, we let p=|C|-12g0 and hk=|C|-12gk for k[K]. Since hk's and p are orthogonal, we can define the spherical affine subspace.

9) Claim 8: Let pHD,q1,,qKHD, and γ1,,γK be the aforementioned geodesics. A point xS=defGHp,q1,,qK belongs to a geodesic whose end points are γk(t) and γkt for t,tR and k,k[K]. Let us show xHHD for a subspace HTpHD. From Claim 6, HHD is a geodesic submanifold. It suffices to show that γ1,,γK belong to HHD. Let hk=γk(0)TpHD, for all k[K], and H=spanh1,,hK. This proves SHHD. Conversely, let xHHD — the hyperbolic affine subspace constructed as before. Since HHD is a geodesic submanifold, x belongs to a geodesic whose end points are γk(t) and γkt for t,tR and k,k[K], constructed as before. From xconvhullγ1,,γK, we have HHDS.

10) Claim 9: From Theorem 4 and Definition 12, a optimal zero-dimensional affine subspace is a subset of any other affine subspace. For optimal affine subspaces HHiD, we have HH1DHH2D if and only if dimHH1DdimHH2D.

11) Proposition 1: Consider the following Lagrangian:

(y,γ,Λ)=x,y+γy,y-C-1+kKλky,hk, (14)

where Λ=defλk:kK,y,y=C-1 and y,hk=0,kK. The solution to equation (14) takes the form 𝒫H(x)=iKαihi+βx, for scalars αiiK and β. The subspace conditions — 𝒫H(x),hk=0, kK — give 𝒫H(x)=βPH(x) where PH(x)=x-C-1kKx,hkhk. Enforcing the norm condition, we arrive at 𝒫H(x)=C-12PH(x)2-1PH(x) where PH(x)22=C-11-kKx,hk2. Then, we have dx,𝒫H(x)=C-12acosCx,C-12PH(x)2-1PH(x)=C-12acosC12PH(x)2. If PH(x)=0 — i.e., xspanH — then 𝒫H(x)SHD is nonunique; but the projection distance well-defined as d(x,y)=C-12π2, ySHD.

12) Proposition 2: Let A=Cp,h1,,hK,h1,,hKR(D+1)×(D+1) where K+K=D. Distinct columns of A are orthogonal — i.e., AA=CID+1. Hence, p,h1,,hK,h1,,hK are linearly independent. Therefore, we have PH(x)=k[K]αkhk+kKβkhk+γp=(a)x-C-1kKx,hkhk where αkk[K],βkkK,γ are scalars, and (a) is due to Proposition 1. We hence have βk=C-1P(x),hk=0,αk=C-1P(x),hk=C-1x,hk and γ=CP(x),p=Cx,p. We can accordingly compute PH(x)2, and prove the proposition.

13) Proposition 3: Let A be a real matrix such that A=A[], that is, A=JDAJD. Let (v,λ) be an eigenvectoreigenvalue pair of A. Then, we have AJDv=sgnv*,vλv and AJDv*=sgnv*,vλ*v*. We have λ*=λ since

λv*,v=v*JDAJDvsgnv*,v=AJDv*JDvsgnv*,v=sgnv*,vλ*vHJDvsgnv*,v=λ*v*,v.

14) Proposition 4: Let A be a full-rank matrix and vC1,D such that AJDv=λv and v*,v=0. Then, we have vHJDAJDv=vHJDλv=λv*,v=0. This contradicts with the assumption that A is full-rank.

15) Proposition 5: Consider the following Lagrangian:

y,γ,Λ=x,y+γy,y-C-1+kKλky,hk, (15)

where Λ=λkkK, admits the solution 𝒫H(x)=iKαihi+βx, for scalars αiiK and β. The subspace conditions, 𝒫H(x),hk=0,kK, give 𝒫H(x)=βPH(x) where PH(x)=x+C-1kKx,hkhk. Enforcing the norm condition, we get 𝒫H(x)=|C|-12PH(x)-1PH(x), PH(x)=def-PH(x),PH(x)=|C|-121+kKx,hk2. We have x,𝒫H(x)=|C|-12acoshCx,𝒫H(x)=|C|-12acosh|C|12PH(x).

16) Proposition 6: Let A=Cp,h1,,hK,h1,,hKR(D+1)×(D+1) where K+K=D. Columns of A are J-orthogonal, that is, AJDA=|C|JD. Since we have JDAJDAD=|C|ID+1, p,h1,,hK,h1,,hK are linearly independent, we have PH(x)=k[K]αkhk+kKβkhk+γp=(a)x+C-1kKx,hkhk where αkk[K],βkkK,γ are scalars, and (a) is due Proposition 5. So we have βk=-C-1P(x),hk=0, αk=-C-1x,hk, and γ=C[P(x),p]=C[x,p], i.e., PH(x)=C[x,p]p-C-1k[K]x,hkhk. Then, we can compute PH(x) and prove the proposition.

17) Proposition 7: Let A=VΛV where V is JD-unitary. Since Λ is a diagonal matrix, we let A=VJDΛJDV. We have AJDV=VJDΛJDVJDV=VJDΛ, i.e., A is JD-diagonalizable.

Let AR(D+1)×(D+1) be symmetric and AJDV=VJDΛ for a JD-invertible V and Λ=diagλdd[D+1] with distinct (in absolute values) elements. Then, we have

AJDvd=-λdvd,ifd=1λdvd,ifd1, (16)

where vd be the d-th column of V. In the eigenequation (16), the negative (positive) signs are designated for the eigenvectors with negative (positive) norms. For distinct i,j, we have

λivi,vj=(a)AJDvi,vj=viJDAJDvj=viJDAJDvj=vi,AJDvj=λjvi,vj,

where (a) is due to the eigenequation (16). Since λiλj, then we must have vi,vj=0. Without loss of generality, JD-eigenvectors are scaled such that vd,vd=1 for d[D+1]. Lemma 1 shows that v1,v1=-1 and vd,vd=1 for d>1.

Lemma 1: VJDV=JD.

Proof: Let A where AJDV=VJDΛ and VJD-diagonalizes AJDA, viz., VJDAJDAJDVΛ=(a)VJDΛJDVJDΛ=(b)ΛJDΛJDΛ=Λ2Λ, where (a) follows from AJDV=VJDΛ, (b) from vi,vj=0 for ij, and Λ is a diagonal matrix. This shows that V diagonalizes B=defAJDJDAJD, i.e., VAJDJDAJDV is a diagonal matrix. However, B is a symmetric matrix with only one negative eigenvalue [18]. Therefore, without loss of generality, the first diagonal element of Λ is negative. □

From Lemma 1, we have V-1=JDVJD and A=AJDVJDV-1=VJDΛV-1JD=VJDΛJDVJDJD=VJDΛJDV=VΛV.

18) Theorem1: Consider SHD with orthogonal tangents h1,,hK. For xSHD, we have PH(x)=x,PH(x)22=C-1, and 𝒬(x)22=C-1, i.e., 𝒬(x)SK and 𝒬 is a map between SHD to SK; see proof of Proposition 2. We also have x=PH(x)=Cx,pp+C-1k[K]x,hkhk=𝒬-1𝒬(x) for all xSHD. Hence, 𝒬-1 is the inverse map of 𝒬 — a bijection. Finally, 𝒬 is an isometry between SHD and SK since dx1,x2=d𝒬x1,𝒬x2 for all x1,x2SHD.

19) Theorem2: We have kKhkCxhkkKCλD+1-kCx where λdCx is the d-the largest eigenvalue of Cx. We achieve the lower bound if we let hk=C12vD+1-kCx for kK. The optimal base point is any vector in spanh1,,hK with norm C-12, i.e., p=C-12v1Cx allows for nested affine subspaces.

20) Theorem3: Consider HHD with orthogonal tangents h1,,hK. For xHHD, we have PH(x)=x, PH(x),PH(x)=C-1, and [𝒬(x),𝒬(x)]=C-1, i.e., 𝒬(x)HK and 𝒬 is a map between HHD to HK; see proof of Proposition 6. We also have x=PH(x)=C[x,p]p-C-1k[K]x,hkhk=𝒬-1𝒬(x) for all xSHD. Hence, 𝒬-1 is the inverse map of 𝒬 — a bijection. Finally, 𝒬 is an isometry between HHD and SK since dx1,x2=d𝒬x1,𝒬x2 for all x1,x2HHD.

21) Theorem4: WLOG, we scale hkk and H=defh1,,hKR(D+1)×K such that hi,hj=δi,j, i.e., HJDH=IK. The cost is:

costHHD𝒳=kKhkJDCxJDhk=TrHJDCxJDH=TrHJDVΛVJDH=TrWΛW=(a)TrWWJDΛJD=Tr𝒲ΛJD,

where Cx=VΛV is JD-diagonalizable, W=VJDHR(D+1)×K, (a) follows from Λ being a diagonal matrix, i.e., Λ=JDΛJD, and 𝒲=defWWJD.

Lemma 2: WJDW=IK.

Proof: WJDW=HJDVJDVJDH=(a)HJDH=IK, where (a) follows from VJDV=JD. This is the case, since by definition, we have VJDV=JD, or JDVJDVD=ID+1, that is, JDVJD=V-1. Hence, we have VJDV=JD. Finally, HJDH=IK is the direct result of the orthogonality of basis vectors h1,,hK.□

We write the cost function as follows:

costHHD𝒳=Tr𝒲ΛJD=-𝒲11λ1+d=2D+1𝒲d,dλd

where d=1D+1𝒲d,d=TrWJDW=TrIK, i.e., 𝒲11=-d=2D+1𝒲d,d+K. Let WR(D+1)×K be as follows:

W=w122-1wK22-1w1wK, (17)

for vectors w1,,wKRD with 2 norms greater or equal to 1 — notice WJDW=IK and Lemma 2. From 𝒲=WWJD and equation (17), we have:

𝒲11=-1kKwk2-10:wk21,kK.

For d2,𝒲d,d is the squared norm of the d-th row of W. Let WcRD×K where its d-th row is equal to the (d+1)-th row of W. Therefore, we have

d=2D+1𝒲d,d=TrWcWc=k=1Kwk22. (18)

Let us now simplify the cost function as follows:

costHHD=d=2D+1𝒲d,dλ1+λd-KDλ1. (19)

Fig. 8.

Fig. 8.

Classification accuracy (%) of PCA to dimension K on reconstructed compositional data with (a) random forest and (b) neural net.

Lemma 3: For all d2, we have λd+λ10.

Proof: Let vd be the d-th JD-eigenvector of Cx. We have λd=vdJDsgnvd,vdλdvd=vdJDCxJDvd=N-1n[N]xn,vd20, for all d1. QED. □

From Lemma 3, equation (18), and wk21 for all kK, the minimum of the cost function in equation (19) happens only if d=2D+1𝒲d,d=k=1Kwk22=K, i.e., w12==wK2=1. Therefore, we have W=VJDH=00w1wK. The first row of VJDH corresponds to the Lorentzian product of h1,,hK and the first column of V, i.e., v1. Hence, we have h1,,hKv1. Since v1 is the only negative JD-eigenvector of Cx, then we have p=v1. From the unit norm constraintS for w1,,wK, we have WW=IK, i.e., W is a sliced unitary matrix and WW has zero-one eigenvalues. The cost function costHHD𝒳=TrWΛW=TrΛWW achieves its minimum if and only if the non-zero singular values of WW are aligned with the K smallest diagonal values of Λ — from the von Neumann's trace inequality [59]. Let λ2λ3. If wi=ei for i=2,,K+1, then we achieve the minimum of the cost function, that is, h1,,hK are K negative JD-eigenvectors paired to the smallest JD-eigenvalues of Cx.

A. Additional Experiments

We present experiments on the NewsGroups dataset to demonstrate the impact of spherical PCA on classification accuracy. Using random forest and a five-layer neural network classifiers with a 90% training and 10% test split, the average accuracy is based on 20 random splits. As shown in Fig. 8, SFPCA outperforms PGA in average accuracy over K (target dimension) experiments. Both methods significantly improve random forest performance and modestly improve the neural network's. This may be due to the neural network's denoising ability. For large K, random forest accuracy declines, unlike the neural network. This may be due to the Newsgroups dataset’s high sparsity. Discarding small eigenvalues of the second-moment matrix significantly alters the data's sparsity, adversely affecting the random forest's accuracy.

Footnotes

1

The JD-eigenvalue λ1 corresponds to the base point.

Contributor Information

Puoya Tabaghi, Halicioğlu Data Science Institute, University of California San Diego, San Diego, CA 92093 USA.

Michael Khanzadeh, Computer Science and Engineering Department, University of California San Diego, San Diego, CA 92093 USA. He is now with the Department of Computer Science, Columbia University, New York, NY 10027 USA.

Yusu Wang, Halicioğlu Data Science Institute, University of California San Diego, San Diego, CA 92093 USA.

Siavash Mirarab, Electrical and Computer Engineering Department, University of California San Diego, San Diego, CA 92093 USA.

REFERENCES

  • [1].Thurstone LL, “Multiple factor analysis.” Psychol. Rev, vol. 38, no. 5, 1931, Art. no. 406. [Google Scholar]
  • [2].Wold S, Esbensen K, and Geladi P, “Principal component analysis,” Chemometrics Intell. Lab. Syst, vol. 2, nos. 1–3, pp. 37–52, 1987. [Google Scholar]
  • [3].Stewart GW, “On the early history of the singular value decomposition,” SIAM Rev., vol. 35, no. 4, pp. 551–566, 1993. [Google Scholar]
  • [4].Hotelling H, “Analysis of a complex of statistical variables into principal components,” J. Educ. Psychol, vol. 24, no. 6, 1933, Art. no. 417. [Google Scholar]
  • [5].Jolliffe IT, Principal Component Analysis. New York, NY, USA: Springer, 2002. [Google Scholar]
  • [6].Tipping ME and Bishop CM, “Probabilistic principal component analysis,” J. Roy. Statist. Soc.: Ser. B (Statist. Methodol.), vol. 61, no. 3, pp. 611–622, 1999. [Google Scholar]
  • [7].Vidal R, Ma Y, and Sastry S, “Generalized principal component analysis (GPCA),” IEEE Trans. Pattern Anal. Mach. Intell, vol. 27, no. 12, pp. 1945–1959, Dec. 2005. [DOI] [PubMed] [Google Scholar]
  • [8].Lawrence N, “Gaussian process latent variable models for visualisation of high dimensional data,” in Proc. Adv. Neural Inf. Process. Syst, vol. 16, 2003, pp. 329–336. [Google Scholar]
  • [9].Roweis S, “EM algorithms for PCA and SPCA,” in Proc. Adv. Neural Inf. Process. Syst, vol. 10, 1997, pp. 626–632. [Google Scholar]
  • [10].Bishop C, “Bayesian PCA,” in Proc. Adv. Neural Inf. Process. Syst, vol. 11, 1998, pp. 382–388. [Google Scholar]
  • [11].Jolliffe IT, Trendafilov NT, and Uddin M, “A modified principal component technique based on the LASSO,” J. Comput. Graphical Statist, vol. 12, no. 3, pp. 531–547, 2003. [Google Scholar]
  • [12].Zou H, Hastie T, and Tibshirani R, “Sparse principal component analysis,” J. Comput. Graphical Statist, vol. 15, no. 2, pp. 265–286, 2006. [Google Scholar]
  • [13].Cai TT, Ma Z, and Wu Y, “Sparse PCA: Optimal rates and adaptive estimation,” Ann. Statist, vol. 41, no. 6, pp. 3074–3110, 2013. [Google Scholar]
  • [14].Guan Y and Dy J, “Sparse probabilistic principal component analysis," in Proc. Artif. Intell. Statist, PMLR, 2009, pp. 185–192. [Google Scholar]
  • [15].Xu H, Caramanis C, and Sanghavi S, “Robust PCA via outlier pursuit,” in Proc. Adv. Neural Inf. Process. Syst, vol. 23, 2010, pp. 2496–2504. [Google Scholar]
  • [16].Fletcher PT, Lu C, Pizer SM, and Joshi S, “Principal geodesic analysis for the study of nonlinear statistics of shape,” IEEE Trans. Med. Imag, vol. 23, no. 8, pp. 995–1005, Aug. 2004. [DOI] [PubMed] [Google Scholar]
  • [17].Jiang Y, Tabaghi P, and Mirarab S, “Learning hyperbolic embedding for phylogenetic tree placement and updates,” Biology, vol. 11, no. 9, 2022, Art. no. 1256. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Tabaghi P and Dokmanić I, “Hyperbolic distance matrices,” in Proc. 26th ACM SIGKDD Int. Conf. Knowl. Discovery & Data Mining, 2020, pp. 1728–1738. [Google Scholar]
  • [19].Fletcher PT and Joshi S, “Principal geodesic analysis on symmetric spaces: Statistics of diffusion tensors," in Proc. Comput. Vis. Math. Methods Med. Biomed. Image Anal, Heidelberg, Germany: Springer, 2004, pp. 87–98. [Google Scholar]
  • [20].Lee JM, Riemannian Manifolds: An Introduction to Curvature, vol. 176. New York, NY, USA: Springer Science & Business Media, 2006. [Google Scholar]
  • [21].Sonthalia R and Gilbert A, “Tree! I am no tree! I am a low dimensional hyperbolic embedding,” in Proc. Adv. Neural Inf. Process. Syst, vol. 33, 2020, pp. 845–856. [Google Scholar]
  • [22].Tabaghi P, Peng J, Milenkovic O, and Dokmanić I, “Geometry of similarity comparisons," 2020, arXiv:2006.09858.
  • [23].Chien E, Pan C, Tabaghi P, and Milenkovic O, “Highly scalable and provably accurate classification in Poincaré balls," in Proc. IEEE Int. Conf. Data Mining, Piscataway, NJ, USA: IEEE Press, 2021, pp. 61–70. [Google Scholar]
  • [24].Chien E, Tabaghi P, and Milenkovic O, “HyperAid: Denoising in hyperbolic spaces for tree-fitting and hierarchical clustering,” in Proc. 28th ACM SIGKDD Conf. Knowl. Discovery Data Mining, 2022, pp. 201–211. [Google Scholar]
  • [25].Klimovskaia A, Lopez-Paz D, Bottou L, and Nickel M, “Poincaré maps for analyzing complex hierarchies in single-cell data,” Nature Commun., vol. 11, no. 1, pp. 1–9, 2020. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Zhou Y and Sharpee TO, “Hyperbolic geometry of gene expression,” Iscience, vol. 24, no. 3, 2021, Art. no. 102225. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [27].Meng Y et al. , “Spherical text embedding,” in Proc. Adv. Neural Inf. Process. Syst, vol. 32, 2019, pp. 8208–8217. [Google Scholar]
  • [28].Dai X and Müller H-G, “Principal component analysis for functional data on Riemannian manifolds and spheres,” Ann. Statist, vol. 46, no. 6B, pp. 3334–3361, 2018. [Google Scholar]
  • [29].Gu A, Sala F, Gunel B, and Ré C, “Learning mixed-curvature representations in product spaces,” in Proc. Int. Conf. Learn. Representations, 2018, pp. 2898–2918. [Google Scholar]
  • [30].Rahman IU, Drori I, Stodden VC, Donoho DL, and Schröder P, “Multiscale representations for manifold-valued data,” Multiscale Model. & Simul, vol. 4, no. 4, pp. 1201–1232, 2005. [Google Scholar]
  • [31].Tournier M, Wu X, Courty N, Arnaud E, and Reveret L, “Motion compression using principal geodesics analysis," in Computer Graphics Forum, vol. 28, no. 2. Oxford, U.K.: Wiley Online Library, 2009, pp. 355–364. [Google Scholar]
  • [32].Anirudh R, Turaga P, Su J, and Srivastava A, “Elastic functional coding of human actions: From vector-fields to latent variables,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2015, pp. 3147–3155. [Google Scholar]
  • [33].Fletcher PT, Lu C, and Joshi S, “Statistics of shape via principal geodesic analysis on Lie groups," in Proc. IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit, vol. 1. Piscataway, NJ, USA: IEEE Press, 2003, pp. I–I. [Google Scholar]
  • [34].Sommer S, Lauze F, and Nielsen M, “Optimization over geodesics for exact principal geodesic analysis,” in Proc. Adv. Comput. Math, vol. 40, no. 2, 2014, pp. 283–313. [Google Scholar]
  • [35].Huckemann S, Hotz T, and Munk A, “Intrinsic shape analysis: Geodesic PCA for Riemannian manifolds modulo isometric lie group actions,” Statistica Sinica, vol. 20, no. 1, pp. 1–58, 2010. [Google Scholar]
  • [36].Lazar D and Lin L, “Scale and curvature effects in principal geodesic analysis,” J. Multivariate Anal, vol. 153, pp. 64–82, Jan. 2017. [Google Scholar]
  • [37].Pennec X, “Barycentric subspace analysis on manifolds,” Ann. Statist, vol. 46, no. 6A, pp. 2711–2746, 2018. [Google Scholar]
  • [38].Sommer S, Lauze F, Hauberg S, and Nielsen M, “Manifold valued statistics, exact principal geodesic analysis and the effect of linear approximations," in Comput. Vis.(ECCV): 11th Eur. Conf. Comput. Vis, Heraklion, Crete, Greece. Springer, 2010, pp. 43–56. [Google Scholar]
  • [39].Tabaghi P, Chien E, Pan C, Peng J, and Milenković O, “Linear classifiers in product space forms," 2021, arXiv:2102.10204.
  • [40].Liu K, Li Q, Wang H, and Tang G, “Spherical principal component analysis," in Proc. SIAM Int. Conf. Data Mining, Philadelphia, PA, USA: SIAM, 2019, pp. 387–395. [Google Scholar]
  • [41].Chami I, Gu A, Nguyen DP, and Ré C, “HoroPCA: Hyperbolic dimensionality reduction via horospherical projections," in Proc. Int. Conf. Mach. Learn. PMLR, 2021, pp. 1419–1429. [Google Scholar]
  • [42].Chakraborty R, Seo D, and Vemuri BC, “An efficient exact-PGA algorithm for constant curvature manifolds,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit, 2016, pp. 3976–3984. [Google Scholar]
  • [43].Gallier J and Quaintance J, Differential Geometry and Lie Groups: A Computational Perspective. Springer Nature, 2020. [Google Scholar]
  • [44].Pearson K, “LIII. On lines and planes of closest fit to systems of points in space,” London, Edinburgh, Dublin Philos. Mag. J. Sci, vol. 2, no. 11, pp. 559–572, 1901. [Google Scholar]
  • [45].Lahav A and Talmon R, “Procrustes analysis on the manifold of SPSD matrices for data sets alignment,” IEEE Trans. Signal Process, vol. 71, pp. 1907–1921, 2023. [Google Scholar]
  • [46].Lou A, Katsman I, Jiang Q, Belongie S, Lim S-N, and De Sa C, “Differentiating through the Fréchet mean," in Proc. Int. Conf. Mach. Learn, PMLR, 2020, pp. 6393–6403. [Google Scholar]
  • [47].Huckemann S and Ziezold H, “Principal component analysis for Riemannian manifolds, with an application to triangular shape spaces,” in Proc. Adv. Appl. Probability, vol. 38, no. 2, pp. 299–319, 2006. [Google Scholar]
  • [48].Jung S, Dryden IL, and Marron JS, “Analysis of principal nested spheres,” Biometrika, vol. 99, no. 3, pp. 551–568, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [49].Higham NJ, “J-orthogonal matrices: Properties and generation,” SIAM Rev., vol. 45, no. 3, pp. 504–519, 2003. [Google Scholar]
  • [50].Slapničar I and Veselić K, “A bound for the condition of a hyperbolic eigenvector matrix,” Linear Algebra Appl., vol. 290, nos. 1-3, pp. 247255, 1999. [Google Scholar]
  • [51].Lahti L, Salojärvi J, Salonen A, Scheffer M, and De Vos WM, “Tipping elements in the human intestinal ecosystem,” Nature Commun., vol. 5, no. 1, 2014, Art. no. 4344. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [52].Charlson ES et al. , “Disordered microbial communities in the upper respiratory tract of cigarette smokers,” PLoS One, vol. 5, no. 12, 2010, Art. no. e15216. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [53].Wickett NJ et al. , “Phylotranscriptomic analysis. the origin and early diversification of land plants,” Proc. Nat. Acad. Sci, vol. 111, no. 45, pp. 4859–4868, Oct. 2014,. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [54].Mai U and Mirarab S, “TreeShrink: fast and accurate detection of outlier long branches in collections of phylogenetic trees,” BMC Genomics, vol. 19, no. S5, May 2018, Art. no. 272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [55].Comte A et al. , “PhylteR: Efficient identification of outlier sequences in phylogenomic datasets,” Mol. Biol. Evol, vol. 40, no. 11, Nov. 2023, Art. no. msad234. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [56].Estabrook GF, McMorris FR, and Meacham CA, “Comparison of undirected phylogenetic trees based on subtrees of four evolutionary units,” Systematic Biol., vol. 34, no. 2, pp. 193–200, Jun. 1985. [Google Scholar]
  • [57].Warnow T, Computational Phylogenetics: An Introduction to Designing Methods for Phylogeny Estimation. Cambridge, U.K.: Cambridge Univ. Press, 2017. [Google Scholar]
  • [58].Zhang C, Rabiee M, Sayyari E, and Mirarab S, “ASTRAL-III: Polynomial time species tree reconstruction from partially resolved gene trees,” BMC Bioinf., vol. 19, no. S6, May 2018, Art. no. 153. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [59].Mirsky L, “A trace inequality of John von Neumann,” Monatshefte für Mathematik, vol. 79, no. 4, pp. 303–306, 1975. [Google Scholar]

RESOURCES