Vector Diffusion Maps and the Connection Laplacian

A Singer; H-T Wu

doi:10.1002/cpa.21395

. Author manuscript; available in PMC: 2014 Jan 9.

Published in final edited form as: Commun Pure Appl Math. 2012 Mar 30;65(8):10.1002/cpa.21395. doi: 10.1002/cpa.21395

Vector Diffusion Maps and the Connection Laplacian

A Singer ¹, H-T Wu ²

PMCID: PMC3886882 NIHMSID: NIHMS484980 PMID: 24415793

Abstract

We introduce vector diffusion maps (VDM), a new mathematical framework for organizing and analyzing massive high-dimensional data sets, images, and shapes. VDM is a mathematical and algorithmic generalization of diffusion maps and other nonlinear dimensionality reduction methods, such as LLE, ISOMAP, and Laplacian eigenmaps. While existing methods are either directly or indirectly related to the heat kernel for functions over the data, VDM is based on the heat kernel for vector fields. VDM provides tools for organizing complex data sets, embedding them in a low-dimensional space, and interpolating and regressing vector fields over the data. In particular, it equips the data with a metric, which we refer to as the vector diffusion distance. In the manifold learning setup, where the data set is distributed on a low-dimensional manifold ℳ^d embedded in ℝ^p, we prove the relation between VDM and the connection Laplacian operator for vector fields over the manifold.

1 Introduction

A popular way to describe the affinities between data points is using a weighted graph, whose vertices correspond to the data points, edges that connect data points with large enough affinities, and weights that quantify the affinities. In the past decade we have witnessed the emergence of nonlinear dimensionality reduction methods, such as locally linear embedding (LLE) [33], ISOMAP [39], Hessian LLE [12], local tangent space alignment (LTSA) [42], Laplacian eigenmaps [2], and diffusion maps [9]. These methods use the local affinities in the weighted graph to learn its global features. They provide invaluable tools for organizing complex networks and data sets, embedding them in a low-dimensional space, and studying and regressing functions over graphs. Inspired by recent developments in the mathematical theory of cryo-electron microscopy [18, 37] and synchronization [10, 35], in this paper we demonstrate that in many applications, the representation of the data set can be vastly improved by attaching to every edge of the graph not only a weight but also a linear orthogonal transformation (see Figure 1.1).

Figure 1.1 — In VDM, the relationships between data points are represented as a weighted graph, where the weights *w_ij* are accompanied by linear orthogonal transformations *O_ij*.

Consider, for example, a data set of images, or small patches extracted from images (see, e.g., [8, 27]). While weights are usually derived from the pairwise comparison of the images in their original representation, we instead associate the weight w_ij to the similarity between image i and image j when they are optimally rotationally aligned. The dissimilarity between images when they are optimally rotationally aligned is sometimes called the rotationally invariant distance [31]. We further define the linear transformation O_ij as the 2 × 2 orthogonal transformation that registers the two images (see Figure 1.2). Similarly, for data sets consisting of three-dimensional shapes, O_ij encodes the optimal 3 × 3 orthogonal registration transformation. In the case of manifold learning, the linear transformations can be constructed using local principal component analysis (PCA) and alignment, as discussed in Section 2.

Figure 1.2 — An example of a weighted graph with orthogonal transformations: *I_i* and *I_j* are two different images of the digit 1, corresponding to nodes i and j in the graph. *O_ij* is the 2 × 2 rotation matrix that rotationally aligns *I_j* with *I_i* and *w_ij* is some measure for the affinity between the two images when they are optimally aligned. The affinity *w_ij* is large, because the images *I_i* and *O_ijI_j* are actually the same. On the other hand, *I_k* is an image of the digit 2, and the discrepancy between *I_k* and *I_i* is large even when these images are optimally aligned. As a result, the affinity *w_ik* would be small, perhaps so small that there is no edge in the graph connecting nodes i and k. The matrix *O_ik* is clearly not as meaningful as *O_ij*. If there is no edge between i and k, then *O_ik* is not represented in the weighted graph.

In this paper, the linear transformations relating data points are restricted to orthogonal transformations in O(d). Transformations belonging to other matrix groups, such as translations and dilations, are not treated here.

While diffusion maps and other nonlinear dimensionality reduction methods are either directly or indirectly related to the heat kernel for functions over the data, our VDM framework is based on the heat kernel for vector fields. We construct this kernel from the weighted graph and the orthogonal transformations. Through the spectral decomposition of this kernel, VDM defines an embedding of the data in a Hilbert space. In particular, it defines a metric for the data, that is, distances between data points that we call vector diffusion distances. For some applications, the vector diffusion metric is more meaningful than currently used metrics, since it takes into account the linear transformations, and as a result, it provides a better organization of the data. In the manifold learning setup, we prove a convergence theorem illuminating the relation between VDM and the connection Laplacian operator for vector fields over the manifold.

The paper is organized in the following way: First, Table 1.1 summarizes the notation used throughout this paper. In Section 2 we describe the manifold learning setup and a procedure to extract the orthogonal transformations from a point cloud scattered in a high-dimensional euclidean space using local PCA and alignment. In Section 3 we specify the vector diffusion mapping of the data set into a finite-dimensional Hilbert space. At the heart of the vector diffusion mapping construction lies a certain symmetric matrix that can be normalized in slightly different ways. Different normalizations lead to different embeddings, as discussed in Section 4. These normalizations resemble the normalizations of the graph Laplacian in spectral graph theory and spectral clustering algorithms. In the manifold learning setup, it is known that when the point cloud is uniformly sampled from a low-dimensional Riemannian manifold, then the normalized graph Laplacian approximates the Laplace-Beltrami operator for scalar functions. In Section 5 we formulate a similar result, stated as Theorem 5.3, for the convergence of the appropriately normalized vector diffusion mapping matrix to the connection Laplacian operator for vector fields.¹

TABLE 1.1.

Summary of symbols used throughout the paper.

Symbol	Meaning

p	Dimension of the ambient euclidean space
d	Dimension of the low-dimensional Riemannian manifold
ℳ^d	d-dimensional Riemannian manifold embedded in ℝ^p
ι	Embedding of ℳ^d into ℝ^p
g	Metric of ℳ induced from ℝ^p
dV	Volume form associated with the metric g
n	Number of data points sampled from ℳ^d
x¹ …,xⁿ	Points sampled from ℳ^d
exp_x	Exponential map at x
Δ	Laplace-Beltrami operator
Tℳ	Tangent bundle of ℳ
T_xℳ	Tangent space to ℳ at x
X	Vector field
C^k(Tℳ)	Space of k^th continuously differentiable vector fields, k = 1,2,…
L²(Tℳ)	Space of squared-integrable vector fields
P_x,y	Parallel transport from y to x along the geodesic linking them
∇	Connection of the tangent bundle
∇²	Connection (rough) Laplacian
e^t∇²	Heat kernel associated with the connection Laplacian
ℛ	Riemannain curvature tensor
Ric	Ricci curvature
s	Scalar curvature
Π	Second fundamental form of the embedding ι
K, K_PCA	Kernel functions
∊,∊_PCA	Bandwidth parameters of the kernel functions

Open in a new tab

The proof of Theorem 5.3 appears in Appendix B. We verify Theorem 5.3 numerically for spheres of different dimensions, as reported in Section 6 and Appendix C. We also use other surfaces to perform numerical comparisons between the vector diffusion distance, the diffusion distance, and the geodesic distance. In Section 7 we briefly discuss out-of-sample extrapolation of vector fields via the Nyström extension scheme. The role played by the heat kernel of the connection Laplacian is discussed in Section 8. We use the well-known short-time asymptotic expansion of the heat kernel to show the relationship between vector diffusion distances and geodesic distances for nearby points. In Section 9 we briefly discuss the application of VDM to cryo-electron microscopy, as a prototypical multireference rotational alignment problem. We conclude in Section 10 with a summary followed by a discussion of some other possible applications and extensions of the mathematical framework.

2 Data Sampled from a Riemannian Manifold

One of the main objectives in the analysis of a high-dimensional large data set is to learn its geometric and topological structure. Even though the data itself is parametrized as a point cloud in a high-dimensional ambient space ℝ^{^p}the correlation between parameters often suggests the popular “manifold assumption” that the data points are distributed on (or near) a single low-dimensional Riemannian manifold ℳ^d embedded in ℝ^p, where d is the dimension of the manifold and d ≪ p. Suppose that the point cloud consists of n data points x₁; x₂, …, x_n that are viewed as points in ℝ^p but are restricted to the manifold.

We now describe how the orthogonal transformations O_ij can be constructed from the point cloud using local PCA and alignment.

Local PCA. For every data point x_i we suggest estimating a basis to the tangent plane T_xiℳ to the manifold at x_i using the following procedure, which we refer to as local PCA. We fix a scale parameter ∊_PCA > 0 and define 𝒩_{x_i, ∊_PCA} as the neighbors of x_i inside a ball of radius $\sqrt{ε_{PCA}}$ centered at x_i:

𝒩_{x_{i}, ε_{PCA}} = {x_{j} : 0 < {‖ x_{j} - x_{i} ‖}_{ℝ^{p}} < \sqrt{ε_{PCA}}} .

Denote the number of neighboring points of x_i by N_i,² that is, N_i |𝒩x_i, ∊_PCA|, and denote the neighbors of x_i by x_i₁, x_i₂, …, x_{i_{N_i}}. We assume that ∊_PCA is large enough so that N_i ≥ d, but at the same time ∊_PCA is small enough such that N_i ≪ n. At this point we assume d is known; methods to estimate it will be discussed later in this section. In Theorem B.1 we show that a satisfactory choice for ∊_PCA is given by ∊_PCA = O(n^−2/(d+1)), so that N_i = O(n^1/(d+1)). In fact, it is even possible to choose ∊_PCA = O (n^−2/(d+2)) if the manifold does not have a boundary.

Observe that the neighboring points are located near T_{x_i}ℳ, where deviations are possible due to curvature. Define X_i to be a p × N_i matrix whose j_th column is the vector x_{i_j} – x_i that is,

X_{i} = [x_{i_{1}} - x_{i} x_{i_{2}} - x_{i} \dots x_{i_{N_{i}}} - x_{i}] .

In other words, X_i is the data matrix of the neighbors shifted to be centered at the point x_i. Notice that while it is more common to shift the data for PCA by the mean $μ_{i} = \frac{1}{N_{i}} Σ_{j = 1}^{N_{i}} x_{i_{j}}$ , here we shift the data by x_i. Shifting the data by µ_i is also possible for all practical purposes, but has the slight disadvantage of complicating the proof for the convergence of the local PCA step (see Appendix B.1).

The local covariance matrix corresponding to the neighbors of x_i is $X_{i} X_{i}^{Τ}$ . Among the neighbors, those that are further away from x_i contribute the most to the covariance matrix. However, if the manifold is not flat at x_i then we would like to give more emphasis to the nearby points, so that the tangent space estimation is more accurate. In order to give more emphasis to nearby points, we weigh the contribution of each point by a monotonically decreasing function of its distance from x_i. Let K_PCA be a C² positive monotonic decreasing function with support on the interval [0, 1], for example, the Epanechnikov kernel K_PCA(u) = (1–u₂)χ_[0,1], where χ is the indicator function. Let D_i be an N_i × N_i diagonal matrix with

D_{i} (j, j) = \sqrt{K_{PCA} (\frac{{‖ x_{i} - x_{i_{j}} ‖}_{ℝ^{p}}}{\sqrt{ε_{PCA}}})}, j = 1, 2, \dots, N_{i},

and define the p × N_i matrix B_i as

B_{i} = X_{i} D_{i} .

(2.1)

The local weighted covariance matrix at x_i, which we denote by Ξ_i is

Ξ_{i} = B_{i} B_{i}^{Τ} = \sum_{j = 1}^{N_{i}} K_{PCA} (\frac{{‖ x_{i} - x_{i_{j}} ‖}_{ℝ^{p}}}{\sqrt{ε_{PCA}}}) (x_{i_{j}} - x_{i}) {(x_{i_{j}} - x_{i})}^{Τ} .

(2.2)

Since K_PCA is supported on the interval [0, 1] the covariance matrix Ξ_i can also be represented as

Ξ_{i} = \sum_{j = 1}^{n} K_{PCA} (\frac{{‖ x_{i} - x_{j} ‖}_{ℝ^{p}}}{\sqrt{ε_{PCA}}}) (x_{j} - x_{i}) {(x_{j} - x_{i})}^{Τ} .

(2.3)

The definition of D_i(j, j) above is via the square root of the kernel, so it appears linearly in the covariance matrix. We denote the singular values of B_i by σ_i;1 ≥ σ_i;2≥⋯ ≥σ_{i,N_i}. The eigenvalues of the p × p local covariance matrix Ξ_i equal the squared singular values. Since the ambient space dimension p is typically large, it is usually more efficient to compute the singular values and singular vectors of B_i rather than the eigendecomposition of Ξ_i.

Suppose that the singular value decomposition (SVD) of B_i is given by

B_{i} = U_{i} Σ_{i} V_{i}^{Τ} .

The columns of the p × N_i matrix U_i are orthonormal and are known as the left singular vectors

U_{i} = [u_{i_{1}} u_{i_{2}} \dots u_{i_{N_{i}}}] .

We define the p × d matrix O_i by the first d left singular vectors (corresponding to the largest singular values):

O_{i} = [u_{i_{1}} u_{i_{2}} \dots u_{i_{d}}] .

(2.4)

The d columns of O_i are orthonormal, i.e., $O_{i}^{Τ} O_{i} = I_{d \times d}$ . The columns of O_i represent an orthonormal basis to a d-dimensional subspace of ℝ^p. This basis is a numerical approximation to an orthonormal basis of the tangent plane T_{x_i}ℳ. The order of the approximation (as a function of ∊_PCA and n, where n is the number of the data points) is established later in Appendix B, using the fact that the columns of O_i are also the eigenvectors (corresponding to the d largest eigenvalues) of the p × p; weighted covariance matrix. Ξ_i. We emphasize that the covariance matrix is never actually formed due to its excessive storage requirements, and all computations are performed with the matrix B_i.

In cases where the intrinsic dimension d is not known in advance, the following procedure can be used to estimate d. The underlying assumption is that data points lie exactly on the manifold, without any noise contamination. We remark that in the presence of noise, other procedures (e.g., [28]) could give a more accurate estimation.

Because our previous definition for the neighborhood size parameter ∊_PCA involves d, we are not allowed to use it when trying to estimate d. Instead, we can take the neighborhood size ∊_PCA to be any monotonically decreasing function of n that satisfies $n ε_{PCA}^{d / 2} \to \infty as n \to \infty$ . This condition ensures that the number of neighboring points increases indefinitely in the limit of an infinite number of sampling points. One can choose, for example, ∊_PCA = 1/log(n), though other choices are also possible.

Notice that if the manifold is flat, then the neighboring points in 𝒩_{x_i, ∊PCA} are located exactly on T_{x_i}ℳ and as a result rank (X_i) = rank (B_i) = d and B_i has exactly d nonvanishing singular values (i.e., σ_i;d+1 = σ_i,d+2 = ⋯ = σ_{i,N_i} = 0). In such a case, the dimension can be estimated as the number of nonzero singular values. For nonflat manifolds, due to curvature, there may be more than d nonzero singular values. Clearly, as n goes to infinity, these singular values approach 0, since the curvature effect disappears. A common practice is to estimate the dimension as the number of singular values that account for a high enough percentage of the variability of the data. That is, one sets a threshold γ between 0 and 1 (usually closer to 1 than to 0), and estimates the dimension as the smallest integer d_i for which

\frac{\sum_{j = 1}^{d_{i}} σ_{i, j}^{2}}{\sum_{j = 1}^{N_{i}} σ_{i, j}^{2}} > γ .

For example, setting γ = 0:9 means that d_i singular values account for at least 90% variability of the data, while d_i − 1 singular values account for less than 90%. We refer to the smallest integer d_i as the estimated local dimension of ℳ at x_i. From the previous discussion it follows that this procedure produces an accurate estimation of the dimension at each point as n goes to infinity. One possible way to estimate the dimension of the manifold would be to use the mean of the estimated local dimensions d₁; d₂; …, d_n, that is, $\hat{d} = \frac{1}{n} Σ_{i = 1}^{n} d_{i}$ (and then round it to the closest integer). The mean estimator minimizes the sum of squared errors $Σ_{i = 1}^{n} {(d_{i} - \hat{d})}^{2}$ . We estimate the intrinsic dimension of the manifold by the median value of all the d_i’s; that is, we define the estimator d̂ for the intrinsic dimension d as

\hat{d} = median {d_{1}, d_{2}, \dots, d_{n}} .

In all proceeding steps of the algorithm we use the median estimator d̂, but in order to facilitate the notation we write d instead of y d̂.

Alignment Suppose x_i and x_j are two nearby points whose euclidean distance satisfies $‖ x_{i} - x_{j} ‖ ℝ^{p} < \sqrt{ε}$ , where ∊ > 0 is a scale parameter different from the scale parameter ∊_PCA. In fact, ∊ is much larger than ∊_PCA as we later choose ∊ = O(n^−2/(d+4)), while, as mentioned earlier, ∊_PCA = O(n^−2/(d+1)) (manifolds with boundary) or ∊_PCA = O(n^−2/(d+2)) (manifolds with no boundary). In any case, ∊ is small enough so that the tangent spaces T_{x_i}ℳ and T_{x_j}ℳ are also close (in the sense that their Grassmannian distance given approximately by the operator norm $‖ O_{i} O_{i}^{Τ} - O_{j} O_{j}^{Τ} ‖$ is small). Therefore, the column spaces of O_i and O_j are almost the same. If the subspaces were to be exactly the same, then the matrices O_i and O_j would differ by a d × d orthogonal transformation O_ij satisfying O_iO_ij = O_j, or equivalently $O_{i j} = O_{i}^{Τ} O_{j}$ . In that case, $O_{i}^{Τ} O_{j}$ would be the matrix representation of the operator that transport vectors from T_{x_j}ℳ to T_{x_j}ℳ, viewed as copies of ℝ_d. The subspaces, however, are usually not exactly the same, due to curvature. As a result, the matrix $O_{i}^{Τ} O_{j}$ is not necessarily orthogonal, and we define O_ij as its closest orthogonal matrix, i.e.,

O_{i j} = \underset{O \in O (d)}{argmin} {‖ O - O_{i}^{Τ} O_{j} ‖}_{HS,}

(2.5)

where ∥·∥_HS is the Hilbert-Schmidt (HS) norm (given by ${‖ A ‖}_{HS}^{2} = Tr (A A^{Τ})$ for any real matrix A) and O(d) is the set of orthogonal d × d matrices. This minimization problem has a simple solution³ [1, 13, 21, 25] via the SVD of $O_{i}^{Τ} O_{j}$ . Specifically, if

O_{i}^{Τ} O_{j} = U Σ V^{Τ}

is the SVD of $O_{i}^{Τ} O_{j}$ , then O_ij is given by

O_{i j} = U V^{Τ} .

We refer to the process of finding the optimal orthogonal transformation between bases as alignment. Later in Appendix B we show that the matrix O_ij is an approximation to the parallel transport operator⁴ from T_{x_j}ℳ to T_{x_i}ℳ whenever x_i and x_j are nearby.

Note that not all bases are aligned; only the bases of nearby points are aligned. We set E to be the edge set of the undirected graph over n vertices that correspond to the data points, where an edge between i and j exists if and only if their corresponding bases are aligned by the algorithm₅ (or equivalently, if and only if $0 < ‖ x_{i} - x_{j} ‖ ℝ^{p} < \sqrt{ε}$ . The weights w_ij are defined⁶ using a kernel function K as

w_{i j} = K (\frac{{‖ x_{i} - x_{j} ‖}_{ℝ^{p}}}{\sqrt{ε}}),

(2.6)

where we assume that K is supported on the interval [0, 1]. For example, the Gaussian kernel K(u) = exp {−u²}χ_[0,1] leads to weights of the form $w_{i j} = exp {- {‖ x_{i} - x_{j} ‖}^{2} / ε} for 0 < ‖ x_{i} - x_{j} ‖ < \sqrt{ε}$ and 0 otherwise. Notice that the kernel K used for the definition of the weights w_ij could be different from the kernel K_PCA used for the previous step of local PCA.

3 Vector Diffusion Mapping

We construct the following matrix S:

S (i, j) = {\begin{matrix} w_{i j} O_{i j} & (i, j) \in E, \\ 0_{d \times d} & (i, j) \notin E . \end{matrix}

(3.1)

That is, S is a block matrix, with n × n blocks, each of which is of size d × d. Each block is either a d × d orthogonal transformation O_ij multiplied by the scalar weight w_ij or a zero d × d matrix. (As mentioned in a previous footnote, the edge set does not contain self-loops, so w_ii = 0 and S(i, i) = 0_{d × d}.) The matrix S is symmetric since $O_{i j}^{Τ} = O_{j i}$ and w_ij = w_ji, and its overall size is nd × nd. We define a diagonal matrix D of the same size, where the diagonal blocks are scalar matrices given by

D (i, i) = deg (i) I_{d \times d},

(3.2)

and

deg (i) = \sum_{j : (i, j) \in E} w_{i j}

(3.3)

is the weighted degree of node i. The matrix D⁻¹S can be applied to vectors v of length nd, which we regard as n vectors of length d, such that v(i) is a vector in ℝ^d viewed as a vector in T_{x_i}ℳ. The matrix D⁻¹S is an averaging operator for vector fields, since

(D^{- 1} S v) (i) = \frac{1}{deg (i)} \sum_{j : (i, j) \in E} w_{i j} O_{i j} v (j) .

(3.4)

This implies that the operator D⁻¹S : (ℝ^d)ⁿ → (ℝ^d)ⁿ transports vectors from the tangent spaces T_{x_j}ℳ (that are nearby to T_{x_i}ℳ) to T_{x_i}ℳ and then averages the transported vectors in T_{x_i}ℳ.

Notice that diffusion maps and other nonlinear dimensionality reduction methods make use of the weight matrix $W = {(w_{i j})}_{i, j = 1}^{n}$ but not of the transformations O_ij. In diffusion maps, the weights are used to define a discrete random walk over the graph, where the transition probability a_ij in a single time step from node i to node j is given by

a_{i j} = \frac{w_{i j}}{deg (i)} .

(3.5)

The Markov transition matrix $A = {(a_{i j})}_{i, j = 1}^{n}$ can be written as

A = 𝒟^{- 1} W,

(3.6)

where 𝒟 is an n × n diagonal matrix with

𝒟 (i, i) = deg (i) .

(3.7)

While A is the Markov transition probability matrix in a single time step, A_t is the transition matrix for t steps. In particular, A_t (i, j) sums the probabilities of all paths of length t that start at i and end at j. Coifman and Lafon [9, 26] showed that A_t can be used to define an inner product in a Hilbert space. Specifically, the matrix A is similar to the symmetric matrix 𝒟_−1/2W 𝒟_−1/2 through A = 𝒟_−1/2(𝒟_−1/2W𝒟_−1/2)𝒟_1/2. It follows that A has a complete set of real eigenvalues and eigenvectors ${μ_{l}}_{l = 1}^{n} and {ϕ_{l}}_{l = 1}^{n}$ , respectively, satisfying Aφ_l = µ_lφ_l. Their diffusion mapping Φ_t is given by

Φ_{t} (i) = (μ_{1}^{t} ϕ_{1} (i), μ_{2}^{t} ϕ_{2} (i), \dots, μ_{n}^{t} ϕ_{n} (i)),

(3.8)

where φ_l(i) is the i^th entry of the eigenvector φ_l. The mapping Φ_t satisfies

\sum_{k = 1}^{n} \frac{A^{t} (i, k)}{\sqrt{deg (k)}} \frac{A^{t} (j, k)}{\sqrt{deg (k)}} = 〈 Φ_{t} (i), Φ_{t} (j) 〉,

(3.9)

where 〈·,·〉 is the usual dot product over euclidean space. The metric associated to this inner product is known as the diffusion distance. The diffusion distance d_DM,t (i, j) between i and j is given by

d_{DM, t}^{2} (i, j) = \sum_{k = 1}^{n} \frac{{(A^{t} (i, k) - A^{t} (j, k))}^{2}}{deg (k)} = 〈 Φ_{t} (i), Φ_{t} (i) 〉 + 〈 Φ_{t} (j), Φ_{t} (j) 〉 - 2 〈 Φ_{t} (i), Φ_{t} (j) 〉 .

(3.10)

Thus, the diffusion distance between i and j is the weighted-ℓ₂ proximity between the probability clouds of random walkers starting at i and j after t steps.

In the VDM framework, we define the affinity between i and j by considering all paths of length t connecting them, but instead of just summing the weights of all paths, we sum the transformations. A path of length t from j to i is some sequence of vertices j₀, j₁, …, j_t with j₀ = j and j_t = i, and its corresponding orthogonal transformation is obtained by multiplying the orthogonal transformations along the path in the following order:

O_{j_{t}, j_{t - 1}} O_{j_{t - 1}, j_{t - 2}} \dots O_{j_{2}, j_{1}} O_{j_{1}, j_{0}} .

(3.11)

Every path from j to i may therefore result in a different transformation. This is analogous to the parallel transport operator from differential geometry that depends on the path connecting two points whenever the manifold has curvature (e.g., the sphere). Thus, when adding transformations of different paths, cancellations may happen.

We would like to define the affinity between i and j as the consistency between these transformations, with higher affinity expressing more agreement among the transformations that are being averaged. To quantify this affinity, we consider again the matrix D⁻¹S, which is similar to the symmetric matrix

\tilde{S} = D^{- 1 / 2} S D^{- 1 / 2}

(3.12)

through D⁻¹S = D^−1/2S̃D^1/2, and define the affinity between i and j as

{‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2},

that is, as the squared HS norm of the d × d matrix S̃^2t (i, j), which takes into account all paths of length 2t, where t is a positive integer. In a sense, ${‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}$ measures not only the number of paths of length 2t connecting i and j but also the amount of agreement between their transformations. That is, for a fixed number of paths, ${‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}$ is larger when the path transformations are in agreement and is smaller when they differ.

Since S̃ is symmetric, it has a complete set of eigenvectors v₁; v₂; …, v_nd and eigenvalues λ₁, λ₂; …, λ_nd. We order the eigenvalues in decreasing order of magnitude |λ₁|≥ |λ₂|≥…≥|λ_nd|. The spectral decompositions of S̃ and S̃_2t are given by

\tilde{S} (i, j) = \sum_{l = 1}^{n d} λ_{l} v_{l} (i) v_{l} {(j)}^{Τ} and {\tilde{S}}^{2 t} (i, j) = \sum_{l = 1}^{n d} λ_{l}^{2 t} v_{l} (i) v_{l} {(j)}^{Τ},

(3.13)

where v_l(i) ∈ ℝ^d for i = 1, 2, …, n and l = 1, 2, …, nd. The HS norm of S̃_2t (i, j) is calculated using the trace:

{‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2} = Tr [{\tilde{S}}^{2 t} (i, j) {\tilde{S}}^{2 t} {(i, j)}^{Τ}] = \sum_{l, r = 1}^{n d} {(λ_{l} λ_{r})}^{2 t} 〈 v_{l} (i), v_{r} (i) 〉 〈 v_{l} (j), v_{r} (j) 〉 .

(3.14)

It follows that the affinity ${‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}$ is an inner product for the finite-dimensional Hilbert space ℝ_(nd)2 via the mapping V_t:

V_{t} : i \mapsto {({(λ_{l} λ_{r})}^{t} 〈 v_{l} (i), v_{r} (i) 〉)}_{l, r = 1}^{n d} .

(3.15)

That is,

{‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2} = 〈 V_{t} (i), V_{t} (j) 〉 .

(3.16)

Note that in the manifold learning setup, the embedding i ↦ V_t(i) is invariant to the choice of basis for T_xiℳ because the dot products 〈v_l(i), v_r(i)〉 are invariant to orthogonal transformations. We refer to V_t as the vector diffusion mapping.

From the symmetry of the dot products 〈v_l (i), v_r (i)〉, it is clear that ${‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}$ is also an inner product for the finite-dimensional Hilbert space ℝ_nd(nd+1)/2 corresponding to the mapping

i \mapsto {(c_{l r} {(λ_{l} λ_{r})}^{t} 〈 v_{l} (i), v_{r} (i) 〉)}_{1 \leq l \leq r \leq n d} where c_{l r} = {\begin{matrix} \sqrt{2} & l < r, \\ 1 & l = r . \end{matrix}

We define the symmetric vector diffusion distance d_VDM,t (i, j) between nodes i and j as

d_{VDM, t}^{2} (i, j) = 〈 V_{t} (i), V_{t} (i) 〉 + 〈 V_{t} (j), V_{t} (j) 〉 - 2 〈 V_{t} (i), V_{t} (j) 〉 .

(3.17)

The matrices I − S̃ and I + S̃ are positive semidefinite due to the following identity:

v^{Τ} (I \pm D^{- 1 / 2} S D^{- 1 / 2}) v = \sum_{(i, j) \in E} {‖ \frac{v (i)}{\sqrt{deg (i)}} \pm \frac{w_{i j} O_{i j} v (j)}{deg (j)} ‖}^{2} \geq 0

(3.18)

for any v ∈ ℝ^nd. As a consequence, all eigenvalues λ_l of S̃ reside in the interval [−1; 1]. In particular, for large enough t, most terms of the form (λ_lλ_r)^2t in (3.14) are close to 0, and ${‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}$ can be well approximated by using only the few largest eigenvalues and their corresponding eigenvectors. This lends itself to an efficient approximation of the vector diffusion distances d_VDM;t(i, j) of (3.17), and it is not necessary to raise the matrix S̃ to its 2t power (which usually results in dense matrices). Thus, for any δ > 0, we define the truncated vector diffusion mapping $V_{t}^{δ}$ that embeds the data set in ℝ^m² (or equivalently, but more efficiently, in ℝ_m(m+1)/2) using the eigenvectors v₁, v₂; …, v_m as

V_{t}^{δ} : i \mapsto {({(λ_{l} λ_{r})}^{t} 〈 v_{l} (i), v_{r} (i) 〉)}_{l, r = 1}^{m}

(3.19)

where m = m(t, δ) is the largest integer for which

{(\frac{λ_{m}}{λ_{1}})}^{2 t} > δ and {(\frac{λ_{m + 1}}{λ_{1}})}^{2 t} \leq δ .

We remark that V_t is defined through ${‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}$ rather than ${‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}$ , because we cannot guarantee that in general all eigenvalues of S̃ are nonnegative. In Section 8, we show that in the continuous setup of the manifold learning problem all eigenvalues are nonnegative. We anticipate that for most practical applications that correspond to the manifold assumption, all negative eigenvalues (if any) would be small in magnitude (say, smaller than δ). In such cases, one can use any real t > 0 for the truncated vector diffusion map $V_{t}^{δ}$ .

4 Normalized Vector Diffusion Mappings

It is also possible to obtain slightly different vector diffusion mappings using different normalizations of the matrix S. These normalizations are similar to the ones used in the diffusion map framework [9]. For example, notice that

w_{l} = D^{- 1 / 2} v_{l}

(4.1)

are the right eigenvectors of D₋₁S, that is, D₋₁Sw_l = λ_lw_l. We can thus define another vector diffusion mapping, denoted $V_{t}^{'}$ , as

V_{t}^{'} : i \mapsto {({(λ_{l} λ_{r})}^{t} 〈 w_{l} (i), w_{r} (i) 〉)}_{l, r = 1}^{n d} .

(4.2)

From (4.1) it follows that $V_{t}^{'}$ and V_t satisfy the relations

V_{t}^{'} (i) = \frac{1}{deg (i)} V_{t} (i)

(4.3)

and

〈 V_{t}^{'} (i), V_{t}^{'} (j) 〉 = \frac{〈 V_{t} (i), V_{t} (j) 〉}{deg (i) deg (j)} .

(4.4)

As a result,

〈 V_{t}^{'} (i), V_{t}^{'} (j) 〉 = \frac{{‖ {\tilde{S}}^{2 t} (i, j) ‖}_{HS}^{2}}{\deg (i) \deg (j)} = \frac{{‖ {(D^{- 1} S)}^{2 t} (i, j) ‖}_{HS}^{2}}{\deg {(j)}^{2}} .

(4.5)

In other words, the Hilbert-Schmidt norm of the matrix D⁻¹S leads to an embedding of the data set in a Hilbert space only upon proper normalization by the vertex degrees (similar to the normalization by the vertex degrees in (3.9) and (3.10) for the diffusion map). We define the associated vector diffusion distances as

d_{VDM', t}^{2} (i, j) = 〈 V_{t}^{'} (i), V_{t}^{'} (i) 〉 + 〈 V_{t}^{'} (j), V_{t}^{'} (j) 〉 - 2 〈 V_{t}^{'} (i), V_{t}^{'} (j) 〉 .

(4.6)

We comment that the normalized mappings $i \mapsto \frac{V_{t} (i)}{‖ V_{t} (i) ‖} and i \mapsto \frac{V_{t}^{'} (i)}{‖ V_{t}^{'} (i) ‖}$ that map the data points to the unit sphere are equivalent, that is,

\frac{V_{t}^{'} (i)}{‖ V_{t}^{'} (i) ‖} = \frac{V_{t} (i)}{‖ V_{t} (i) ‖} .

(4.7)

This means that the angles between pairs of embedded points are the same for both mappings. For a diffusion map, it has been observed that in some cases the distances

‖ \frac{Φ_{t} (i)}{‖ Φ_{t} (i) ‖} - \frac{Φ_{t} (j)}{‖ Φ_{t} (j) ‖} ‖

are more meaningful than ∥Φ_t(i)–Φ_t(j)∥ (see, for example, [17]). This may also suggest the usage of the distances

‖ \frac{V_{t} (i)}{‖ V_{t} (i) ‖} - \frac{V_{t} (j)}{‖ V_{t} (j) ‖} ‖

in the VDM framework.

Another important family of normalized diffusion mappings is obtained by the following procedure. Suppose 0 ≤ α ≤ 1, and define the symmetric matrices W_α and S_α as

W_{α} = 𝒟^{- α} W 𝒟^{- α}

(4.8)

and

S_{α} = D^{- α} S D^{- α} .

(4.9)

We define the weighted degrees deg_α(1), deg_α(2), …, deg_α(n) corresponding to W_α by

{deg}_{α} (i) = \sum_{j = 1}^{n} W_{α} (i, j),

the n × n diagonal matrix D_α as

𝒟_{α} (i, i) = {deg}_{α} (i),

(4.10)

and the n × n block diagonal matrix D_α (with blocks of size d × d) as

D_{α} (i, i) = {deg}_{α} (i) I_{d \times d} .

(4.11)

We can then use the matrices S_α and D_α (instead of S and D) to define the vector diffusion mappings $V_{α, t} and V_{α, t}^{'}$ . Notice that for α = 0 we have S₀ = S and D₀ = D, so that V_0,t = V_t and $V_{0, t}^{'} = V_{t}^{'}$ . The case α = 1 turns out to be especially important as discussed in the next section.

5 Convergence to the Connection Laplacian

For diffusion maps, the discrete random walk over the data points converges to a continuous diffusion process over that manifold in the limit n → ∞ and ∊ → 0. This convergence can be stated in terms of the normalized graph Laplacian L given by

L = 𝒟^{- 1} W - I .

In the case where the data points ${x_{i}}_{i = 1}^{n}$ are sampled independently from the uniform distribution over ℳ^d the graph Laplacian converges pointwise to the Laplace-Beltrami operator, as we have the following proposition [3, 20, 26, 34]: If f: ℳ^d → ℝ is a smooth function (e.g., f ∈ C₃(ℳ)), then with high probability

\frac{1}{ε} \sum_{j = 1}^{n} L_{i j} f (x_{j}) = \frac{1}{2} Δ_{ℳ} f (x_{i}) + O (ε + \frac{1}{n^{1 / 2} ε^{1 / 2 + d / 4}}),

(5.1)

where Δ_ℳ is the Laplace-Beltrami operator on ℳ^d. The error consists of two terms: a bias term O(∊) and a variance term that decreases as $1 / \sqrt{n}$ but also depends on ∊. Balancing the two terms may lead to an optimal choice of the parameter ∊ as a function of the number of points n. In the case of uniform sampling, Belkin and Niyogi [4] have shown that the eigenvectors of the graph Laplacian converge to the eigenfunctions of the Laplace-Beltrami operator on the manifold, which is stronger than the pointwise convergence given in (5.1).

In the case where the data points ${x_{i}}_{i = 1}^{n}$ are independently sampled from a probability density function p(x) whose support is a d-dimensional manifold ℳ^d and satisfies some mild conditions, the graph Laplacian converges pointwise to the Fokker-Planck operator as stated in following proposition [3, 20, 26, 34]: If f ∈ C₃(ℳ), then with high probability

\frac{1}{ε} \sum_{j = 1}^{n} L_{i j} f (x_{j}) = \frac{1}{2} Δ_{ℳ} f (x_{i}) + \nabla U (x_{i}) \cdot \nabla f (x_{i}) + O (ε + \frac{1}{n^{1 / 2} ε^{1 / 2 + d / 4}}),

(5.2)

where the potential term U is given by U(x) = −2 log p(x). The error is interpreted in the same way as in the uniform sampling case. In [9] it is shown that it is also possible to recover the Laplace-Beltrami operator for nonuniform sampling processes using W₁ and 𝒟₁ (that correspond to α = 1 in (4.8) and (4.11)). The matrix $D_{1}^{- 1} W_{1} - I$ converges to the Laplace-Beltrami operator independently of the sampling density function p(x).

For VDM, we prove in Appendix B the following theorem, Theorem 5.3, that states that the matrix $D_{α}^{- 1} S_{α} - I$ , where 0 ≤ α ≤ 1, converges to the connection Laplacian operator (defined via the covariant derivative; see Appendix A and [32]) plus some potential terms depending on p(x). In particular, $D_{1}^{- 1} S_{1} - I$ converges to the connection Laplacian operator, without any additional potential terms. Using the terminology of spectral graph theory, it may thus be appropriate to call $D_{1}^{- 1} S_{1} - I$ the connection Laplacian of the graph.

The main content of Theorem 5.3 specifies the way in which VDM generalizes diffusion maps: while diffusion mapping is based on the heat kernel and the Laplace-Beltrami operator over scalar functions, VDM is based on the heat kernel and the connection Laplacian over vector fields. While for diffusion maps the computed eigenvectors are discrete approximations of the Laplacian eigenfunctions, for VDM the l^th eigenvector v_l of $D_{1}^{- 1} S_{1} - I$ is a discrete approximation of the l^th eigenvector field X_l of the connection Laplacian ▽² over ℳ, which satisfies ▽²X_l= −λ_lX_l for some λ₁ ≥ 0.

In the formulation of Theorem 5.3, as well as in the remainder of the paper, we slightly change the notation used so far in the paper, as we denote the sampled data points in ℳ^d by x₁; x₂; …, x_n, and the observed data points in ℝ_p by ι(x₁), ι(x₂), …, ι(x_n), where ι : ℳ ↪ ℝ^p is the embedding of the Riemannian manifold ℳ in ℝ^p. Furthermore, we denote by ι_*T_{x_i}ℳ the d-dimensional subspace of ℝ^p that is the embedding of T_{x_i}ℝ in ℝ^p. It is important to note that in the manifold learning setup, the manifold ℳ, the embedding ι, and the points x₁, x₂; …, x_n ∈ ℳ are assumed to exist but cannot be directly observed.

Theorems 5.3, 5.5, and 5.6, stated later in this section, and their proofs in Appendix B all share the following assumption:

ASSUMPTION 5.1.

ι : ℳ ↪ ℳ^p is a smooth d-dimensional compact Riemannian manifold embedded in ℳ^p, with metric g induced from the canonical metric on ℳ^p.
When ∂ ℳ ≠, we denote ℳ_t = {x ∈ ℳ : min_y∈∂ℳd(x, y) ≤ t}, where t > 0 and d(x; y) is the geodesic distance between x and y.
The data points x₁, x₂; …, x_n are independent samples from ℳ according to the probability density function p ∈ C³ (ℳ) supported on ℳ ⊂ ℝ_p, where p is uniformly bounded from below and above, that is, 0 < p_m ≤ p(x) ≤ p_M < ∞.
K ∈ C₂([0, 1)) is a positive function. Also, m_l := ∫ℝ^d ∥x∥^lK(∥x∥)dx and $m_{l}^{'} : = \int R^{d} {‖ x ‖}^{l} K' (‖ x ‖) d x, l = 0, 1, 2, \dots$ . We assume m₀ = 1.
The vector field X is in C₃(Tℳ).
Denote τ to be the largest number having the following property: the open normal bundle about ℳ of radius r is embedded in ℝ_p for every r < τ [30]. This condition holds automatically since ℳ is compact. In all theorems, we assume that $\sqrt{ε} < τ$ . In [30], 1/τ is referred to as the “condition number” of ℳ.
To ease notation, in what follows we use the same notation ▽ to denote different connections on different bundles whenever there is no confusion and the meaning is clear from the context.

DEFINITION 5.2. For ∊ > 0, define

K_{ε} (x_{i}, x_{j}) = {\begin{matrix} K (\frac{{‖ ι (x_{i}) - ι (x_{j}) ‖}_{ℝ^{p}}}{\sqrt{ε}}) & for 0 < ‖ ι (x_{i}) - ι (x_{j}) ‖ < \sqrt{ε}, \\ 0 & otherwise . \end{matrix}

We define the empirical probability density function by

p_{ε} (x_{i}) = \sum_{j = 1}^{n} K_{ε} (x_{i}, x_{j})

and for 0 ≤ α ≤ 1 define the α-normalized kernel K_{∊, α} by

K_{ε, α} (x_{i}, x_{j}) = \frac{K_{ε} (x_{i}, x_{j})}{p_{ε}^{α} (x_{i}) p_{ε}^{α} (x_{j})} .

For 0 ≤ α ≤ 1, we define

T_{ε, α} X (x) = \frac{\int_{ℳ} K_{ε, α} (x, y) P_{x, y} X (y) p (y) d V (y)}{\int_{ℳ} K_{ε, α} (x, y) p (y) d V (y)} .

THEOREM 5.3. In addition to Assumption 5.1, suppose ℳ is closed and ∊_PCA = O(n^−2/(d+2)). For all x_i with high probability (w.h.p.)

\frac{1}{ε} [\frac{\sum_{j = 1}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1}^{n} K_{ε, α} (x_{i}, x_{j})} - {\bar{X}}_{i}] = \frac{m_{2}}{2 d} {(〈 ι_{*} {\nabla^{2} X (x_{i}) + \frac{2 \nabla X (x_{i}) \cdot \nabla (p^{1 - α}) (x_{i})}{p^{1 - α} (x_{i})}}, u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε^{\frac{1}{2}} + ε^{- 1} n^{- \frac{3}{d + 2}} + n^{- \frac{1}{2}} ε^{- \frac{d + 2}{4}}) = \frac{m_{2}}{2 d} {(〈 ι_{*} {\nabla^{2} X (x_{i}) + \frac{2 \nabla X (x_{i}) \cdot \nabla (p^{1 - α}) (x_{i})}{p^{1 - α} (x_{i})}}, e_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε^{\frac{1}{2}} + ε^{- 1} n^{- \frac{3}{d + 2}} + n^{- \frac{1}{2}} ε^{- \frac{d + 2}{4}})

(5.3)

where ${\bar{X}}_{i} \equiv {(〈 ι_{*} X (x_{i}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} \in ℝ^{d} f o r a l l i, {u_{l} (x_{i})}_{l = 1}^{d}$ is an orthonormal basis for a d-dimensional subspace of ℝ_p determined by local PCA (i.e., the columns of O_i), ${e_{l} (x_{i})}_{l = 1}^{d}$ is an orthonormal basis for ι_*T_{x_i}ℳ, O_ij is the optimal orthogonal transformation determined by the alignment procedure, and $\nabla X (x_{i}) \cdot \nabla (p^{1 - α}) (x_{i}) : = Σ_{l = 1}^{d} \nabla_{E_{l}} X \nabla_{E_{l}} (p^{1 - α})$ , where E_l is an orthonormal basis for T_{x_i}ℳ.

When ∊_PCA = O(n^−2/(d+1)), then the same almost surely convergence results above hold but with a slower convergence rate.

COROLLARY 5.4. For ∊ = O(n^−2/(d+4)) almost surely,

lim_{n \to \infty} \frac{1}{ε} [\frac{\sum_{j = 1}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1}^{n} K_{ε, α} (x_{i}, x_{j})} - {\bar{X}}_{i}] = \frac{m_{2}}{2 d} {(〈 ι_{*} {\nabla^{2} X (x_{i}) + \frac{2 \nabla X (x_{i}) \cdot \nabla (p^{1 - α}) (x_{i})}{p^{1 - α} (x_{i})}}, e_{l} (x_{i}) 〉)}_{l = 1}^{d},

(5.4)

and in particular

lim_{n \to \infty} \frac{1}{ε} [\frac{\sum_{j = 1}^{n} K_{ε, 1} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1}^{n} K_{ε, 1} (x_{i}, x_{j})} - {\bar{X}}_{i}] = \frac{m_{2}}{2 d} {(〈 ι_{*} \nabla^{2} X (x_{i}), e_{l} (x_{i}) 〉)}_{l = 1}^{d} .

(5.5)

When the manifold is compact with boundary, (5.3) does not hold at the boundary. However, we have the following result for the convergence behavior near the boundary:

THEOREM 5.5. Suppose the boundary ∂ℳ is smooth and that Assumption 5.1 applies. Choose ∊_PCA = O(n^−2/d+1)). When $x_{i} \in ℳ \sqrt{ε},$ we have

\frac{\sum_{j = 1}^{n} K_{ε, 1} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1}^{n} K_{ε, 1} (x_{i}, x_{j})} = {(〈 ι_{*} P_{x_{i}, x_{0}} (X (x_{0}) + \frac{m_{1}^{ε}}{m_{0}^{ε}} \nabla_{\partial_{d}} X (x_{0})), e_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε + n^{- \frac{3}{2 (d + 1)}} + n^{- \frac{1}{2}} ε^{- \frac{d - 2}{4}}),

(5.6)

where x₀ = argmin_y∈∂ℳ d(x_i, y), $m_{1}^{ε} a n d m_{0}^{ε}$ are constants defined in (B.93) and (B.94)and ∂^d is the normal direction to the boundary at x₀.

For the choice ∊ = O(n^−2/(d+4)) (as in Corollary 5.4), the error appearing in (5.6) is O(∊^3/4), which is asymptotically smaller than $O (\sqrt{ε})$ , which is the order of $m_{1}^{ε} / m_{0}^{ε}$ . A consequence of Theorem 5.3, Theorem 5.5, and the above discussion about the error terms is that the eigenvectors of $D_{1}^{- 1} S_{1} - I$ are discrete approximations of the eigenvector fields of the connection Laplacian operator with homogeneous Neumann boundary condition that satisfy

{\begin{matrix} \nabla^{2} X (x) = - λ X (x) & for x \in ℳ, \\ \nabla_{\partial_{d}} X (x) = 0 & for x \in \partial ℳ . \end{matrix}

(5.7)

We remark that the Neumann boundary condition also emerges for the choice ∊_PCA = O(n^−2/(d+2)). This is due to the fact that the error in the local PCA term is $O (ε_{PCA}^{1 / 2}) = 0 (n^{- 1 / (d + 2)})$ , which is asymptotically smaller than O(∊^1/2 = O(n^−1/(d+4)) error term.

Finally, Theorem 5.6 details the way in which the algorithm approximates the continuous heat kernel of the connection Laplacian:

THEOREM 5.6. Under Assumption 5.1, for any t > 0, the heat kernel e^t▽2 can be approximated on L²(Tℳ) by $T_{ε, 1}^{t / ε}$ , that is,

lim_{ε \to 0} T_{ε, 1}^{t / ε} = e^{t \nabla^{2}} .

6 Numerical Simulations

In all numerical experiments reported in this section, we use the normalized vector diffusion mapping $V_{1, t}^{'}$ corresponding to α = 1 in (4.9) and (4.10); that is, we use the eigenvectors of $D_{1}^{- 1} S_{1}$ to define the VDM. In all experiments we used the kernel function K (u) = e^−5u²χ_[0,1] for the local PCA step as well as for the definition of the weights w_ij. The specific choices for ∊ and ∊_PCA are detailed below. We remark that the results are not very sensitive to these choices; that is, similar results are obtained for a wide regime of parameters. The purpose of the first experiment is to numerically verify Theorem 5.3 using spheres of different dimensions. Specifically, we sampled n = 8000 points uniformly from 𝕊^d embedded in ℝ^d+1 for d = 2, 3, 4, 5. Figure 6.1 shows bar plots of the largest 30 eigenvalues of the matrix $D_{1}^{- 1} S_{1}$ for ∊_PCA = 0:1 when d = 2, 3, 4 and ∊_PCA = 0:2 when d = 5, and $ε = ε_{PCA}^{(d + 1) / (d + 4)}$ . It is noticeable that the eigenvalues have numerical multiplicities greater than 1. Since the connection Laplacian commutes with rotations, the dimensions of its eigenspaces can be calculated using representation theory (see Appendix C). In particular, our calculation predicted the following dimensions for the eigenspaces of the largest eigenvalues:

\begin{matrix} 𝕊^{2} : 6, 10, 14, \dots, & 𝕊^{3} : 4, 6, 9, 16, 16, \dots, \\ 𝕊^{4} : 5, 10, 14, \dots, & 𝕊^{5} : 6, 15, 20, \dots . \end{matrix}

These dimensions are in full agreement with the bar plots shown in Figure 6.1.

In the second set of experiments, we numerically compare the vector diffusion distance, the diffusion distance, and the geodesic distance for different compact manifolds with and without boundaries. The comparison is performed for the following four manifolds: (1) the sphere 𝕊² embedded in ℝ³; (2) the torus 𝕋² embedded in ℝ³; (3) the interval [−π, π] in ℝ and (4) the square [0, 2π] × [0, 2π] in ℝ². For both VDM and DM we truncate the mappings using δ = 0.2; see (3.19). The geodesic distance is computed by the algorithm of Dijkstra on a weighted graph, whose vertices correspond to the data points, the edges link data points whose euclidean distance is less than $\sqrt{ε}$ , and the weights w_G(i, j) are the euclidean distances, that is,

w_{G} (i, j) = {\begin{matrix} {‖ x_{i} - x_{j} ‖}_{ℝ^{p}}, & ‖ x_{i} - x_{j} ‖ < \sqrt{ε}, \\ + \infty & otherwise . \end{matrix}

𝕊² case: We sampled n = 5000 points uniformly from 𝕊² = {x ∈ ℝ₃ : ∥x∥ = 1} ⊂ ℝ³ and set ∊_PCA = 0.1 and $ε = \sqrt{ε_{PCA}} \approx 0.316$ . For the truncated vector diffusion distance, when t = 10, we find that the number of eigenvectors whose eigenvalue is larger (in magnitude) than $λ_{1}^{t} δ$ is m_VDM = m_VDM(t = 10, δ = 0.2) = 16 (recall the definition of m(t, δ) that appears after (3.19)). The corresponding embedded dimension is m_VDM(m_VDM + 1)/2, which in this case is 16 · 17/2 = 136. Similarly, for t = 100, m_VDM = 6 (embedded dimension is 6 · 7/2 = 21), and when t = 1000, m_VDM = 6 (embedded dimension is again 21). Although the first eigenspace (corresponding to the largest eigenvalue) of the connection Laplacian over 𝕊² is of dimension 6, there are small discrepancies between the top 6 numerically computed eigenvalues, due to the finite sampling. This numerical discrepancy is amplified upon raising the eigenvalues to the t^th power, when t is large, e.g., t = 1000. For demonstration purposes, we remedy this numerical effect by artificially setting λ_l = λ_l for l = 2, 3, …, 6. For the truncated diffusion distance, when t = 10, m_DM = 36 (embedded dimension is 36 − 1 = 35), when t = 100, m_DM = 4 (embedded dimension is 3), and when t = 1000, m_DM = 4 (embedded dimension is 3). Similarly, we have the same numerical effect when t = 1000, that is, µ₂, µ₃, and µ₄ are close but not exactly the same, so we again set µ_l = µ₂ for l = 3, 4. The results are shown in Figure 6.2.

Figure 6.2 — 𝕊² case. Top: truncated vector diffusion distances for t = 10, t = 100, and t = 1000. Bottom: truncated diffusion distances for t = 10, t = 100, and t = 1000, and the geodesic distance. The reference point from which distances are computed is marked in red.

𝕋² case: We sampled n = 5000 points (u, v) uniformly over the [0, 2π) × [0, 2π) square and then mapped them to ℝ₃ using the following transformation, which defines the surface 𝕋² as

Τ^{2} = {((2 + cos (v)) cos (u), (2 + cos (v)) sin (u), sin (v)) : (u, v) \in [0, 2 π) \times [0, 2 π)} \subset ℝ^{3} .

Notice that the resulting sample points are not uniformly distributed over 𝕋². Therefore, the usage of S₁ and D₁ instead of S and D is important if we want the eigenvectors to approximate the eigenvector fields of the connection Laplacian over 𝕋². We used ∊_PCA = 0:2 and $ε = \sqrt{ε_{PCA}} \approx 0.447$ , and find that for the truncated vector diffusion distance, when t = 10, the embedded dimension is 2628, when t = 100, the embedded dimension is 36, and when t = 1000, the embedded dimension is 3. For the truncated diffusion distance, when t = 10, the embedded dimension is 130; when t = 100, the embedded dimension is 14; and when t = 1000, the embedded dimension is 2. The results are shown in Figure 6.3.

Figure 6.3 — 𝕋² case. Top: truncated vector diffusion distances for t = 10, t = 100, and t = 1000. Bottom: truncated diffusion distances for t = 10, t = 100, and t = 1000, and the geodesic distance. The reference point from which distances are computed is marked in red.

One-dimensional interval case: We sampled n = 5000 equally spaced grid points from the interval [−π, π] ⊂ ℝ¹ and set ∊_PCA = 0.01 and $ε = ε_{PCA}^{2 / 5} \approx 0.158$ . For the truncated vector diffusion distance, when t = 10, the embedded dimension is 120, when t = 100, the embedded dimension is 15, and when t = 1000, the embedded dimension is 3. For the truncated diffusion distance, when t = 10, the embedded dimension is 36, when t = 100, the embedded dimension is 11, and when t = 1000, the embedded dimension is 3. The results are shown in Figure 6.4.

Figure 6.4 — One-dimensional interval case. Top: truncated vector diffusion distances for t = 10, t = 100, and t = 1000. Bottom: truncated diffusion distances for t = 10, t = 100, and t = 1000, and the geodesic distance. The reference point from which distances are computed is marked in red.

Square case: We sampled n = 6561 = 81₂ equally spaced grid points from the square [0, 2π] × [0, 2π] and fixed ∊_PCA = 0:01 and $ε = \sqrt{ε_{PCA}} = 0.1$ . For the truncated vector diffusion distance, when t = 10, the embedded dimension is 20;100 (we only calculate the first 200 eigenvalues); when t = 100, the embedded dimension is 1596; and when t = 1000, the embedded dimension is 36. For the truncated diffusion distance, when t = 10, the embedded dimension is 200 (we only calculate the first 200 eigenvalues); when t = 100, the embedded dimension is 200; and when t = 1000, the embedded dimension is 28. The results are shown in Figure 6.5.

Figure 6.5 — Square case. Top: truncated vector diffusion distances for t = 10, t = 100, and t = 1000. Bottom: truncated diffusion distances for t = 10, t = 100, and t = 1000, and the geodesic distance. The reference point from which distances are computed is marked in red.

7 Out-of-Sample Extension of Vector Fields

Let $𝒳 = {x_{i}}_{i = 1}^{n} and 𝒴 = {y_{i}}_{i = 1}^{m}$ so that 𝒳, 𝒴 ⊂ ℳ^d where ℳ is embedded in ℝ^p by ι. Suppose X is a smooth vector field that we observe only on 𝒳 and want to extend to 𝒴. That is, we observe the vectors ι_*X(x₁), ι_*X(x₂), …, ι_*X(x_n) ∈ ℝ^p and want to estimate ι_*X(y₁), ι_*X(y₂), …, ι_*X(y_m). The set 𝒳 is assumed to be fixed, while the points in 𝒴 may arrive on the fly and need to be processed in real time. We propose the following Nyström scheme for extending X from 𝒳 to 𝒴.

In the preprocessing step we use the points x₁, x₂; …, x_n for local PCA, alignment, and vector diffusion mapping as described in Sections 2 and 3. That is, using local PCA, we find the p × d matrices O_i (i = 1, 2, …, n) such that the columns of O_i are an orthonormal basis for a subspace that approximates the embedded tangent plane ι_*T_{x_i}ℳ; using alignment, we find the orthonormal d × d matrices O_ij that approximate the parallel transport operator from T_{x_j}ℳ to T_{x_i}ℳ; and using w_ij and O_ij we construct the matrices S and D and compute (a subset of) the eigenvectors v₁, v₂, …, v_nd and eigenvalues λ₁, λ₂, …, λ_nd of D⁻¹S.

We project the embedded vector field ι_*X(x_i) ∈ ℝ^p into the d-dimensional subspace spanned by the columns of O_i and define X_i ∈ ℝ^d as

X_{i} = O_{i}^{Τ} ι_{*} X (x_{i}) .

(7.1)

We represent the vector field X on 1D4B3 by the vector x of length nd, organized as n vectors of length d, with

x (i) = X_{i} for i = 1, 2, \dots, n .

We use the orthonormal basis of eigenvector fields v₁; v₂, …, v_nd to decompose x as

x = \sum_{l = 1}^{n d} a_{l} v_{l},

(7.2)

where a_l = x^Τv_l. This concludes the preprocessing computations.

Suppose y ∈ 𝒴 is a “new” out-of-sample point. First, we perform the local PCA step to find a p × d matrix, denoted O_y whose columns form an orthonormal basis to a d-dimensional subspace of ℝ^p that approximates the embedded tangent plane ι_*T_yℳ. The local PCA step uses only the neighbors of y among the points in 𝒳 (but not in 𝒴) inside a ball of radius $\sqrt{ε_{PCA}}$ centered at y.

Next, we use the alignment process to compute the d × d orthonormal matrix O_y,i between x_i and y by setting

O_{y, i} = \underset{O \in O (d)}{argmin} {‖ O_{y}^{Τ} O_{i} - O ‖}_{HS} .

Notice that the eigenvector fields satisfy

v_{l} (i) = \frac{1}{λ_{l}} \frac{\sum_{j = 1}^{n} K_{ε} (‖ x_{i} - x_{j} ‖) O_{i j} v_{l} (j)}{\sum_{j = 1}^{n} K_{ε} (‖ x_{i} - x_{j} ‖)} .

We denote the extension of v_l to the point y by ṽ_l(y) and define it as

{\tilde{v}}_{l} (y) = \frac{1}{λ_{l}} \frac{\sum_{j = 1}^{n} K_{ε} (‖ y - x_{j} ‖) O_{y, j} v_{l} (j)}{\sum_{j = 1}^{n} K_{ε} (‖ y - x_{j} ‖)} .

(7.3)

To finish the extrapolation problem, we denote the extension of x to y by x̃(y) and define it as

\tilde{x} (y) = \sum_{l = 1}^{m (δ)} a_{l} {\tilde{v}}_{l} (y),

(7.4)

where m(δ) = max_l|λ_l| > and > 0 is some fixed parameter to ensure the numerical stability of the extension procedure (due to the division by λ_l in (7.3); 1/δ can be regarded as the condition number of the extension procedure).

8 The Continuous Case: Heat Kernels

As discussed earlier, in the limit n → ∞ and ∊ ∞ 0 considered in (5.2), the normalized graph Laplacian converges to the Laplace-Beltrami operator, which is the generator of the heat kernel for functions (0-forms). Similarly, in the limit n → ∞ considered in (5.3), we get the connection Laplacian operator, which is the generator of a heat kernel for vector fields (or 1-forms). The connection Laplacian ▽₂ is a self-adjoint, second-order elliptic operator defined over the tangent bundle Tℳ. It is well-known [16] that the spectrum of ▽₂ is discrete inside ℝ⁻, and the only possible accumulation point is −∞. We will denote the spectrum as ${- λ_{k}}_{k = 0}^{\infty}$ , where 0 ≤ λ₀ ≤ λ₁ ≤ …. From classical elliptic theory (see, for example, [16]), we know that e^t▽² has the kernel

k_{t} (x, y) = \sum_{n = 0}^{\infty} e^{- λ_{n} t} X_{n} (x) \otimes \bar{X_{n} (y)} .

where ▽²X_n = −λ_nX_n. Also, the eigenvector fields X_n of ▽² form an orthonormal basis of L²(Tℳ). In the continuous setup, we define the vector diffusion distance between x; y ∈ ℳ using ${‖ k_{t} (x, y) ‖}_{HS}^{2}$ . An explicit calculation gives

{‖ k_{t} (x, y) ‖}_{HS}^{2} = Tr [k_{t} (x, y) k_{t} {(x, y)}^{*}] = \sum_{n, m = 0}^{\infty} e^{- (λ_{n} + λ_{m}) t} 〈 X_{n} (x), X_{m} (x) 〉 \bar{〈 X_{n} (y), X_{m} (y) 〉} .

(8.1)

It is well-known that the heat kernel k_t(x, y) is smooth in x and y and analytic in t [16], so for t > 0 we can define a family of vector diffusion mappings V_t that map any x ∈ ℳ into the Hilbert space ℓ₂ by

V_{t} : x \mapsto {(e^{- (λ_{n} + λ_{m}) t / 2} 〈 X_{n} (x), X_{m} (x) 〉)}_{n, m = 0}^{\infty},

(8.2)

which satisfies

{‖ k_{t} (x, y) ‖}_{HS}^{2} = {〈 V_{t} (x), V_{t} (y) 〉}_{ℓ^{2}} .

(8.3)

The vector diffusion distance d_VDM,t (x, y) between x ∈ M and y ∈ M is defined as

d_{VDM, t} (x, y) ≔ {‖ V_{t} (x) - V_{t} (y) ‖}_{ℓ^{2}},

(8.4)

which is clearly a distance function over ℳ. In practice, due to the decay of e^{−(λ_n+λ_m)t} only pairs (n, m) for which λ_n + λ_m is not too large are needed to get a good approximation of this vector diffusion distance. As in the discrete case, the dot products 〈X_n (x), X_m(x)〉 are invariant to the choice of basis for the tangent space at x.

We now study some properties of the vector diffusion map V_t (8.2). First, we claim that for all t > 0, the vector diffusion mapping V_t is an embedding of the compact Riemannian manifold ℳ into ℓ².

THEOREM 8.1. Given a d-dimensional closed Riemannian manifold (ℳ, g) and an orthonormal basis ${X_{n}}_{n = 0}^{\infty} o f L^{2} (T M)$ composed of the eigenvector fields of the connection Laplacian ▽², then for any t > 0, the vector diffusion map V_t is a diffeomorphic embedding of ℳ in ℓ².

PROOF. We show that V_t: ℳ → ℓ₂ is continuous in x by noting that

{‖ V_{t} (x) - V_{t} (y) ‖}_{ℓ^{2}}^{2} = \sum_{n, m = 0}^{\infty} e^{- (λ_{n} + λ_{m}) t} {(〈 X_{n} (x), X_{m} (x) 〉 - 〈 X_{n} (y), X_{m} (y) 〉)}^{2} = Tr (k_{t} (x, x) k_{t} {(x, x)}^{*}) + Tr (k_{t} (y, y) k_{t} {(y, y)}^{*}) - 2 Tr (k_{t} (x, y) k_{t} {(x, y)}^{*})

(8.5)

From the continuity of the kernel k_t(x, y) it is clear that ${‖ V_{t} (x) - V_{t} (y) ‖}_{ℓ^{2}}^{2} \to 0$ as y → x. Since ℳ is compact, it follows that V_t(ℳ) is compact in ℓ₂. Finally, we show that V_t is one-to-one. Fix x ≠ y and a smooth vector field X that satisfies 〈X(x),X(x)〉 ≠ 〈X(y), X(y)〉. Since the eigenvector fields ${[X_{n}]}_{n = 0}^{\infty}$ form a basis for L₂ (Tℳ), we have

X (z) = \sum_{n = 0}^{\infty} c_{n} X_{n} (z) for all z \in ℳ,

where c_n = ∫ℳ〈X,X_n〉dV. As a result,

〈 X (z), X (z) 〉 = \sum_{n, m = 0}^{\infty} c_{n} c_{m} 〈 X_{n} (z), X_{m} (z) 〉 .

Since 〈X(x), X(x)〉 ≠ 〈X(y), X(y)〉, there exist n, m ∈ ℕ such that

〈 X_{n} (x), X_{m} (x) 〉 \neq 〈 X_{n} (y), X_{m} (y) 〉,

which shows that V_t (x) ≠ V_t (y); i.e., V_t is one-to-one. From the fact that the map V_t is continuous and one-to-one from ℳ, which is compact, onto V_t (ℳ), we conclude that V_t is an embedding.

Next, we demonstrate the asymptotic behavior of the vector diffusion distance d_VDM,t(x, y) and the diffusion distance d_DM,t (x, y) when t is small and x is close to y. The following theorem shows that in this asymptotic limit both the vector diffusion distance and the diffusion distance behave like the geodesic distance.

THEOREM 8.2. Let (ℳ, g) be a smooth d-dimensional closed Riemannian manifold. Suppose x, y ∈ ℳ so that x = exp_y(v), where v ∈ T_y ℳ. For any t > 0, when ∥v∥² ≪ t ≪ 1 we have the following asymptotic expansion of the vector diffusion distance:

d_{VDM, t}^{2} (x, y) = d {(4 π)}^{- d} \frac{{‖ v ‖}^{2}}{t^{d + 1}} + O (\frac{{‖ v ‖}^{2}}{t^{d}})

Similarly, when ∥v∥² ≪ t ≪ 1, we have the following asymptotic expansion of the diffusion distance:

d_{DM, t}^{2} (x, y) = {(4 π)}^{- d / 2} \frac{{‖ v ‖}^{2}}{2 t^{d / 2 + 1}} + O (\frac{{‖ v ‖}^{2}}{t^{d}}) .

PROOF. Fixing y and a normal coordinate around y, we define j(x, y) = |det(d_v exp_y)|, where x = exp_y(v), v ∈ T_xℳ. Suppose ∥v∥ is small enough so that x = exp_y(v) is away from the cut locus of y. It is well-known that the heat kernel k_t(x, y) for the connection Laplacian ▽² over the vector bundle ℰ possesses the following asymptotic expansion when x and y are close [5, p. 84] or [11]:

{‖ \partial_{t}^{k} (k_{t} (x, y) - k_{t}^{N} (x, y)) ‖}_{l} = O (t^{N - d / 2 - l / 2 - k}),

(8.6)

, where ∥·∥_l is the C_l norm,

k_{t}^{N} (x, y) ≔ {(4 π t)}^{- \frac{d}{2}} e^{- \frac{{‖ v ‖}^{2}}{4 t}} j {(x, y)}^{- \frac{1}{2}} \sum_{i = 0}^{N} t^{i} Φ_{i} (x, y),

(8.7)

N > d/2, and Φ_i is a smooth section of the vector bundle ℰ ⊗ ℰ over ℳ × ℳ. Moreover, Φ₀(x, y) = P_{x, y} is the parallel transport from ℰ_y to ℰ_x. In the VDM setup, we take ℰ = Tℳ, the tangent bundle of ℳ. Also, by [5, prop. 1.28], we have the following expansion:

j (x, y) = 1 + \frac{Ric (v, v)}{6} + O ({‖ v ‖}^{3}) .

(8.8)

Equations (8.7) and (8.8) lead to the following expansion under the assumption ∥v∥² ≪ t:

Tr (k_{t} (x, y) k_{t} {(x, y)}^{*}) = {(4 π t)}^{- d} e^{- \frac{{‖ v ‖}^{2}}{2 t}} {(1 + \frac{Ric (v, v)}{6} + O ({‖ v ‖}^{3}))}^{- 1} \times Tr ((P_{x, y} + O (t)) ({(P_{x, y} + O (t))}^{*}) = {(4 π t)}^{- d} e^{- \frac{{‖ v ‖}^{2}}{2 t}} (1 - \frac{Ric (v, v)}{6} + O ({‖ v ‖}^{3})) (d + O (t)) = (d + O (t)) {(4 π t)}^{- d} (1 - \frac{{‖ v ‖}^{2}}{2 t} + O (\frac{{‖ v ‖}^{4}}{t^{2}})) .

In particular, for ∥v∥ = 0 we have

Tr (k_{t} (x, x) k_{t} {(x, x)}^{*}) = (d + O (t)) {(4 π t)}^{- d} .

Thus, for ∥v∥² ≪ t ≪ 1, we have

d_{VDM, t}^{2} (x, y) = Tr (k_{t} (x, x) k_{t} {(x, x)}^{*}) + Tr (k_{t} (y, y) k_{t} {(y, y)}^{*}) - 2 Tr (k_{t} (x, y) k_{t} {(x, y)}^{*}) = d {(4 π)}^{- d} \frac{{‖ v ‖}^{2}}{t^{d + 1}} + O (\frac{{‖ v ‖}^{2}}{t^{d}}) .

(8.9)

By the same argument we can carry out the asymptotic expansion of the diffusion distance d_DM,t (x, y). Denote the eigenfunctions and eigenvalues of the Laplace-Beltrami operator Δ by φ_n and µ_n. We can rewrite the diffusion distance as follows:

d_{DM, t}^{2} (x, y) = \sum_{n = 1}^{\infty} e^{- μ_{n} t} {(ϕ_{n} (x) - ϕ_{n} (y))}^{2} = {\tilde{k}}_{t} (x, x) + {\tilde{k}}_{t} (y, y) - 2 {\tilde{k}}_{t} (x, y),

(8.10)

where k̃_t is the heat kernel of the Laplace-Beltrami operator. Note that the Laplace- Beltrami operator is equal to the connection Laplacian operator defined over the trivial line bundle over ℳ. As a result, equation (8.7) also describes the asymptotic expansion of the heat kernel for the Laplace-Beltrami operator as

{\tilde{k}}_{t} (x, y) = {(4 π t)}^{- \frac{d}{2}} e^{- \frac{{‖ v ‖}^{2}}{4 t}} {(1 + \frac{Ric (v, v)}{6} + O ({‖ v ‖}^{3}))}^{- \frac{1}{2}} (1 + O (t)) .

Putting these facts together, we obtain

d_{DM, t}^{2} (x, y) = {(4 π)}^{- d / 2} \frac{{‖ v ‖}^{2}}{2 t^{d / 2 + 1}} + O (\frac{{‖ v ‖}^{2}}{t^{d}}),

(8.11)

, when ∥v∥² ≪ t ≪ 1.

9 Application of VDM to Cryo-Electron Microscopy

In addition to being a general framework for data analysis and manifold learning, VDM is useful for performing robust multireference rotational alignment of objects, such as one-dimensional periodic signals, two-dimensional images, and three-dimensional shapes. In this section, we briefly describe the application of VDM to a particular multireference rotational alignment problem of two-dimensional images that arise in the field of cryo-electron microscopy (cryo-EM). A more comprehensive study of this problem can be found in [19, 37]. It can be regarded as a prototypical multireference alignment problem, and we expect many other multireference alignment problems that arise in areas such as computer vision and computer graphics to benefit from the proposed approach.

The goal in cryo-EM [14] is to determine three-dimensional macromolecular structures from noisy projection images taken at unknown random orientations by an electron microscope, i.e., a random computational tomography (CT). Determining three-dimensional macromolecular structures for large biological molecules remains vitally important, as witnessed, for example, by the 2003 Chemistry Nobel Prize, co-awarded to R. MacKinnon for resolving the three-dimensional structure of the Shaker K⁺ channel protein, and by the 2009 Chemistry Nobel Prize, awarded to V. Ramakrishnan, T. Steitz, and A. Yonath for studies of the structure and function of the ribosome. The standard procedure for structure determination of large molecules is X-ray crystallography. The challenge in this method is often more in the crystallization itself than in the interpretation of the X-ray results, since many large proteins have so far withstood all attempts to crystallize them.

In cryo-EM, an alternative to X-ray crystallography, the sample of macromolecules is rapidly frozen in an ice layer so thin that their tomographic projections are typically disjoint; this seems the most promising alternative for large molecules that defy crystallization. The cryo-EM imaging process produces a large collection of tomographic projections of the same molecule, corresponding to different and unknown projection orientations. The goal is to reconstruct the three-dimensional structure of the molecule from such unlabeled projection images, where data sets typically range from 10⁴ to 10⁵ projection images whose size is roughly 100 × 100 pixels. The intensity of the pixels in a given projection image is proportional to the line integrals of the electric potential induced by the molecule along the path of the imaging electrons (see Figure 9.1). The highly intense electron beam destroys the frozen molecule, and it is therefore impractical to take projection images of the same molecule at known different directions as in the case of classical CT. In other words, a single molecule can be imaged only once, rendering an extremely low signal-to-noise ratio (SNR) for the images (see Figure 9.2 for a sample of real microscope images), mostly due to shot noise induced by the maximal allowed electron dose (other sources of noise include the varying width of the ice layer and partial knowledge of the contrast function of the microscope). In the basic homogeneity setting considered hereafter, all imaged molecules are assumed to have the exact same structure; they differ only by their spatial rotation. Every image is a projection of the same molecule but at an unknown random three-dimensional rotation, and the cryo-EM problem is to find the three-dimensional structure of the molecule from a collection of noisy projection images.

Figure 9.1 — Schematic drawing of the imaging process: every projection image corresponds to some unknown three-dimensional rotation of the unknown molecule.

Figure 9.2 — A collection of four real electron microscope images of the E. coli 50S ribosomal subunit; courtesy of Dr. Fred Sigworth (Yale Medical School).

The rotation group SO(3) is the group of all orientation-preserving orthogonal transformations about the origin of the three-dimensional euclidean space ℝ³ under the operation of composition. Any three-dimensional rotation can be expressed using a 3 × 3 orthogonal matrix

R = (\begin{matrix} | & | & | \\ R^{1} & R^{2} & R^{3} \\ | & | & | \end{matrix})

satisfying RR^Τ = R^ΤR = I_3×3 and det R = 1. The column vectors R₁, R₂, R₃ of R form an orthonormal basis to ℝ³. To each projection image P there corresponds a 3 × 3 unknown rotation matrix R describing its orientation (see Figure 9.1). Excluding the contribution of noise, the intensity P(x, y) of the pixel located at (x, y) in the image plane corresponds to the line integral of the electric potential induced by the molecule along the path of the imaging electrons, that is,

P (x, y) = \int_{- \infty}^{\infty} ϕ (x R^{1} + y R^{2} + z R^{3}) d z

(9.1)

where φ : ℝ³ ↦ ℝ is the electric potential of the molecule in some fixed “laboratory” coordinate system. The projection operator (9.1) is also known as the X-ray transform [29].

We therefore identify the third column R³ of R as the imaging direction, also known as the viewing angle of the molecule. The first two columns R¹ and R² form an orthonormal basis for the plane in ℝ³ perpendicular to the viewing angle R³. All clean projection images of the molecule that share the same viewing angle look the same up to some in-plane rotation. That is, if R_i and R_j are two rotations with the same viewing angle $R_{i}^{3} = R_{j}^{3}, then R_{i}^{1}, R_{i}^{2} and R_{j}^{1}, R_{j}^{2}$ are two orthonormal bases for the same plane. On the other hand, two rotations with opposite viewing angles $R_{i}^{3} = - R_{j}^{3}$ give rise to two projection images that are the same after reflection (mirroring) and some in-plane rotation.

As projection images in cryo-EM have extremely low SNR, a crucial initial step in all reconstruction methods is “class averaging” [14, 41]. Class averaging is the grouping of a large data set of n noisy raw projection images P₁, P₂; …, P_n into clusters such that images within a single cluster have similar viewing angles (it is possible to artificially double the number of projection images by including all mirrored images). Averaging rotationally aligned noisy images within each cluster results in “class averages”; these are images that enjoy a higher SNR and are used in later cryo-EM procedures such as the angular reconstitution procedure [40] that requires better-quality images. Finding consistent class averages is challenging due to the high level of noise in the raw images as well as the large size of the image data set. A sketch of the class-averaging procedure is shown in Figure 9.3.

Figure 9.3 — (a) A clean simulated projection image of the ribosomal subunit generated from its known volume. (b) Noisy instance of (a), denoted *P_i*, obtained by the addition of white Gaussian noise. For the simulated images we chose the SNR to be higher than that of experimental images in order for image features to be clearly visible. (c) Noisy projection, denoted *P_j*, taken at the same viewing angle but with a different in-plane rotation. (d) Averaging the noisy images (b) and (c) after in-plane rotational alignment. The class average of the two images has a higher SNR than that of the noisy images (b) and (c), and it has better similarity with the clean image (a).

Penczek, Zhu, and Frank [31] introduced the rotationally invariant K-means clustering procedure to identify images that have similar viewing angles. Their rotationally invariant distance d_RID(i, j) between image P_i and image P_j is defined as the euclidean distance between the images when they are optimally aligned with respect to in-plane rotations (assuming the images are centered)

d_{RID} (i, j) = min_{θ \in [0, 2 π)} ‖ P_{i} - R (θ) P_{j} ‖,

(9.2)

, where R(θ) is the rotation operator of an image by an angle θ in the counterclockwise direction. Prior to computing the invariant distances of (9.2), a common practice is to center all images by correlating them with their total average $\frac{1}{n} Σ_{i = 1}^{n} P_{i}$ , which is approximately radial (i.e., has little angular variation) due to the randomness in the rotation. The resulting centers usually miss the true centers by only a few pixels (as can be validated in simulations during the refinement procedure). Therefore, like [31], we also choose to focus on the more challenging problem of rotational alignment by assuming that the images are properly centered, while the problem of translational alignment can be solved later by solving an overdetermined linear system.

It is worth noting that the specific choice of metric to measure proximity between images can make a big difference in class averaging. The cross-correlation and euclidean distance (9.2) are by no means optimal measures of proximity. In practice, it is common to denoise the images prior to computing their pairwise distances. Although the discussion that follows is independent of the particular choice of filter or distance metric, we emphasize that filtering can have a dramatic effect on finding meaningful class averages.

The invariant distance between noisy images that share the same viewing angle (with perhaps a different in-plane rotation) is expected to be small. Ideally, all neighboring images of some reference image P_i in a small invariant distance ball centered at P_i should have similar viewing angles, and averaging such neighboring images (after proper rotational alignment) would amplify the signal and diminish the noise.

Unfortunately, due to the low SNR, it often happens that two images of completely different viewing angles have a small invariant distance. This can happen when the realizations of the noise in the two images match well for some random in-plane rotational angle, leading to spurious neighbor identification. Therefore, averaging the nearest-neighbor images can sometimes yield a poor estimate of the true signal in the reference image.

The histograms of Figure 9.5 demonstrate the ability of small rotationally invariant distances to identify images with similar viewing directions. For each image we use the rotationally invariant distances to find its 40 nearest neighbors among the entire set of n = 40;000 images. In our simulation we know the original viewing directions, so for each image we compute the angles (in degrees) between the viewing direction of the image and the viewing directions of its 40 neighbors. Small angles indicate successful identification of “true” neighbors that belong to a small spherical cap, while large angles correspond to outliers. We see that for $SNR= \frac{1}{2}$ there are no outliers, and all the viewing directions of the neighbors belong to a spherical cap whose opening angle is about 8°. However, for lower values of the SNR, there are outliers, indicated by arbitrarily large angles (all the way to 180°).

Clustering algorithms, such as the K-means algorithm, perform much better than naive nearest-neighbors averaging, because they take into account all pairwise distances, not just distances to the reference image. Such clustering procedures are based on the philosophy that images that share a similar viewing angle with the reference image are expected to have a small invariant distance not only to the reference image but also to all other images with similar viewing angles. This observation was utilized in the rotationally invariant K-means clustering algorithm [31]. Still, due to noise, the rotationally invariant K-means clustering algorithm may suffer from misidentifications at the low SNR values present in experimental data.

VDM is a natural algorithmic framework for the class-averaging problem, as it can further improve the detection of neighboring images even at lower SNR values. The rotationally invariant distance neglects an important piece of information, namely, the optimal angle that realizes the best rotational alignment in (9.2):

θ_{i j} = \underset{θ \in [0, 2 π)}{argmin} ‖ P_{i} - R (θ) P_{j} ‖, i, j = 1, 2, \dots, n .

(9.3)

In VDM, we use the optimal in-plane rotation angles θ_ij to define the orthogonal transformations O_ij and to construct the matrix S in (3.1). The eigenvectors and eigenvalues of D⁻¹S (other normalizations of S are also possible) are then used to define the vector diffusion distances between images.

This VDM based classification method has proven to be quite powerful in practice. We applied it to a set of n = 40;000 noisy images with $SNR= \frac{1}{64}$ . For every image we found the 40 nearest neighbors using the vector diffusion metric. In the simulation we knew the viewing directions of the images, and we computed for each pair of neighbors the angle (in degrees) between their viewing directions. The histogram of these angles is shown in Figure 9.6 (left panel). About 92% of the identified images belong to a small spherical cap of opening angle 20° whereas this percentage is only about 65% when neighbors are identified by the rotationally invariant distances (right panel). We remark that for $SNR= \frac{1}{50}$ , the percentage of correctly identified images by the VDM method goes up to about 98%.

Figure 9.6 — $SNR = \frac{1}{64}$ : Histogram of the angles (x-axis, in degrees) between the viewing directions of each image (out of 40;000) and its 40 neighboring images. Left: neighbors are post-identified using vector diffusion distances. Right: neighbors are identified using the original rotationally invariant distances d_RID.

The main advantage of the algorithm presented here is that it successfully identifies images with similar viewing angles even in the presence of a large number of spurious neighbors, that is, even when many pairs of images with viewing angles that are far apart have relatively small rotationally invariant distances. In other words, the VDM-based algorithm is shown to be robust to outliers.

10 Summary and Discussion

This paper introduced vector diffusion maps, an algorithmic and mathematical framework for analyzing data sets where scalar affinities between data points are accompanied by orthogonal transformations. The consistency among the orthogonal transformations along different paths that connect any fixed pair of data points is used to define an affinity between them. We showed that this affinity is equivalent to an inner product, giving rise to the embedding of the data points in a Hilbert space and to the definition of distances between data points, to which we referred as vector diffusion distances.

For data sets of images, the orthogonal transformations and the scalar affinities are naturally obtained via the procedure of optimal registration. The registration process seeks to find the optimal alignment of two images over some class of transformations (also known as deformations), such as rotations, reflections, translations, and dilations. For the purpose of vector diffusion mapping, we extract from the optimal deformation only the corresponding orthogonal transformation (rotation and reflection). We demonstrated the usefulness of the vector diffusion map framework in the organization of noisy cryo-electron microscopy images, an important step towards resolving three-dimensional structures of macromolecules. Optimal registration is often used in various mainstream problems in computer vision and computer graphics, for example, in optimal matching of three-dimensional shapes. We therefore expect the vector diffusion map framework to become a useful tool in such applications.

In the case of manifold learning, where the data set is a collection of points in a high-dimensional euclidean space, but with a low-dimensional Riemannian manifold structure, we detailed the construction of the orthogonal transformations via the optimal alignment of the orthonormal bases of the tangent spaces. These bases are found using the classical procedure of PCA. Under certain mild conditions about the sampling process of the manifold, we proved that the orthogonal transformation obtained by the alignment procedure approximates the parallel transport operator between the tangent spaces. The proof required careful analysis of the local PCA step, which we believe is interesting in its own right. Furthermore, we proved that if the manifold is sampled uniformly, then the matrix that lies at the heart of the vector diffusion map framework approximates the connection Laplacian operator. Following spectral graph theory terminology, we call that matrix the connection Laplacian of the graph. Using different normalizations of the matrix we proved convergence to the connection Laplacian operator also for the case of nonuniform sampling. We showed that the vector diffusion mapping is an embedding and proved its relation with the geodesic distance using the asymptotic expansion of the heat kernel for vector fields. These results provide the mathematical foundation for the algorithmic framework that underlies the vector diffusion mapping.

We expect many possible extensions and generalizations of the vector diffusion mapping framework. We conclude by mentioning a few of them.

Topology of the data. In [36] we showed how the vector diffusion mapping can determine if a manifold is orientable or nonorientable, and in the latter case to embed its double covering in a euclidean space. To that end we used the information in the determinant of the optimal orthogonal transformation between bases of nearby tangent spaces. In other words, we used just the optimal reflection between two orthonormal bases. This simple example shows that vector diffusion mapping can be used to extract topological information from the point cloud. We expect more topological information can be extracted using appropriate modifications of the vector diffusion mapping.
Hodge and higher-order Laplacians. Using tensor products of the optimal orthogonal transformations it is possible to construct higher-order connection Laplacians that act on p-forms (p ≥ 1). The index theorem [16] relates topological structure with geometrical structure. For example, the so-called Betti numbers are related to the multiplicities of the harmonic p-forms of the Hodge Laplacian. For the extraction of topological information it would therefore be useful to modify our construction in order to approximate the Hodge Laplacian instead of the connection Laplacian.
Multiscale, sparse, and robust PCA. In the manifold learning case, an important step of our algorithm is local PCA for estimating the bases for tangent spaces at different data points. In the description of the algorithm, a single-scale parameter ∊_PCA is used for all data points. It is conceivable that a better estimation can be obtained by choosing a different, location-dependent scale parameter. A better estimation of the tangent space T_{x_i}ℳ may be obtained by using a location-dependent scale parameter ∊_PCA,i due to several reasons: nonuniform sampling of the manifold, varying curvature of the manifold, and global effects such as different pieces of the manifold that are almost touching at some points (i.e., varying the “condition number” of the manifold). Choosing the correct scale was recently considered in [28], where a multiscale approach was taken to resolve the optimal scale. We recommend the incorporation of such multiscale PCA approaches into the vector diffusion mapping framework. Another difficulty that we may face when dealing with real-life data sets is that the underlying assumption about the data points being located exactly on a low-dimensional manifold does not necessarily hold. In practice, the data points are expected to reside off the manifold, either due to measurement noise or due to the imperfection of the low-dimensional manifold model assumption. It is therefore necessary to estimate the tangent spaces in the presence of noise. Noise is a limiting factor for successful estimation of the tangent space, especially when the data set is embedded in a high-dimensional space and noise affects all coordinates [23]. We expect recent methods for robust PCA [7] and sparse PCA [6, 24] to improve the estimation of the tangent spaces and as a result to become useful in the vector diffusion map framework.
Random matrix theory and noise sensitivity. The matrix S that lies at the heart of the vector diffusion map is a block matrix whose blocks are either d × d orthogonal matrices O_ij or the zero blocks. We anticipate that for some applications the measurement of O_ij would be imprecise and noisy. In such cases, the matrix S can be viewed as a random matrix, and we expect tools from random matrix theory to be useful in analyzing the noise sensitivity of its eigenvectors and eigenvalues. The noise model may also allow for outliers, for example, orthogonal matrices that are uniformly distributed over the orthogonal group O(d) (according to the Haar measure). Notice that the expected value of such random orthogonal matrices is 0, which leads to robustness of the eigenvectors and eigenvalues even in the presence of a large number of outliers (see, for example, the random matrix theory analysis in [35]).
Compact and noncompact groups and their matrix representation. As mentioned earlier, the vector diffusion mapping is a natural framework to organize data sets for which the affinities and transformations are obtained from an optimal alignment process over some class of transformations (deformations). In this paper we focused on utilizing orthogonal transformations. At this point the reader has probably asked herself the following question: Is the method limited to orthogonal transformations, or is it possible to utilize other groups of transformations such as translations, dilations, and more? We note that the orthogonal group O(d) is a compact group that has a matrix representation and remark that the vector diffusion mapping framework can be extended to such groups of transformations without much difficulty. However, the extension to noncompact groups, such as the euclidean group of rigid transformation, the general linear group of invertible matrices, and the special linear group is less obvious. Such groups arise naturally in various applications, rendering the importance of extending the vector diffusion mapping to the case of noncompact groups.

Figure 2.1 — The orthonormal basis of the tangent plane T_{x_i}ℳ is determined by local PCA using data points inside a euclidean ball of radius $\sqrt{ε_{PCA}}$ centered at *x_i*. The bases for T_{x_i}ℳ and T_{x_j}ℳ are optimally aligned by an orthogonal transformation *O_ij* that can be viewed as a mapping from T_{x_j}ℳ to T_{x_i}ℳ.

Figure 9.4 — Simulated projection with various levels of additive Gaussian white noise.

Acknowledgment

A. Singer was partially supported by grant DMS-0914892 from the National Science Foundation, by grant FA9550-09-1-0551 from the Air Force Office of Scientific Research, by grant R01GM090200 from the National Institute of General Medical Sciences, and by the Alfred P. Sloan Foundation. H.-T. Wu acknowledges support by Federal Highway Administration Grant DTFH61-08-C-00028. The authors would like to thank Charles Fefferman for various discussions regarding this work and the anonymous referee for his useful comments and suggestions. They also express gratitude to the audiences of the seminars at Tel Aviv University, the Weizmann Institute of Science, Princeton University, Yale University, Stanford University, University of Pennsylvania, Duke University, Johns Hopkins University, and the Air Force, where parts of this work were presented in 2010 and 2011.

Appendix A

Some Differential Geometry Background

The purpose of this appendix is to provide the required mathematical background for readers who are not familiar with concepts such as the parallel transport operator, connection, and the connection Laplacian. We illustrate these concepts by considering a surface ℳ embedded in ℝ³.

Given a function f (x) : ℝ³ → ℝ, its gradient vector field is given by

\nabla f ≔ (\frac{\partial f}{\partial x}, \frac{\partial f}{\partial y}, \frac{\partial f}{\partial z}) .

Through the gradient, we can find the rate of change of f at x ∈ ℝ³ in a given unit vector v ∈ ℝ³, using the directional derivative:

\nabla f (x) (v) ≔ lim_{t \to 0} \frac{f (x + t v) - f (x)}{t} .

Define ▽_vf(x) := ▽f(x)(v).

Let X be a vector field on ℝ³,

X (x, y, z) = (f_{1} (x, y, z), f_{2} (x, y, z), f_{3} (x, y, z)) .

It is natural to extend the derivative notion to a given vector field X at x ∈ ℝ³ by mimicking the derivative definition for functions in the following way:

lim_{t \to 0} \frac{X (x + t v) - X (x)}{t}

(A.1)

where v ∈ ℝ³. Following the same notation for the directional derivative of a function, we denote this limit by ▽_vX(x). This quantity tells us that at x, following the direction v, we compare the vector field at two points x and x + tv, and see how the vector field changes. While this definition looks good at first sight, we now explain that it has certain shortcomings that need to be fixed in order to generalize it to the case of a surface embedded in ℝ³.

Consider a two-dimensional smooth surface ℳ embedded in ℝ³ by ι. Fix a point x ∈ ℳ and a smooth curve γ(t) : (−∊, ∊); → ℳ ⊂ ℝ³ where ∊ ≪ 1 and γ(0) = x. We call γʹ(₀) ∈ ℝ³ a tangent vector to ℳ at x. The two-dimensional subspace in ℝ³ spanned by the collection of all tangent vectors to ℳ at x is defined to be the tangent plane at x and denoted by T_xℳ;⁷ see Figure A.1 (left panel). Having defined the tangent plane at each point x ∈ ℳ, we define a vector field X over ℳ to be a differentiable map that maps x to a tangent vector in T_xℳ.⁸

Figure A.1 — Left: a tangent plane and a curve γ. Middle: a vector field. Right: the covariant derivative.

We now generalize the definition of the derivative of a vector field over ℝ³ (A.1) to define the derivative of a vector field over ℳ. The first difficulty we face is how to make sense of “X(x +tv),” since x + tv does not belong to ℳ. This difficulty can be tackled easily by changing the definition (A.1) a bit by considering the curve γ: (−∊, ∊) → ℳ ⊂ ℝ³ so that γ(0) = x and γʹ(₀) = v. Thus, (A.1) becomes

lim_{t \to 0} \frac{X (γ (t)) - X (γ (0))}{t}

(A.2)

where v ∈ ℝ³. In ℳ, the existence of the curve γ: (−∊, ∊) → ℝ³ and γ(0) = x and γʹ(0) = v is guaranteed by the classical ordinary differential equation theory. However, (A.2) still cannot be generalized to ℳ directly even though X(γ(t)) is well-defined. The difficulty we face here is how to compare X(γ(t)) and X(x), that is, how to make sense of the subtraction X(γ(t)) − X(γ(0)). It is not obvious since a priori we do not know how T_γ(t)ℳ and T_γ(0)ℳ are related. The way we proceed is by defining an important notion in differential geometry called “parallel transport,” which plays an essential role in our VDM framework.

Fix a point x ∈ ℳ and a vector field X on ℳ, and consider a parametrized curve γ: (−∊, ∊) → ℳ so that γ(0) = x. Define a vector-valued function V : (−∊, ∊) → ℝ³ by restricting X to γ, that is, V(t) = X(γ(t)). The derivative of V is well-defined as usual:

\frac{d V}{d t} (h) ≔ lim_{t \to 0} \frac{V (h + t) - V (h)}{t},

where h ∈ (−∊; ∊). The covariant derivative $\frac{D V}{d t} (h)$ is defined as the projection of $\frac{d V}{d t} (h)$ onto T_γ(h)ℳ. Then, using the definition of $\frac{D V}{d t} (h)$ , we consider the following equation:

{\begin{matrix} \frac{D W}{d t} (t) = 0, \\ W (0) = w, \end{matrix}

where w ∈ T_γ(0)ℳ. The solution W(t) exists by the classical ordinary differential equation theory. The solution W(t) along γ(t) is called the parallel vector field along the curve γ(t), and we also call W(t) the parallel transport of w along the curve γ(t) and denote W(t) = P_{γ(t),γ (0)}w.

We come back to address the initial problem: how to define the “derivative” of a given vector field over a surface ℳ. We define the covariant derivative of a given vector field X over ℳ as follows:

\nabla_{v} X (x) = lim_{t \to 0} \frac{P_{γ (0), γ (t)} X (γ (t)) - X (γ (0))}{t},

(A.3)

where γ: (−∊, ∊) → ℳ with γ(0) = x ∈ ℳ, γʹ (0) = v ∈ T_γ(0)ℳ. This definition says that if we want to analyze how a given vector field at x ∈ ℳ changes along the direction v, we choose a curve γ so that γ(0) = x and γʹ(₀) = v, and then “transport” the vector field value at point γ(t) to γ(0) = x so that the comparison of the two tangent planes makes sense. The key point of the whole story is that without applying parallel transport to transport the vector at point γ(t) to T_γ(0)ℳ, then the subtraction X(γ(t)) – X(γ(0)) ∈ ℝ³ in general does not live on T_xℳ, which distorts the notion of derivative. For comparison, let us reconsider definition (A.1). Since at each point x ∈ ℝ³, the tangent plane at x is T_xℝ³ = ℝ³, the subtraction X(x + tv) – X(x) always makes sense. To be more precise, the true meaning of X(x+tv) is P_γ(0),γ(t)X(γ(t)), where P_γ(0),γ(t) = id, and γ(t) = x + tv.

With the above definition, when X and Y are two vector fields on ℳ, we define ▽_XY to be a new vector field on ℳ so that

\nabla_{X} Y (x) ≔ \nabla_{X (x)} Y .

Note that X(x) ∈ T_xℳ. We call ▽ a connection on ℳ. (The notion of connection can be quite general. For our purposes, this definition is sufficient.)

Once we know how to differentiate a vector field over ℳ, it is natural to consider the second-order differentiation of a vector field. The second-order differentiation of a vector field is a natural notion in ℝ³. For example, we can define a second-order differentiation of a vector field X over ℝ³ as follows:

\nabla^{2} X ≔ \nabla_{x} \nabla_{x} X + \nabla_{y} \nabla_{y} X + \nabla_{z} \nabla_{z} X,

(A.4)

where x, y, z are standard unit vectors corresponding to the three axes. This definition can be generalized to a vector field over ℳ as follows:

\nabla^{2} X (x) ≔ \nabla_{E_{1}} \nabla_{E_{1}} X (x) + \nabla_{E_{2}} \nabla_{E_{2}} X (x),

(A.5)

where X is a vector field over ℳ, x ∈ ℳ, and E₁, E₂ are two vector fields on ℳ that satisfy ▽_{E_i}E_j = 0 for i, j = 1, 2. The condition ▽_{E_i}E_j = 0 (for i, j = 1, 2) is needed for technical reasons. (Please see [32] for details.) Note that in the ℝ³ case (A.4), if we set E₁ = x, E₂ = y, and E₃ = z, then ▽_{E_i}E_j = 0 for i, j = 1, 2, 3. The operator ▽₂ is called the connection Laplacian operator, which lies at the heart of the VDM framework. The notion of an eigenvector field over ℳ is defined to be the solution of the following equation:

\nabla^{2} X (x) = λ X (x)

for some λ ∈ ℝ. The existence and other properties of the eigenvector fields can be found in [16]. Finally, we comment that all the above definitions can be extended to the general manifold setup without much difficulty, where, roughly speaking, a “manifold” is the higher-dimensional generalization of a surface. (We will not provide details in the manifold setting, and refer readers to standard differential geometry textbooks, such as [32].)

Appendix B

Proofs of Theorems 5.3, 5.5, and 5.6

Throughout this appendix, we adapt Assumption 5.1 and Definition 5.2 (see also Table 1.1). We divide the proof of Theorem 5.3 into four theorems, each of which has its own interest. The first theorem, Theorem B.1, states that the columns of the matrix O_i that are found by local PCA (see (2.4)) form an orthonormal basis to a d-dimensional subspace of ℝ^p that approximates the embedded tangent plane ι_*T_{x_i}ℳ. The proven order of approximation is crucial for proving Theorem 5.3. The proof of Theorem B.1 involves geometry and probability theory.

THEOREM B.1. In addition to Assumption 5.1, suppose K_PCA ∈ C₂([0, 1]). If ∊_PCA = O(n^−2/(d+2)) and $x_{i} \notin ℳ \sqrt{ε_{PCA}}$ then, with high probability (w.h.p.), the columns ${u_{l} (x_{i})}_{l = 1}^{d}$ of the p × d matrix O_i, which is determined by local PCA, form an orthonormal basis to a d-dimensional subspace of ℝ^p that deviates from $ι_{*} T_{x_{i}} ℳ by O (ε_{PCA}^{3 / 2})$ in the following sense:

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} = O (ε_{PCA}^{3 / 2}) = O (n^{- \frac{3}{d + 2}}),

(B.1)

where Θ_i is a p × d matrix whose columns form an orthonormal basis to ι_*T_{x_i}ℳ. Let the minimizer in (B.1) be

{\hat{O}}_{i} = \underset{O \in O (d)}{argmin} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS},

(B.2)

and denote by Q_i the p × d matrix

Q_{i} ≔ Θ_{i} {\hat{O}}_{i}^{Τ},

(B.3)

and e_l(x_i) the l^th column of Q_i. The columns of Q_i form an orthonormal basis to ι_*T_{x_i}ℳ, and

{‖ O_{i} - Q_{i} ‖}_{HS} = O (ε_{PCA}) .

(B.4)

If $x_{i} \in ℳ \sqrt{ε_{PCA}}$ then, w.h.p.

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} = O (ε_{PCA}^{1 / 2}) = O (n^{- \frac{1}{d + 2}}) .

Better convergence near the boundary is obtained for ∊_PCA = O(n^−2/(d+1)), which gives

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} = O (ε_{PCA}^{3 / 4}) = O (n^{- \frac{3}{2 (d + 1)}})

for $x_{i} \in ℳ \sqrt{ε_{PCA}}$ , and

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} = O (ε_{PCA}^{5 / 4}) = O (n^{- \frac{5}{2 (d + 1)}})

for $x_{i} \notin ℳ \sqrt{ε_{PCA}}$ .

Theorem B.1 may seem a bit counterintuitive at first glance. When considering data points in a ball of radius $\sqrt{ε_{PCA}}$ , it is expected that the order of approximation would be O(∊_PCA), while equation (B.1) indicates that the order of approximation is higher $(\frac{3}{2} in stead of 1)$ . The true order of approximation for the tangent space, as observed in (B.4), is still O(∊_PCA). The improvement observed in (B.1) is of relevance to Theorem B.2, and we relate it to the probabilistic nature of the PCA procedure, more specifically, to a large-deviation result for the error in the law of large numbers for the covariance matrix that underlies PCA. Since the convergence of PCA is slower near the boundary, then for manifolds with boundary we need a smaller ∊_PCA. Specifically, for manifolds without boundary we choose ∊_PCA = O(n_−2/(d+2)), and for manifolds with boundary we choose ∊_PCA = O(n^−2/(d+1)). We remark that the first choice also works for manifolds with boundary at the expense of a slower convergence rate.

The second theorem, Theorem B.2, states that the d × d orthonormal matrix O_ij which is the output of the alignment procedure (2.5), approximates the parallel transport operator P_{x_i},_{x_j} from x_j to x_i along the geodesic connecting them. Assuming that $‖ x_{i} - x_{j} ‖ = 0 (\sqrt{ε})$ (here, ∊ is different than ∊_PCA), the order of this approximation is $O (ε_{PCA}^{3 / 2} + ε^{3 / 2})$ whenever x_i, x_j are away from the boundary. This result is crucial for proving Theorem 5.3. The proof of Theorem B.2 uses Theorem B.1 and is purely geometric.

THEOREM B.2. Consider $x_{i}, x_{j} \notin ℳ \sqrt{ε_{P C A}}$ satisfying that the geodesic distance between x_i and x_j is $O (\sqrt{ε})$ . For ∊_PCA = O(n^−2/(d+2)) w.h.p., O_ij approximates P_{x_i, x_j} in the following sense:

O_{i j} {\bar{X}}_{j} = {(〈 ι_{*} P_{x_{i}, x_{j}} X (x_{j}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) for all X \in C^{3} (T ℳ),

(B.6)

where ${\bar{X}}_{i} \equiv {(〈 ι_{*} X (x_{i}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} \in ℝ^{d} a n d {u_{l} (x_{i})}_{l = 1}^{d}$ is an orthonormal set determined by local PCA. For $x_{i}, x_{j} \in ℳ \sqrt{ε_{P C A}}$

O_{i j} {\bar{X}}_{j} = {(〈 ι_{*} P_{x_{i}, x_{j}} X (x_{j}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{1 / 2} + ε^{3 / 2}) for all X \in C^{3} (T ℳ) .

(B.7)

For∊_PCA = O(n^−2/(d+1)), the orders of ∊_PCA in the error terms change according to Theorem B.1.

The third theorem, Theorem B.3, states that the n × n block matrix $D_{α}^{- 1} S_{α}$ is a discrete approximation of an integral operator over smooth sections of the tangent bundle. The integral operator involves the parallel transport operator. The proof of Theorem B.3 mainly uses probability theory.

THEOREM B.3. In addition to Assumption 5.1 suppose ∊_PCA = O(n^−2/(d+2)). For $x_{i} \notin ℳ \sqrt{ε_{PCA}}$ we have w.h.p.

\frac{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j})} = {(〈 ι_{*} T_{ε, α} X (x_{i}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (\frac{1}{n^{1 / 2} ε^{d / 4 - 1 / 2}} + ε_{PCA}^{3 / 2} + ε^{3 / 2}),

(B.8)

where T_{∊, α} is defined in (5.2), ${\bar{X}}_{i} \equiv {(〈 ι_{*} X (x_{i}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} \in ℝ^{d}, {u_{l} (x_{i})}_{l = 1}^{d}$ is the orthonormal set determined by local PCA, and O_ij is the optimal orthogonal transformation determined by the alignment procedure.

For $x_{i} \in ℳ \sqrt{ε_{PCA}}$ we have w.h.p.

\frac{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j})} = {(〈 ι_{*} T_{ε, α} X (x_{i}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (\frac{1}{n^{1 / 2} ε^{d / 4 - 1 / 2}} + ε_{PCA}^{1 / 2} + ε^{3 / 2}) .

(B.9)

For ∊_PCA = O(n^−2/(d+1)) the orders of ∊_PCA in the error terms change according to Theorem B.1.

The fourth theorem, Theorem B.4, states that the operator T_{∊, α} can be expanded in powers of $\sqrt{ε}$ , where the leading-order term is the identity operator, the second-order term is the connection Laplacian operator plus some possible potential terms, and the first- and third-order terms vanish for vector fields that are sufficiently smooth. For α = 1, the potential terms vanish, and as a result, the second-order term is the connection Laplacian. The proof is based on geometry.

THEOREM B.4. For $x \notin ℳ \sqrt{ε}$ we have

T_{ε, α} X (x) = X (x) + ε \frac{m_{2}}{2 d} {\nabla^{2} X (x) + \frac{2 \nabla X (x) \cdot \nabla (p^{1 - α}) (x)}{p^{1 - α} (x)}} + O (ε^{2}),

(B.10)

where T_{∊, α} is defined in (5.2).

COROLLARY B.5. Under the same conditions and notation as in Theorem B.4 if X ∈ C³(Tℳ), then for all $x \notin ℳ \sqrt{ε}$ we have

T_{ε, 1} X (x) = X (x) + ε \frac{m_{2}}{2 d} \nabla^{2} X (x) + O (ε^{2}) .

(B.11)

Putting Theorems B.1, B.3, and B.4 together, we now prove Theorem 5.3.

PROOF OF THEOREM 5.3. Suppose $x_{i} \notin ℳ \sqrt{ε}$ . By Theorem B.3, w.h.p.

\frac{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j})} = {(〈 ι_{*} T_{ε, α} X (x_{i}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (\frac{1}{n^{1 / 2} ε^{d / 4 - 1 / 2}} + ε_{PCA}^{3 / 2} + ε^{3 / 2}) = {(〈 ι_{*} T_{ε, α} X (x_{i}), e_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (\frac{1}{n^{1 / 2} ε^{d / 4 - 1 / 2}} + ε_{PCA}^{3 / 2} + ε^{3 / 2}),

where ∊_PCA = O(n^−2/(d+2)), and Theorem B.1 was used to replace u_l(x_i) by e_l(x_i). Using Theorem B.4 for the right-hand side of (B), we get

\frac{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j})} = {(〈 ι_{*} X (x_{i}) + ε \frac{m_{2}}{2 d} ι_{*} {\nabla^{2} X (x_{i}) + \frac{2 \nabla X (x_{i}) \cdot \nabla (p^{1 - α}) (x_{i})}{p^{1 - α} (x_{i})}}, e_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (\frac{1}{n^{1 / 2} ε^{d / 4 - 1 / 2}} + ε_{PCA}^{3 / 2} + ε^{3 / 2}) .

For ∊ = O(n^−2/(d+4)), upon dividing by ∊, the three error terms are

\frac{1}{n^{1 / 2} ε^{d / 4 + 1 / 2}} = O (n^{- \frac{1}{d + 4}}), \frac{1}{ε} ε_{PCA}^{3 / 2} = O (n^{- \frac{d + 8}{(d + 1) (d + 2)}}), ε^{\frac{1}{2}} = O (n^{- \frac{1}{d + 4}}) .

Clearly the three error terms vanish as n → ∞. Specifically, the dominant error is O(n^−1/(d+4)), which is the same as $O (\sqrt{ε})$ . As a result, in the limit n → ∞, almost surely,

lim_{n \to \infty} \frac{1}{ε} [\frac{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j})} - {\bar{X}}_{i}] = \frac{m_{2}}{2 d} {(〈 ι_{*} {\nabla^{2} X (x_{i}) + \frac{2 \nabla X (x_{i}) \cdot \nabla (p^{1 - α}) (x_{i})}{p^{1 - α} (x_{i})}}, e_{l} (x_{i}) 〉)}_{l = 1}^{d},

as required.

B.1 Preliminary Lemmas

For the proofs of Theorems B.1 through B.4, we need the following lemmas: LEMMA B.6. In polar coordinates around x ∈ ℳ, the Riemannian measure is given by

d V ({exp}_{x} t θ) = J (t, θ) d t d θ,

, where θ ∈ T_xℳ, ∥θ∥ = 1, t > 0 and

J (t, θ) = t^{d - 1} + t^{d + 1} Ric (θ, θ) + O (t^{d + 2}) .

PROOF. Please see [32].

The following lemma is needed in Theorems B.1 and B.2.

LEMMA B.7. Fix x ∈ ℳ and denote by exp_x the exponential map at x and by $\exp_{ι (x)}^{ℝ^{p}}$ the exponential map of ℝ^p at ι(x). With the identification of T_ι(x) ℝ^p with ℝ^p, for v ∈ T_xℳ with ∥v∥ ≪ 1 we have

ι \circ {exp}_{x} (v) = ι (x) + d ι (v) + \frac{1}{2} Π (v, v) + \frac{1}{6} \nabla_{v} Π (v, v) + O ({‖ v ‖}^{4}) .

(B.12)

Furthermore, for w ∈ T_xℳ ≅ ℝ^d, we have

d {[ι \circ {exp}_{x}]}_{v} (w) = d {[ι \circ {exp}_{x}]}_{v = 0} (w) + Π (v, w) + \frac{1}{6} \nabla_{v} Π (v, w) + \frac{1}{3} \nabla_{w} Π (v, v) + O ({‖ v ‖}^{3}) .

(B.13)

PROOF. Define $ϕ : = {({exp}_{ι (x)}^{ℝ^{p}})}^{- 1} \circ ι \circ {exp}_{x}$ , that is,

ι \circ {exp}_{x} = {exp}_{ι (x)}^{ℝ^{p}} \circ ϕ .

(B.14)

Since φ can be viewed as a function from T_xℳ ≅ ℝ^d to T_ι,(x)ℝ^p≅ ℝ^p we can Taylor-expand it to get

ι \circ {exp}_{x} (v) = ι (x) + d ϕ |_{0} (v) + \frac{1}{2} \nabla d ϕ |_{0} (v, v) + \frac{1}{6} \nabla^{2} d ϕ |_{0} (v, v, v) + O ({‖ v ‖}^{4}),

where the equality holds since φ(0) = 0 and ${exp}_{ι (x)}^{ℝ^{p}} (w) = ι (x) + w$ for all w ∈ T_ι(x)ℝ^p if we identify T_ι(x)ℝ^p with ℝ^p. To conclude (B.12), we claim that

\nabla^{k} d ϕ |_{0} (v, \dots, v) = \nabla^{k} d ι |_{x} (v, \dots, v) for all k \geq 0,

which comes from the chain rule and the fact that

\nabla^{k} d {exp}_{x} |_{0} (v, \dots, v) = 0 for all k \geq 1 .

(B.15)

Indeed, we have that d ${exp}_{x} \in Γ (T^{*} T_{x} ℳ \otimes {exp}_{x}^{*} T ℳ)$ from the definition of the exponential map, where ${exp}_{x}^{*} T ℳ$ is the pullback tangent bundle, and

\nabla d {exp}_{x} (v' (t), v' (t)) = \nabla_{d {exp}_{x} (v' (t))} d {exp}_{x} (v' (t)) - d {exp}_{x} (\nabla_{v' (t)} v' (t)) = 0,

where v ∈ T_xℳ, v(t) = tv ∈ T_xℳ, and vʹ(t) = v ∈ T_tvT_xℳ. The claim (B.15) for k = 1 follows when t = 0. The result for k ≥ 2 follows from a similar argument. Hence

ι \circ {exp}_{x} (v) = ι (x) + d ι (v) + \frac{1}{2} \nabla d ι (v, v) + \frac{1}{6} \nabla^{2} d ι (v, v, v) + O ({‖ v ‖}^{4}),

which gives us (B.12) since ▽dι =Π and ▽₂dι = ▽Π.

Next we show (B.13). When w ∈ T_vT_xℳ, we view d[ι○exp_x].(w) as a function from T_xℳ≅ ℝ^d to ℝ^p so that when ∥v∥ ≪ 1, Taylor expansion gives us

d {[ι \circ {exp}_{x}]}_{v} (w) = d {[ι \circ {exp}_{x}]}_{0} (w) + \nabla (d [ι \circ {exp}_{x}] . (w)) |_{0} (v) + \frac{1}{2} \nabla^{2} (d [ι \circ {exp}_{x}] . (w)) |_{0} (v, v) + O ({‖ v ‖}^{3});

(B.16)

here d and ▽ are understood as the ordinary differentiation over ℝ^d. To simplify the calculation of ▽(d[ι ○ exp_x].(w))|₀ (v) and ▽²(d[ι ○ exp_x].(w))|₀ (v,v), we denote

H_{w} (v) = \frac{1}{6} \nabla_{w} Π (v, v) + \frac{1}{6} \nabla_{v} Π (w, v) + \frac{1}{6} \nabla_{v} Π (v, w)

and

G_{w} (u, v) = \frac{1}{3} (\nabla_{w} Π (u, v) + \nabla_{u} Π (w, u) + \nabla_{u} Π (u, w)),

where u, v, w ∈ ℝ^d. By (B.12), when ∥v∥ is small enough, we have

d {[ι \circ {exp}_{x}]}_{v} (w) = lim_{δ \to 0} \frac{ι \circ {exp}_{x} (v + δ w) - ι \circ {exp}_{x} (v)}{δ} = lim_{δ \to 0} \frac{d ι (δ w) + Π (v, δ w) + δ H_{w} (v) + R (v + δ w) - R (v)}{δ},

where R(v) is the remainder term in the Taylor expansion:

R (v) = \sum_{| α | = 4} \frac{1}{α!} (\int_{0}^{1} {(1 - t)}^{3} \nabla^{4} (ι \circ {exp}_{x}) (t v) d t) v^{α} .

Thus

d {[ι \circ {exp}_{x}]}_{v} (w) = d ι (w) + Π (v, w) + H_{w} (v) + O (‖ v ‖ ‖ w ‖)

(B.17)

since

\frac{R (v + δ w) - R (v)}{δ} = O (‖ v ‖ ‖ w ‖) .

Similarly, from (B.17), when ∥u∥ is small enough we have

\nabla (d [ι \circ {exp}_{x}] . (w)) |_{u} (v) = lim_{δ \to 0} \frac{d {[ι \circ {exp}_{x}]}_{u + δ v} (w) - d {[ι \circ {exp}_{x}]}_{u} (w)}{δ} = Π (v, w) + G_{w} (u, v) + O (‖ u ‖ ‖ v ‖ ‖ w ‖)

(B.18)

and

\nabla^{2} d ([ι \circ {exp}_{x}] . (w)) |_{0} (v, v) = G (v, v) .

(B.19)

As a result, from (B.17), (B.18), and (B.19) we have that

d {[ι \circ {exp}_{x}]}_{0} (w) = d ι (w), \nabla (d [ι \circ {exp}_{x}] . (w)) |_{0} (v) = Π (v, w), G (v, v) = \frac{1}{3} \nabla_{v} Π (v, w) + \frac{2}{3} \nabla_{w} Π (v, v) .

Plugging them into (B.16) we get (B.13) as required.

LEMMA B.8. Suppose x, y ∈ ℳ such that y = exp_x(tθ)where θ ∈ T_xℳ and ∥θ∥ = 1. If t ≪ 1, then h = ∥ι(x) – ι(y)∥ ≪ 1 satisfies

t = h + \frac{1}{24} ‖ Π (θ, θ) ‖ h^{3} + O (h^{4}) .

(B.20)

PROOF. Please see [9] or apply (B.12) directly.

LEMMA B.9. Fixx ∈ ℳ and y = exp_x(tθ)where θ ∈ T_xℳ and ∥θ∥ = 1. Let ${\partial_{l} (x)}_{l = 1}^{d}$ be the normal coordinate on a neighborhood U of x; then for a sufficiently small t we have

ι_{*} P_{y, x} \partial_{l} (x) = ι_{*} \partial_{l} (x) + t Π (θ, \partial_{l} (x)) + \frac{t^{2}}{6} \nabla_{θ} Π (θ, \partial_{l} (x)) + \frac{t^{2}}{3} \nabla_{\partial_{l} (x)} Π (θ, θ) - \frac{t^{2}}{6} ι_{*} P_{y, x} (ℛ (θ, \partial_{l} (x)) θ) + O (t^{3})

(B.21)

for all l = 1, 2, …, d.

PROOF. Choose an open subset U ⊂ ℳ small enough and find an open neighborhood B of 0 ∈ T_xℳ so that exp_x : B → U is diffeomorphic. It is well-known that

\partial_{l} ({exp}_{x} (t θ)) = \frac{J_{l} (t)}{t},

where J_l(t) is the Jacobi field with J_l(0) = 0 and ▽_tJ_l (0) = ∂_l(x). By applying Taylor’s expansion in a neighborhood of t = 0, we have

J_{l} (t) = P_{y, x} (J_{l} (0) + t \nabla_{t} J_{l} (0) + \frac{t^{2}}{2} \nabla_{t}^{2} J_{l} (0) + \frac{t^{3}}{6} \nabla_{t}^{3} J_{l} (0)) + O (t^{4}) .

Since $J_{l} (0) = \nabla_{t}^{2} J_{l} (0) = 0$ , the following relationship holds:

\partial_{l} ({exp}_{x} (t θ)) = P_{y, x} (\nabla_{t} J_{l} (0) + \frac{t^{2}}{6} \nabla_{t}^{3} J_{l} (0)) + O (t^{3}) = P_{y, x} \partial_{l} (x) + \frac{t^{2}}{6} P_{y, x} (ℛ (θ, \partial_{l} (x)) θ) + O (t^{3}) .

(B.22)

Thus we obtain

P_{y, x} \partial_{l} (x) = \partial_{l} ({exp}_{x} (t θ)) - \frac{t^{2}}{6} P_{y, x} (ℛ (θ, \partial_{l} (x)) θ) + O (t^{3}) .

(B.23)

On the other hand, from (B.13) in Lemma B.7 we have

ι_{*} \partial_{l} ({exp}_{x} (t θ)) = ι_{*} \partial_{l} (x) + t Π (θ, \partial_{l} (x)) + \frac{t^{2}}{6} \nabla_{θ} Π (θ, \partial_{l} (x)) + \frac{t^{2}}{3} \nabla_{\partial_{l} (x)} Π (θ, θ) + O (t^{3}) .

(B.24)

Putting (B.23) and (B.24) together, it follows that, for l = 1, 2, … d,

ι_{*} P_{y, x} \partial_{l} (x) = l_{*} \partial_{l} ({exp}_{x} (t θ)) - \frac{t^{2}}{6} ι_{*} P_{y, x} (ℛ (θ, \partial_{l} (x)) θ) + O (t^{3}) = l_{*} \partial_{l} (x) + t Π (θ, \partial_{l} (x)) + \frac{t^{2}}{6} \nabla_{θ} Π (θ, \partial_{l} (x)) + \frac{t^{2}}{3} \nabla_{\partial_{l} (x)} Π (θ, θ) - \frac{t^{2}}{6} ι_{*} P_{y, x} (ℛ (θ, \partial_{l} (x)) θ) + O (t^{3}) .

(B.25)

B.2 Proof of Theorem B.1

Fix $x_{i} \notin ℳ \sqrt{ε_{PCA}}$ . Denote by ${v_{k}}_{k = 1}^{p}$ the standard orthonormal basis of ℝ^pthat is, v_k has 1 in the k_th entry and 0 elsewhere. We can properly translate and rotate the embedding ι so that ι(x_i) = 0 and the first d components {v₁, v₂; …, v^d} ⊂ ℝ^p form the orthonormal basis of ι_*T_{x_i}ℳ, and we find a normal coordinate ${\partial_{k}}_{k = 1}^{d}$ around x_i so that ι_*∂_k(x_i) = v_k. Instead of directly analyzing the matrix B_i that appears in the local PCA procedure given in (2.3), we analyze the covariance matrix $Ξ_{i} : = B_{i} B_{i}^{Τ}$ , whose eigenvectors coincide with the left singular vectors of B_i. Throughout this proof, we let K = K_PCA to simply the notation. We rewrite Ξ_i as

Ξ_{i} = \sum_{j \neq i}^{n} F_{j},

(B.26)

where

F_{j} = K (\frac{{‖ ι (x_{i}) - ι (x_{j}) ‖}_{ℝ^{p}}}{\sqrt{ε_{PCA}}}) (ι (x_{j}) - ι (x_{i})) {(ι (x_{j}) - ι (x_{i}))}^{Τ}

(B.27)

and

F_{j} (k, l) = K (\frac{{‖ ι (x_{i}) - ι (x_{j}) ‖}_{ℝ^{p}}}{\sqrt{ε_{PCA}}}) 〈 ι (x_{j}) - ι (x_{i}), v_{k} 〉 〈 ι (x_{j}) - ι (x_{i}), v_{l} 〉 .

(B.28)

Let $B \sqrt{ε_{PCA}} (x_{i})$ denote the geodesic ball of radius $\sqrt{ε_{PCA}}$ around x_i. We apply the same variance error analysis as in [34, sec. 3] to approximate Ξ_i. Since the points x_i are independent identically distributed (i.i.d.), F_j, j ≠ i , are also i.i.d.; by the law of large numbers one expects

\frac{1}{n - 1} \sum_{j \neq i}^{n} F_{j} \approx 𝔼 F,

(B.29)

where F = F₁,

𝔼 F = \int_{B_{\sqrt{ε_{PCA}}} (x_{i})} K_{ε_{PCA}} (x_{i}, y) (ι (y) - ι (x_{i})) {(ι (y) - ι (x_{i}))}^{Τ} p (y) d V (y)

(B.30)

and

𝔼 F (k, l) = \int_{B_{\sqrt{ε_{PCA}}} (x_{i})} K_{ε_{PCA}} (x_{i}, y) 〈 ι (y) - ι (x_{i}), v_{k} 〉 〈 ι (y) - ι (x_{i}), v_{l} 〉 p (y) d V (y) .

(B.31)

In order to evaluate the first moment 𝔼F (k, l) of (B.31), we note that for y = exp_{x_i} v, where v ∈ T_{x_i}ℳ, by (B.12) in Lemma B.7 we have

〈 ι ({exp}_{x_{i}} v) - ι (x_{i}), v_{k} 〉 = 〈 ι_{*} v, v_{k} 〉 + \frac{1}{2} 〈 Π (v, v), v_{k} 〉 + \frac{1}{6} 〈 \nabla_{v} Π (v, v), v_{k} 〉 + O ({‖ v ‖}^{4}) .

(B.32)

By substituting (B.32) into (B.31), applying Taylor’s expansion, and combining Lemma B.8 and Lemma B.6, we have

\int_{B_{\sqrt{ε_{PCA}}} (x_{i})} K_{ε_{PCA}} (x_{i}, y) 〈 ι (y) - ι (x_{i}), v_{k} 〉 〈 ι (y) - ι (x_{i}), v_{l} 〉 p (y) d V (y) = \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε_{PCA}}} [K (\frac{t}{\sqrt{ε_{PCA}}}) + O (\frac{t^{3}}{\sqrt{ε_{PCA}}})] \times {t^{2} 〈 ι_{*} θ, v_{k} 〉 〈 ι_{*} θ, v_{l} 〉 + \frac{t^{3}}{2} (〈 Π (θ, θ), v_{k} 〉 〈 ι_{*} θ, v_{l} 〉 + 〈 Π (θ, θ), v_{l} 〉 〈 ι_{*} θ, v_{k} 〉) + O (t^{4})} \times (p (x_{i}) + t \nabla_{θ} p (x_{i}) + O (t^{2})) (t^{d - 1} + O (t^{d + 1})) d t d θ = \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε_{PCA}}} [K (\frac{t}{\sqrt{ε_{PCA}}}) 〈 ι_{*} θ, v_{k} 〉 〈 ι_{*} θ, v_{l} 〉 p (x_{i}) t^{d + 1} + O (t^{d + 3})] d t d θ,

where the last equality holds since integrals involving odd powers of θ must vanish due to the symmetry of the sphere 𝕊 _d−1. Note that 〈ι_*θ, v_k〉 = 0 when k = d + 1, d + 2, … p. Therefore,

𝔼 F (k, l) = {\begin{matrix} {D ε_{PCA}}^{d / 2 + 1} + O ({ε_{PCA}}^{d / 2 + 2}) & for 1 \leq k = l \leq d, \\ O ({ε_{PCA}}^{d / 2 + 2}) & otherwise . \end{matrix}

(B.33)

where $D = p (x_{i}) \int_{𝕊^{d - 1}} {| 〈 ι_{*} θ, v_{1} 〉 |}^{2} d θ \int_{0}^{1} K (u) u^{d + 1} d u$ is a positive constant. Similar considerations give the second moment of F(k, l) as

𝔼 [F {(k, l)}^{2}] = {\begin{matrix} O ({ε_{PCA}}^{d / 2 + 2}) & for k, l = 1, 2, \dots, d, \\ O ({ε_{PCA}}^{d / 2 + 4}) & for k, l = d + 1, d + 2, \dots, p, \\ O ({ε_{PCA}}^{d / 2 + 3}) & otherwise . \end{matrix}

(B.34)

Hence, the variance of F(k, l) becomes

Var F (k, l) = {\begin{matrix} O ({ε_{PCA}}^{d / 2 + 2}) & for k, l = 1, 2, \dots, d, \\ O ({ε_{PCA}}^{d / 2 + 4}) & for k, l = d + 1, d + 2, \dots, p, \\ O ({ε_{PCA}}^{d / 2 + 3}) & otherwise . \end{matrix}

(B.35)

We now move on to establish a large-deviation bound on the estimation of $\frac{1}{n - 1} Σ_{j \neq i} F_{j} (k, l)$ by its mean 𝔼F_j(k, l). For that purpose, we measure the deviation from the mean value by α and define its probability by

p_{k, l} (n, α) ≔ Pr {| \frac{1}{n - 1} \sum_{j \neq i}^{n} F_{j} (k, l) - 𝔼 F (k, l) | > α} .

(B.36)

To establish an upper bound for the probability p_k,l(n, α), we use Bernstein’s inequality; see, e.g., [22]. Define

Y_{j} (k, l) ≔ F_{j} (k, l) - 𝔼 F (k, l) .

Clearly Y_j (k, l) are zero mean i.i.d. random variables. From the definition of F_j (k, l) (see (B.27) and (B.28)) and from the calculation of its first moment (B.33), it follows that Y_j (k, l) are bounded random variables. More specifically,

Y_{j} (k, l) = {\begin{matrix} O (ε_{PCA}) & for k, l = 1, 2, \dots, d, \\ O ({ε_{PCA}}^{2}) & for k, l = d + 1, d + 2, \dots, p, \\ O ({ε_{PCA}}^{3 / 2}) & otherwise . \end{matrix}

(B.37)

Consider first the case k, l = 1, 2, …, d, for which Bernstein’s inequality gives

p_{k, l} (n, α) \leq exp {- \frac{(n - 1) α^{2}}{2 𝔼 (Y_{1} {(k, l)}^{2}) + O (ε_{PCA}) α}} \leq exp {- \frac{(n - 1) α^{2}}{O ({ε_{PCA}}^{d / 2 + 2}) + O (ε_{PCA}) α}} .

(B.38)

From (B.38) it follows that for

α = O (\frac{{ε_{PCA}}^{d / 4 + 1}}{n^{1 / 2}})

and

\frac{1}{n^{1 / 2} {ε_{PCA}}^{d / 4}} ≪ 1,

(B.39)

the probability p_{k, l}(n, α) is exponentially small.

Similarly, for k, l = d + 1, d + 2, …, p, we have

p_{k, l} (n, α) \leq exp {- \frac{(n - 1) α^{2}}{O ({ε_{PCA}}^{d / 2 + 4}) + O ({ε_{PCA}}^{2}) α}},

, which means that for

α = O (\frac{{ε_{PCA}}^{d / 4 + 2}}{n^{1 / 2}})

and (B.39), the probability p_{k, l}(n, α) is exponentially small.

Finally, for k = d+1, d+2, …, p, l = 1, 2, …, d or l = d+1, d+2; …, p, and k = 1, 2, …, d, we have

p_{k, l} (n, α) \leq exp {- \frac{(n - 1) α^{2}}{O ({ε_{PCA}}^{d / 2 + 3}) + O ({ε_{PCA}}^{3 / 2}) α}},

, which means that for

α = O (\frac{{ε_{PCA}}^{d / 4 + 3 / 2}}{n^{1 / 2}})

and (B.39), the probability p_k,l(n, α) is exponentially small. The condition (B.39) is quite intuitive as it is equivalent to n∊_PCA^d/2 ≫ 1, which says that the expected number of points inside $B \sqrt{ε_{PCA}} (x_{i})$ should be large.

As a result, when (B.39) holds, w.h.p., the covariance matrix Ξ_i is given by

Ξ_{i} = {ε_{PCA}}^{d / 2 + 1} D [\begin{matrix} I_{d \times d} & 0_{d \times p - d} \\ 0_{p - d \times d} & 0_{p - d \times p - d} \end{matrix}]

(B.40)

+ {ε_{PCA}}^{d / 2 + 2} [\begin{matrix} O (1) & O (1) \\ O (1) & O (1) \end{matrix}]

(B.41)

+ \frac{{ε_{PCA}}^{d / 4 + 1}}{\sqrt{n}} [\begin{matrix} O (1) & O ({ε_{PCA}}^{1 / 2}) \\ O ({ε_{PCA}}^{1 / 2}) & O (ε_{PCA}) \end{matrix}],

(B.42)

where I_d×d is the identity matrix of size d×d, and 0_m×mʹ is the zero matrix of size m × mʹ. The error term in (B.41) is the bias term due to the curvature of the manifold, while the error term in (B.42) is the variance term due to finite sampling (i.e., finite n). In particular, under the condition in the statement of the theorem for the sampling rate, namely, ∊_PCA = O(n^−2/(d+2)), we have w.h.p.

Ξ_{i} = ε_{PCA}^{d / 2 + 1} D [\begin{matrix} I_{d \times d} & 0_{d \times p - d} \\ 0_{p - d \times d} & 0_{p - d \times p - d} \end{matrix}] + ε_{PCA}^{d / 2 + 2} [\begin{matrix} O (1) & O (1) \\ O (1) & O (1) \end{matrix}] + ε_{PCA}^{d / 2 + 3 / 2} [\begin{matrix} O (1) & O (ε_{PCA}^{1 / 2}) \\ O (ε_{PCA}^{1 / 2}) & O (ε_{PCA}) \end{matrix}] = ε_{PCA}^{d / 2 + 1} {D [\begin{matrix} I_{d \times d} & 0_{d \times p - d} \\ 0_{p - d \times d} & 0_{p - d \times p - d} \end{matrix}] + [\begin{matrix} O ({ε_{PCA}}^{1 / 2}) & O (ε_{PCA}) \\ O (ε_{PCA}) & O (ε_{PCA}) \end{matrix}]} .

(B.43)

Note that by definition Ξ_i is symmetric, so we rewrite (B.43) as

Ξ_{i} = ε_{PCA}^{d / 2 + 1} D [\begin{matrix} I + ε_{PCA}^{1 / 2} A & ε_{PCA} C \\ ε_{PCA} C^{Τ} & ε_{PCA} B \end{matrix}],

(B.44)

where I is the d × d identity matrix, A is a d × d symmetric matrix, C is a d × (p – d) matrix, and B is a (p – d) × (p – d) symmetric matrix. All entries of A, B, and C are O(1).

Denote by u_k and λ_k, k = 1, 2, …, p, the eigenvectors and eigenvalues of Ξ_i, where the eigenvectors are orthonormal and the eigenvalues are listed in decreasing order. Using regular perturbation theory, we find that λ_k = D∊_PCA^d/2+1 (1 + O (∊_PCA^1/2)) (for k = 1, 2, …, d), and that the expansion of the first d eigenvectors ${u_{k}}_{k = 1}^{d}$ is given by

u_{k} = [\begin{matrix} {[w_{k} + O ({ε_{PCA}}^{3 / 2})]}_{d \times 1} \\ {[O (ε_{PCA})]}_{p - d \times 1} \end{matrix}] \in ℝ^{p},

(B.45)

where ${w_{k}}_{k = 1}^{d}$ are orthonormal eigenvectors of A satisfying $A w_{k} = λ_{k}^{A} w_{k}$ . Indeed, a direct calculation gives us

[\begin{matrix} I + ε_{PCA}^{1 / 2} A & ε_{PCA} C \\ ε_{PCA} C^{Τ} & ε_{PCA} B \end{matrix}] [\begin{matrix} w_{k} + ε_{PCA}^{3 / 2} v_{3 / 2} + ε_{PCA}^{2} v_{2} + O ({ε_{PCA}}^{5 / 2}) \\ ε_{PCA} z_{1} + ε_{PCA}^{3 / 2} z_{3 / 2} + O ({ε_{PCA}}^{2}) \end{matrix}] = [\begin{matrix} w_{k} + ε_{PCA}^{1 / 2} A w_{k} + ε_{PCA}^{3 / 2} v_{3 / 2} + ε_{PCA}^{2} (A v_{3 / 2} + v_{2} + C z_{1}) + O (ε_{PCA}^{5 / 2}) \\ ε_{PCA} C^{Τ} w_{k} + ε_{PCA}^{2} B z_{1} + O (ε_{PCA}^{5 / 2}) \end{matrix}],

(B.46)

where v_3/2, v₂ ∈ ℝ^d and z₁, z_3/2 ∈ ℝ_p–d. On the other hand,

(1 + ε_{PCA}^{1 / 2} λ_{k}^{A} + ε_{PCA}^{2} λ_{2} + O (ε_{PCA}^{5 / 2})) [\begin{matrix} w_{k} + ε_{PCA}^{3 / 2} v_{3 / 2} + ε_{PCA}^{2} v_{2} + O (ε_{PCA}^{5 / 2}) \\ ε_{PCA} z_{1} + ε_{PCA}^{3 / 2} z_{3 / 2} + O (ε_{PCA}^{2}) \end{matrix}] = [\begin{matrix} w_{k} + ε_{PCA}^{1 / 2} λ_{k}^{A} w_{k} + ε_{PCA}^{3 / 2} v_{3 / 2} + ε_{PCA}^{2} (λ_{k}^{A} v_{2} + v_{3 / 2} + λ_{2} w_{k}) + O (ε_{PCA}^{5 / 2}) \\ ε_{PCA} z_{1} + ε_{PCA}^{3 / 2} (λ_{k}^{A} z_{1} + z_{3 / 2}) + O (ε_{PCA}^{2}) \end{matrix}],

(B.47)

where γ₂ ∈ R. Matching orders of ∊_PCA between (B.46) and (B.47), we conclude that

O (ε_{PCA}) : z_{1} = C^{Τ} w_{k}, O ({ε_{PCA}}^{3 / 2}) : z_{3 / 2} = - λ_{k}^{A} z_{1}, O ({ε_{PCA}}^{2}) : (A - λ_{k}^{A} I) v_{3 / 2} = λ_{2} w_{k} - C C^{Τ} w_{k} .

(B.48)

Note that the matrix $(A - λ_{k}^{A} I)$ appearing in (B.48) is singular and its null space is spanned by the vector w_k so the solvability condition is $λ_{2} = {‖ C^{Τ} w_{k} ‖}^{2} / {‖ w_{k} ‖}^{2}$ . We mention that A is a generic symmetric matrix generated due to random finite sampling, so almost surely the eigenvalue $λ_{k}^{A}$ is simple.

Denote by O_i the p × d matrix whose k^th column is the vector u_k. We measure the deviation of the d-dimensional subspace of ℝ^p spanned by u_k, k = 1, 2, …, d, from ιT_{x_i}M by

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS},

(B.49)

where Θ_i is a p×d matrix whose k_th column is v_k (recall that v_k is the k_th standard unit vector in ℝ^p). Let Ô be the d × d orthonormal matrix

\hat{O} = {[\begin{matrix} w_{1}^{Τ} \\ ⋮ \\ w_{d}^{Τ} \end{matrix}]}_{d \times d} .

Then,

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} \leq {‖ O_{i}^{Τ} Θ_{i} - \hat{O} ‖}_{HS} = O ({ε_{PCA}}^{3 / 2}),

(B.50)

which completes the proof for points away from the boundary.

Next, we consider $x_{i} \in ℳ \sqrt{ε_{PCA}}$ . The proof is almost the same as the above, so we just point out the main differences without giving the full details. The notations Ξ_i, F_j (k, l), p_k,l(n, α), and Y_j (k, l) refer to the same quantities. Here the expectation of F_j (k, l) is

𝔼 F (k, l) = \int_{B_{\sqrt{ε_{PCA}}} (x_{i}) \cap ℳ} K_{ε_{PCA}} (x_{i}, y) 〈 ι (y) - ι (x_{i}), v_{k} 〉 \times 〈 ι (y) - ι (x_{i}), v_{l} 〉 p (y) d V (y) .

(B.51)

Due to the asymmetry of the integration domain ${exp}_{x_{i}}^{- 1} (B \sqrt{ε_{PCA}} (x_{i}) \cap ℳ)$ when x_i is near the boundary, we do not expect 𝔼F_j (k, l) to be the same as (B.33) and (B.34), since integrals involving odd powers of θ do not vanish. In particular, when l = d + 1, …, p, k = 1, 2, …, d or k = d + 1, d + 2, …, p, l = 1, 2, …, d, 𝔼F(k, l) becomes

\int_{{exp}_{x_{i}}^{- 1} (B_{\sqrt{ε_{PCA}}} (x_{i}) \cap ℳ)} K (\frac{t}{\sqrt{ε_{PCA}}}) 〈 ι_{*} θ, v_{k} 〉 〈 Π (θ, θ), v_{l} 〉 p (x_{i}) t^{d + 2} d t d θ + O ({ε_{PCA}}^{d / 2 + 2}),

which is O(∊_PCA^d/2+3/2). Note that for $x_{i} \in ℳ \sqrt{ε_{PCA}}$ the bias term in the expansion of the covariance matrix differs from (B.41) when l = d + 1, d = 2, …, p, k = 1, 2, …, d or k = d +1, d +2, …, p, l = 1, 2, …, d. Similar calculations show that

𝔼 F (k, l) = {\begin{matrix} O ({ε_{PCA}}^{d / 2 + 1}) & when k, l = 1, 2, \dots, d, \\ O ({ε_{PCA}}^{d / 2 + 2}) & when k, l = d + 1, d + 2, \dots, p, \\ O ({ε_{PCA}}^{d / 2 + 3 / 2}) & otherwise, \end{matrix}

(B.52)

𝔼 [F {(k, l)}^{2}] = {\begin{matrix} O ({ε_{PCA}}^{d / 2 + 2}) & when k, l = 1, 2, \dots, d, \\ O ({ε_{PCA}}^{d / 2 + 4}) & when k, l = d + 1, d + 2, \dots, p, \\ O ({ε_{PCA}}^{d / 2 + 3}) & otherwise, \end{matrix}

(B.53)

Var F (k, l) = {\begin{matrix} O ({ε_{PCA}}^{d / 2 + 2}) & when k, l = 1, 2, \dots, d, \\ O ({ε_{PCA}}^{d / 2 + 4}) & when k, l = d + 1, d + 2, \dots, p, \\ O ({ε_{PCA}}^{d / 2 + 3}) & otherwise . \end{matrix}

(B.54)

Similarly, Y_j (k, l) are also bounded random variables satisfying

Y_{j} (k, l) = {\begin{matrix} O (ε_{PCA}) & for k, l = 1, 2, \dots, d, \\ O ({ε_{PCA}}^{2}) & for k, l = d + 1, d + 2, \dots, p, \\ O ({ε_{PCA}}^{3 / 2}) & otherwise . \end{matrix}

(B.55)

Consider first the case k, l = 1, 2, …, d, for which Bernstein’s inequality gives

p_{k, l} (n, α) \leq exp {- \frac{(n - 1) α^{2}}{O ({ε_{PCA}}^{d / 2 + 2}) + O (ε_{PCA}) α}} .

(B.56)

From (B.56) it follows that w.h.p.

α = O (\frac{{ε_{PCA}}^{d / 4 + 1}}{n^{1 / 2}})

provided (B.39). Similarly, for k, l = d + 1, d + 2, …, p, we have

p_{k, l} (n, α) \leq exp {- \frac{(n - 1) α^{2}}{O ({ε_{PCA}}^{d / 2 + 4}) + O ({ε_{PCA}}^{2}) α}},

which means that w.h.p.

α = O (\frac{{ε_{PCA}}^{d / 4 + 2}}{n^{1 / 2}})

provided (B.39). Finally, for k = d + 1, d + 2, …, p, l = 1, 2, …, d, or l = d + 1, d + 2, …, p, k = 1, 2, …, d, we have

p_{k, l} (n, α) \leq exp {- \frac{(n - 1) α^{2}}{O ({ε_{PCA}}^{d / 2 + 3}) + O ({ε_{PCA}}^{3 / 2}) α}},

which means that w.h.p.

α = O (\frac{{ε_{PCA}}^{d / 4 + 3 / 2}}{n^{1 / 2}})

provided (B.39). As a result, under the condition in the statement of the theorem for the sampling rate, namely, ∊_PCA = O(n^−2/(d+2)), we have w.h.p.

Ξ_{i} = {ε_{PCA}}^{d / 2 + 1} [\begin{matrix} O (1) & 0 \\ 0 & 0 \end{matrix}] + {ε_{PCA}}^{d / 2 + 3 / 2} [\begin{matrix} O (1) & O (1) \\ O (1) & O ({ε_{PCA}}^{1 / 2}) \end{matrix}] + \frac{{ε_{PCA}}^{d / 4 + 1}}{\sqrt{n}} [\begin{matrix} O (1) & O ({ε_{PCA}}^{1 / 2}) \\ O ({ε_{PCA}}^{1 / 2}) & O (ε_{PCA}) \end{matrix}] = {ε_{PCA}}^{d / 2 + 1} {[\begin{matrix} O (1) & 0_{d \times p - d} \\ 0_{p - d \times d} & 0_{p - d \times p - d} \end{matrix}] + [\begin{matrix} O ({ε_{PCA}}^{1 / 2}) & O ({ε_{PCA}}^{1 / 2}) \\ O ({ε_{PCA}}^{1 / 2}) & O (ε_{PCA}) \end{matrix}]} .

Then, by the same argument as in the case when $x_{i} \notin ℳ \sqrt{ε_{PCA}}$ , we conclude that

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} = O ({ε_{PCA}}^{1 / 2}) .

Similar calculations show that for ∊_PCA = O(n^−2/(d+1)) we get

min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} = O ({ε_{PCA}}^{5 / 4}) for x_{i} \notin ℳ_{\sqrt{ε_{PCA}}}, min_{O \in O (d)} {‖ O_{i}^{Τ} Θ_{i} - O ‖}_{HS} = O ({ε_{PCA}}^{3 / 4}) for x_{i} \in ℳ_{\sqrt{ε_{PCA}}} .

B.3 Proof of Theorem B.2

Denote by O_i the p × d matrix whose columns u_l(x_i), l = 1, 2, …, d, are orthonormal inside ℝ^p as determined by local PCA around x_i. As in (B.3), we denote by e_l(x_i) the l^th column of Q_i where Q_i is a p × d matrix whose columns form an orthonormal basis of ι_*T_{x_i}ℳ so by Theorem B.1 ${‖ O_{i}^{Τ} Q_{i} - Id ‖}_{HS} = O (ε_{PCA}^{3 / 2}) f o r ε_{PCA} = O (n^{- 2 / (d + 2)})$ , which is the case of focus here (if ∊_PCA = O(n_−2/(d+1)), then ${‖ O_{i}^{Τ} Q_{i} - Id ‖}_{HS} = O (ε_{PCA}^{5 / 4})$ ).

Fix x_i and the normal coordinate ${\partial_{l}}_{l = 1}^{d}$ around x_i so that ι_*∂_l(x_i) = e_l(x_i). Let x_j = exp_{x_i} tθ, where θ ∈ T_{x_i}ℳ, $‖ θ ‖ = 1, and t = O (\sqrt{ε})$ . Then, by the definition of the parallel transport, we have

P_{x_{i}, x_{j}} X (x_{j}) = \sum_{l = 1}^{d} g (X (x_{j}), P_{x_{j}, x_{i}} \partial_{l} (x_{i})) \partial_{l} (x_{i}),

(B.57)

and since the parallel transport and the embedding ι are isometric, we have

g (P_{x_{i}, x_{j}} X (x_{j}), \partial_{l} (x_{i})) = g (X (x_{j}), P_{x_{j}, x_{i}} \partial_{l} (x_{i})) = 〈 ι_{*} X (x_{j}), ι_{*} P_{x_{j}, x_{i}} \partial_{l} (x_{i}) 〉 .

(B.58)

Local PCA provides an estimation of an orthonormal basis spanning ι_*T_{x_i}ℳ, which is free up to O(d). Thus, there exists R ∈ O(p) so that ι_*T_{x_j}ℳ is invariant under R and e_l(x_j) = Rι_*P_{x_j,x_i} ∂_l (x_i) for all l = 1, 2, …, d. Hence we have the following relationship:

〈 ι_{*} X (x_{j}), ι_{*} P_{x_{j}, x_{i}} \partial_{l} (x_{i}) 〉 = 〈 \sum_{k = 1}^{d} 〈 ι_{*} X (x_{j}), e_{k} (x_{j}) 〉 e_{k} (x_{j}), ι_{*} P_{x_{j}, x_{i}} \partial_{l} (x_{i}) 〉 = \sum_{k = 1}^{d} 〈 ι_{*} X (x_{j}), e_{k} (x_{j}) 〉 〈 e_{k} (x_{j}), ι_{*} P_{x_{j}, x_{i}} \partial_{l} (x_{i}) 〉 = \sum_{k = 1}^{d} 〈 R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{j}), ι_{*} P_{x_{j}, x_{i}} \partial_{l} (x_{i}) 〉 〈 ι_{*} X (x_{j}), e_{k} (x_{j}) 〉 ≔ \sum_{k = 1}^{d} {\bar{R}}_{l, k} 〈 ι_{*} X (x_{j}), e_{k} (x_{j}) 〉 ≔ \bar{R} X_{j},

(B.59)

where ${\bar{R}}_{l, k} : = 〈 R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}), ι_{*} P_{x_{j}, x_{i}} \partial_{l} (x_{i}) 〉, \bar{R} : = {[{\bar{R}}_{l, k}]}_{l, k = 1}^{d}, and X_{j} = {(〈 ι_{*} X (x_{j}), e_{k} (x_{j}) 〉)}_{k = 1}^{d}$

On the other hand, since $Q_{i}^{Τ} Q_{j} = {[ι_{*} \partial_{l} {(x_{i})}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i})]}_{l, k = 1}^{d}$ , Lemma B.9 gives us

Q_{i}^{Τ} Q_{j} = {[ι_{*} P_{x_{j}, x_{i}} \partial_{l} {(x_{i})}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i})]}_{l, k = 1}^{d} - t {[Π {(θ, \partial_{l} (x_{i}))}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i})]}_{l, k = 1}^{d} - \frac{t^{2}}{6} {[2 \nabla_{\partial_{l} (x_{i})} Π {(θ, θ)}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}) + \nabla_{θ} Π {(θ, \partial_{l} (x_{i}))}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}) - {(ι_{*} P_{x_{j}, x_{i}} ℛ (θ, \partial_{l} (x_{i})) θ)}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i})]}_{l, k = 1}^{d} + O (Τ^{3}) .

We now analyze the right-hand side term by term. Note that since ι_*T_{x_j}ℳ is invariant under R, we have $R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}) = Σ_{r = 1}^{d} {\bar{R}}_{r, k} ι_{*} P_{x_{j}, x_{i}} \partial_{r} (x_{i})$ . For the O(t) term, we have

Π {(θ, \partial_{l} (x_{i}))}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}) = \sum_{r = 1}^{d} {\bar{R}}_{r, k} Π {(θ, \partial_{l} (x_{i}))}^{Τ} ι_{*} P_{x_{j}, x_{i}} \partial_{r} (x_{i}) = \sum_{r = 1}^{d} {\bar{R}}_{r, k} Π {(θ, \partial_{l} (x_{i}))}^{Τ} [ι_{*} \partial_{r} (x_{i}) + t Π (θ, \partial_{r} (x_{i})) + O (t^{2})] = t \sum_{r = 1}^{d} {\bar{R}}_{r, k} Π {(θ, \partial_{l} (x_{i}))}^{Τ} Π (θ, \partial_{r} (x_{i})) + O (t^{2})

(B.60)

where the second equality is due to Lemma B.9 and the third equality holds since Π(θ, ∂_l(x_i)) is perpendicular to ι_*∂_r(x_i) for all l, r = 1, 2, …, d. Moreover, the Gauss equation gives us

0 = 〈 ℛ (θ, θ) \partial_{r} (x_{i}), \partial_{l} (x_{i}) 〉 = Π {(θ, \partial_{l} (x_{i}))}^{Τ} Π (θ, \partial_{r} (x_{i})) - Π {(θ, \partial_{r} (x_{i}))}^{Τ} Π (θ, \partial_{l} (x_{i})),

which means the matrix

S_{1} ≔ {[[Π {(θ, \partial_{l} (x_{i}))}^{Τ} Π (θ, \partial_{r} (x_{i}))]}_{l, r = 1}^{d}

is symmetric.

Fix a vector field X on a neighborhood around x_i so that X(x_i) = θ. By definition we have

\nabla_{\partial_{l}} Π (X, X) = \nabla_{\partial_{l}} (Π (X, X)) - 2 Π (X, \nabla_{\partial_{l}} X) .

(B.61)

Viewing Tℳ as a subbundle of Tℝ^pwe have the equation of Weingarten:

\nabla_{\partial_{l}} (Π (X, X)) = - A_{Π (X, X)} \partial_{l} + \nabla_{\partial_{l}}^{⊥} (Π (X, X)),

(B.62)

where AΠ(X, X)^∂_l and $\nabla_{\partial_{l}}^{⊥} (Π (X, X))$ are the tangential and normal components of ▽_{∂_l}, respectively. Moreover, the following equation holds:

〈 A_{Π (X, X)} \partial_{l}, ι_{*} \partial_{k} 〉 = 〈 Π (\partial_{l}, \partial_{k}), Π (X, X) 〉 .

(B.63)

By evaluating (B.61) and (B.62) at x_i we have

\nabla_{\partial_{l} (x_{i})} Π {(θ, θ)}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}) = \sum_{r = 1}^{d} {\bar{R}}_{r, k} \nabla_{\partial_{l} (x_{i})} Π {(θ, θ)}^{Τ} ι_{*} P_{x_{j}, x_{i}} \partial_{r} (x_{i}) = \sum_{r = 1}^{d} {\bar{R}}_{r, k} {(- A_{Π (θ, θ)} \partial_{l} (x_{i}) + \nabla_{\partial_{l} (x_{i})}^{⊥} (Π (X, X)) - 2 Π (θ, \nabla_{\partial_{l} (x_{i})} X))}^{Τ} \times [ι_{*} \partial_{r} (x_{i}) + t Π (θ, \partial_{r} (x_{i})) + O (t^{2})] = - \sum_{r = 1}^{d} {\bar{R}}_{r, k} {(A_{Π (θ, θ)} \partial_{l} (x_{i}))}^{Τ} ι_{*} \partial_{r} (x_{i}) + O (t) = - \sum_{r = 1}^{d} {\bar{R}}_{r, k} 〈 Π (θ, θ), Π (\partial_{l} (x_{i}), \partial_{r} (x_{i})) 〉 + O (t) .

(B.64)

where the third equality holds since $Π (θ, \nabla_{\partial_{l}} (x_{i}) X) and \nabla_{\partial_{l} (x_{i})}^{⊥} (Π (X, X))$ are perpendicular to ι_*∂_l(x_i) and the last equality holds by (B.63). Due to the symmetry of the second fundamental form, we know that the matrix

S_{2} = {[〈 Π (θ, θ), Π (\partial_{l} (x_{i}), \partial_{r} (x_{i})) 〉]}_{l, r = 1}^{d}

is symmetric.

Similarly we have

\nabla_{θ} Π {(\partial_{l} (x_{i}), θ)}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}) = \sum_{r = 1}^{d} {\bar{R}}_{r, k} {(A_{Π (\partial_{l} (x_{i}), θ)} θ)}^{Τ} ι_{*} \partial_{r} (x_{i}) + O (t) .

(B.65)

Since ${(A_{Π (\partial_{l} (x_{i}), θ)} θ)}^{Τ} ι_{*} \partial_{r} (x_{i}) = Π {(θ, \partial_{l} (x_{i}))}^{Τ} Π (θ, \partial_{r} (x_{i}))$ by (B.63), which we denoted earlier by S₁ and used the Gauss equation to conclude that it is symmetric.

To estimate the last term, we work out the following calculation by using the isometry of the parallel transport:

{(ι_{*} P_{x_{j}, x_{i}} ℛ (θ, \partial_{l} (x_{i})) θ)}^{Τ} R ι_{*} P_{x_{j}, x_{i}} \partial_{k} (x_{i}) = \sum_{r = 1}^{d} {\bar{R}}_{r, k} 〈 ι_{*} P_{x_{j}, x_{i}} (ℛ (θ, \partial_{l} (x_{i})) θ), ι_{*} P_{x_{j}, x_{i}} \partial_{r} (x_{i}) 〉 = \sum_{r = 1}^{d} {\bar{R}}_{r, k} g (P_{x_{j}, x_{i}} (ℛ (θ, \partial_{l} (x_{i})) θ), P_{x_{j}, x_{i}} \partial_{r} (x_{i})) = \sum_{r = 1}^{d} {\bar{R}}_{r, k} g (ℛ (θ, \partial_{l} (x_{i})) θ, \partial_{r} (x_{i})) .

(B.66)

Denote $S_{3} = {[g (ℛ (θ, \partial_{l} (x_{i})) θ, \partial_{r} (x_{i}))]}_{l, r = 1}^{d}$ , which is symmetric by the definition of ℛ.

By (B.60), (B.64), (B.65), and (B.66) we have

Q_{i}^{Τ} Q_{j} = \bar{R} + t^{2} (- S_{1} - \frac{S_{2}}{3} + \frac{S_{1}}{6} - \frac{S_{3}}{6}) \bar{R} + O (t^{3}) = \bar{R} + t^{2} S \bar{R} + O (t^{3}),

(B.67)

where S:= −S₁ − S₂/3 + S₁/6 − S₃/6 is a symmetric matrix.

Suppose that both x_i and x_j are not in $ℳ \sqrt{ε}$ . To finish the proof, we have to understand the relationship between $O_{i}^{Τ} O_{j} and Q_{i}^{Τ} Q_{j}$ which is rewritten as

O_{i}^{Τ} O_{j} = Q_{i}^{Τ} Q_{j} + {(O_{i} - Q_{i})}^{Τ} Q_{j} + O_{i}^{Τ} (O_{j} - Q_{j}) .

(B.68)

From (B.1) in Theorem B.1, we know

{‖ {(O_{i} - Q_{i})}^{Τ} Q_{i} ‖}_{HS} = {‖ O_{i}^{Τ} Q_{i} - Id ‖}_{HS} = O (ε_{PCA}^{3 / 2}),

which is equivalent to

{(O_{i} - Q_{i})}^{Τ} Q_{i} = O (ε_{PCA}^{3 / 2}) .

(B.69)

Due to (B.67) we have Q_j = Q_i R̄+t²Q_iSR̄+O(t³), which together with (B.69) gives

{(O_{i} - Q_{i})}^{Τ} Q_{j} = {(O_{i} - Q_{i})}^{Τ} (Q_{i} \bar{R} + t^{2} Q_{i} S \bar{R} + O (t^{3})) = O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) .

(B.70)

Together with the fact that $Q_{i}^{Τ} = \bar{R} Q_{j}^{Τ} + t^{2} S \bar{R} Q_{j}^{Τ} + O (t^{3})$ derived from (B.67), we have

O_{i}^{Τ} (O_{j} - Q_{j}) = Q_{i}^{Τ} (O_{j} - Q_{j}) + {(O_{i} - Q_{i})}^{Τ} (O_{j} - Q_{j}) = (\bar{R} Q_{j}^{Τ} + t^{2} S \bar{R} Q_{j}^{Τ} + O (t^{3})) (O_{j} - Q_{j}) + {(O_{i} - Q_{i})}^{Τ} (O_{j} - Q_{j}) = O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) + {(O_{i} - Q_{i})}^{Τ} (O_{j} - Q_{j}) .

(B.71)

Recall that the following relationship between O_i and Q_i holds (B.45)

O_{i} = Q_{i} + [\begin{matrix} O (ε_{PCA}^{3 / 2}) \\ O (ε_{PCA}) \end{matrix}]

(B.72)

when the embedding ι is translated and rotated so that it satisfies ι(x_i) = 0, the first d standard unit vectors {v₁, v₂, … v_d} ⊂ ℝ^p form the orthonormal basis of ι_*T_{x_i}ℳ, and the normal coordinates ${\partial_{k}}_{k = 1}^{d} around x_{i} satisfy ι_{*} \partial_{k} (x_{i}) = v_{k}$ . Similarly, the following relationship between O_j and Q_j holds (B.45):

O_{j} = Q_{j} + [\begin{matrix} O (ε_{PCA}^{3 / 2}) \\ O (ε_{PCA}) \end{matrix}]

(B.73)

when the embedding ι is translated and rotated so that it satisfies ι(x_j) = 0, the first d standard unit vectors {v₁, v₂, … v_d} ⊂ ℝ^p form the orthonormal basis of ι_*T_{x_j}ℳ, and the normal coordinates ${\partial_{k}}_{k = 1}^{d}$ around x_j satisfy ι_*∂_k(x_j) = R̄ι_*P_{x_j,x_i}∂_k(x_i)=v_k. Also, recall that ι_*T_{x_j}ℳis invariant under the rotation R and from Lemma B.9, e_k(x_i) and e_k(x_j) are related by $e_{k} (x_{j}) = e_{k} (x_{i}) + O (\sqrt{ε})$ . Therefore,

O_{j} - Q_{j} = (R + O (\sqrt{ε})) [\begin{matrix} O (ε_{PCA}^{3 / 2}) \\ O (ε_{PCA}) \end{matrix}] = [\begin{matrix} O (ε_{PCA}^{3 / 2}) \\ O (ε_{PCA}) \end{matrix}]

(B.74)

when expressed in the standard basis of ℝ^p so that the first d standard unit vectors {v₁, v₂, … v_d} ⊂ ℝ^p form the orthonormal basis of ι_*T_{x_i}ℳ. Hence, plugging (B.74) into (B.71) gives

O_{i}^{Τ} (O_{j} - Q_{j}) = O (ε_{PCA}^{3 / 2}) + {(O_{i} - Q_{i})}^{Τ} (O_{j} - Q_{j}) = O (ε_{PCA}^{3 / 2}) .

(B.75)

Inserting (B.71) and (B.75) into (B.68), we conclude

O_{i}^{Τ} O_{j} = Q_{i}^{Τ} Q_{j} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) .

(B.76)

Recall that O_ij is defined as O_ij = UV^Τ, where U and V come from the singular value decomposition of $O_{i}^{Τ} O_{j,} that is, O_{i}^{Τ} O_{j} = U Σ V^{Τ}$ . As a result,

O_{i j} = \underset{O \in O (d)}{argmin} {‖ O_{i}^{Τ} O_{j} - O ‖}_{HS} = \underset{O \in O (d)}{argmin} {‖ Q_{i}^{Τ} Q_{j} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - O ‖}_{HS} = \underset{O \in O (d)}{argmin} {‖ {\bar{R}}^{Τ} Q_{i}^{Τ} Q_{j} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - {\bar{R}}^{Τ} O ‖}_{HS} = \underset{O \in O (d)}{argmin} {‖ Id + t^{2} {\bar{R}}^{Τ} S \bar{R} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - {\bar{R}}^{Τ} O ‖}_{HS} .

Since R̄^ΤSR̄ is symmetric, we rewrite R̄^ΤSR̄ = UΣU^Τ, where U is an orthonormal matrix and Σ is a diagonal matrix with the eigenvalues of R̄^ΤSR̄ on its diagonal. Thus,

Id + t^{2} {\bar{R}}^{Τ} S \bar{R} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - {\bar{R}}^{Τ} O = U (Id + t^{2} Σ) U^{Τ} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - {\bar{R}}^{Τ} O .

Since the Hilbert-Schmidt norm is invariant to orthogonal transformations, we have

{‖ Id + t^{2} {\bar{R}}^{Τ} S \bar{R} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - {\bar{R}}^{Τ} O ‖}_{HS} = {‖ U (Id + t^{2} Σ) U^{Τ} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - {\bar{R}}^{Τ} O ‖}_{HS} = {‖ Id + t^{2} Σ + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) - U^{Τ} {\bar{R}}^{Τ} O U ‖}_{HS} .

Since U^ΤR̄^ΤOU is orthogonal, the minimizer must satisfy $U^{Τ} {\bar{R}}^{Τ} O U = Id + O (ε_{PCA}^{3 / 2} + ε^{3 / 2})$ , as otherwise the sum of squares of the matrix entries would be larger. Hence we conclude $O_{i j} = \bar{R} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2})$

Applying (B.59) and (B.45), we conclude

O_{i j} {\bar{X}}_{j} = \bar{R} {\bar{X}}_{j} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) = \bar{R} {(〈 ι_{*} X (x_{j}), u_{l} (x_{j}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) = \bar{R} {(〈 ι_{*} X (x_{j}), e_{l} (x_{j}) + O (ε_{PCA}^{3 / 2}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) = {(〈 ι_{*} X (x_{j}), ι_{*} P_{x_{j}, x_{i}} \partial_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) = {(〈 ι_{*} P_{x_{i}, x_{j}} X (x_{j}), e_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) = {(〈 ι_{*} P_{x_{i}, x_{j}} X (x_{j}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}) .

This concludes the proof for points away from the boundary.

When x_i and x_j are in $ℳ \sqrt{ε_{PCA}}$ , by the same reasoning as above we get

O_{i j} {\bar{X}}_{j} = {(〈 ι_{*} P_{x_{i}, x_{j}} X (x_{j}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (ε_{PCA}^{1 / 2} + ε^{3 / 2}) .

This concludes the proof. Similar results hold for ∊_PCA = O(n_−2/(d+1)) by using the results in Theorem B.1.

B.4 Proof of Theorem B.3

We demonstrate the proof for the case when the data is uniformly distributed over the manifold. The proof for the nonuniform sampling case is the same but more tedious. Note that when the data is uniformly distributed, T_∊,α = T_∊,0 for all 0 < α ≤ 1, so in the proof we focus on analyzing T_∊ := T_{∊, 0}. Denote K_∊ := K_{∊, 0}. Fix $x_{i} \notin ℳ \sqrt{ε_{PCA}}$ . We rewrite the left-hand side of (B.8) as

\frac{\sum_{j = 1, j \neq i}^{n} K_{ε} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1, j \neq i}^{n} K_{ε} (x_{i}, x_{j})} = \frac{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} F_{j}}{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} G_{j}}

(B.77)

where

F_{j} = K_{ε} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}, G_{j} = K_{ε} (x_{i}, x_{j}) .

Since x₁, x₂; …, x_n are i.i.d. random variables, then G_j for j ≠ i are also i.i.d. random variables. However, the random vectors F_j for j ≠ i are not independent, because the computation of O_ij involves several data points, which leads to possible dependency between O_ij₁ and O_ij₂. Nonetheless, Theorem B.2 implies that the random vectors F_j are well approximated by the i.i.d. random vectors $F_{j}^{'}$ that are defined as

F_{j}^{'} ≔ K_{ε} (x_{i}, x_{j}) {(〈 ι_{*} P_{x_{i}, x_{j}} X (x_{j}), u_{l} (x_{i}) 〉)}_{l = 1}^{d},

(B.78)

and the approximation is given by

F_{j} = F_{j}^{'} + K_{ε} (x_{i}, x_{j}) O (ε_{PCA}^{3 / 2} + ε^{3 / 2}),

(B.79)

where we use ∊_PCA = O(n_−2/(d+2)) (the following analysis can be easily modified to apply to the case ∊_PCA = O(n_−2/(d+1)).

Since G_j when j ≠ i are identical and independent random variables and $F_{j}^{'}$ when j ≠ i are identical and independent random vectors, we hereafter replace $F_{j}^{'}$ and G_j by Fʹ and G in order to ease notation. By the law of large numbers we should expect the following approximation to hold:

\frac{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} F_{j}}{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} G_{j}} = \frac{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} [F_{j}^{'} + G_{j} O (ε_{PCA}^{3 / 2} + ε^{3 / 2})]}{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} G_{j}} \approx \frac{𝔼 F'}{𝔼 G} + O (ε_{PCA}^{3 / 2} + ε^{3 / 2}),

(B.80)

where

𝔼 F' = {(〈 ι_{*} \int_{ℳ} K_{ε} (x_{i}, y) P_{x_{i}, y} X (y) d V (y), u_{l} (x_{i})) 〉)}_{l = 1}^{d}

(B.81)

and

𝔼 G = \int_{ℳ} K_{ε} (x_{i}, y) d V (y) .

(B.82)

In order to analyze the error of this approximation, we make use of the result in [34, eq. (3.14), p. 132] to conclude a large-deviation bound on each of the d coordinates of the error. Together with a simple union bound, we obtain the following large-deviation bound:

Pr {‖ \frac{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} F_{j}^{'}}{\frac{1}{n - 1} \sum_{j = 1, j \neq i}^{n} G_{j}} - \frac{𝔼 F'}{𝔼 G} ‖ > α} \leq C_{1} exp {\frac{- C_{2} (n - 1) α^{2} ε^{d / 2} vol (ℳ)}{2 ε [{‖ \nabla |_{y = x_{i}} 〈 ι_{*} P_{x_{i}, y} X (y), u_{l} (x_{i}) 〉 ‖}^{2} + O (ε)]}},

(B.83)

where C₁ and C₂ are some constants (related to d). This large-deviation bound implies that w.h.p. the variance term is $O (\frac{1}{n^{1 / 2} ε^{d / 4 - 1 / 2}})$ . As a result,

\frac{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j}) O_{i j} {\bar{X}}_{j}}{\sum_{j = 1, j \neq i}^{n} K_{ε, α} (x_{i}, x_{j})} = {(〈 ι_{*} T_{ε, α} X (x_{i}), u_{l} (x_{i}) 〉)}_{l = 1}^{d} + O (\frac{1}{n^{1 / 2} ε^{d / 4 - 1 / 2}} + ε_{PCA}^{3 / 2} + ε^{3 / 2}),

which completes the proof for points away from the boundary. The proof for points inside the boundary is similar.

B.5 Proof of Theorem B.4

We begin the proof by citing the following lemma from [9, lemma 8]: LEMMA B.10. Suppose f ∈ 𝒞³(ℳ) and $x \notin ℳ_{\sqrt{ε}}$ then

\int_{B_{\sqrt{ε}} (x)} ε^{- d / 2} K_{ε} (x, y) f (y) d V (y) = f (x) + ε \frac{m_{2}}{d} [\frac{Δ f (x)}{2} + w (x) f (x)] + O (ε^{2})

where $w (x) = s (x) + \frac{m_{3}^{'} z (x)}{24 | 𝕊^{d - 1} |} and z (x) = \int_{𝕊^{d - 1}} ‖ Π (θ, θ) ‖ d θ$ .

By Lemma B.10, we get

p_{ε} (y) = p (y) + ε \frac{m_{2}}{d} (\frac{Δ p (y)}{2} + w (y) p (y)) + O (ε^{2}),

(B.84)

which leads to

\frac{p (y)}{p_{ε}^{α} (y)} = p^{1 - α} (y) [1 - α ε \frac{m_{2}}{d} (w (y) + \frac{Δ p (y)}{2 p (y)})] + O (ε^{2}) .

(B.85)

Plug (B.85) into the numerator of T_∊,αX(x):

\int_{B_{\sqrt{ε}} (x)} K_{ε, α} (x, y) P_{x, y} X (y) p (y) d V (y) = p_{ε}^{- α} (x) \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) P_{x, y} X (y) p_{ε}^{- α} (y) p (y) d V (y) = p_{ε}^{- α} (x) \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) P_{x, y} X (y) p^{1 - α} (y) \times [1 - α ε \frac{m_{2}}{d} (w (y) + \frac{Δ p (y)}{2 p (y)})] d V (y) + O (ε^{d / 2 + 2}) ≔ p_{ε}^{- α} (x) A - \frac{m_{2} ε}{d} α p_{ε}^{- α} (x) B + O (ε^{d / 2 + 2})

where

{\begin{matrix} A ≔ \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) P_{x, y} X (y) p^{1 - α} (y) d V (y), \\ B ≔ \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) P_{x, y} X (y) p^{1 - α} (y) (w (y) + \frac{Δ p (y)}{2 p (y)}) d V (y) . \end{matrix}

We evaluate A and B by changing the integration variables to polar coordinates, and the odd monomials in the integral vanish because the kernel is symmetric. Thus, applying Taylor’s expansion to A leads to

A = \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε}} [K (\frac{t}{\sqrt{ε}}) + K' (\frac{t}{\sqrt{ε}}) \frac{‖ Π (θ, θ) ‖ t^{3}}{24 \sqrt{ε}} + O (\frac{t^{6}}{ε})] \times [X (x) + \nabla_{θ} X (x) t + \nabla_{θ, θ}^{2} X (x) \frac{t^{2}}{2} + O (t^{3})] \times [p^{1 - α} (x) + \nabla_{θ} (p^{1 - α}) (x) t + \nabla_{θ, θ}^{2} (p^{1 - α}) (x) \frac{t^{2}}{2} + O (t^{3})] \times [t^{d - 1} + Ric (θ, θ) t^{d + 1} + O (t^{d + 2})] d t d θ,

which after rearrangement becomes

A = p^{1 - α} (x) X (x) \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε}} {K (\frac{t}{\sqrt{ε}}) [1 + Ric (θ, θ) t^{2}] t^{d - 1} \times K' (\frac{t}{\sqrt{ε}}) \frac{‖ Π (θ, θ) ‖ t^{d + 2}}{24 \sqrt{ε}}} d t d θ + p^{1 - α} (x) \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε}} K (\frac{t}{\sqrt{ε}}) \nabla_{θ, θ}^{2} X (x) \frac{t^{d + 1}}{2} d t d θ + X (x) \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε}} K (\frac{t}{\sqrt{ε}}) \nabla_{θ, θ}^{2} (p^{1 - α}) (x) \frac{t^{d + 1}}{2} d t d θ + \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε}} K (\frac{t}{\sqrt{ε}}) \nabla_{θ} X (x) \nabla_{θ} (p^{1 - α}) (x) t^{d + 1} d t d θ + O (ε^{d / 2 + 2}) .

From the definition of z(x) it follows that

\int_{B_{\sqrt{ε}} (0)} \frac{1}{ε^{d / 2}} K' (\frac{t}{\sqrt{ε}}) \frac{‖ Π (θ, θ) ‖ t^{d + 2}}{24 \sqrt{ε}} d t d θ = \frac{ε^{d / 2 + 1} m_{3}^{'} z (x)}{24 | 𝕊^{d - 1} |} .

Suppose ${E_{l}}_{l = 1}^{d}$ is an orthonormal basis of T_xℳ and express $θ = Σ_{l = 1}^{d} θ_{l} E_{l}$ . A direct calculation shows that

\int_{𝕊^{d - 1}} \nabla_{θ, θ}^{2} X (x) d θ = \sum_{k, l = 1}^{d} \int_{𝕊^{d - 1}} θ_{l} θ_{k} \nabla_{E_{l}, E_{k}}^{2} X (x) d θ = \frac{| 𝕊^{d - 1} |}{d} \nabla^{2} X (x),

and similarly

\int_{𝕊^{d - 1}} Ric (θ, θ) d θ = \frac{| 𝕊^{d - 1} |}{d} s (x) .

Therefore, the first three terms of A become

ε^{d / 2} p^{1 - α} (x) {(1 + \frac{ε m_{2}}{d} \frac{Δ (p^{1 - α}) (x)}{2 p^{1 - α} (x)} + \frac{ε m_{2}}{d} w (x)) X (x) + \frac{ε m_{2}}{2 d} \nabla^{2} X (x)},

and the last term is simplified to

ε^{\frac{d}{2} + 1} \frac{m_{2}}{d} \nabla X (x) \cdot \nabla (p^{1 - α}) (x) .

Next, we consider B. Note that since there is an ∊ in front of B, we only need to consider the leading-order term. Let $Q (y) = p^{1 - α} (y) (w (y) + \frac{Δ p (y)}{2 p (y)})$ to simplify notation. Thus, applying Taylor’s expansion to each of the terms in the integrand of B leads to

B = \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) P_{x, y} X (y) Q (y) d V (y) = \int_{𝕊^{d - 1}} \int_{0}^{\sqrt{ε}} [K (\frac{t}{\sqrt{ε}}) + O (\frac{t^{3}}{\sqrt{ε}})] [X (x) + \nabla_{θ} X (x) t + O (t^{2})] \times [Q (x) + \nabla_{θ} Q (x) t + O (t^{2})] [t^{d - 1} + Ric (θ, θ) t^{d + 1} + O (t^{d + 2})] d t d θ = ε^{\frac{d}{2}} X (x) Q (x) + O (ε^{d / 2 + 1}) .

In conclusion, the numerator of T_∊,αX(x) becomes

ε^{d / 2} \frac{p^{1 - α} (x)}{p_{ε}^{α} (x)} {1 + \frac{ε m_{2}}{d} [\frac{Δ (p^{1 - α}) (x)}{2 p^{1 - α} (x)} - α \frac{Δ p (x)}{2 p (x)}]} X (x) + ε^{\frac{d}{2} + 1} \frac{m_{2}}{d p_{ε}^{α} (x)} {\frac{p^{1 - α} (x)}{2} \nabla^{2} X (x) + \nabla X (x) \cdot \nabla (p^{1 - α}) (x)} + O (ε^{\frac{d}{2} + 2}) .

Similar calculation of the denominator of the T_∊,αX(x) gives

\int_{B_{\sqrt{ε}} (x)} K_{ε, α} (x, y) p (y) d V (y) = p_{ε}^{- α} (x) \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) p^{1 - α} (y) [1 - α ε \frac{m_{2}}{d} (w (y) + \frac{Δ p (y)}{2 p (y)})] d V (y) + O (ε^{\frac{d}{2} + 2}) = p_{ε}^{- α} (x) C - \frac{ε m_{2}}{d} α p_{ε}^{- α} (x) D + O (ε^{\frac{d}{2} + 2})

where

{\begin{matrix} C ≔ \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) p^{1 - α} (y) d V (y), \\ D ≔ \int_{B_{\sqrt{ε}} (x)} K_{ε} (x, y) p^{1 - α} (y) (w (y) + \frac{Δ p (y)}{2 p (y)}) d V (y) . \end{matrix}

We apply Lemma B.10 to C and D and get

C = ε^{\frac{d}{2}} p^{1 - α} (x) [1 + \frac{ε m_{2}}{d} (w (x) + \frac{Δ (p^{1 - α}) (x)}{2 p^{1 - α} (x)})] + O (ε^{\frac{d}{2} + 2})

D = ε^{\frac{d}{2}} p_{ε}^{1 - α} (x) (s (x) + \frac{Δ p (x)}{2 p (x)}) + O (ε^{\frac{d}{2} + 1}) .

In conclusion, the denominator of T_∊,αX(x) is

ε^{\frac{d}{2}} p_{ε}^{- α} (x) p^{1 - α} (x) {1 + ε \frac{m_{2}}{d} (\frac{Δ (p^{1 - α}) (x)}{2 p^{1 - α} (x)} - α \frac{Δ p (x)}{2 p (x)})} + O (ε^{\frac{d}{2} + 2}) .

Putting all the above together, we have

T_{ε, α} X (x) = X (x) + ε \frac{m_{2}}{2 d} (\nabla^{2} X (x) + \frac{2 \nabla X (x) \cdot \nabla (p^{1 - α}) (x)}{p^{1 - α} (x)}) + O (ε^{2}) .

In particular, when α = 1, we have

T_{ε, 1} X (x) = X (x) + ε \frac{m_{2}}{2 d} \nabla^{2} X (x) + O (ε^{2}) .

B.6 Proof of Theorem 5.5

Suppose min_y∈∂ℳd(x, y) =∊̃. Choose a normal coordinate {∂₁, ∂₂; …, ∂^d} on the geodesic ball B_∊^1/2(x) around x so that x₀ = exp_x(∊̃∂_d(x)). Due to the Gauss lemma, we know span {∂₁(x₀), ∂₁(x₁), …, ∂_d–1(x₀)}=T_x₀∂ℳ and ∂_d(x₀) is outer normal at x₀.

We focus first on the integral appearing in the numerator of T_{∊, 1}X(x):

\int_{B_{\sqrt{ε}} (x) \cap ℳ} \frac{1}{ε^{d / 2}} K_{ε, 1} (x, y) P_{x, y} X (y) p (y) d V (y) .

We divide the integral domain ${exp}_{x}^{- 1} (B_{\sqrt{ε}} (x) \cap ℳ)$ into slices S_η defined by

S_{η} = {(u, η) \in ℝ^{d} : ‖ (u_{1}, u_{2}, \dots, u_{d - 1}, η) ‖ < \sqrt{ε}},

where η ∈ [−∊^1/2,∊^1/2] and u = (u₁, u₂, …, u_d–1) ∈ ℝ^d–1. By Taylor’s expansion and (B.85), the numerator of T_∊,1X becomes

\int_{B_{\sqrt{ε}} (x) \cap ℳ} K_{ε, 1} (x, y) P_{x, y} X (y) p (y) d V (y) = p_{ε}^{- 1} (x) \int_{S_{η}} \int_{- \sqrt{ε}}^{\sqrt{ε}} K (\frac{\sqrt{{‖ u ‖}^{2} + η^{2}}}{\sqrt{ε}}) \times (X (x) + \sum_{i = 1}^{d - 1} u_{i} \nabla_{\partial_{i}} X (x) + η \nabla_{\partial_{d}} X (x) + O (ε)) \times [1 - ε \frac{m_{2}}{d} (w (y) + \frac{Δ p (y)}{2 p (y)}) + O (ε^{2})] d η d u = p^{- 1} (x) \int_{S_{η}} \int_{- \sqrt{ε}}^{\sqrt{ε}} K (\frac{\sqrt{{‖ u ‖}^{2} + η^{2}}}{\sqrt{ε}}) \times (X (x) + \sum_{i = 1}^{d - 1} u_{i} \nabla_{\partial_{i}} X (x) + η \nabla_{\partial_{d}} X (x) + O (ε)) d η d u .

(B.86)

Note that in general the integral domain S_η is not symmetric with relation to (0, …, 0, η), so we will try to symmetrize S_η by defining the symmetrized slices:

{\tilde{S}}_{η} = \cap_{i = 1}^{d - 1} (R_{i} S_{η} \cap S_{η}),

where R_i(u₁; …, u_i, …, η) = (u₁, …, −u_i, …, η). Note that from (B.22) in Lemma B.9, the orthonormal basis {P_x0,x∂₁(x), P_x0,x∂₂(x), …, P_x0,x∂_d–1(x)} of T_x0∂ℳ differs from (∂₁(x₀),∂₂(x₀), … ∂_d–1(x₀)} by O(∊). Also note that up to error of order ∊^3/2, we can express ∂ℳ ∩ B_∊^1/2(x) by a homogeneous degree 2 polynomial with variables {P_x₀,x∂₁(x), P_x₀,x∂₂(x), …, P_x₀,x∂_d–1(x)}. Thus the difference between S̃_η and S_η is of order ∊ and (B.86) can be reduced to

p^{- 1} (x) \int_{{\tilde{S}}_{η}} \int_{- \sqrt{ε}}^{\sqrt{ε}} K (\frac{\sqrt{{‖ u ‖}^{2} + η^{2}}}{\sqrt{ε}}) \times (X (x) + \sum_{i = 1}^{d - 1} u_{i} \nabla_{\partial_{i}} X (x) + η \nabla_{\partial_{d}} X (x) + O (ε)) d η d u .

(B.87)

Next, we apply Taylor’s expansion to X(x):

P_{x, x_{0}} X (x_{0}) = X (x) + \tilde{ε} \nabla_{\partial_{d}} X (x) + O (ε) .

Since

\nabla_{\partial_{d}} X (x) = P_{x, x_{0}} (\nabla_{\partial_{d}} X (x_{0})) + O (ε^{\frac{1}{2}}),

the Taylor expansion of X(x) becomes

X (x) = P_{x, x_{0}} (X (x_{0}) - \tilde{ε} \nabla_{\partial_{d}} X (x_{0}) + O (ε)) .

(B.88)

Similarly for all i = 1, 2, …, d we have

P_{x, x_{0}} (\nabla_{\partial_{i}} X (x_{0})) = \nabla_{\partial_{i}} X (x) + O (ε^{\frac{1}{2}}) .

(B.89)

Plugging (B.88) and (B.89) into (B.87) further reduces (B.86) into

p^{- 1} (x) \int_{{\tilde{S}}_{η}} \int_{- \sqrt{ε}}^{\sqrt{ε}} K (\frac{\sqrt{{‖ u ‖}^{2} + η^{2}}}{\sqrt{ε}}) \times P_{x, x_{0}} (X (x_{0}) + \sum_{i = 1}^{d - 1} u_{i} \nabla_{\partial_{i}} X (x_{0}) + (η - \tilde{ε}) \nabla_{\partial_{d}} X (x_{0}) + O (ε)) d η d u .

(B.90)

The symmetry of the kernel implies that for i = 1, 2, …, d – 1,

\int_{{\tilde{S}}_{η}} K (\frac{\sqrt{{‖ u ‖}^{2} + η^{2}}}{\sqrt{ε}}) u^{i} d u = 0,

(B.91)

and hence the numerator of T_1,∊X(x) becomes

p^{- 1} (x) P_{x, x_{0}} (m_{0}^{ε} X (x_{0}) + m_{1}^{ε} \nabla_{\partial_{d}} X (x_{0})) + O (ε^{\frac{d}{2} + 1})

(B.92)

where

m_{0}^{ε} = \int_{{\tilde{S}}_{η}} \int_{- \sqrt{ε}}^{\sqrt{ε}} K (\frac{\sqrt{{‖ u ‖}^{2} + η^{2}}}{\sqrt{ε}}) d η d x = O (ε^{\frac{d}{2}})

(B.93)

and

m_{1}^{ε} = \int_{{\tilde{S}}_{η}} \int_{- \sqrt{ε}}^{\sqrt{ε}} K (\frac{\sqrt{{‖ u ‖}^{2} + η^{2}}}{\sqrt{ε}}) (η - \tilde{ε}) d η d x = O (ε^{\frac{d}{2} + \frac{1}{2}}) .

(B.94)

Similarly, the denominator of T_∊,1X can be expanded as

\int_{B_{\sqrt{ε}} (x) \cap ℳ} K_{ε, 1} (x, y) p (y) d V (y) = p^{- 1} (x) m_{0}^{ε} + O (ε^{\frac{d}{2} + \frac{1}{2}}),

(B.95)

which together with (B.92) gives us the following asymptotic expansion:

T_{ε, 1} X (x) = P_{x, x_{0}} (X (x_{0}) + \frac{m_{1}^{ε}}{m_{0}^{ε}} \nabla_{\partial_{d}} X (x_{0})) + O (ε) .

(B.96)

Combining (B.96) with (B.9) in Theorem B.3, we conclude the theorem.

B.7 Proof of Theorem 5.6

We denote the spectrum of ▽² by ${λ_{l}}_{l = 0}^{\infty}, where 0 \leq λ_{0} \leq λ_{1} \leq \dots$ and the corresponding eigenspaces by E_l := {X ∈ L²(Tℳ) : ▽²X = −λ_lX}, l = 0, 1, …. The eigenvector fields are smooth and form a basis for L₂(Tℳ), that is,

L^{2} (T ℳ) = \bar{\underset{l \in ℕ \cup {0}}{\oplus} E_{l} .}

Thus we proceed by considering the approximation through eigenvector field subspaces. To simplify notation, we rescale the kernel K so that $\frac{m_{2}}{2 d} = 1$ .

Fix X_l ∈ E_l. When $x \notin ℳ_{\sqrt{ε}}$ from Corollary B.5 we have uniformly

\frac{T_{ε, 1} X_{l} (x) - X_{l} (x)}{ε} = \nabla^{2} X_{l} (x) + O (ε) .

When $x \in ℳ_{\sqrt{ε}}$ , from Theorem 5.5 and the Neumann condition, we have uniformly

T_{ε, 1} X_{l} (x) = P_{x, x_{0}} X_{l} (x_{0}) + O (ε) .

(B.97)

Note that we have

P_{x, x_{0}} X_{l} (x_{0}) = X_{l} (x) + P_{x, x_{0}} \sqrt{ε} \nabla_{\partial_{d}} X_{l} (x_{0}) + O (ε);

thus again by the Neumann condition at x₀, (B.97) becomes

T_{ε, 1} X_{l} (x) = X_{l} (x) + O (ε) .

In conclusion, when $x \in ℳ_{\sqrt{ε}}$ uniformly we have

\frac{T_{ε, 1} X_{l} (x) - X_{l} (x)}{ε} = O (1) .

Note that when the boundary of the manifold is smooth, the measure of $ℳ_{\sqrt{ε}} is O (ε^{1 / 2})$ . We conclude that in the L² sense,

{‖ \frac{T_{ε, 1} X_{l} - X_{l}}{ε} - \nabla^{2} X_{l} ‖}_{L^{2}} = O (ε^{1 / 4}) .

(B.98)

Next we show how $T_{ε, 1}^{t / ε}$ converges to e^−t▽². We know I + ∊▽² is invertible on E_l with norm $\frac{1}{2} \leq ‖ I + ε \nabla^{2} ‖ < 1 when ε < \frac{1}{2 λ_{l}}$ . Next, note that if B is a bounded operator with norm ∥B∥ < 1, we have the following bound for any s > 0 by the binomial expansion:

‖ {(I + B)}^{s} - I ‖ = ‖ s B + \frac{s (s - 1)}{2!} B^{2} + \frac{s (s - 1) (s - 2)}{3!} B^{3} + \dots ‖ \leq s ‖ B ‖ + \frac{s (s - 1)}{2!} {‖ B ‖}^{2} + \frac{s (s - 1) (s - 2)}{3!} {‖ B ‖}^{3} + \dots = s ‖ B ‖ {1 + \frac{s - 1}{2!} ‖ B ‖ + \frac{(s - 1) (s - 2)}{3!} {‖ B ‖}^{2} + \dots} \leq s ‖ B ‖ {1 + \frac{s - 1}{1!} ‖ B ‖ + \frac{(s - 1) (s - 2)}{2!} {‖ B ‖}^{2} + \dots} = s ‖ B ‖ {(1 + ‖ B ‖)}^{s - 1} .

(B.99)

On the other hand, note that on E_l

e^{t \nabla^{2}} = {(I + ε \nabla^{2})}^{\frac{t}{ε}} + O (ε) .

(B.100)

Indeed, for X ∈ E_l we have $e^{t \nabla^{2}} X = (1 - t λ_{l} + t^{2} λ_{l}^{2} / 2 + \dots) X$ and

{(I + ε \nabla^{2})}^{\frac{t}{ε}} X = (1 - t λ_{l} + \frac{t^{2} λ_{l}^{2}}{2} - \frac{t ε λ_{l}^{2}}{2} + \dots) X

by the binomial expansion. Thus we have the claim.

Putting all the above together, over E_l for all l ≥ 0, when $ε < 1 / λ_{l}^{16 + 2 d}$ we have

‖ T_{ε, 1}^{t / ε} - e^{t \nabla^{2}} ‖ = ‖ {(I + ε \nabla^{2} + O (ε^{\frac{5}{4}}))}^{\frac{t}{ε}} - {(I + ε \nabla^{2})}^{}^{\frac{t}{ε}} + O (ε) ‖ \leq ‖ {(I + ε \nabla^{2})}^{}^{\frac{t}{ε}} ‖ ‖ {[I + {(I + ε \nabla^{2})}^{- 1} O (ε^{5 / 4})]}^{\frac{t}{ε}} - I + O (ε) ‖ = (1 + t + O (ε)) (ε^{1 / 4} t + O (ε) = O (ε^{\frac{1}{4}}),

where the first equality comes from (B.98) and (B.100), and the third inequality comes from (B.99). Thus on $\bar{\oplus l : λ_{l < ε^{- 1 / (16 + 2 d)}} E l} we have ‖ T_{ε, 1}^{t / ε} - e^{t \nabla^{2}} ‖ \leq O (ε^{1 / 8})$ . By taking ∊ → 0, the proof is completed.

Appendix C

Multiplicities of Eigen 1-Forms of Connection Laplacian over 𝕊ⁿ

All results and proofs in this section can be found in [15, 38]. Consider the following setting:

G = SO (n + 1), K = SO (n), ℳ = G / K = 𝕊^{n}, 𝔤 = so (n + 1), 𝔨 = so (n) .

Denote by Ω¹(𝕊ⁿ) the complexified smooth 1-forms, which is a G-module by (g · s)(x) = g · s(g⁻¹x) for g ∈ G, s ∈ Ω¹(𝕊ⁿ), and x ∈ 𝕊_n. Over ℳ we have the Haar measure dµ defined by the bi-invariant Hermitian metric 〈·, ·〉, which comes from the Killing form of G and the Hodge Laplacian operator Δ = dδ+ δd defined by 〈·, ·〉. Since Δ is a self-adjoint and uniform second-order elliptic operator on Ω¹(𝕊_n), the eigenvalues λ_i are discrete and nonnegative real numbers, with an accumulation point only at ∞, and their related eigenspaces E_i are of finite dimension. We also know $\oplus_{i = 1}^{\infty} E_{i}$ is dense in Ω¹(𝕊ⁿ) in the topology defined by the inner product (f, g)_𝕊ⁿ := ∫_𝕊ⁿ 〈f, g〉dµ. Note that ℊ/ 𝓉⊗_ℝ ℂ ≅ ℂ_n when G = SO(n + 1) and K = SO(n). Denote by V = ℂⁿ the standard representation of SO(n). We split the calculation of the multiplicity of eigenforms of Δ over 𝕊ⁿ into four steps. Step 1. Clearly Ω¹(𝕊ⁿ) is a reducible G-module. Fix an irreducible representation λ of G acting on Γ_λ and construct a G-homomorphism

{Hom}_{G} (Γ_{λ}, Ω^{1} (𝕊^{n})) \otimes_{ℂ} Γ_{λ} \to Ω^{1} (𝕊^{n})

by φ ⊗ v ↦ φ (v). We call the image the Γ_λ-isotypical summand in Ω¹(𝕊ⁿ) with multiplicity dim_ℂHom_G(Γ_λ, Ω¹(𝕊ⁿ)). Then we apply the Frobenius reciprocity law:

{Hom}_{G} (Γ_{λ}, Ω^{1} (𝕊^{n})) ≅ {Hom}_{K} ({res}_{K}^{G} Γ_{λ}, V) .

Thus if we can calculate dim_ℂ Hom_K $({res}_{K}^{G} Γ_{λ}, V)$ , we know how many copies of the irreducible representation Γ_λ are inside Ω¹(𝕊_n).

Denote by L₁, L₂, …, L_n the basis for the dual space of Cartan subalgebra of so(2n) or so(2n+1). Then L₁, L₂; …, L_n together with $\frac{1}{2} Σ_{i = 1}^{n} L^{i}$ generate the weight lattice. The Weyl chamber of SO(2n + 1) is

𝒲 = {\sum a_{i} L_{i} : a_{1} \geq α_{2} \geq \dots \geq a_{n} \geq 0},

and the edges of the W are thus the rays generated by the vectors L₁, L₁ + L₂; …, L₁ + L₂ + … + L_n; for SO(2n), the Weyl chamber is

𝒲 = {\sum a_{i} L_{i} : a_{1} \geq α_{2} \geq \dots \geq | a_{n} |},

and the edges are thus the rays generated by the vectors L₁, L₁ + L₂, …, L₁ + L₂ + …, + L_n–3 + L_n–2, L₁ + L₂ + … + L_n–3 + L_n–2 + L_n–1 + L_n, and L₁ + L₂ + … + L_n–3 + L_n–2 + L_n–1 – L_n.

To keep the notation unified, we denote the fundamental weights ω_i separately. When G = SO(2n), denote

{\begin{matrix} ω_{0} = 0 & when p = 0, \\ ω_{p} = \sum_{i = 1}^{p} λ_{i} & when 1 \leq p \leq n - 1, \\ ω_{n} = \frac{1}{2} \sum_{i = 1}^{n} λ_{i} & when p = n; \end{matrix}

(C.1)

when G = SO(2n + 1), denote

{\begin{matrix} ω_{0} = 0 & when p = 0, \\ ω_{p} = \sum_{i = 1}^{p} λ_{i} & when 1 \leq p \leq n - 2, \\ ω_{n - 1} = \frac{1}{2} (\sum_{i = 1}^{n - 1} λ_{i} - λ_{n}) & when p = n, \\ ω_{n} = \frac{1}{2} (\sum_{i = 1}^{n - 1} λ_{i} - λ_{n}) & when p = n . \end{matrix}

(C.2)

THEOREM C.1.

When m = 2n + 1, the exterior powers Λ^pV of the standard representation V of so(2n + 1) are the irreducible representations with the highest weight ω_p when p < n and 2ω_n when p = n.
When m = 2n, the exterior powers Λ^pV of the standard representation V of so(2n) are the irreducible representations with the highest weight ω_p when p ≤ n–1; when p = n, ΛⁿV splits into two irreducible representations with the highest weights 2ω_m–1 and 2ω_m.

PROOF. Please see [15] for details.

THEOREM C.2 (Branching Theorem). When G = SO(m) and K = SO(m – 1) the restriction of the irreducible representations of G to K will be decomposed as the direct sum of the irreducible representations of K in the following way: Let Γ_λ be an irreducible G-module over ℂ with the highest weight $λ = Σ_{i = 1}^{n} λ_{i} L_{i} \in 𝒲$ .

If m = 2n
$Γ_{λ} = \oplus Γ_{\sum_{i = 1}^{n} λ_{i}^{'} L_{i}}$
where ⊕ runs over all $λ_{i}^{'}$ such that
$λ_{1} \geq λ_{1}^{'} \geq λ_{2} \geq λ_{2}^{'} \geq \dots \geq λ_{n - 1}^{'} \geq | λ_{n} |$
with $λ_{i}^{'}$ and λ_i simultaneously all integers or all half-integers. Here $Γ_{Σ_{i = 1}^{n} λ_{i}^{'} L_{i}}$ is the irreducible K-module with the highest weight $Σ_{i = 1}^{n} λ_{i}^{'} L_{i}$ .
If m = 2n+1
$Γ_{λ} = \oplus Γ_{\sum_{i = 1}^{n} λ_{i}^{'} L_{i}}$
where ⊕ runs over all $λ_{i}^{'}$ such that
$λ_{1} \geq λ_{1}^{'} \geq λ_{2} \geq λ_{2}^{'} \geq \dots \geq λ_{n - 1}^{'} \geq λ_{n} \geq | λ_{n}^{'} |$
with $λ_{i}^{'}$ and λ_i simultaneously all integers or all half-integers. Here $Γ_{Σ_{i = 1}^{n} λ_{i}^{'} L_{i}}$ is the irreducible K-module with the highest weight $Σ_{i = 1}^{n} λ_{i}^{'} L_{i}$ .

PROOF. Please see [15] for details.

From the above theorems, we know how to calculate dim_ℂ Hom_G(Γ_λ,Ω₁(ℳ)). To be more precise, since the right-hand side of Hom_K $({res}_{K}^{G} Γ_{λ}, V)$ is the irreducible representation of K with the highest weight ω₁ (or splits in the low-dimension case), we know the dim_ℂ Hom_K $({res}_{K}^{G} Γ_{λ}, V)$ can be 1 only if ${res}_{K}^{G} Γ_{λ}$ has the same highest weight by Schur’s lemma and the classification theorem.

Step 2. In this step we relate the irreducible representation Γ_λ ⊂ Ω¹(𝕊ⁿ) to the eigenvalue of the Hodge-Laplace operator Δ.

THEOREM C.3. Suppose Γ_µ ⊂ Ω¹(𝕊ⁿ) is an irreducible G-module with the highest weight µ; then we have

Δ f = 〈 μ + 2 ρ, μ 〉 f

where f ∈ Γ_µ, ρ is the half sum of all positive roots, and 〈·, ·〉 is induced inner product on the dual Cartan subalgebra of ℊ from the Killing form B.

PROOF. Please see [38] for details.

Note that for $SO (2 n +1), ρ = Σ_{i = 1}^{n} (n + \frac{1}{2} - i) L_{i}$ since R⁺={L_i–L_j, L_i+L_j,L_i:i < j} for $SO (2 n), ρ = Σ_{i = 1}^{n} (n - i) L_{i}$ since R⁺={L_i–L_j, L_i+L_j : i < j}.

Combining these theorems, we know if Γ_µ is an irreducible representation of G, then it is an eigenspace of Δ with eigenvalue λ = 〈µ +2ρ, µ〉. In particular, if we can decompose the eigen 1-form space E_λ ⊂ Ω¹(𝕊ⁿ) into an irreducible G-module and calculate the dimension of the irreducible G-module, we can determine the multiplicity of E_λ.

Step 3. Now we apply the Weyl character formula to calculate the dimension of Γ_λ for G.

THEOREM C.4.

When m = 2n + 1, consider $λ = Σ_{i = 1}^{n} λ_{i} L_{i}$ where λ₁ ≥ λ₂ ≥ … ≥ λ_n ≥ 0, the highest weight of an irreducible representation Γ_λ. Then
$dim Γ_{λ} = Π_{i < j} \frac{l_{i} - l_{j}}{j - i} Π_{i \leq j} \frac{l_{i} + l_{j}}{2 n + 1 - i - j} where l_{i} = λ_{i} + n + i + \frac{1}{2} .$
When m = 2n, consider $λ = Σ_{i = 1}^{n} λ_{i} L_{i}$ where λ₁ ≥ λ₂ ≥ … ≥|λ_n|, the highest weight of an irreducible representation Γ_λ. Then
$dim Γ_{λ} = Π_{i < j} \frac{l_{i} - l_{j}}{j - i} \frac{l_{i} + l_{j}}{2 n - i - j} where l_{i} = λ_{i} + n - 1 .$

PROOF. Please see [15] for details.

Step 4. We need the following theorem about real representations of G to solve the original problem.

TABLE C.1.

Eigenvalues and their multiplicity of 𝕊^m, where m ≥ 2. l_0,i = λ_i + n – i, m_0,i = n – i, l_1/2,_i = $λ_{i} + n - i + \frac{1}{2}$ , and $m_{1 / 2, i} = n - i + \frac{1}{2}$ .

eigenvalues

multiplicity

𝕊²ⁿ, n ≥ 2

kL₁, k ≥ 1
kL₁ + L₂, k ≥ 1

k(k + 2n − 1)
(k + 1)(k + 2n − 2)

Π_{i < j} \frac{l_{0, i}^{2} - l_{0, j}^{2}}{m_{0, i}^{2} - m_{0, j}^{2}} Π_{i < j} \frac{l_{0, i}^{2} - l_{0, j}^{2}}{m_{0, i}^{2} - m_{0, j}^{2}}

𝕊²ⁿ⁺¹, n ≥ 2

kL₁, k ≥ 1
kL₁ + L₂, k ≥ 1

k(k + 2n)
(k + 1)(k + 2n − 1)

Π_{i < j} \frac{l_{1 / 2, i}^{2} - l_{1 / 2, j}^{2}}{m_{1 / 2, i}^{2} - m_{1 / 2, j}^{2}} Π_{i} \frac{l_{1 / 2, i}}{m_{1 / 2, i}} Π_{i < j} \frac{l_{1 / 2, i}^{2} - l_{1 / 2, j}^{2}}{m_{1 / 2, i}^{2} - m_{1 / 2, j}^{2}} Π_{i} \frac{l_{1 / 2, i}}{m_{1 / 2, i}}

𝕊³

kL₁, k ≥ 1
kL₁ + L₂, k ≥ 1
kL₁ − L₂, k ≥ 1

k(k + 2)
(k + 1)²
(k + 1)²

(k + 1)₂
k(k + 2)
k(k + 2)

𝕊²

kL₁ k ≥ 1

k(k + 1)

2(2k + 1)

Open in a new tab

THEOREM C.5.

When n is odd, for any weight λ = a₁ω₁ + a₂ω₂ +… + a_n–1ω_n–1 + a_nω_n/2 of so_2n+1ℝ, the irreducible representation Γ_λ with the highest weight λ is real if a_n is even or if n ≅ 0 or 3 mod 4, if a_n is odd and n ≡ 1 or 2 mod 4 then Γ_λ is quaternionic.
When n is even, the representation Γ_λ of so_2n ℝ with highest weight λ = a₁ω₁ + a₂ω₂ + … + a_n–2ω_n–2 + a_n–1ω_n–1 + a_nω_n will be complex if n is odd and a_n–1 ≠ a_n; it will be quaternionic if n ≡ 2 mod 4 and a_n–1 +a_n is odd; and it will be real otherwise.

PROOF. Please see [15] for details.

Combining Table C.1 and this theorem, we know all the eigen 1-form spaces are real form.

Step 5. Now we put all the above together. All the eigen 1-form spaces of 핊^m as an irreducible representation of SO(m+1), m ≥ 2, are listed in Table C.1. The highest weight is denoted by λ.

Consider SO(3) for example. In this case, n = 1 and ℳ = 𝕊². From the analysis in cryo-EM [18], we know that the multiplicities of eigenvectors are 6, 10, …, which echoes the above analysis. Moreover, the first few multiplicities of 𝕊₃, 𝕊₄, and 𝕊₅ are 4, 6, 9, 16, 16, …; 5, 10, 14, …; and 6, 15, 20, …, respectively.

Footnotes

One of the main considerations in the presentation of this paper is to make it as accessible as possible, including to readers who are not familiar with differential geometry. Although the connection Laplacian is essential to the understanding of the mathematical framework that underlies VDM, and differential geometry is extensively used in Appendix B for the proof of Theorem 5.3, we do not assume knowledge of differential geometry in Sections 2 through 10 (except for some parts of Section 8) that detail the algorithmic framework. The concepts of differential geometry that are required for achieving basic familiarity with the connection Laplacian are explained in Appendix A.

Since N_i depends on ∊_PCA, it should be denoted as N_i,_{∊_PCA}, but since ∊_PCA is kept fixed it is suppressed from the notation, a convention that we use except for cases in which confusion may arise.

The solution is unique whenever $O_{i}^{Τ} O_{j}$ is nonsingular, a condition that is satisfied whenever the distance between x_i and x_j is sufficiently small, due to bounded curvature.

⁴

The definition of the parallel transport operator is provided in Appendix A and in textbooks on differential geometry; see, e.g., [32, chap. 2].

⁵

We do not align a basis with itself, so the edge set E does not contain self-loops of the form (i, i).

⁶

Notice that the weights are only a function of the euclidean distance between data points; another possibility, which we do not consider in this paper, is to include the Grassmannian distance ${‖ O_{i} O_{i}^{Τ} - O_{j} O_{j}^{Τ} ‖}_{2}$ in the definition of the weight.

⁷

Here we abuse notation slightly. Usually T_xℳ defined here is understood as the embedded tangent plane by the embedding ι of the tangent plane at x. Please see [32] for a rigorous definition of the tangent plane.

⁸

See [32] for the exact notion of differentiability. Here again we abuse notation slightly. Usually X defined here is understood as the embedded vector field by the embedding ι of the vector field X. For the rigorous definition of a vector field, please see [32].

Contributor Information

A. Singer, Princeton University, Dedicated to the memory of Partha Niyogi, Fine Hall, Washington Road, Princeton, N.J. 08544-1000, amits@math.princeton.edu

H.-T. Wu, Princeton University, Dedicated to the memory of Partha Niyogi, Fine Hall, Washington Road, Princeton, N.J. 08544-1000, hauwu@math.princeton.edu

Bibliography

1.Arun KS, Huang TS, Blostein SD. Least-squares fitting of two 3-D point sets. IEEE Trans. Patt. Anal. Mach. Intell. 1987;9(no. 5):698–700. doi: 10.1109/tpami.1987.4767965. [DOI] [PubMed] [Google Scholar]
2.Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. 2003;15(no. 6):1373–1396. [Google Scholar]
3.Belkin M, Niyogi P. Lecture Notes in Computer Science. Vol. 3559. Berlin: Springer; 2005. Towards a theoretical foundation for Laplacian-based manifold methods. Learning Theory; pp. 486–500. [Google Scholar]
4.Belkin M, Niyogi P. Advances in neural information processing systems. Cambridge, Mass: MIT Press; 2008. Convergence of Laplacian eigenmaps; pp. 1–31. [Google Scholar]
5.Berline N, Getzler E, Vergne M. Heat kernels and Dirac operators. Berlin: Springer; 2004. [Google Scholar]
6.Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008;36(no. 6):2577–2604. [Google Scholar]
7.Candès EJ, Li X, Ma Y, Wright J. Robust principal component analysis? J. ACM. 2011;58(no. 3):37. Art. 11. [Google Scholar]
8.Carlsson G, Ishkhanov T, de Silva V, Zomorodian A. On the local behavior of spaces of natural images. Int. J. Comput. Vis. 2008;76(no. 1):1–12. [Google Scholar]
9.Coifman RR, Lafon S. Diffusion maps. Appl. Comput. Harmon. Anal. 2006;21(no. 1):5–30. [Google Scholar]
10.Cucuringu M, Lipman Y, Singer A. Sensor network localization by eigenvector synchronization over the Euclidean group. ACM Transactions on Sensor Networks. doi: 10.1145/2240092.2240093. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.DeWitt B. The global approach to quantum field theory. New York: Oxford University Press; 2003. [Google Scholar]
12.Donoho DL, Grimes C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA. 2003;100(no. 10):5591–5596. doi: 10.1073/pnas.1031596100. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Fan K, Hoffman AJ. Some metric inequalities in the space of matrices. Proc. Amer. Math. Soc. 1955;6(no. 1):111–116. [Google Scholar]
14.Frank J. Three-dimensional electron microscopy of macromolecular assemblies: visualization of biological molecules in their native state. 2nd ed. New York: Oxford University Press; 2006. [Google Scholar]
15.Fulton W, Harris J. Representation theory: a first course. New York: Springer; 1991. [Google Scholar]
16.Gilkey P. Mathematics Lecture Series, 4. Publish or Perish. Boston: 1974. The index theorem and the heat equation. [Google Scholar]
17.Goldberg M, Kim S. Some remarks on diffusion distances. J. Appl. Math. 2010;2010:17. Art. ID 464815. [Google Scholar]
18.Hadani R, Singer A. Representation theoretic patterns in three dimensional cryo-electron microscopy I—the intrinsic reconstitution algorithm. Ann. of Math. (2) 2011;174(no. 2):1219–1241. doi: 10.4007/annals.2011.174.2.11. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Hadani R, Singer A. Representation theoretic patterns in three dimensional cryo-electron microscopy II—the class averaging problem. Found. Comput. Math. 2011;11(no. 5):589–616. doi: 10.1007/s10208-011-9095-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Hein M, Audibert J-Y, von Luxburg U. Learning theory, Lecture Notes in Computer Science. 3559. Berlin: Springer; 2005. From graphs to manifolds - weak and strong pointwise consistency of graph Laplacians; pp. 470–485. [Google Scholar]
21.Higham NJ. Computing the polar decomposition—with applications. SIAM J. Sci. Statist. Comput. 1986;7(no. 4):1160–1174. [Google Scholar]
22.Hoeffding W. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 1963;58:13–30. [Google Scholar]
23.Johnstone IM. International Congress of Mathematicians. Vol. 1. Zürich: European Mathematical Society; 2007. High dimensional statistical inference and random matrices; pp. 307–333. [Google Scholar]
24.Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 2009;104(no. 486):682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Keller JB. Closest unitary, orthogonal and Hermitian operators to a given operator. Math. Mag. 1975;48(no. 4):192–197. Available at: http://www.jstor.org/stable/2690338. [Google Scholar]
26.Lafon S. Thesis. New Haven, Conn.: Yale University; 2004. Diffusion maps and geometric harmonics. [Google Scholar]
27.Lee A, Pedersen K, Mumford D. The nonlinear statistics of high-contrast patches in natural images. Int. J. Comput. Vis. 2003;54(no. 1–3):83–103. [Google Scholar]
28.Little AV, Lee J, Jung Y, Maggioni M. Estimation of intrinsic dimensionality of samples from noisy low-dimensional manifolds in high dimensions with multiscale SVD. 2009 IEEE/SP 15th Workshop on Statistical Signal Processing. 2009:85–88. [Google Scholar]
29.Natterer F. Classics in Applied Mathematics. Vol. 32. Philadelphia: Society for Industrial and Applied Mathematics (SIAM); 2001. The mathematics of computerized tomography. [Google Scholar]
30.Niyogi P, Smale S, Weinberger S. Finding the homology of submanifolds with high confidence from random samples. Discrete Comput. Geom. 2008;39(no. 1–3):419–441. [Google Scholar]
31.Penczek P, Zhu J, Frank J. A common-lines based method for determining orientations for N ≥ 3 particle projections simultaneously. Ultramicroscopy. 1996;63(no. 3–4):205–218. doi: 10.1016/0304-3991(96)00037-x. [DOI] [PubMed] [Google Scholar]
32.Petersen P. Riemannian geometry. New York: Springer; 2006. [Google Scholar]
33.Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(no. 5500):2323–2326. doi: 10.1126/science.290.5500.2323. [DOI] [PubMed] [Google Scholar]
34.Singer A. From graph to manifold Laplacian: the convergence rate. Appl. Comput. Harmon. Anal. 2006;21(no. 1):128–134. [Google Scholar]
35.Singer A. Angular synchronization by eigenvectors and semidefinite programming. Appl. Comput. Harmon. Anal. 2011;30(no. 1):20–36. doi: 10.1016/j.acha.2010.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Singer A, Wu H-T. Orientability and diffusion map. Appl. Comput. Harmon. Anal. 2011;31(no. 1):44–58. doi: 10.1016/j.acha.2010.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Singer A, Zhao Z, Shkolnisky Y, Hadani R. Viewing angle classification of cryo-electron microscopy images using eigenvectors. SIAM J. Imaging Sci. 2011;4(no. 2):723–759. doi: 10.1137/090778390. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Taylor ME. Mathematical Surveys and Monographs. Vol. 22. Providence, RI: American Mathematical Society; 1986. Noncommutative harmonic analysis. [Google Scholar]
39.Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(no. 5500):2319–2323. doi: 10.1126/science.290.5500.2319. [DOI] [PubMed] [Google Scholar]
40.van Heel M. Angular reconstitution: a posteriori assignment of projection directions for 3D reconstruction. Ultramicroscopy. 1987;21(no. 2):111–123. doi: 10.1016/0304-3991(87)90078-7. [DOI] [PubMed] [Google Scholar]
41.van Heel M, Gowen B, Matadeen R, Orlova E, Finn R, Pape T, Cohen D, Stark H, Schmidt R, Schatz M, Patwardhan A. Single-particle electron cryo-microscopy: towards atomic resolution. Quarterly Reviews of Biophysics. 2000;33(no. 4):307–369. doi: 10.1017/s0033583500003644. [DOI] [PubMed] [Google Scholar]
42.Zhang Z, Zha H. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM J. Sci. Comput. 2004;26(no. 1):313–338. [Google Scholar]

[R1] 1.Arun KS, Huang TS, Blostein SD. Least-squares fitting of two 3-D point sets. IEEE Trans. Patt. Anal. Mach. Intell. 1987;9(no. 5):698–700. doi: 10.1109/tpami.1987.4767965. [DOI] [PubMed] [Google Scholar]

[R2] 2.Belkin M, Niyogi P. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation. 2003;15(no. 6):1373–1396. [Google Scholar]

[R3] 3.Belkin M, Niyogi P. Lecture Notes in Computer Science. Vol. 3559. Berlin: Springer; 2005. Towards a theoretical foundation for Laplacian-based manifold methods. Learning Theory; pp. 486–500. [Google Scholar]

[R4] 4.Belkin M, Niyogi P. Advances in neural information processing systems. Cambridge, Mass: MIT Press; 2008. Convergence of Laplacian eigenmaps; pp. 1–31. [Google Scholar]

[R5] 5.Berline N, Getzler E, Vergne M. Heat kernels and Dirac operators. Berlin: Springer; 2004. [Google Scholar]

[R6] 6.Bickel PJ, Levina E. Covariance regularization by thresholding. Ann. Statist. 2008;36(no. 6):2577–2604. [Google Scholar]

[R7] 7.Candès EJ, Li X, Ma Y, Wright J. Robust principal component analysis? J. ACM. 2011;58(no. 3):37. Art. 11. [Google Scholar]

[R8] 8.Carlsson G, Ishkhanov T, de Silva V, Zomorodian A. On the local behavior of spaces of natural images. Int. J. Comput. Vis. 2008;76(no. 1):1–12. [Google Scholar]

[R9] 9.Coifman RR, Lafon S. Diffusion maps. Appl. Comput. Harmon. Anal. 2006;21(no. 1):5–30. [Google Scholar]

[R10] 10.Cucuringu M, Lipman Y, Singer A. Sensor network localization by eigenvector synchronization over the Euclidean group. ACM Transactions on Sensor Networks. doi: 10.1145/2240092.2240093. in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.DeWitt B. The global approach to quantum field theory. New York: Oxford University Press; 2003. [Google Scholar]

[R12] 12.Donoho DL, Grimes C. Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data. Proc. Natl. Acad. Sci. USA. 2003;100(no. 10):5591–5596. doi: 10.1073/pnas.1031596100. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] 13.Fan K, Hoffman AJ. Some metric inequalities in the space of matrices. Proc. Amer. Math. Soc. 1955;6(no. 1):111–116. [Google Scholar]

[R14] 14.Frank J. Three-dimensional electron microscopy of macromolecular assemblies: visualization of biological molecules in their native state. 2nd ed. New York: Oxford University Press; 2006. [Google Scholar]

[R15] 15.Fulton W, Harris J. Representation theory: a first course. New York: Springer; 1991. [Google Scholar]

[R16] 16.Gilkey P. Mathematics Lecture Series, 4. Publish or Perish. Boston: 1974. The index theorem and the heat equation. [Google Scholar]

[R17] 17.Goldberg M, Kim S. Some remarks on diffusion distances. J. Appl. Math. 2010;2010:17. Art. ID 464815. [Google Scholar]

[R18] 18.Hadani R, Singer A. Representation theoretic patterns in three dimensional cryo-electron microscopy I—the intrinsic reconstitution algorithm. Ann. of Math. (2) 2011;174(no. 2):1219–1241. doi: 10.4007/annals.2011.174.2.11. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Hadani R, Singer A. Representation theoretic patterns in three dimensional cryo-electron microscopy II—the class averaging problem. Found. Comput. Math. 2011;11(no. 5):589–616. doi: 10.1007/s10208-011-9095-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Hein M, Audibert J-Y, von Luxburg U. Learning theory, Lecture Notes in Computer Science. 3559. Berlin: Springer; 2005. From graphs to manifolds - weak and strong pointwise consistency of graph Laplacians; pp. 470–485. [Google Scholar]

[R21] 21.Higham NJ. Computing the polar decomposition—with applications. SIAM J. Sci. Statist. Comput. 1986;7(no. 4):1160–1174. [Google Scholar]

[R22] 22.Hoeffding W. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. 1963;58:13–30. [Google Scholar]

[R23] 23.Johnstone IM. International Congress of Mathematicians. Vol. 1. Zürich: European Mathematical Society; 2007. High dimensional statistical inference and random matrices; pp. 307–333. [Google Scholar]

[R24] 24.Johnstone IM, Lu AY. On consistency and sparsity for principal components analysis in high dimensions. J. Amer. Statist. Assoc. 2009;104(no. 486):682–693. doi: 10.1198/jasa.2009.0121. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Keller JB. Closest unitary, orthogonal and Hermitian operators to a given operator. Math. Mag. 1975;48(no. 4):192–197. Available at: http://www.jstor.org/stable/2690338. [Google Scholar]

[R26] 26.Lafon S. Thesis. New Haven, Conn.: Yale University; 2004. Diffusion maps and geometric harmonics. [Google Scholar]

[R27] 27.Lee A, Pedersen K, Mumford D. The nonlinear statistics of high-contrast patches in natural images. Int. J. Comput. Vis. 2003;54(no. 1–3):83–103. [Google Scholar]

[R28] 28.Little AV, Lee J, Jung Y, Maggioni M. Estimation of intrinsic dimensionality of samples from noisy low-dimensional manifolds in high dimensions with multiscale SVD. 2009 IEEE/SP 15th Workshop on Statistical Signal Processing. 2009:85–88. [Google Scholar]

[R29] 29.Natterer F. Classics in Applied Mathematics. Vol. 32. Philadelphia: Society for Industrial and Applied Mathematics (SIAM); 2001. The mathematics of computerized tomography. [Google Scholar]

[R30] 30.Niyogi P, Smale S, Weinberger S. Finding the homology of submanifolds with high confidence from random samples. Discrete Comput. Geom. 2008;39(no. 1–3):419–441. [Google Scholar]

[R31] 31.Penczek P, Zhu J, Frank J. A common-lines based method for determining orientations for N ≥ 3 particle projections simultaneously. Ultramicroscopy. 1996;63(no. 3–4):205–218. doi: 10.1016/0304-3991(96)00037-x. [DOI] [PubMed] [Google Scholar]

[R32] 32.Petersen P. Riemannian geometry. New York: Springer; 2006. [Google Scholar]

[R33] 33.Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000;290(no. 5500):2323–2326. doi: 10.1126/science.290.5500.2323. [DOI] [PubMed] [Google Scholar]

[R34] 34.Singer A. From graph to manifold Laplacian: the convergence rate. Appl. Comput. Harmon. Anal. 2006;21(no. 1):128–134. [Google Scholar]

[R35] 35.Singer A. Angular synchronization by eigenvectors and semidefinite programming. Appl. Comput. Harmon. Anal. 2011;30(no. 1):20–36. doi: 10.1016/j.acha.2010.02.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R36] 36.Singer A, Wu H-T. Orientability and diffusion map. Appl. Comput. Harmon. Anal. 2011;31(no. 1):44–58. doi: 10.1016/j.acha.2010.10.001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R37] 37.Singer A, Zhao Z, Shkolnisky Y, Hadani R. Viewing angle classification of cryo-electron microscopy images using eigenvectors. SIAM J. Imaging Sci. 2011;4(no. 2):723–759. doi: 10.1137/090778390. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] 38.Taylor ME. Mathematical Surveys and Monographs. Vol. 22. Providence, RI: American Mathematical Society; 1986. Noncommutative harmonic analysis. [Google Scholar]

[R39] 39.Tenenbaum JB, de Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(no. 5500):2319–2323. doi: 10.1126/science.290.5500.2319. [DOI] [PubMed] [Google Scholar]

[R40] 40.van Heel M. Angular reconstitution: a posteriori assignment of projection directions for 3D reconstruction. Ultramicroscopy. 1987;21(no. 2):111–123. doi: 10.1016/0304-3991(87)90078-7. [DOI] [PubMed] [Google Scholar]

[R41] 41.van Heel M, Gowen B, Matadeen R, Orlova E, Finn R, Pape T, Cohen D, Stark H, Schmidt R, Schatz M, Patwardhan A. Single-particle electron cryo-microscopy: towards atomic resolution. Quarterly Reviews of Biophysics. 2000;33(no. 4):307–369. doi: 10.1017/s0033583500003644. [DOI] [PubMed] [Google Scholar]

[R42] 42.Zhang Z, Zha H. Principal manifolds and nonlinear dimensionality reduction via tangent space alignment. SIAM J. Sci. Comput. 2004;26(no. 1):313–338. [Google Scholar]

PERMALINK

Vector Diffusion Maps and the Connection Laplacian

A Singer

H-T Wu

Abstract

1 Introduction

Figure 1.1.

Figure 1.2.

TABLE 1.1.

2 Data Sampled from a Riemannian Manifold

3 Vector Diffusion Mapping

4 Normalized Vector Diffusion Mappings

5 Convergence to the Connection Laplacian

6 Numerical Simulations

Figure 6.1.

Figure 6.2.

Figure 6.3.

Figure 6.4.

Figure 6.5.

7 Out-of-Sample Extension of Vector Fields

8 The Continuous Case: Heat Kernels

9 Application of VDM to Cryo-Electron Microscopy

Figure 9.1.

Figure 9.2.

Figure 9.3.

Figure 9.5.

Figure 9.6.

10 Summary and Discussion

Figure 2.1.

Figure 9.4.

Acknowledgment

Appendix A

Some Differential Geometry Background

Figure A.1.

Appendix B

Proofs of Theorems 5.3, 5.5, and 5.6

B.1 Preliminary Lemmas

B.2 Proof of Theorem B.1

B.3 Proof of Theorem B.2

B.4 Proof of Theorem B.3

B.5 Proof of Theorem B.4

B.6 Proof of Theorem 5.5

B.7 Proof of Theorem 5.6

Appendix C

Multiplicities of Eigen 1-Forms of Connection Laplacian over 𝕊n

TABLE C.1.

Footnotes

Contributor Information

Bibliography

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Multiplicities of Eigen 1-Forms of Connection Laplacian over 𝕊ⁿ