Linear convergence of the subspace constrained mean shift algorithm: from Euclidean to directional data

Yikun Zhang; Yen-Chi Chen

doi:10.1093/imaiai/iaac005

. 2022 Apr 9;12(1):210–311. doi: 10.1093/imaiai/iaac005

Linear convergence of the subspace constrained mean shift algorithm: from Euclidean to directional data

Yikun Zhang ^1,^✉, Yen-Chi Chen ²

PMCID: PMC9893762 PMID: 36761435

Abstract

This paper studies the linear convergence of the subspace constrained mean shift (SCMS) algorithm, a well-known algorithm for identifying a density ridge defined by a kernel density estimator. By arguing that the SCMS algorithm is a special variant of a subspace constrained gradient ascent (SCGA) algorithm with an adaptive step size, we derive the linear convergence of such SCGA algorithm. While the existing research focuses mainly on density ridges in the Euclidean space, we generalize density ridges and the SCMS algorithm to directional data. In particular, we establish the stability theorem of density ridges with directional data and prove the linear convergence of our proposed directional SCMS algorithm.

Keywords: ridges, subspace constrained mean shift, directional data, optimization on a manifold

Keywords: 62G05, 49Q12, 62H11

1. Introduction

Identifying meaningful lower dimensional structures from a point cloud has long been a popular research topic in Statistics and Machine Learning [60, 111]. One reliable characterization of such a low-dimensional structure is the density ridge, which can be feasibly estimated by a kernel density estimator (KDE) from point cloud data [39, 45]. Loosely speaking, an estimated density ridge signifies a high-density curve or surface in a point cloud; see the left panel of Fig. 1. Let Inline graphic be the underlying probability density function that generates the data in the Euclidean space . Its order- density ridge with is the set of points defined as

(1.1)

where Inline graphic are the eigenvalues of Hessian and has its columns as the last orthonormal eigenvectors. The notion of density ridges has appeared in various scientific fields, such as medical imaging [114], seismology [95] and astronomy [26, 101]. To locate an estimated density ridge defined by (Euclidean) KDE, [83] proposed a practical method called subspace constrained mean shift (SCMS) algorithm.

Fig. 1. — Density ridges estimated by Euclidean and directional SCMS algorithms on two synthetic datasets (drawn as black points) with hidden circular manifold structures (indicated by blue curves) on and the unit sphere , respectively. **Left**: The orange points indicate the estimated ridge obtained by the Euclidean SCMS algorithm from the dataset on . **Right**: The red points represent the estimated directional ridge identified by our directional SCMS algorithm, while the orange points indicate the estimated ridge obtained by the Euclidean SCMS algorithm from the dataset on . This panel is presented under the Hammer projection; see Appendix B for more details.

While the statistical estimation and asymptotic theories of density ridges in Inline graphic have been well-studied [22, 24, 45, 88, 89], the literature falls short of addressing the algorithmic properties of the ridge-finding method, i.e. the SCMS algorithm. To the extent of our knowledge, [46, 47] were the only available works to investigate the SCMS algorithm and its modified version from an algorithmic perspective. However, they only proved a non-decreasing property of density estimates and the validity of two stopping criteria for the SCMS algorithm. The algorithmic convergence of the SCMS algorithm remains an open question. There are two challenges to answering this question. First, because every iteration of the SCMS algorithm involves a projection matrix defined by the (estimated) Hessian, it is no longer a conventional first-order method in optimization. Second, estimating a density ridge in practice is a nonconvex/nonconcave optimization problem. Thus, the first objective of this paper is to provide a theoretical study on the algorithmic convergence and its associated (linear) rate of convergence for the SCMS algorithm.

In stark contrast to abundant research papers about density ridges in the Euclidean space, little work has been done to examine the statistical properties and any practical algorithm of estimating density ridges on the unit hypersphere Inline graphic . Nevertheless, data on are ubiquitous in many scientific fields of study, such as seismology (e.g. longitudes and latitudes of the epicenters of earthquakes) and astronomy (e.g. right ascensions and declinations of astronomical objects). Such data are generally known as directional data in the statistical literature [70, 74]. Hence, the second objective of this paper is to generalize density ridges and the SCMS algorithm to directional data.

More importantly, identifying an estimated density ridge from directional data on Inline graphic by the Euclidean SCMS algorithm always suffers from high bias near the two poles of . Consider a synthetic dataset with independently and identically distributed (i.i.d.) observations from a great circle connecting the North and South Poles of with additive noises. We apply both the Euclidean and directional SCMS algorithms to this simulated dataset. While the estimated ridges by the Euclidean SCMS algorithm fail to recover the desired great circle in high latitude regions, the ridges identified by our proposed directional SCMS algorithm align well with the underlying circular structure; see the right panel in Fig. 1 for a preview and Appendix B for a more detailed discussion.

Main Results. The main contributions of this paper are summarized as follows:

We present the convergence analysis of the SCMS and the general SCGA algorithms and prove their linear convergence properties with Euclidean data (Theorem 3.1, Corollary 3.2, and related discussion in Section 3.3):
where is a sequence of points generated by the SCGA or SCMS algorithm in , is the limit point of the sequence, and is a constant.
We generalize density ridges and the SCMS algorithm to directional data on (Section 4).
We prove the statistical convergence rate of a ridge estimator on the sphere defined by the directional KDE (Theorem 4.1):
where and are the population and estimated directional density ridges, respectively, is the Hausdorff distance, and is the dimension of .
We establish the convergence of the SCMS and the general SCGA algorithms with directional data and derive their linear convergence results (Theorem 4.2, Corollary 4.2 and related expositions in Section 4.3):
where is the sequence of points generated by the directional SCGA or SCMS algorithm, is the convergence point, is a constant and is the geodesic distance on .

Other Related Literature. The problem of density ridge estimation has its unique standing in both the computer science and statistics literature; see [33, 39, 52, 53] and references therein. Among various definition of density ridges [79, 84], our definition follows from [22, 39, 45], because its statistical estimation theory has been well established and it is feasible to be directly generalized to directional densities. Practically, the SCMS algorithm for identifying an estimated density ridge first appeared in the field of computer vision [94] before its introduction to the statistical community by [83]. More recently, [90] proposed alternative methods to the SCMS algorithm for finding density ridges, which are based on a gradient descent of the ridgeness and have connections to solution manifolds [28]. They presented the convergence analysis on continuous versions of their proposed methods and discretized them via Euler’s method. Our directional SCMS algorithm is extended from the directional mean shift algorithm [62, 65, 80, 113, 117, 118]. As we cast the (directional) SCMS algorithms into subspace constrained gradient ascent (SCGA) algorithms (on a hypersphere), it is worth mentioning that one should not confuse the SCGA algorithm here with the projected gradient ascent/descent method for a constrained problem in the standard optimization theory; see Section 3.2 in [17] for some references of the latter one. The SCGA algorithm discussed in this paper is a gradient ascent algorithm but with a subspace constrained gradient. When the subspace coincides with alternating one-dimensional coordinate spaces, the SCGA algorithm reduces to the well-known coordinate ascent/descent method [112]. Some linear convergence results of the coordinate descent algorithms were previously established by [11, 73]. Other related work includes [66, 67], though, in their problem setups, the projection matrix onto the subspace is random and has its expectation equal to the identity matrix. Our interested SCGA algorithm always has a deterministic constrained subspace defined by the eigenspace associated with the last several eigenvalues of the Hessian of the density Inline graphic .

Outlines and Notations. Section 2 introduces the definitions of Euclidean and directional KDEs and reviews some preliminary concepts of differential geometry on Inline graphic . We discuss the assumptions on the Euclidean density ridges and establish the (linear) convergence results of the SCGA and SCMS algorithms in Section 3. In Section 4, we generalize the definition of density ridges to the directional data scenario and prove the (linear) convergence properties of the SCGA and SCMS algorithms on Inline graphic . Some simulation studies and real-world applications of Euclidean and directional SCMS algorithms are presented in Section 5, whose code is available at https://github.com/zhangyk8/EuDirSCMS. We conclude the paper and discuss some potential impacts in Section 6.

Throughout the paper we use Inline graphic as the intrinsic dimension of density ridges, whose ambient spaces are in the Euclidean data case and in the directional data case. Notice that a quantity under the directional data setting that has its counterpart in the Euclidean data case will be denoted by the same notation with an extra underline. For instance, Inline graphic is a ridge of the density in the Euclidean space , while refers to a ridge of the directional density on the sphere .

Let Inline graphic be a smooth function and be a multi-index (that is, are nonnegative integers and ). Define as the -th order partial derivative operator, where is often written as . For , we define the functional norms

When Inline graphic , this becomes the infinity norm of ; for , the above norms are indeed some semi-norms. We also define .

The (total) gradient and Hessian of Inline graphic are defined as and . Inductively, the third derivative of is a array given by . When is a directional density supported on , the preceding functional norms are defined via the Riemannian gradient, Hessian and high-order derivatives of within the tangent space at , and the supremum will be taken over Inline graphic instead of . They are equivalent to the derivatives of with respect to the local coordinate chart on ; see Section 2.3 for a review.

Let Inline graphic denote the entry of a matrix . Then, the Frobenius norm is , where is the trace of the square matrix , and the operator norm is . In most cases, we consider the (operator) norm . We define . The inequality relationships between the above matrix norms are , and .

We use the big-O notation Inline graphic if the absolute value of is upper bounded by a positive constant multiple of for all sufficiently large . In contrast, when . For random vectors, the notation is short for a sequence of random vectors that converges to zero in probability. The expression denotes the sequence that is bounded in probability; see Section 2.2 of [107] for details.

2. Preliminaries

In this section, we review the KDE with Euclidean and directional data as well as some differential geometry concepts on Inline graphic .

2.1 Kernel Density Estimation with Euclidean Data

Let Inline graphic be a random sample from a distribution with density supported on the Euclidean space . We call such random sample Euclidean data in the sequel. The (Euclidean) KDE at point with a kernel function and bandwidth parameter is written as [27, 96, 110]:

(2.1)

The kernel Inline graphic is generally a unimodal function satisfying the following properties:

(K1) .
(K2) is (radially) symmetric, i.e. .
(K3) and , where is the usual norm in .

One possible approach to construct a multivariate kernel Inline graphic with the above properties is to derive it from a kernel profile as follows:

(2.2)

where Inline graphic is the normalizing constant such that satisfies (K1) and the function is called the profile of the kernel. This kernel form is generally used in deriving (subspace constrained) mean shift algorithms; see Section 3.2. An important example of the profile function is for , leading to the multivariate Gaussian kernel Inline graphic .

Another approach of designing a multivariate kernel function is to leverage the product kernel technique as Inline graphic , where are kernels function defined on satisfying the properties (K1-3). This leads to a multivariate KDE as

(2.3)

In fact, the multivariate Gaussian kernel Inline graphic can be obtained by defining its kernel profile as for or taking . In practice, the multivariate KDE (2.1) with Gaussian kernel is the most popular nonparametric density estimator with Euclidean data.

The most crucial part in applying the KDE is to select the bandwidth parameter Inline graphic . Common methods in the literature aim at minimizing the mean integrated square error (MISE):

or its asymptotic part through the rule of thumb [99], cross validation [16, 50, 91, 102] and plug-in methods [98]. As choosing the bandwidth is not the main focus of this paper, we refer the interested reader to [61, 97] and Chapter 6.5 of [96] for comprehensive reviews.

2.2 Kernel Density Estimation with Directional Data

The Euclidean KDE (2.1) exhibits some salient drawbacks in dealing with directional data samples; see Appendix B for a detailed exposition. Fortunately, the theory of kernel density estimation with directional data has been well studied since late 1970s [7, 12, 43, 51, 86, 120]. Let Inline graphic be a random sample generated from an underlying directional density function on with where is the Lebesgue measure on . The directional KDE is given by

(2.4)

where Inline graphic is a directional kernel (i.e. a rapidly decaying function with nonnegative values and defined on for some constant ), is the bandwidth parameter and is a normalizing constant satisfying .

Remark 2.1.

The distance metric used by the directional KDE (2.4) on is identical to the standard Euclidean metric in the ambient space . This is because the standard Euclidean metric of is topologically equivalent (but not strongly equivalent) to the geodesic distance on due to the following equality:

(2.5)

See Section C.1.5 in [81] for the definition of equivalence of metrics. Hence, the distance metric in (2.4) is indeed intrinsic on and adaptive to its geometry.

As in the applications of Euclidean KDEs, the bandwidth selection is a critical part in determining the performances of directional KDEs [7, 43, 51, 75, 82, 93, 106]. On the contrary, the choice of the kernel is less crucial; see, e.g. Page 72 of [110] and Section 6.3.2 in [96] for the reasoning. A popular candidate is the so-called von Mises kernel Inline graphic , which serves as a counterpart of the Gaussian kernel for directional KDEs. Its name originates from the famous -von Mises–Fisher distribution on , which is denoted by and has the density:

(2.6)

where Inline graphic is the directional mean, is the concentration parameter and is the modified Bessel function of the first kind at order . For more details on statistical properties of the von Mises–Fisher distribution and directional KDE, we refer the interested reader to [9, 44, 74].

2.3 Riemannian Gradient, Hessian and Exponential Map on

Given that the unit hypersphere Inline graphic is a nonlinear manifold, the Riemannian gradient and Hessian of a smooth function on are defined within its tangent spaces. They are different from but also interconnected with the total gradient and Hessian of in the ambient Euclidean space .

Inline graphic Riemannian Gradient on . Let be the tangent space of at point , which consists of all the vectors starting from and tangent to . Given a smooth function , its Riemannian gradient is defined as

(2.7)

for any (unit) vector Inline graphic , where is the inner product (or Riemannian metric) in and is the differential operator of at ; see, e.g. Section 3.1 in [10] for more details. Note that the Riemannian metric on coincides with the standard inner product in the ambient space ; see Section 3.6.1 in [1]. If is smooth in an open neighborhood containing Inline graphic and we consider as vectors in , then the inner product in reduces to the usual one in and the Riemannian gradient can be expressed in terms of the total gradient as

(2.8)

where Inline graphic is the identity matrix. The left-hand side of (2.8) is the projection of the total gradient onto the tangent space at .

Inline graphic Riemannian Hessian on . The Riemannian Hessian at point is a symmetric bilinear map from the tangent space into itself defined as

(2.9)

for any Inline graphic , where is the Riemannian connection on . Similar to , the Riemannian Hessian has the following explicit formula when viewed in the ambient Euclidean space :

(2.10)

where Inline graphic and are the total gradient and Hessian of in . This formula can be derived via the Riemannian connection and Weingarten map on [2] and Section 5.5 in [1] or geodesics on [118].

Inline graphic Exponential Map. An exponential map at is a mapping that takes a vector to a point along the curve with and . Here, is a curve of minimum length between and (i.e. the so-called geodesic on ). An intuitive way of thinking of the exponential map evaluated at on is that starting at point Inline graphic , we identify another point on along the geodesic (or great circle) in the direction of so that the geodesic distance between and is . As is a compact Riemannian manifold, the exponential map is a diffeomorphism (smooth bijection) from a neighborhood of to its image on ; see Lemma 6.16 in [69]. The inverse of an exponential map (or logarithmic map) is defined within a neighborhood Inline graphic around as a mapping such that represents the vector in starting at , pointing to , and with its length equal to the geodesic distance between and .

3. Linear Convergence of the SCMS Algorithm With Euclidean Data

Given the definition of a order- Inline graphic ridge in (1.1) of the (smooth) density on the Euclidean space , we introduce, in this section, some commonly assumed conditions to regularize and its stability theorem. After revisiting the frameworks of the Euclidean mean shift and SCMS algorithms as well as deriving the SCMS algorithm as the SCGA algorithm with an adaptive step size, we present our (linear) convergence analysis on the SCGA and SCMS algorithms.

3.1 Assumptions and Stability of Euclidean Density Ridges

Under the spectral decomposition on the Hessian Inline graphic as , we know that is a real orthogonal matrix with the eigenvectors of as its columns and is a diagonal matrix with . Given that , we let be the projection matrix onto the column space of and be the projection matrix onto the complement space, where and is the identity matrix in Inline graphic . Then, the order- principal gradient (or projected gradient in [22, 45] is defined as

(3.1)

and Inline graphic will be called the residual gradient. The order- density ridge can be equivalently defined as

(3.2)

It follows that the 0-ridge Inline graphic is the set of local modes of , whose statistical properties and practical estimation algorithm have been well studied in [6, 25]. Thus, we only consider the case when in the sequel. We define the projection from point onto a ridge by and the distance from point to by . Note that the projection from point Inline graphic to may not be unique. To guarantee the uniqueness of the projection, we introduce a concept called the reach [32, 42]:

(3.3)

where Inline graphic and is a -dimensional ball of radius centered at . To obtain a well-behaved ridge , some assumptions need imposing on the underlying density around a small neighborhood of .

(A1) (Differentiability) We assume that is bounded and at least four times differentiable with bounded partial derivatives up to the fourth order for every .
(A2) (Eigengap) We assume that there exist constants and such that and for any .
(A3) (Path Smoothness) Under the same in (A2), we assume that there exists another constant such that
for all and .

Condition (A1) is a natural differentiability assumption under the context of ridge estimation. Condition (A2) is a curvature assumption on the true density Inline graphic , ensuring that is ‘strongly concave’ around inside the -dimensional linear space spanned by the columns of . We call this property ‘subspace constrained strong concavity’. It is one of the most important components in establishing the linear convergence of the SCGA and SCMS algorithms; see Remark 3.3 for the reasoning. Condition (A3) regularizes the gradient and third-order derivatives of Inline graphic from being too steep around the ridge . They are also imposed by [45] for characterizing a quadratic behavior of around and ensuring the stability of , as well as by [22] to avoid the degenerate normal spaces of . Consequently, is a -dimensional manifold that contains neither intersections nor endpoints; see also Lemma C.1 in the Appendix. Notice that the inequality assumptions in (A3) depend on both the ambient dimension Inline graphic and the intrinsic dimension of the ridge . The larger the dimensions and are, the harder the assumptions will hold. This phenomenon, in some sense, reflects the curse of dimensionality in nonparametric ridge estimation.

Given conditions (A1–3), the ridge Inline graphic will be stable under small perturbations of the underlying density and its derivatives, which is summarized in the following lemma. The stability of is generally measured by the Hausdorff distance defined as

(3.4)

where Inline graphic are two sets in .

Lemma 3.1. (Theorem 4 in [45]).

Assume conditions (A1–3) for two densities . When is sufficiently small, we have

where and are the -ridges of and , respectively.

When the true density Inline graphic that generates the Euclidean data is replaced by the Euclidean KDE in the definition (1.1) of density ridges, we obtain a natural (plug-in) estimator of the true ridge as

To regularize statistical behaviors of the estimated ridge Inline graphic , we make the following assumptions on the kernel of its form (2.2):

(E1) We assume that the kernel profile is non-increasing and at least three times continuously differentiable with bounded fourth-order partial derivatives as well as
with .
(E2) Let
We assume that is a bounded VC (subgraph) class of measurable functions on ; that is, there exist constants such that for any ,
where is the -covering number of the normed space , is any probability measure on and is an envelope function of . Here, the norm is defined as .

Remark 3.1.

Recall that the -covering number is defined as the minimal number of -balls of radius needed to cover the (function) class . One popular concept for controlling uniform covering number is the notion of Vapnik–Červonenkis (subgraph) classes, or simply VC classes. Starting from collections of sets, we say that a collection of subsets of the sample space picks out a certain subset of the finite set if it can be written as for some . The collection is said to shatter if picks out each of its subsets. The VC-index of is the smallest for which no set of size is shattered by . A collection of measurable sets is called a VC class if its index is finite. To generalize this concept to a class of real-valued and measurable functions defined on , we say that is a VC subgraph class if the collection of all subgraphs of the functions in forms a VC class of sets in . An important property of VC (subgraph) classes is that their -covering numbers grow polynomially in as what condition (E2) is stated; see Theorem 2.6.4 in [108]. More in-depth discussion on VC classes can be found in Chapter 2.6 of the same book.

Condition (E1) can be relaxed such that the kernel profile Inline graphic is three times continuously differentiable except for finite number of points on . Such relaxation allows us to include the Epanechnikov and other compactly supported kernel. The integrability assumption on in condition (E1) is similar to the conditions (K1) and (K3) in Section 2.1 for the purpose of bounding the expectations and variances of the KDE Inline graphic and its (partial) derivatives. Condition (E2) regularizes the complexity of the kernel and its (partial) derivatives, which is essential in establishing the uniform consistency of and its derivatives to the corresponding quantities of as in equation (3.5).

Given conditions (E1) and (E2), the techniques in [20, 40, 48] can be utilized to show the uniform consistency of the Euclidean KDE Inline graphic and its derivatives as

(3.5)

3.2 Mean Shift and SCMS Algorithms with Euclidean Data

We begin with a quick review on the Euclidean mean shift algorithm, as the SCMS algorithm is built on top of such formulation. Given condition (E1) and the Euclidean KDE Inline graphic with kernel (2.2), its gradient estimator takes the form

(3.6)

where the first term is a variant of KDEs and the second term is the mean shift vector

(3.7)

This factorization suggests that the mean shift vector aligns with the direction of maximum increase in Inline graphic . Thus, moving a point along its mean shift vector successively yields an ascending path to a local mode [29, 31, 71]. Let be the mean shift sequence with the Euclidean KDE . Then, one step iteration of the mean shift algorithm is written as

(3.8)

showing that the mean shift algorithm is a gradient ascent method with an adaptive step size

(3.9)

Here, we denote by Inline graphic the denominator of the adaptive step size . Lemma 3.2 shows that under condition (E1) and the differentiability assumption on , tends to a fixed constant with probability tending to 1 for any as and . Therefore, the step size has its asymptotic rate as and tends to zero as and as well. The proof of Lemma 3.2 can be found in Appendix D.

Lemma 3.2.

Assume conditions (A1) and (E1). The convergence rate of is

for any as and .

As the mean shift algorithm is not the main focus of this paper, we will make an abuse of notation and denote by Inline graphic the sequence produced by the SCMS or SCGA algorithm in the sequel. Compared with the mean shift iteration (3.8), the SCMS algorithm updates the sequence through the SCMS vector as

(3.10)

See Algorithm 1 in Appendix A for the entire procedure. This also implies that the SCMS algorithm can be viewed as a sample-based SCGA method as

(3.11)

with the same adaptive step size Inline graphic as the Euclidean mean shift algorithm in (3.8). The formulation (3.11) sheds light on some (linear) convergence properties of the SCMS algorithm as we will demonstrate in the next subsection.

3.3 Linear Convergence of Population and Sample-Based SCGA Algorithms

We have shown in (3.11) that the (usual/Euclidean) SCMS algorithm is a variant of the sample-based SCGA algorithm in Inline graphic with an adaptive step size . To establish the (linear) convergence results of the SCMS algorithm with Euclidean KDE , it suffices to study the (linear) convergence of the sample-based SCGA algorithm with objective function . To this end, we begin by studying the convergence of the population SCGA algorithm whose objective function is the underlying density Inline graphic .

Let Inline graphic be the sequence defined by the population SCGA algorithm and be the sequence defined by the sample-based SCGA algorithm. The population SCGA algorithm is defined by its iterative formula as

(3.12)

where Inline graphic is a (fixed) step size. The sample-based SCGA algorithm has its iterative formula as (3.11), except that the standard sample-based SCGA algorithm normally embraces a constant step size .

Remark 3.2.

In (3.10) and (3.11), we consider the SCMS algorithm as a sample-based SCGA iteration with an adaptive step size . Our Lemma 3.2 suggests that tends to zero in a rate as and . However, once the sample size is fixed and the bandwidth is chosen, the step size is not only upper bounded but also uniformly lower bounded away from zero with respect to the iteration number by the differentiability condition (E1) when the current iterative point lies within the compact neighborhood . Note that is compact because is a finite union of connected and compact manifolds; see (d) of Lemma C.1. More importantly, these upper and lower bounds of when are independent of the iteration number . Therefore, conditioning on the case when the sample size is sufficiently large, one can always select a small bandwidth such that the adaptive step size of the SCMS algorithm is sufficiently small but not equal to zero.

As revealed by the following proposition, our imposed conditions (A1–3) in Section 3.1 ensure that as long as the step size Inline graphic is small, the objective function along any population SCGA sequence is non-decreasing and the sequence by itself converges to when it is initialized within a small neighborhood of .

Proposition 3.1. (Convergence of the SCGA Algorithm.)

For any SCGA sequence defined by (3.12) with , the following properties hold.

(a) Under condition (A1), the objective function sequence is non-decreasing and converges.

(b) Under condition (A1), .

(c) Under conditions (A1–3), whenever with the convergence radius satisfying
where is a constant defined in (h) of Lemma C.1 while is a quantity depending on both the dimension and functional norm up to the fourth-order (partial) derivatives of .

The proof of Proposition 3.1 can be found in Appendix D. We make two comments on the choice of the convergence radius Inline graphic in (c) of Proposition 3.1. The first two quantities in the upper bound of ensure that and therefore, the projection of onto is well defined. The last quantity in the upper bound of is critical to guarantee that the distances from the SCGA sequence to the ridge can be controlled by the norms Inline graphic of order- principal gradients for .

Corollary 3.1. (Convergence of the SCMS Algorithm.)

When the fixed sample size is sufficiently large and the fixed bandwidth is chosen to be sufficiently small, the following properties hold for the SCMS sequence with high probability under conditions (A1–3) and (E1–2).

(a) The Euclidean KDE sequence is non-decreasing and thus converges.

(b) .

(c) whenever with the convergence radius defined in (c) of Proposition 3.1.

Corollary 3.1 is the sample-based version of Proposition 3.1. On the one hand, when Inline graphic is sufficiently large and is small enough, the estimated ridge also satisfies conditions (A1–3) with high probability; see Lemma 3.1 and the uniform bounds (3.5) of . On the other hand, the adaptive step size of the SCMS algorithm can be always smaller than the threshold when the sample size Inline graphic is sufficiently large and is small; see Remark 3.2. Consequently, our arguments in Proposition 3.1 can be applied to establish the (local) convergence of the SCMS sequence here. In addition, we point out that Proposition 2 in [46] also proved the results (a-b) of Corollary 3.1 under condition (E1) and the convexity assumption on the kernel profile Inline graphic . The difference is that our arguments hold when is large and is small while the extra convexity assumption in [46] enables the authors to prove the results (a-b) universally for any choice of the bandwidth .

By Proposition 3.1 and Corollary 3.1, it is now reasonable to denote the limiting points of the population and sample-based SCGA sequences Inline graphic and by and , respectively. Before stating our main linear convergence results, we introduce the concepts of Q-linear and R-linear convergence from optimization literature; see, e.g. Appendix A2 in [78].

Definition 3.2. (Linear Rate of Convergence.)

We say that the convergence of the sequence to is Q-linear if there exists a constant such that

We say that the convergence is R-linear if there is a sequence of nonnegative scalars such that

The linear convergence of the SCGA sequence Inline graphic will be established under the following local condition.

(A4) (Quadratic Behaviors of Residual Vectors) We assume that the SCGA sequence with step size and as its limiting point satisfies
for some constant , where is the constant defined in condition (A2).

Condition (A4) imposes a direct assumption on the SCGA sequence Inline graphic , under which the residual vector and its inner product with the residual gradient are upper bounded by a quadratic term . This condition is imposed to guarantee that is ‘subspace constrained strongly concave’ around ; see also Remark 3.3. Our proof of Theorem 3.1 suggests that the residual vector Inline graphic is only required to be smaller than the first-order term . For simplicity, we require it to be quadratic. When condition (A4) fails to hold, the associated SCGA sequence can only converge sublinearly to . Therefore, it is an essential element in the linear convergence of the SCGA algorithm, and we discuss some potentially weaker assumptions that implicate condition (A4) in Appendix E. Intuitively, the SCGA path converges to Inline graphic following the direction of principal gradient . To further gain more insights into the correctness of condition (A4), we consider a special density function

(3.13)

on Inline graphic , whose one-dimensional ridge is by the definition (1.1). Some careful calculations suggest that its principal gradient points toward the ridge in the direction when and in the direction when ; see Fig. 2 for a graphical illustration. Furthermore, the smallest eigenvalue of is negative whenever Inline graphic . Hence, the residual gradient is perpendicular to the SCGA direction, and condition (A4) naturally holds.

Fig. 2. — Contour lines of the density function (3.13) and its principal gradient flows.

We now present our linear convergence results for the population and sample-based SCGA algorithms.

Theorem 3.3. (Linear Convergence of the SCGA Algorithm.)

Assume conditions (A1–4) throughout the theorem.

(a) Q-Linear convergence of : Consider a convergence radius satisfying
where is the constant defined in (h) of Lemma C.1 and is a quantity defined in (c) of Proposition 3.1 that depends on both the dimension and the functional norm up to the fourth-order derivative of . Whenever and the initial point with , we have that

(b) R-Linear convergence of : Under the same radius in (a), we have that whenever and the initial point with ,

We further assume conditions (E1-2) in the rest of statements. If and ,

(c) Q-Linear convergence of : under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

(d) R-Linear convergence of : under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

The detailed proof of Theorem 3.3 can be found in Appendix D. Note that, as in (c) of Proposition 3.1, we elucidate a threshold value for the convergence radius Inline graphic in (a), under which the population SCGA algorithm converges linearly to . The first three quantities in the threshold value are directly adopted from the upper bound of the convergence radius in (c) of Proposition 3.1, while the last term controls the ‘subspace constrained strongly concavity’ (3.15) of Inline graphic within .

Remark 3.3.

Notice that the standard strong concavity assumption on the objective function (or density function) is not sufficient to establish the linear convergence of the population SCGA algorithm (3.12). This is because, under the (quasi-)strong concavity assumption [76], the objective function would satisfy

(3.14)

for some constant , and those standard proofs of the linear convergence of gradient ascent methods rely on this inequality; see Section 3.4 in [17]. However, as indicated in our proof of Theorem 3.3, the linear convergence of the SCGA algorithm requires the following inequality instead:

(3.15)

for some constant , where is generally chosen to be . We call the function satisfying (3.15) to be ‘subspace constrained strongly concave’. Since

the strong concavity assumption (3.14) will not imply the key inequality (3.15) for the linear convergence of the population SCGA algorithm unless the residual gradient term can be upper bounded by the second-order error term . The imposed eigengap condition (A2) as well as condition (A4) with its related discussion in Appendix E fill in this gap, ensuring that such a quadratic upper bound holds on the residual gradients along the SCGA sequence.

Corollary 3.2. (Linear Convergence of the SCMS Algorithm.)

Assume conditions (A1–4) and (E1–2). When the fixed sample size is sufficiently large and the bandwidth is chosen to be sufficiently small, there exists a convergence radius such that the SCMS sequence satisfies the following property with high probability:

whenever and the initial point .

Corollary 3.2 should also be regarded as the linear convergence of the sample-based SCGA algorithm to the estimated ridge Inline graphic defined by the Euclidean KDE . Based on conditions (E1–2) and the uniform bounds (3.5), together with its ridge and sample-based SCGA sequence satisfy conditions (A1–4) with probability tending to 1 as and . As a result, one can follow our argument in (a) of Theorem 3.1 to establish the linear convergence of the sample-based SCGA algorithm with a fixed step size Inline graphic satisfying . Furthermore, when the fixed sample size is sufficiently large and the bandwidth is chosen to be small, the adaptive step size of the SCMS algorithm always falls below the threshold for linear convergence but is also uniformly bounded away from zero with respect to the iteration number Inline graphic ; see our Remark 3.2. By taking the infimum of the adaptive step size with respect to , one can thus establish the linear convergence of the SCMS algorithm with its rate of convergence as and .

4. The SCMS Algorithm With Directional Data and Its Linear Convergence

In this section, we generalize the definition (1.1) of density ridges to directional densities on Inline graphic and propose our directional SCMS algorithm to identify directional density ridges. In addition, we prove the linear convergence of our directional SCMS algorithm by adjusting the arguments in Section 3.3. Throughout this section, denotes a random sample from a directional distribution with density Inline graphic supported on the unit hypersphere that is embedded in the ambient Euclidean space .

4.1 Definitions, Assumptions and Stability of Directional Density Ridges

To apply the matrix forms of the Riemannian gradient Inline graphic and Hessian of a directional density in the ambient space , we first extend from its support to by defining

(4.1)

Now, given the expressions of Inline graphic and defined in (2.8) and (2.10), we perform the spectral decomposition on as , where is a real orthogonal matrix with columns as the eigenvectors of that are associated with the eigenvalues and lie within the tangent space at , and . Note that the Riemannian Hessian has a unit eigenvector Inline graphic that is orthogonal to and corresponds to eigenvalue 0.

Let Inline graphic be the last columns of , i.e. the unit eigenvectors inside the tangent space corresponding to the smallest eigenvalues of . Let be the projection matrix onto the linear space spanned by the columns of , and . We define the order- principal Riemannian gradient by

(4.2)

where the last equality follows from the fact that the columns of Inline graphic are orthogonal to the unit vector . The order- density ridge on (or directional density ridge) is the set of points defined as

(4.3)

Our definition of density ridges on Inline graphic can be arguably generalized to any smooth function supported on an arbitrary Riemannian manifold. It also follows that the 0-ridge is the set of local modes of on , whose statistical properties and practical estimation algorithm are discussed in [118]. Therefore, we only focus on the case when Inline graphic in this paper. To regularize the directional density ridge , we modify our assumptions on the Euclidean density ridge in Section 3.1 as follows:

(A1) (Differentiability) Under the extension (4.1) of the directional density , we assume that the total gradient , total Hessian matrix and third-order derivative tensor in exist, and are continuous on and square integrable on . We also assume that has bounded fourth-order derivatives on .
(A2) (Eigengap) We assume that there exist constants and such that and for any .
(A3) (Path Smoothness) Under the same in (A2), we assume that there exists another constant such that
for all and .

Recall that Inline graphic is a -neighborhood of the directional ridge in the ambient space . The discussions about conditions (A1–3) in Section 3.1 apply to their directional counterparts (A1–3), except that the eigengap condition (A2) is imposed on eigenvalues within the tangent space at . However, since the only eigenvalue of Inline graphic associated with the eigenvector outside the tangent space is 0, the eigengap condition (A2) is also valid to the entire spectrum of in the ambient space . The extension of in (4.1) has also been used by [43, 44, 120]. Because the directional density remains unchanged along every radial direction of Inline graphic under the extension (4.1), the radial component of its total gradient is for all , and the Riemannian gradient (2.8) of on becomes

(4.4)

Similarly, the Riemannian Hessian (2.10) of Inline graphic on reduces to

(4.5)

Both the Riemannian gradient and Hessian of Inline graphic on are invariant under this extension.

Remark 4.1. (Connection to Solution Manifolds.)

Example 4 in [28] showed that any Euclidean density ridge defined in (1.1) is a concrete example of a solution manifold with being a vector-valued function. It is not difficult to verify that our defined directional density ridge in (4.3) also belongs to the general form of the solution manifold , where we may rewrite with defined by

recalling that are the last eigenvectors of the Riemannian Hessian of the directional density . More importantly, our imposed conditions (A1–3) in the Euclidean ridge case and (A1–3) in the directional ridge case imply all the required assumptions in [28], i.e. the differentiability of and non-degeneracy of the normal space of ; see (d) of Lemmas C.1 and G.1 in the Appendix. Therefore, the discussion about statistical properties and (normal) gradient flows of a generic solution manifold apply to the (directional) density ridge or here.

Similar to Euclidean density ridges, we establish the following stability theorem of directional density ridges. To measure the distance between two directional ridges Inline graphic defined by the directional densities and , we adopt the definition (3.4) of Hausdorff distance between two sets in the ambient Euclidean space . Note that the Euclidean norm used in the definition (3.4) is upper bounded by the geodesic distance when our interested sets lie on ; see also (2.5). We will leverage this property in our proof of Theorem 4.1; see Appendix H for details.

Theorem 4.1.

Suppose that conditions (A1–3) hold for the directional density and that condition (A1) holds for . When is sufficiently small,

(a) conditions (A2–3) holds for .

(b) .

(c) for a constant .

One natural estimator of the directional density ridge Inline graphic can be obtained by plugging the directional KDE into the definition (4.3) as

To regularize the statistical behavior of the estimated directional ridge Inline graphic , we consider the following assumptions that are generalized from conditions (E1–2):

(D1) Assume that is a bounded and three times continuously differentiable function with a bounded fourth order derivative on for some constant such that
(D2) Let
We assume that is a bounded VC (subgraph) class of measurable functions on ; that is, there exist constants such that for any ,
where is the -covering number of the normed space , is any probability measure on and is an envelope function of . Here, the norm is defined as .

The differentiability assumption in condition (D1) can be relaxed such that Inline graphic is (three times) continuously differentiable except for a set of points with Lebesgue measure on . Conditions (D1) and (A1) are generally required for establishing the pointwise convergence rates of the directional KDE and its derivatives [43, 44, 51, 64, 120]. Under these two conditions, Inline graphic appearing in the step sizes or of the directional mean shift or SCMS algorithm can also be shown to diverge at the order as and ; see Section 4.2 for details. Condition (D2) regularizes the complexity of kernel and its derivatives as in condition (E2) in order for the uniform convergence rates of the directional KDE and its derivatives; see (4.6) below. One can justify via integration by parts that the von-Mises kernel Inline graphic and many compactly supported kernels satisfy conditions (D1–2).

Given conditions (D1–2), the techniques in [7, 43, 44, 51, 118, 120] can be utilized to demonstrate that

(4.6)

where Inline graphic is the Riemannian connection on with so that , and ; see Section 5.3 in [1] and Chapter 4 in [69].

4.2 Mean Shift and SCMS Algorithm with Directional Data

Before deriving our directional SCMS algorithm, we first review the mean shift algorithm with directional data Inline graphic [62, 80, 113]. The formal derivation can be found in Section 3 of [118]. Given the directional KDE in (2.4), the directional mean shift vector can be defined as

(4.7)

Similar to the Euclidean mean shift vector (3.7), Inline graphic also points toward the direction of maximum increase in after being projected onto the tangent space . Thus, the directional mean shift iteration translates a point as with an extra projection to draw the shifted point back to .

Let Inline graphic denote the sequence defined by the above directional mean shift procedure. Later, by abuse of notation, we will use the same notation to denote the directional SCGA/SCMS sequence with . As , some simple algebra shows that the directional mean shift algorithm can be written into the following fixed-point iteration formula:

(4.8)

From (4.8), it is also possible to write the directional mean shift algorithm as a gradient ascent method on Inline graphic with the iteration formula [116]:

(4.9)

where the adaptive step size Inline graphic is given by

(4.10)

Here, we denote the angle between Inline graphic and (or equivalently, the angle between and ) by ; see Section 5.2 in [118] for detailed derivations. Within some small neighborhoods around local modes of , and the adaptive step size will be dominated by . The following lemma characterizes the asymptotic behaviors of on and consequently, Inline graphic .

Lemma 4.1. (Lemma 10 in [118]).

Assume conditions (D1) and (A1). For any fixed , we have

as and , where is a constant depending only on kernel and dimension .

Lemma 4.1 indicates that Inline graphic with probability tending to 1 as and for any . The conclusion may seem counterintuitive at the first glance, but one should be aware that the consistency of holds only on its tangent component; see (4.6). The radial component of that is perpendicular to diverges, despite the fact that the true directional density Inline graphic does not have any radial component. Using Lemma 4.1, one can argue that the adaptive step size in (4.10) of the directional mean shift algorithm as a gradient ascent method on tends to zero at the rate as and .

In the sequel, we denote by Inline graphic the iterative sequence generated by our directional SCMS algorithm. There are two different methods of defining a directional SCMS iteration, while we will demonstrate that one of them is superior.

Inline graphic Method 1: As in the Euclidean SCMS algorithm, one can define the directional SCMS sequence by the directional mean shift vector (4.7) as

(4.11)

where Inline graphic , and is the estimated version of defined by the directional KDE . Here, we plug in (4.7) and leverage the orthogonality between the columns of and in (*).

Unlike the Euclidean SCMS algorithm, we need an extra standardization step Inline graphic to project the updated point back to , which leads to the following fixed-point iteration:

(4.12)

where the components Inline graphic and are always orthogonal for any ; see Fig. 3 for a graphical illustration.

Inline graphic Method 2: The fixed-point iteration formula (4.8) of the directional mean shift algorithm suggests a more efficient formulation of the directional SCMS algorithm as

(4.13)

where we replace the directional mean shift vector Inline graphic with the standardized total gradient estimator in (4.11). This directional SCMS is again a fixed-point iteration as

(4.14)

A direct computation demonstrates that, by the non-increasing property of kernel Inline graphic and the fact that for ,

(4.15)

Because the radial components Inline graphic and in directional SCMS iterative formulae (4.12) and (4.14), respectively, make no contributions to the iteration of point on , the inequality (4.15) indicates that the directional SCMS algorithm with iterative formula (4.14) takes a larger step size in moving the SCMS sequence on Inline graphic . This helps accelerate the movements of those points that are far away from the ridge or lie in the regions with low density values of . In this sense, the directional SCMS algorithm with iterative formula (4.14) will be superior to (4.12); see Fig. 3 for a graphical demonstration. We thus choose Method 2 as our directional SCMS algorithm. Algorithm 2 in Appendix A provides the detailed steps of implementing Method 2 in practice.

Inspired by Proposition 2 in [46] for the Euclidean SCMS algorithm, we derive the ascending property of our directional SCMS algorithm (4.13) and two convergent results for stopping the algorithm in the following proposition. The proof is deferred to Appendix I, in which our argument is similar to but logically different from the proof of Proposition 2 in [46].

Proposition 4.2.

Assume that the directional kernel is non-increasing, twice continuously differentiable and convex with . Given the directional KDE and the directional SCMS sequence defined by (4.13) or (4.14), the following properties hold:

(a) The estimated density sequence is non-decreasing and thus converges.

(b) .

(c) If the kernel is also strictly decreasing on , then .

Remark 4.2.

Our results (b) and (c) in Proposition 4.2 demonstrate that the stopping criterion of our directional SCMS algorithm can follow either the norm of the principal Riemannian gradient estimator or the (Euclidean) distance between two consecutive iterative points, where the latter one requires a strictly decreasing kernel such as the von Mises kernel .

Motivated by the iterative formula (4.9) for the gradient ascent algorithm on Inline graphic , we consider writing our directional SCMS algorithm as a variant of the SCGA algorithm on with an iterative formula:

(4.16)

where Inline graphic is the exponential map at and is the adaptive step size. Analogous to the Euclidean SCMS algorithm and its SCGA representation (3.11), the formulation (4.16) will reveal the (linear) convergence properties of our directional SCMS algorithm in the upcoming Section 4.3. To derive an explicit formula for Inline graphic , we recall the fixed-point equation (4.14) of our directional SCMS algorithm and compute the geodesic distance between and (one-step directional SCMS update) as

where, in the second equality, we equate the geodesic distance between Inline graphic and to the norm of the tangent vector inside the exponential map in (4.16). This suggests that our directional SCMS algorithm is a sample-based SCGA algorithm on with adaptive step size

(4.17)

for Inline graphic , where ’ denotes the angle between and . Note that the above derivation is based on the orthogonality between and the order- principal Riemannian gradient estimator

see Fig. 3 for a graphical illustration. When our directional SCMS algorithm approaches the estimated ridge Inline graphic , ’ tends to 0 and is approximately equal to 1. Thus, the step size is also controlled by as in the directional mean shift scenario; see Equation (4.10). Therefore, Lemma 4.1 is still effective to argue that the step size converges to 0 with probability tending to 1 when and .

4.3 Linear Convergence of Population and Sample-Based SCGA Algorithms on

As we have shown in (4.16) that our proposed directional SCMS algorithm is an example of the sample-based SCGA method with directional KDE Inline graphic on with an adaptive step size , our main focus in this subsection will be on the (linear) convergence of such SCGA algorithm on . We first consider the population SCGA algorithm on defined by its iterative formula as

(4.18)

with a suitable choice of the step size Inline graphic . The sample-based version substitutes the subspace constrained Riemannian gradient with its estimator and generally has a constant step size ; see (4.16). In the sequel, we denote the sequence defined by the population SCGA algorithm with objective function on by and the sequence defined by the sample-based SCGA algorithm with objective function Inline graphic on by .

Remark 4.3.

Note that the definition (4.18) of the SCGA algorithm is adaptive to any Riemannian manifold , not restricting to the unit hypersphere . The only requirement on for (4.18) to be valid is that the exponential map is well defined within a small neighborhood of on the tangent space for each . More importantly, our assumptions (A1–3) and condition (A4) are generalizable to any smooth function supported on , and our (linear) convergence results are applicable to the SCGA algorithm (4.18) on whose sectional curvature is lower bounded by a real number; see one of the key lemmas in our proofs (Lemma I.1).

Similar to the SCGA algorithm in the Euclidean space Inline graphic , the following proposition demonstrates that the SCGA algorithm (4.18) on yields a non-decreasing sequence of the objective function supported on and a convergent SCGA sequence to the directional ridge , as long as the step size is sufficiently small.

Proposition 4.3. (Convergence of the SCGA Algorithm on .)

For any SCGA sequence defined by (4.18) with , the following properties hold:

(a) Under condition (A1), the objective function sequence is non-decreasing and thus converges.

(b) Under condition (A1), .

(c) Under conditions (A1–3), whenever with the convergence radius satisfying
where is a constant defined in (h) of Lemma G.1 while is a quantity depending on both the dimension and the functional norm up to the fourth-order (partial) derivatives of .

The proof of Proposition 4.3 can be found in Appendix I. The upper bound for the convergence radius Inline graphic has the same meaning as in Proposition 3.1 for the Euclidean SCGA algorithm, ensuring that and the distances from the SCGA sequence on to the directional ridge can be upper bounded by the norms of order- principal Riemannian gradients for all .

Corollary 4.1. (Convergence of the Directional SCMS Algorithm.)

When the fixed sample size is sufficiently large and the bandwidth is chosen to be correspondingly small, the following properties hold for the directional SCMS sequence with high probability under conditions (A1–3) and (D1–2):

(a) The directional KDE sequence is non-decreasing and thus converges.

(b) .

(c) whenever with the convergence radius defined in (c) of Proposition 4.3.

Corollary 4.1 should also be considered as the convergence results of the sample-based SCGA algorithm on Inline graphic . To justify Corollary 4.1, we know from Theorem 4.1 that conditions (A1–3) also hold with high probability for the directional KDE and its estimated directional ridge when is sufficiently large and is small enough. Further, by Lemma 4.1, the adaptive step size of our directional SCMS algorithm can be smaller than the threshold value Inline graphic in Proposition 4.3 but also universally bounded away from zero with respect to the iteration number , given a sufficiently large but fixed sample and a sufficiently small bandwidth ; recall our Remark 3.2. As a result, Corollary 4.1 follows from Proposition 4.3. Notice that the statements in Proposition 4.2 are essentially the same as the results (a–b) in Corollary 4.1 here. However, similar to Proposition 2 in [46] for the Euclidean SCMS algorithm, Proposition 4.2 for the directional SCMS algorithm is established under the convexity assumption on the directional kernel Inline graphic and holds for any sample size and bandwidth . On the contrary, the results (a–b) in Corollary 4.1 are asymptotic and probabilistic properties, in which we require and .

According to Proposition 4.3 and Corollary 4.1, we can denote the limiting points of the population and sample-based SCGA algorithms on Inline graphic by and , respectively. The definition of the linear convergence of any converging sequence on (or an arbitrary Riemannian manifold) is similar to the one in the flat Euclidean space (see Definition 3.2), except that the Euclidean distance is replaced with the geodesic distance on in the definition; see Section 4.5 in [1].

Using the notation in [116], we let Inline graphic . Given that the sectional curvature is on , we have . One can show by differentiating that is strictly increasing with respect to and for any . Analogous to the Euclidean SCGA algorithms, we will establish the linear convergence of the SCGA sequence on (or any Riemannian manifold whose sectional curvature is lower bounded by a real number) as well as its sample-based version under the following local condition.

(A4) (Quadratic Behaviors of Residual Vectors) We assume that the SCGA sequence on with step size and as its limiting point satisfies
for some constant , where is the constant defined in condition (A2) and is the logarithmic map.

Condition (A4) serves as a generalization of its Euclidean counterpart condition (A4) to Inline graphic , which again requires a quadratic behavior of the residual vector within the tangent space . Under this condition, the objective (density) function is ‘subspace constrained geodesically strongly concave’ around the directional ridge ; see also Remark 4.4. Some discussions about potentially weaker assumptions that imply condition (A4) in Appendix E are also applicable in the manifold setting under some modifications; see Remark E.1. One intuitive example that condition (A4) holds is presented at the second row of Fig. 5, where the directional SCMS/SCGA iterative vector Inline graphic is always orthogonal to the residual space for all around the (estimated) ridge on .

Fig. 5. — Density ridges estimated by the directional SCMS algorithm performed on the two simulated datasets and their (linear) convergence plots. Horizontally, the first row displays the results on the simulated vMF mixture dataset, while the second row presents the results on the circular simulated dataset on . Vertically, the first column includes plots with directional KDE, estimated ridges and trajectories of directional SCMS sequences from two (randomly) chosen initial points on . The second and third columns present the convergence plots for the log-distances of points in the highlighted sequences (indicated by hollow cyan points) to their limiting points or the estimated ridges on .

Theorem 4.4. (Linear Convergence of the SCGA Algorithm on .)

Assume conditions (A1–4) throughout the theorem.

(a) Q-Linear convergence of : Consider a convergence radius satisfying
where is the constant defined in (h) of Lemma G.1 and is a quantity defined in (c) of Proposition 4.3 that depends on both the dimension and the functional norm up to the fourth-order (partial) derivatives of . Whenever and the initial point with , we have that

(b) R-Linear convergence of : Under the same radius in (a), we have that whenever and the initial point with ,

We further assume (D1–2) in the rest of statements. Suppose that and .

(c) Q-Linear convergence of : Under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

(d) R-Linear convergence of : Under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

The detailed proof of Theorem 4.4 is in Appendix I. The theorem illuminates both the step size requirement and the convergence radius Inline graphic for the linear convergence of SCGA algorithms on . Similar to Euclidean SCGA algorithms in Theorem 3.3, the upper bound of the convergence radius consists of the three quantities adopted from Proposition 4.3 and a quantity controlling the ‘subspace constrained geodesically strong concavity’ around the directional ridge Inline graphic .

Remark 4.4.

Similar to Euclidean SCGA algorithms, the geodesically strong concavity assumption [116] on the objective function is not sufficient to prove the linear convergence of the SCGA algorithm (4.18) on . We instead establish the following ‘subspace constrained geodesically strong concavity’ under some mild conditions (A1–4):

(4.19)

for some constant , where is generally chosen to be . In fact, the most critical factors for establishing this property is the eigengap condition (A2) and the quadratic behaviors of residual vectors stated in condition (A4).

Corollary 4.2. (Linear Convergence of the Directional SCMS Algorithm.)

Assume conditions (A1–4) and (D1–2). When the fixed sample size is sufficiently large and the fixed bandwidth is chosen to be sufficiently small, there exists a convergence radius such that the directional SCMS sequence satisfies

with high probability whenever and the initial point .

We also identify Corollary 4.2 as the linear convergence of the sample-based SCGA algorithm on Inline graphic to the estimated directional ridge defined by the directional KDE . The corollary can be justified by noticing that, under conditions (D1–2) and the uniform bounds (4.6), satisfies conditions (A1–3) with probability tending to 1 as and ; see Theorem 4.1. With this fact, one can leverage our argument in (a) of Theorem 4.4 to prove the linear convergence of the sample-based SCGA algorithm on Inline graphic with a fixed step size satisfying . Additionally, when the fixed sample size is sufficiently large and the bandwidth is chosen to be accordingly small, the adaptive step size of our directional SCMS algorithm in (4.17) always falls below the threshold value for linear convergence by Lemma 4.1 but is also bounded away from zero; recall Remark 3.2. Taking the infimum of Inline graphic with respect to the iteration number under a fixed and yields our results in Corollary 4.2.

5. Experiments

In this section, we first validate our linear convergence results of both Euclidean and directional SCMS algorithms on some simulated datasets. Then, we apply these two algorithms to a real-world earthquake dataset so as to identify its density ridges and compare the estimated ridges with boundaries of tectonic plates and fault lines, on which earthquakes are known to happen frequently.

We leverage the Gaussian kernel profile Inline graphic in the Euclidean SCMS algorithm and the von Mises kernel in the directional SCMS algorithm. In addition, the logarithms of the estimated densities are utilized in our actual implementations (Step 2 in Algorithms 1 and 2 in Appendix A) of the Euclidean and directional SCMS algorithms because of two advantages. First, using the log-density in the Euclidean SCMS algorithm leads to a faster convergence process [46]; see our empirical illustration in Fig. A7. Second, estimating a hidden manifold with a density ridge defined by a log-density stabilizes the valid region for a well-defined ridge compared with the corresponding ridge defined by the original density; see Theorem 7 (Surrogate theorem) in [45].

Unless stated otherwise, we set the default bandwidth parameter of the Euclidean SCMS algorithm to the normal reference rule in [20, 25], which is

(5.1)

where Inline graphic is the sample standard deviation along -th coordinate and is the (Euclidean) dimension of the data in . As mentioned by [25], there are two advantages of applying the normal reference rule (5.1) in our context. First, the KDE under tends to be oversmoothing [97], because the bandwidth minimizes the asymptotic MISE for estimating the first-order derivatives of a multivariate Gaussian distribution with covariance matrix Inline graphic ; see Corollary 4 in [20]. More importantly, the Euclidean SCMS algorithm with an oversmoothed KDE would not produce too many spurious ridges. Second, compared with cross validation methods, is easy to compute in practice, especially when the dimension of data is high. The default bandwidth parameter of the directional SCMS algorithm is selected via the rule of thumb in Proposition 2 of [43], which optimizes the asymptotic MISE for a Inline graphic distribution. The concentration parameter is estimated by Equation (4.4) in [9]. That is,

(5.2)

where Inline graphic given the directional dataset and we recall that is the modified Bessel function of the first kind of order . As -von Mises–Fisher distribution behaves as the Gaussian distribution on , choosing the bandwidth (5.2) also helps smooth out the resulting directional KDE. The tolerance level is always set to be Inline graphic for any SCMS algorithm.

5.1 Simulation Study on the Euclidean SCMS Algorithm

To evaluate the algorithmic rate of convergence of the Euclidean SCMS algorithm (Algorithm 1), we generate the first simulated dataset by randomly drawing 1000 data points from a Gaussian mixture model with density Inline graphic , where , and . Another simulated dataset consists of 1000 data points randomly generated from an upper half circle with radius 2 and i.i.d. Gaussian noises . When applying Algorithm 1 with the estimated log-density on each of these two simulated datasets, we choose the set of initial mesh points as the simulated dataset itself and remove those initial points whose density values are below 25% of the maximum density from the set of mesh points in order to obtain a cleaner ridge structure.

Figure 4 presents the Euclidean KDE plots, estimated density ridges from the Euclidean SCMS algorithm and their (linear) convergence plots on the two simulated datasets. The linear trends of those plots in the second and third columns of Fig. 4 empirically demonstrate the correctness of our Theorem 3.1 and Corollary 3.2 about the linear convergence of the Euclidean SCMS algorithm.

Fig. 4. — Density ridges estimated by the Euclidean SCMS algorithm on the two simulated datasets and their (linear) convergence plots. Horizontally, the first row displays the results of the simulated Gaussian mixture dataset, while the second row presents the results of the half circle simulated dataset. Vertically, the first column includes plots with Euclidean KDE, estimated ridges, and trajectories of SCMS sequences from two (randomly) chosen initial points. The second and third columns present the (linear) convergence plots for the log-distances of points in the highlighted sequences (indicated by hollow cyan points) to their limiting points or the estimated ridges.

5.2 Simulation Study on the Directional SCMS Algorithm

Analogous to our simulation study for the linear convergence of the Euclidean SCMS algorithm, we verify the linear convergence of our directional SCMS algorithm (Algorithm 2) on two different simulated datesets. One of them comprises 1000 data points randomly generated from a vMF mixture model Inline graphic with , and . The other simulated dataset is identical to the example in the right panel of Fig. 1 and the underlying dataset in Fig. B9, which consists of 1000 randomly sampled points from a circle connecting two poles on with i.i.d. additive Gaussian noises to their Cartesian coordinates and additional Inline graphic normalization onto . In our implementation of Algorithm 2 with the directional log-density on the two simulated datasets, we also set each initial mesh as the dataset itself and remove those points whose density values are below 10% of the maximal density value from each set of mesh points.

Figure 5 shows the directional KDE plots, estimated density ridges on Inline graphic from the directional SCMS algorithm and their (linear) convergence plots on the aforementioned simulated datasets. Those linear decreasing trends in the convergence plots, possibly after several pilot iterations, illustrate the locally linear convergence of the directional SCMS algorithm that we proved in Theorem 4.4 and Corollary 4.2. Note that those minor perturbations at the tails of some linear convergence plots in Fig. 5 are due to precision errors.

5.3 Density Ridges on Earthquake Data

It is well known that earthquakes on Earth tend to strike more frequently along the boundaries of tectonic plates and fault lines (i.e. sections of a plate or two plates are moving in different directions); see [54, 103] for more details. We analyze earthquakes with magnitudes of 2.5+ occurring between 2020-10-01 00:00:00 UTC and 2021-03-31 23:59:59 UTC, which can be obtained from the Earthquake Catalog (https://earthquake.usgs.gov/earthquakes/search/) of the United States Geological Survey. The dataset Inline graphic contains 15049 earthquakes worldwide in this half-year period.

The normal reference rule (5.1) leads to the bandwidth parameter Inline graphic and the rule of thumb (5.2) yields under the earthquake dataset . However, as these bandwidths lead to oversmoothing density estimates, we decrease the bandwidths for the Euclidean and directional SCMS algorithms to and respectively, in order to detect more ridge structures. We generate 5000 points uniformly on the sphere Inline graphic as the initial mesh points.

To compare the earthquake ridges obtained by the Euclidean and directional SCMS algorithms with the boundaries of tectonic plates, we download the boundary geometry file of the 56 tectonic plates from https://www.kaggle.com/cwthompson/tectonic-plate-boundaries according to the models of [5, 13] and overlap them with the estimated ridges in Fig. 6. The results suggest that the ridges identified by the Euclidean and directional SCMS algorithms on the earthquake dataset coincide with the boundaries of tectonic plates to a large extent. Note that the Euclidean and directional ridges on the earthquake dataset Inline graphic do not show too much difference, because most of the observed earthquakes are in the low latitude region () where most human beings live. Yet, the ridges estimated by our proposed directional SCMS algorithm do align better with the boundary of the Eurasian Plate near the North Pole than the ones estimated by the Euclidean SCMS algorithm, which confirms the superiority of our directional SCMS algorithm in the high latitude region; see also Appendix B for more in-depth analysis.

Fig. 6. — Comparisons between density ridges obtained by the Euclidean SCMS algorithm on angular coordinates and the directional SCMS algorithm on Cartesian coordinates from the earthquake dataset. On each panel, the ground-truth boundaries of tectonic plates are plots in blue curves.

We further quantify the performances of earthquake ridges Inline graphic and estimated by the Euclidean and directional SCMS algorithms from two different perspectives. First, given the fact that an estimated ridge should lie on the region where earthquakes happen more intensively, we compute the mean geodesic distances from each point in the earthquake dataset Inline graphic to the ridges and , respectively, as

where Inline graphic is the number of earthquakes in the dataset. The ridge estimated by our directional SCMS algorithm is around 4% closer to the earthquakes in on average. Second, we assess the estimation errors of and with respect to the boundaries of tectonic plates. To this end, we view the surface of the Earth as a unit sphere Inline graphic and define a manifold-recovering error measure [119] between the set of boundary points and an estimated ridge as

(5.3)

where Inline graphic and are the cardinalities of and , respectively. Note that although the density ridge and the boundaries of tectonic plates are continuous structures in theory, they are generally represented by sets of discrete points in practice. That is why we can calculate their cardinalities without computing complicated integrals. Moreover, the manifold-recovering error measure is an average between the mean geodesic distances from each point in Inline graphic to and from each point in to . We define such a balanced error measure to avoid biasing toward an estimated ridge that only approximates a small portion of in high accuracy but fails to cover other parts of ; see Fig. 4 in [119] for an illustrative example. The manifold-recovering error measures of the ridges Inline graphic and estimated by the Euclidean and directional SCMS algorithms with respect to the boundaries of tectonic plates are

Our directional SCMS algorithm again reduces the estimation error by around 3.9%. In summary, the earthquake ridges yielded by our directional SCMS algorithm are not only closer to the earthquakes on average than the ones identified by the Euclidean SCMS algorithm but also have a lower error in approximating the boundaries of tectonic plates.

6. Discussions

In this paper, we have provided a rigorous proof for the linear convergence of the well-known SCMS algorithm by viewing it as an example of the SCGA algorithm. We have also generalized the definition of density ridges from the usual densities supported on compact sets in Inline graphic to the directional densities supported on with nonzero curvature. The stability theorem of directional density ridges has been established, and the linear convergence of our proposed directional SCMS algorithm has been proved. Table 1 summarizes the frameworks of considering the (directional) mean shift/SCMS algorithms as gradient ascent/SCGA methods (on Inline graphic ) and our results of asymptotic convergence rates of their corresponding step sizes.

Table 1.

Comparisons between the Euclidean and directional mean shift (MS) or SCMS algorithms and summary of the asymptotic convergence rates of their adaptive sizes when viewed as GA/SCGA algorithms in Inline graphic or on .

Algorithms	Recast forms as GA/SCGA (in or on )	Asymptotic step sizes
MS / SCMS in
		(See Lemma 3.2)
MS / SCMS on

		(See Lemma 4.1)

Open in a new tab

Our theoretical analyses of the SCGA algorithm in the Euclidean space Inline graphic and on the unit hypersphere has potential implications beyond proving the linear convergence of SCMS algorithms. In the optimization literature [1, 77, 78, 116], it is well known that a standard gradient ascent method (on a smooth manifold) will converge linearly given an appropriate step size when the objective function is smooth and (geodesically) strongly concave. However, as we have discussed in Remarks 3.3 and 4.4, the smoothness and (geodesically) strong concavity assumptions are not sufficient for the linear convergence of the SCGA algorithms. Therefore, identifying density ridges with the SCGA algorithms is not only a nonconvex optimization problem, but also fundamentally more complex than standard gradient ascent methods. The assumptions and proof arguments developed in this paper may give some insights into the linear convergence of the SCGA algorithms with other forms of subspace constrained gradients.

There are still many open problems related to the SCMS algorithm. First, a central issue in determining the performance of an SCMS algorithm is the bandwidth selection. There is a variety of bandwidth selection mechanisms available to the Euclidean KDE and its derivatives in the literature [20, 96], but it is unclear how they can be applied to the SCMS algorithm. We plan to specialize or generalize such techniques to the SCMS algorithm under both the Euclidean and directional data. Second, our definition of density ridges is generalizable to any density supported on an arbitrary Riemannian manifold. As [56] has formulated the principal curve on a Riemannian manifold based on its classical definition in [55], it will be interesting to propose a new definition of principal curves from the perspective of density ridges on Riemannian manifolds and derive a more general SCMS algorithm, possibly based on some existing nonlinear mean shift methods on manifolds [104, 105].

Data Availability Statement

The data and code underlying this paper are available at https://github.com/zhangyk8/EuDirSCMS. Specifically, the earthquake data in Section 5.3 were obtained from the Earthquake Catalog (https://earthquake.usgs.gov/earthquakes/search/) of the United States Geological Survey.

Supplementary Material

EuDirSCMS-main_iaac005

Click here for additional data file.^{(5.6MB, zip)}

Acknowledgment

We thank the anonymous reviewers for their helpful comments that improved the quality of this paper.

Funding

Y.C. is supported by the National Science Foundation [DMS-1952781 and DMS-2112907] and CAREER award [DMS-2141808], and the National Institutes of Health [U24-AG072122].

A. Algorithmic Summaries of Euclidean and Directional SCMS Algorithms

In this section, we provide algorithmic summaries of the Euclidean and directional SCMS algorithms for practical reference. Algorithm 1 describes each step of the Euclidean SCMS algorithm in detail. In our actual implementation of the algorithm, we replace the density estimator Inline graphic with . To demonstrate that the (directional) SCMS algorithms under the log-density implementation give rise to a faster convergence process, we repeat our experiments in Sections 5.1 and 5.2 (i.e. Figs 4 and 5) 20 times for each simulated dataset with the (directional) SCMS algorithms under the original (estimated) density and the (estimated) log-density, respectively. The comparisons between their running times are shown in Fig. A7, in which the (directional) SCMS algorithms under the log-density implementation clearly outperform their counterparts with the original density in terms of the average elapsed time until convergence.

Fig. A7. — Running time comparisons between the (directional) SCMS algorithms with the original density and the log-density applied to our simulated datasets in Figs 4 and 5.

Additionally, when the observational data in practice are noisy, it is common to incorporate an extra denoising step before Step 2 of Algorithm 1 to remove observations in low-density areas and stabilize the (Euclidean) SCMS algorithm; see [23, 45] for comparative studies that demonstrate the significance of denoising.

We summarize the directional SCMS algorithm in Algorithm 2. Note that in Step 2-1 of Algorithm 2, we compute the scaled versions Inline graphic and for because the estimated principal Riemannian gradient and Hessian are often very small. The scaling stabilizes the numerical computation. The spectral decomposition is thus performed on the scaled Hessian estimator , and the scaled principal Riemannian gradient estimator is calculated as

where Inline graphic has its columns equal to the orthonormal eigenvectors associated with the smallest eigenvalues of the scaled Hessian estimator (or equivalently, ) inside the tangent space .

graphic file with name iaac005fx1.jpg

B. Limitations of Euclidean KDE in Handling Directional Data

In this section, we demonstrate with examples and simulation studies that it is inadequate to analyze angular or directional data with Euclidean KDE (2.1) and SCMS algorithm (Algorithm 1). Consider a directional data sample Inline graphic generated from a directional density on . In real-world applications, the random observations on are commonly represented by their angular coordinates with or equivalently, for , where are longitudes and are latitudes.

graphic file with name iaac005fx2.jpg

B.1 Case I: Density Estimation

As the angular coordinates Inline graphic of the directional dataset have their ranges in a subset of the flat Euclidean space , it is tempting to apply the Euclidean KDE on to construct a density estimator as

(B1)

where Inline graphic uses a radial symmetric kernel with profile , and leverages a product kernel. However, the Euclidean KDEs in (B1) (both and ) exhibit two potential drawbacks of dealing with directional data.

Inline graphic First, in (B1) is an estimator of the directional density under its angular representation . Here, is -periodic in its first coordinate and -periodic in its second coordinate. Then, the bias of in estimating is

where Inline graphic and is the Laplacian of ; see [27] for details. However, the second-order partial derivative along the lines of constant latitude (or parallels) would tend to infinity as we approach the north and south poles, given that the first-order partial derivative is bounded. One method to justify this claim is that the curvatures of these parallels, which are equivalent to the reciprocals of their radii, tend to infinity as these radii shrink. In addition, one should recall that the curvature of a function Inline graphic is defined as . Therefore, applying (B1) to estimate the angular representation of the directional density will produce high bias as the estimator approaches the high-latitude regions (around the north and south poles); see also Panel (c) of Fig. B9.

Inline graphic Second, the Euclidean KDE leverages the Euclidean distances between any query point and observations under their angular coordinates to construct the density estimates, instead of using the (intrinsic) geodesic distances. Note that the Euclidean distance in the angular coordinate system is not equivalent to the Euclidean distance in the ambient Euclidean space Inline graphic containing the directional data on . As a result, some observations that have dramatically different geodesic distances to density query points can have the same density contributions in , as illustrated in Example B.1.

Example B.1.

Suppose that we want to estimate the density values at and , where is of a small value. Consider a random sample consisting of only two observations and . If we use the Euclidean distance, the distance between and the distance between are the same. Therefore, when we use the Euclidean KDE to estimate the underlying density, the contribution of to will be the same as the contribution of to . Nevertheless, their geodesic distances are very different, because while is a quantity close to zero; see Fig. B8 for a graphical illustration. It explains, from a different angle, why the Euclidean KDE will have a large bias in estimating the underlying density when the query point is within the high latitude region.

Fig. B8. — Graphical illustration of geodesic distances between and as well as and .

B.2 Case II: Ridge-Finding Problem

Consider the following simulated example of identifying a density ridge via the Euclidean SCMS algorithm (Algorithm 1) and our proposed directional SCMS algorithm (Algorithm 2). We generate 1000 data points Inline graphic uniformly frbecauseom a great circle connecting the North and South Poles of with some i.i.d. additive Gaussian noises to their Cartesian coordinates. Then, all the simulated points will be standardized back to via normalization. The angular coordinates of these simulated points are denoted by Inline graphic accordingly. Figure B9 presents the result of applying both the Euclidean SCMS algorithm (with the Gaussian kernel) to angular coordinates and the directional SCMS algorithm (with the von Mises kernel) to Cartesian coordinates of our simulated dataset. As shown in the panel (b) of Fig. B9, the Euclidean SCMS algorithm exhibits high bias in estimating the true circular structure near two poles of Inline graphic , while our directional SCMS algorithm is able to seek out the true circular structure under negligible errors. The density plot in the panel (c) of Fig. B9 exhibits two nonsmoothing peaks on the North Pole due to the infinite Hessian matrices of the underlying density in its angular coordinate; recall our discussion in Section B.1. This also explains the chaotic behavior of the Euclidean KDE in high-latitude regions.

Fig. B9. — Euclidean and directional SCMS algorithms performed on the simulated dataset. **Panels (a)–(c):** Outcomes of the Euclidean SCMS algorithm with the contour plot for the Euclidean KDE. **Panels (d)–(f):** Outcomes of our directional SCMS algorithm with the contour plot for the directional KDE. **Panels (a)–(b) and (d)–(e)** are shown in the view of Hammer projections (page 160 in [100], while **Panels (c) and (f)** are presented under the orthographic projections.

At this point, some readers may have a natural concern: why we do not directly apply the Euclidean SCMS algorithm to the Cartesian coordinates Inline graphic of the available data points? We discuss the potential downsides of this approach from two different aspects.

The Euclidean SCMS algorithm is not intrinsically designed for handling the directional data . Directly applying the algorithm to these Cartesian coordinates leads to an estimated ridge not lying on . While the normalization is able to standardize the ridge points back to , this standardization process will inevitable introduce extra bias.
When estimating the underlying density of , we know from (3.5) and some KDE literature [20, 27, 96] that the (uniform) rates of convergence of the Euclidean KDE and its derivatives depend on the dimension of the ambient space instead of the intrinsic dimension of directional data. This dimensionality effect also appears in the (linear) convergence of the downstream SCMS algorithm, which, for instance, shrinks the upper bounds of the (linear) convergence radius and step size threshold in Theorem 3.1. Thus, analyzing directional data with the Euclidean KDE and SCMS algorithm will slow down the statistical and algorithmic rates of convergence of the density estimators as well as lower the accuracy of the resulting ridge in recovering the underlying structure inside the dataset.

To support our above explanations, we extend our simulation study in Fig. B9 as follows. We vary the maximum latitude attained by the underlying (intrinsic) circular structure on Inline graphic from to while keeping the circle parallel to the original great circle connecting the North and South Poles of ; see the panel (a) in Fig. B10 for an illustration. For each of these underlying circles, we follow the same sampling scheme as in Fig. B9, i.e. sampling 1000 points uniformly on the circle with some i.i.d. additive Gaussian noises Inline graphic to their Cartesian coordinates and normalization back to . The Cartesian coordinates of the simulated points from each circular structure are denoted by while their angular coordinates are represented by . Then, we apply our directional SCMS algorithm to from each of these simulated datasets. Moreover, the Euclidean SCMS algorithm is applied to both the angular coordinates Inline graphic and Cartesian coordinates from each of these simulated datasets, where we consider as a dataset in the ambient space in the latter case. Here, the sets of initial points for the Euclidean and directional SCMS algorithms are the simulated datasets themselves. Finally, we compute the average geodesic distance errors on Inline graphic from the resulting ridges to the corresponding true circular structures. To reduce the randomness of our simulation studies, we also repeat the above sampling and experimental procedures 20 times for each true circular structure.

We present our comparisons of the Euclidean and directional SCMS algorithms based on three metrics in Fig. B10: (i) average geodesic distance errors between the estimated ridges and the true circular structures, (ii) the number of iteration steps and (iii) the running time. Notice that, as the latitudes of the underlying circular structures increase, the distance errors of (Euclidean) ridges based on the Euclidean SCMS algorithm applied on the angular coordinates Inline graphic rise. Conversely, the distance errors of directional ridges and the ridges based on the Euclidean SCMS algorithm in decreases when the true circular structures climb on ; see the panel (b) of Fig. B10. While the performances of our directional SCMS algorithm and the Euclidean SCMS algorithm in Inline graphic are almost indistinguishable in terms of the average geodesic distance errors, our directional SCMS algorithm significantly outperforms the Euclidean SCMS algorithm with regards to time efficiency; see the panels (c–d) of Fig. B10. Note that the Euclidean SCMS algorithm exhibits high variance in the number of iteration steps under the repeated experiments, because each simulated dataset may contain some outliers that are far away from the true circular structure on Inline graphic and the Euclidean SCMS algorithm requires exceptionally large iterative steps to converge when initialized from these outliers. Our directional SCMS algorithm, however, is stabler in its iterative step due to the fact that it is adaptive to the geometry of .

Other potential issues of analyzing directional data with Euclidean methods and ignoring the curvature of Inline graphic can be found in [30]. In summary, it is highly inadequate and inefficient to handle directional data with the Euclidean KDE and SCMS algorithm, which calls for the needs to introduce the directional KDE (2.4) and propose our well-designed SCMS algorithm for analyzing directional data (Algorithm 2).

C. Normal Space of the Euclidean Density Ridge

As we will refer to conditions (A1–3) frequently in the next two sections, we restate them here:

(A1) (Differentiability) We assume that is bounded and at least four times differentiable with bounded partial derivatives up to the fourth order for every .
(A2) (Eigengap) We assume that there exist constants and such that and for any .
(A3) (Path Smoothness) Under the same in (A2), we assume that there exists another constant such that
for all and .

Given a matrix-valued function Inline graphic , its gradient will be an array defined as . The derivative of in the directional of a vector is defined as

When the matrix Inline graphic , we will use the notation interchangeably to denote its directional derivative along .

Recall that an order- Inline graphic ridge of the density in is the collection of points defined as

Lemma C.1 below shows that under conditions (A1–3), the Jacobian matrix Inline graphic has rank at every point of , and is a -dimensional manifold by the implicit function theorem [92]. Consequently, the row space of spans the normal space to .

If we define Inline graphic , the derivation in pages 60–63 of [39] shows that

(C1)

for Inline graphic , and the column space of spans the normal space to . Let

graphic file with name DmEquation100.gif

for Inline graphic . Then,

graphic file with name DmEquation101.gif

(C2)

However, the columns of Inline graphic are not orthonormal. Thus, we leverage the orthonormalization in [22] to construct whose columns are orthonormal and span the same column space as in the following steps. Under the condition that has full rank at every point (see Lemma C.1), is positive definite, and we perform the Cholesky decomposition on it, that is,

graphic file with name DmEquation102.gif

(C3)

where Inline graphic is a lower triangular matrix whose diagonal elements are positive. We then define

graphic file with name DmEquation103.gif

(C4)

Notice that Inline graphic intrinsically depend on the dimension of the ridge , but we do not explicate these dependencies in their notations. As discussed in [22], might not be unique because the eigenvalues of can have their multiplicities greater than 1. Any collection of linearly independent unit eigenvectors of Inline graphic fits into the above construction for . However, as will be shown later, this volatility of will not affect our results, as we only require the smoothness of to develop a lower bound of .

Lemma C.1.

Assume conditions (A1–3). Given that and are defined in (C2) and (C4), we have the following properties:

(a) and have the same column space. In addition,
That is, is the projection matrix onto the columns of .

(b) The columns of are orthonormal to each other.

(c) For , the column space of is normal to the (tangent) direction of at .

(d) For all , . Moreover, is a -dimensional manifold that contains neither intersections and nor endpoints. Namely, is a finite union of connected and compact manifolds.

(e) For , all the nonzero singular values of are greater than and therefore,

(f) When is sufficiently small and ,
for some constant .

(g) Assume that another density function also satisfies conditions (A1–3) and is sufficiently small. Then
for some constant and any , where is the matrix defined in (C4) with the underlying density .

(h) The reach of satisfies
for some constant .

Lemma C.1 is extended from Lemma 2 in [22] to handle the density ridge Inline graphic with . As our conditions (A1–3) imply the imposed conditions of Lemma 2 in [22], our proof of Lemma C.1 essentially follows from their arguments with some minor modifications.

Proof. Proof of Lemma C.1 —

We adopt and generalize parts of the proof of Lemma 2 in [22].

(a) This property is a natural corollary of the Cholesky decomposition as

(b) Some direct calculations show that

(c) It can be proved by the argument of Lemma 1 in [22]. Or, we define an arbitrary parametrized curve lying within for some . Then aligns with the tangent direction at . Since , taking the derivative with respect to gives us that

with . Hence, by the arbitrariness of , the column of is normal to the tangent direction of at .

(d) We prove that the nonzero singular values of are bounded away from 0. Recall that

with

for . Under conditions (A2-3),

It shows that all the singular values of are less than . Moreover, under condition (A2) again, all the nonzero singular values of are greater than . By Theorem 3.3.16 in [57], we know that all the nonzero singular values of are greater than . Therefore, . The rest of the proof follows directly from Claim 4 in [22].

(e) By the proof of (d), we already know that all the nonzero singular values of are greater than . Thus, , and

where is the smallest singular value of matrix .

Finally, the proofs of properties (f), (g) and (h) are essentially the same as the corresponding claims in [22]. We thus omitted them.

As we have discussed in Remark 4.1, property (d) of Lemma C.1 demonstrates that our imposed assumptions (A1–3) for the ridge Inline graphic is sufficient to imply the critical full-rank condition of its normal space in [28] in order for to be a well-defined solution manifold.

D. Proofs of Lemma 3.2, Proposition 3.1, and Theorem 3.1

Lemma D.1.

Assume conditions (A1) and (E1). The convergence rate of is

for any as and .

Another interpretation of Lemma 3.2 is that Inline graphic diverges to infinity at the rate

graphic file with name DmEquation117.gif

if we select the bandwidth Inline graphic to minimize the asymptotic MISE [27], where ‘’ stands for the asymptotic equivalence.

Proof. Proof of Lemma 3.2 —

Note that

(D1)

Given the differentiability of guaranteed by condition (A1), the expectation of is given by

By condition (E1), the dominating constant is finite and therefore,

(D2)

In addition, we calculate the variance of as

Again, by condition (E1), the dominating constant is finite. Thus, by the central limit theorem,

(D3)

where . Combining (D1), (D2) and (D3), we conclude that

for any as and .

Remark D.1.

Some previous research papers on the mean shift algorithm [3; 6; 19; 71] have already justified that the algorithm converges to a local mode of the KDE when its local modes are isolated and the algorithm starts within some small neighborhoods of these estimated local modes. Lemma 3.2 here provides a (probabilistic) perspective on the linear convergence of the mean shift algorithm. It is well known that the set of the true local modes of can be approximated by the set of estimated modes defined by [25]. Moreover, around the true local modes of the density , one can argue that is strongly convex and has a Lipschitz gradient with probability tending to 1 by the uniform consistency of and as and ; see the uniform bounds (3.5). Hence, by some standard results in optimization theory (e.g. Chapter 3 in [17]), a sample-based gradient ascent algorithm with objective function converges linearly to (estimated) local modes around their neighborhoods as long as its step size is below some threshold value. Finally, recall that (i) the mean shift algorithm is a special variant of the sample-based gradient ascent method with an adaptive size by (3.8) and (ii) can be sufficiently small but bounded away from 0 when is large and is small by Lemma 3.2; see also Remark 3.2. Therefore, the mean shift algorithm will converge linearly with high probability around some small neighborhood of the local modes of when the sample size is sufficiently large and is chosen to be small.

Proposition D.1. (Convergence of the SCGA Algorithm.)

For any SCGA sequence defined by (3.12) with , the following properties hold.

(a) Under condition (A1), the objective function sequence is non-decreasing and converges.

(b) Under condition (A1), .

(c) Under conditions (A1-3), whenever with the convergence radius satisfying
where is a constant defined in (h) of Lemma C.1 and is a quantity depending on both the dimension and functional norm up to the fourth-order (partial) derivatives of .

Proof. Proof of Proposition 3.1 —

(a) We first derive the following fact about the objective function .

Fact 1. Given (A1), is -smooth, that is, is -Lipschitz.

This fact follows from the differentiability of ensured by condition (A1) and Taylor’s theorem that

for any , where is within a -neighborhood of . Moreover,

(D4)

When , we have that

(D5)

showing that the objective function is non-decreasing along . Given the boundedness of guaranteed by condition (A1), we know that the sequence is bounded. Thus, also converges.

(b) From (a), we know that when ,

Since the sequence converges as , it follows that

(c) Given condition (A2) and the fact that , we know that

Let denote the projection of in the SCGA sequence onto the ridge . Since by (h) of Lemma C.1, is well defined when . Given that the definition of in (C2), we know that

(D6)

when , where we leverage (e) of Lemma C.1 to obtain the inequality (i). More specifically, is a full row rank matrix by (d) of Lemma C.1 and lies within the row space of because is normal to at . Since the nonzero singular values of are lower bounded by , it follows that . From the above derivation, we also know that is indeed the supremum norm of over the line segment connecting and , which depends on the uniform functional norm of the partial derivatives of up to the fourth order. The result follows from (b).

The following Davis–Kahan theorem [36] is one of the most notable theorems in matrix perturbation theory. We present the theorem in a modified version from [45, 109]. Other useful variants of the Davis–Kahan theorem can be found in [115].

Lemma D.2. (Davis–Kahan.)

Let and be two symmetric matrices in , whose spectra (Definition 1.1.4 in [58] are and , and be an interval. Denote by the set of eigenvalues of that are contained in , and by the matrix whose columns are the corresponding (unit) eigenvectors to (more formally, is the image of the spectral projection induced by ). Denote by and the analogous quantities for . If

then the distance between two subspaces is bounded by

for any orthogonally invariant norm , such as the Frobenius norm and the -operator norm , where is a diagonal matrix with the ascending principal angles between the column spaces of and on the diagonal.

Note that when we take the Frobenius norm in Lemma D.2, Inline graphic by some simple algebra. Consequently, we will utilize the following inequality from the Davis–Kahan theorem in our subsequent proofs as

graphic file with name DmEquation134.gif

(D7)

Theorem D.1. (Linear Convergence of the SCGA Algorithm.)

Assume conditions (A1–4) throughout the theorem.

(a) Q-Linear convergence of : Consider a convergence radius satisfying
where is the constant defined in (h) of Lemma C.1 and is a quantity defined in (c) of Proposition 3.1 that depends on both the dimension and the functional norm up to the fourth-order derivative of . Whenever and the initial point with , we have that

(b) R-Linear convergence of : Under the same radius in (a), we have that whenever and the initial point with ,

We further assume conditions (E1–2) in the rest of statements. If and ,

(c) Q-Linear convergence of : under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

(d) R-Linear convergence of : under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

Proof. Proof of Theorem 3.1 —

The entire proof is inspired by some standard results in optimization theory. However, the objective function is no longer strongly concave, and we focus on the SCGA iteration instead of the standard gradient ascent method. We first recall the following two facts. Fact 1. Given condition (A1), is -smooth, that is, is -Lipschitz. Fact 2. Given conditions (A1–3), we know that for any and

for any .

Fact 1 has been proved in Proposition 3.1, implying that the objective function sequence is non-decreasing when . Fact 2 is a natural corollary by Proposition 3.1, because and is the objective function value after one-step SCGA iteration with step size . The iteration will move toward the ridge . With the help of these two facts, we start the proofs of (a–d).

(a) We first show that the following claim: for all and the initial point ,

(D8)

where . By the differentiability of guaranteed by condition (A1) and Taylor’s theorem, we have that

where we use the equality in (i) and (iii), leverage conditions (A2) and (A4) to obtain that and in (ii), apply the quadratic bound on from condition (A4) to obtain (iv), and use the fact that in (v). We also use the fact that and in (v). Claim (D8) thus follows.

Given Fact 2 and any ,

where we apply (D4) to obtain the inequality. It implies that

(D9)

Therefore,

whenever , where we apply (D8) and (D9) in (i) and use the choice of to argue that

in (ii). By telescoping, we conclude that when and ,

The result follows.

(b) The result follows easily from (a) and the inequality for all .

(c) The proof here is partially inspired by the proof of Theorem 2 in [8]. We write the spectral decompositions of and as

where and . By Weyl’s theorem (Theorem 4.3.1 in [58]) and uniform bounds (3.5), we know that for any ,

Thus, satisfies the first two inequalities in condition (A2) when is sufficiently small and is sufficiently large. According to the Davis–Kahan theorem (Lemma D.2 here) and uniform bounds (3.5),

for any , where we use (D7) and the fact that to obtain (i). Hence, when and ,

(D10)

with probability tending to 1.

We now claim that and

(D11)

for all . We will prove this claim by induction on the iteration number. Note that when , we derive from triangle inequality that

where we apply the result in (a) to obtain the last inequality. Moreover, by the choice of and (D10), we are guaranteed that . In the induction from , we suppose that and the claim (D11) holds at iteration . The same argument then implies that the claim (D11) holds for iteration and that . The claim (D11) is thus proved.

Now, given that , we iterate the claim (D11) to show that

where the fourth inequality follows by summing the geometric series, and the last equality is due to our notation . It completes the proof.

(d) The result follows easily from (c) and the inequality for all .

E. Discussion on Condition (A4)

In this section, we explore several avenues to derive condition (A4) based on some potentially weaker assumptions. Recall from Section 3.3 that condition (A4) requires the following:

(A4) (Quadratic Behaviors of Residual Vectors) We assume that the SCGA sequence with step size and as its limiting point satisfies that
for some constant , where is the constant defined in condition (A2).

E.1 Self-Contractedness Assumption

One important assumption that connects condition (A4) with the existing conditions (A1–3) in Section 3.1 is the so-called self-contracted property [34, 35, 49]:

(A5) (Self-Contractedness) We assume that the SCGA sequence satisfies that

Condition (A5) requires the SCGA sequence to move toward the ridge Inline graphic under a relatively straight and shrinking path. As we have proved in Proposition 3.1 that the SCGA sequence converges to when the sequence is initialized near and its step size is small, condition (A5) is indeed a mild assumption as long as the sequence does not move erratically around Inline graphic . More importantly, we demonstrate by Proposition E.1 below that condition (A5) can be implied by a subspace constrained version of the concavity assumption on the objective (density) function .

Proposition E.1.

Assume condition (A1) and the following assumption on the objective function :

(A6) (Subspace Constrained Concavity ) For any with being a constant radius, it holds that

Then, the SCGA sequence defined in (3.12) with step size and initial point is self-contracted.

Notice that the density function (3.13) satisfies the ‘subspace constrained concavity’ condition (A6) around a small neighborhood of its ridge Inline graphic . Moreover, it is intuitive to verify that condition (A6) is a weaker assumption compared with our established ‘subspace constrained strong concavity’ in Theorem 3.1; see also Remark 3.3.

Proof. Proof of Proposition E.1 —

The proof is inspired by Lemma 14 in [49]. We show the self-contractedness for as follows, where is arbitrary. For all and with , we calculate that

where we apply condition (A6) in inequality (i), use the ascending property of from (a) of Proposition 3.1 to argue that in inequality (ii), and leverage the inequality (D5) guaranteed by condition (A1) to obtain (iii). The self-contractedness of the SCGA sequence thus follows.

Under the self-contractedness condition (A5), we argue by the following lemma that the existing conditions (A1–3) in the literature [22, 45] is nearly sufficient to imply the quadratic behavior of the residual vector Inline graphic along the SCGA sequence . In other words, condition (A4) and the linear convergence of the SCGA algorithm hold without any extra assumption.

Lemma E.1.

Assume condition (A5) throughout the lemma.

(a) The total length of the SCGA trajectory is of the linear order, i.e.

(b) We further assume conditions (A1–2). Then,
for any with some radius , where we recall that is the effective radius in condition (A2) under which the underlying density has an eigengap between the -th and -th eigenvalues of its Hessian matrix .

Proof. Proof of Lemma E.1 —

(a) This result follows directly from Theorem 15 of [49]. Note that although their results are stated for the standard gradient descent path, the associated proof only utilizes the self-contractedness property of the iterative path. Thus, their proofs are applicable to our SCGA setting under condition (A5).

(b) We first decompose the vector into an infinite sum of SCGA iterations for and obtain that

(E1)

for any , where we leverage the orthogonality between and and the idempotence of for all in (ii). See also Fig. E11 for a graphical illustration of the decomposition. By Davis–Kahan theorem (Lemma D.2 and (D7) here) and conditions (A1–2), we deduce that for all ,

where we use the Taylor’s theorem in (i) as well as apply the self-contractedness condition (A5) and possibly shrink the radius so that in (ii). Hence, by (E1) and the fact that , we obtain that

implying the second bound in condition (A4) with . In addition,

The results follow.

Fig. E11. — Decomposition of the vector into the summation of subspace constrained gradient ascent iterative vectors.

According to (b) of Lemma E.1, condition (A4) will hold with Inline graphic whenever

graphic file with name DmEquation165.gif

(E2)

The choice of Inline graphic is a valid constant under the differentiability condition (A1). More importantly, (E2) is essentially the same assumption as the first inequality of condition (A3). Compared with the corresponding condition in (A3), the upper bound in (E2) for around the ridge is only shrunk by a dimension-dependent factor Inline graphic . As condition (A3) and (E2) are local, this adjustment does not induce too much extra strictness on the underlying density .

E.2 Subspace Constrained Polyak-Łojasiewicz Inequality Assumption

We have demonstrated in Appendix E.1 that the crucial condition (A4) is valid under the self-contractedness assumption on the SCGA sequence Inline graphic . Consequently, the linear convergence of the SCGA algorithm can be established by slightly modifying the common assumptions (A1–3) in ridge estimation. Nevertheless, the self-contractedness property of the SCGA sequence does not always hold in practice, and it may only be implied by the subspace constrained concavity condition (A6) as proved in Proposition E.1.

Given the fact that the underlying density function Inline graphic or its estimator may not satisfy the subspace constrained concavity assumption in many practical applications of SCGA and SCMS algorithms, we present another approach to deduce condition (A4) based on the well-known Polyak–Łojasiewicz inequality [72, 87]. Given any SCGA sequence with limiting point Inline graphic and step size , we consider the following condition:

(A7) (Subspace Constrained Polyak–Łojasiewicz Inequality) For all , there exists a constant such that

Similar to the standard Polyak–Łojasiewicz inequality, there exist some objective functions that satisfy the subspace constrained Polyak–Łojasiewicz inequality but fail to be concave in the subspace constrained sense as in condition (A6); see [21, 41] and Equation (36) in [28]. From this aspect, condition (A7) incorporates some extra SCGA sequences satisfying condition (A4) and converging linearly to the ridge Inline graphic . However, as the subspace constrained Polyak–Łojasiewicz inequality does not imply condition (A5) or (A6), it should not be regarded as a more general condition. Furthermore, unlike the standard gradient ascent/descent method (Theorem 2 in [63]), the error bound condition (i.e. Equation (D6) here) does not imply the subspace constrained Polyak–Łojasiewicz inequality, indicating a challenge in validating condition (A7) in practice.

Despite these disadvantages, the subspace constrained Polyak–Łojasiewicz inequality condition does give rise to a concise proof for the linear convergence of the objective function value Inline graphic along the SCGA sequence .

Proposition E.2.

Assume conditions (A1) and (A7). Then, for any SCGA sequence with step size , we have that

Proof. Proof of Proposition E.2 —

The proof is inspired by Theorem 1 in [63]. From (D5) and condition (A7), we know that

for all when . By some rearrangements, we conclude that

The final display follows from telescoping.

More importantly, the subspace constrained Polyak–Łojasiewicz inequality controls the total length of the SCGA path to be of the linear order and implicates the quadratic behaviors of residual vectors as required by condition (A4).

Lemma E.2.

Assume conditions (A1) and (A7) throughout the lemma.

(a) The total length of the SCGA trajectory is of the linear order, i.e.

(b) We further assume condition (A2). Then,
for any with some radius , where we recall that is the effective radius in condition (A2) under which the underlying density has an eigengap between the -th and -th eigenvalues of its Hessian matrix .

Proof. Proof of Lemma E.2 —

(a) This part of the proof is inspired by the arguments in Theorem 9 of [49]. Based on the proof of (a) in Proposition 3.1 under condition (A1), we know from (D5) that

when . Using this inequality and condition (A7), we derive that

where we use the inequality to obtain (i) and apply condition (A7) in inequality (ii). Since , some rearrangement of the above inequality suggests that

Therefore,

where we leverage condition (A7) again in (i). In addition, to obtain inequalities (ii) and (iii), we recall from the proof of (d) for Lemma C.1 that , in which the singular values of is bounded by and the singular values of is bounded by . The result thus follows.

(b) This part of the proof is analogous to our arguments in (b) of Lemma E.1, except that the SCGA sequence is no longer self-contracted. For the completeness, we still repeat some arguments and highlight the differences here. By Davis–Kahan theorem (Lemma D.2 and (D7) here) and conditions (A1–2), we have that for all ,

where we possibly shrink the radius so that to obtain inequality (ii). Notice also that, since may not hold without the self-contractedness property, we use a looser bound

from (a) to derive inequality (i). Therefore, by (E1) and the fact that , we obtain that

which implies the second bound in condition (A4) with . Finally,

The results follow.

The results in (b) of Lemma E.2 also imply condition (A4) with Inline graphic whenever

graphic file with name DmEquation180.gif

(E3)

Once again, the choice of Inline graphic is feasible under condition (A1) and the upper bound (E3) can be viewed as a variant of the first inequality in condition (A3). From this perspective, the subspace constrained Polyak–Łojasiewicz inequality (A7) also leads to an alternative assumptions for condition (A4) and the linear convergence of the SCGA algorithm.

Remark E.1.

Note that the results in Proposition E.2 can be generalized to the directional or arbitrary manifold cases under conditions (A1–3). First, the subspace constrained Polyak–Łojasiewicz inequality for the SCGA sequence on or an arbitrary manifold can be modified as

where is the objective (density) function. Based on the proof of (a) in Proposition 4.2 and our arguments in (a) of Lemma E.2, it follows that the total length of the SCGA trajectory on or an arbitrary manifold is of the linear order, i.e.

Second, to establish the quadratic bounds for and , one can follow the arguments in the proof of (b) in Lemma E.2 and leverage the two facts:

1. The tangent vector can be decomposed into an infinite sum of SCGA updates (4.18) on or an arbitrary manifold as

See Fig. E12 for a graphical illustration. This equation is valid because parallel transports preserve inner products and are linear.

2. Under conditions (A1–2), we know that

for some constant , where we leverage the fact that the vector field

with and has its variation bounded by

according to the Davis–Kahan theorem for any . However, we are not sure if the self-contractedness condition can also be adaptive to the directional or general manifold cases, given that the arguments in Theorem 15 of [49] are based on the Euclidean geometry.

Fig. E12. — Decomposition of the vector within the tangent space into the summation of parallel transported SCGA iterative vectors. Here, the blue curves on are iterative paths of the SCGA algorithm, while the green vectors are tangent vectors after being parallel transported to .

F. Other Technical Concepts of Differential Geometry on

Inline graphic Taylor’s Theorem on . Given a smooth function on , its Taylor’s expansion is often written as [85]:

graphic file with name DmEquation187.gif

(F1)

for any Inline graphic , where is the exponential map at . One may replace the exponential map with a more general concept called the retractions on an arbitrary manifold; see Section 4.1 and Proposition 5.5.5 in [1].

Parallel Transport. When comparing vectors in two different tangent spaces on , we leverage the notion of parallel transport to transport vectors from one tangent space to another along a geodesic. In addition, is a tangent vector in after being parallel transported from along a geodesic (or great circle) on . The parallel transport mapping is a linear isometry along any smooth curve on , i.e. for any ; see Proposition 5.5 in [69] or Proposition 1 in Section 4-4 of [37].
Sectional Curvature. Sectional curvature is the Gaussian curvature of a two-dimensional submanifold formed as the image of a two-dimensional subspace of a tangent space after exponential mapping; see Section 3-2 in [37] for detailed discussions about the Gaussian curvature. It is known that a two-dimensional submanifold with positive, zero or negative sectional curvature is locally isometric to a two-dimensional sphere, a Euclidean plane or a hyperbolic plane with the same Gaussian curvature [116].
Geodesically Strong Concavity. A function is said to be geodesically concave if for any , it holds that
for any , where is a geodesic with and . When is differentiable, an equivalent statement of the geodesic concavity is that (Theorem 11.17 in [15])
A function is said to be geodesically -strongly concave if for any , it holds that

G. Normal Space of Directional Density Ridge

Recall that we extend the directional density Inline graphic from its support to by defining for all . As we will refer to conditions (A1–3) frequently in the next three sections, we restate them here:

(A1) (Differentiability) Under the extension (4.1) of the directional density , we assume that the total gradient , total Hessian matrix and third-order derivative tensor in exist, and are continuous on and square integrable on . We also assume that has bounded fourth-order derivatives on .
(A2) (Eigengap) We assume that there exist constants and such that and for any .
(A3) (Path Smoothness) Under the same in (A2), we assume that there exists another constant such that
for all and .

Recall that an order- Inline graphic density ridge of a directional density on is the set of points defined as

graphic file with name DmEquation192.gif

(G1)

Lemma G.1 below shows that under conditions (A1–3), the Jacobian matrices Inline graphic and (i.e. projecting the columns of onto the tangent space ) both have rank at every point on , and will be a -dimensional submanifold on by the implicit function theorem [68, 92]. Analogous to the discussion about the normal space of a Euclidean density ridge in Appendix C, we define

graphic file with name DmEquation193.gif

Different from the Euclidean density ridge case, it is the column space of

graphic file with name DmEquation194.gif

(G2)

that spans the normal space of Inline graphic within the ambient space . It can be seen from our Remark 4.1 that the rows of

graphic file with name DmEquation195.gif

span the normal space of the solution manifold Inline graphic ; see also Lemma 1 in [28]. Consequently, the column space of spans the normal space of within the tangent space at each . The technique in pages 60–63 of [39] is still valid to argue that

graphic file with name DmEquation196.gif

(G3)

for Inline graphic , where we use the fact that on under the extension of as in (A1). Let

graphic file with name DmEquation197.gif

for Inline graphic . Then,

graphic file with name DmEquation198.gif

(G4)

As in the Euclidean data case, the columns of Inline graphic are not orthonormal, and we again leverage the orthonormalization technique in [22] to construct that shares the same column space with but has orthonormal columns. That is, under the condition that has full column rank at every point (see Lemma G.1),

graphic file with name DmEquation199.gif

(G5)

with the Cholesky decomposition Inline graphic , where is a lower triangular matrix whose diagonal elements are positive. Finally, the non-uniqueness of will not affect our subsequent discussions about the properties of directional density ridges.

Lemma G.1.

Assume conditions (A1–3). Given that , and are defined in (G4) and (G5), we have the following properties:

(a) and have the same column space. In addition,
That is, is the projection matrix onto the columns of .

(b) The columns of are orthonormal to each other.

(c) For , the column space of is normal to the (tangent) direction of at .

(d) For , the smallest eigenvalue , and
Moreover, all the nonzero singular values of are greater than , and
Therefore, is a -dimensional submanifold that contains neither intersections nor endpoints on . Namely, is a finite union of connected and compact submanifolds on .

(e) For all ,

(f) When is sufficiently small and ,
for some constant .

(g) Assume that another directional density function also satisfies conditions (A1–3) after the extension in , and is sufficiently small. Then,
for some constant and any , where is the matrix defined in (G5) with directional density .

(h) The reach of satisfies
for some constant .

This lemma is a direct extension of Lemma C.1 to the directional data scenario; thus, its proof is similar to the proof of Lemma C.1.

Proof. Proof of Lemma G.1 —

The proofs of properties (a), (b) and (c) can be inherited from the corresponding ones in Lemma C.1 with mild modifications and we thus omit them.

(d) We will prove that the nonzero singular values of and are bounded away from 0. Recall that

with

for . Under condition (A2),

It shows that all the singular values of or simply are less than . Moreover, under condition (A2) again, all the singular values of

are greater than .

By Theorem 3.3.16 in [57], we know that all the singular values of and are greater than

where are singular values of a matrix in their descending order. Therefore, the minimum eigenvalue of satisfies

(G6)

Now, given and , we know that

If we denote the orthonormal eigenvectors of by , then

are the orthonormal eigenvectors of , whose eigenvalues are thus lower bounded by due to (G6). Hence, .

By the implicit function theorem and the extra constraint , is a -dimensional submanifold on . It also implies that cannot have intersections, because otherwise the intersected points will violate the rank condition. Finally, we argue by contradiction that has no endpoints. Assume, on the contrary, that has an end point . Our preceding argument has shown that , the derivative of , is bounded. In addition, . However, this contradicts to the implicit function theorem indicating that is a -dimensional submanifold on , because at the end point , there exists no local coordinate chart for defined on an open set in . The results follow.

(e) By the proof of (d), we already know that all the nonzero singular values of and are greater than . Also, all the nonzero singular values of are greater than . Thus, the results follow easily from the argument of (e) in Lemma C.1.

Finally, the proofs of properties (f), (g) and (h) are essentially the same as the corresponding claims in [22]. We thus omitted them. For (h), the reader should be aware that we have extended the directional density from to . In addition, it is the columns of that span the normal space of in the ambient space, whose nonzero singular values are lower bounded by . The proof of (h) can also be found in Theorem 3 of [28].

H. Stability of Directional Density Ridge

H.1 Subspace Constrained Gradient Flows

This subsection is modified from Section 4 in [45] for directional densities and their ridges on Inline graphic . A map is a subspace constrained gradient flow with the principal Riemannian gradient if and

graphic file with name DmEquation215.gif

(H1)

where the last equality follows from (4.4). Given the definition of the directional density ridge Inline graphic in (G1), it consists of the destinations of the subspace constrained gradient flow , i.e. if for some satisfying (H1). It will be convenient to parametrize the SCGA path with by arc length. Let be the arc length from to :

graphic file with name DmEquation216.gif

Denote the inverse of Inline graphic by . Note that

graphic file with name DmEquation217.gif

With Inline graphic , we have that

graphic file with name DmEquation218.gif

(H2)

which is a reparametrization of (H1) by arc length. Note that Inline graphic always lies on because its velocity is within the tangent space for every . Lemma 2 in [45] justifies the uniqueness of passing through any particular point under conditions (A1–3). The (reversed) subspace constrained gradient flow can be lifted onto the directional function , as we may define

graphic file with name DmEquation219.gif

(H3)

Sometimes, we may add the subscript Inline graphic to the curves if we want to emphasize that start from or pass through the specific point .

To analyze the behavior of the subspace constrained gradient flow Inline graphic lifted on , we need the derivative of the projection matrix along the path . Recall that . The collection defines a matrix field: there is a matrix attached to each point . As mentioned earlier, there is a unique path and unique such that for any . Define

graphic file with name DmEquation220.gif

(H4)

where Inline graphic with being the Riemannian connection on . Under conditions (A1–3), has a quadratic-like behavior near the directional ridge , analogous to Lemma 3 in [45].

Lemma H.1.

Assume that conditions (A1–3) holds. For all , we have the following properties:

(a) , , and . Thus, is non-decreasing in .

(b) The second derivative of satisfies .

(c) .

Proof. Proof of Lemma H.1 —

The proof is adopted from Lemma 3 in [45].

(a) The first property is obvious from the definition (H3). Then,

since for all . By the definition of in (G1), when . Thus, and is non-decreasing in .

(b) Note that

Differentiating both sides of the equation, we have that

Since (idempotent), we have that , and hence the second term on the right-hand side of the above equation becomes

Thus,

By (a) and (H2), we conclude that

(H5)

Now, we will bound the two terms in (H5), respectively. As for the first term , we notice that is in the column space of . Hence,

where . Therefore, from condition (A2),

and consequently,

As for the second term , we notice that , where , and . Then,

However, . To see this, note that and it implies that

showing that . To bound , we proceed as follows. As before, we let . Then, by the Davis–Kahan theorem (Lemma D.2 here),

Note that , because . Thus, from condition (A3),

Therefore, .

(c) For some ,

by (a) and (b). As is parametrized by arc length, we conclude that

The result follows.

The statement (c) in Lemma H.1 is known as the quadratic growth condition in the optimization literature [4, 38]. Under conditions (A1–3), such a quadratic growth of the subspace constrained gradient flow Inline graphic lifted onto the directional density enables us to quantify the stability of directional ridges under small perturbations on the directional density and develop the linear convergence of the (directional) SCGA algorithms on .

H.2 Proof of Theorem 4.1

We now show that if two directional densities Inline graphic and are close, their corresponding ridges and are also close. We will use, for instance, and , to refer to the principal (Riemannian) gradient and projection matrix with its columns as the eigenvectors corresponding to the smallest eigenvalues of the (Riemannian) Hessian with the tangent space of Inline graphic defined by .

Theorem H.1.

Suppose that conditions (A1–3) hold for the directional density and that condition (A1) holds for . When is sufficiently small,

(a) conditions (A2–3) holds for .

(b) .

(c) for a constant .

Proof. Proof of Theorem 4.1 —

Our arguments are modified from the proof of Theorem 4 in [45] as well as Proposition 4 and Theorem 5 in [28].

(a) We write the spectral decompositions of and as

By Weyl’s Theorem (Theorem 4.3.1 in [58]), we know that

where we recall that there are at most nonzero eigenvalues of the Riemannian Hessian on . Thus, satisfies condition (A2). Moreover, since condition (A3) depends only on the first and third order derivative of , they hold for when is small enough.

(b) We present two methods based on two different flows to prove this statement and comment their pros and cons in Remark H.1. Method A: By the Davis–Kahan theorem (Lemma D.2 and (D7)),

for any . Then, given that ,

Therefore, by the differentiability of from (A1) and the compactness of , we obtain from the above calculations that

for some constant that only depends on the dimension .

Now, let . Then, , and . Let be the SCGA flow through as defined in Section H.1 so that for some . Note that . From property (a) of Lemma H.1, we have that . Moreover, by Taylor’s theorem,

for some between and . Since , from property (b) of Lemma H.1,

and consequently, , where denotes the geodesic distance between and on . Therefore,

Now let . The same argument shows that for some constant because conditions (A1–3) hold for .

As a result, .

Method B: Since we are only required to bound the maximum Euclidean distance between and , i.e. , we may view and as solution manifolds in and tentatively ignore the manifold constraint . Define . Given that , the gradient of ,

(H6)

is a vector in . Let . We define a flow such that

It can be argued by Theorem 7 in [28] that when for some small . In addition, we can always choose to be small enough so that . By Theorem 3.39 in [59], is uniquely defined because the gradient is well defined for all . We can also reparametrize by arc length as

Let be the terminal time/arc-length point and be the destination of on . The above argument also demonstrates that the flows or converge to the manifold from the normal direction of , because we can write

and the column space of spans the normal space of at . The goal now is to bound because its length must be greater or equal to . We then define . Differentiating with respect to leads to

(H7)

by (d) in Lemma G.1. (Note that because and by the continuity of , we can always choose such that for all .) As , by the proof of Method A, we know that

where is some value between and . Hence, , which is independent of . This implies that

We can exchange the role of and and apply the same argument to show that

In total, this leads to the conclusion that .

(c) By (h) in Lemma G.1, the reach of has a lower bound, . Note that and depend on the first three order derivatives of . Thus, the lower bound for the reach of will be identical to the one for with an error rate .

Note that for the stability of directional ridges, one can relax the condition (A1) by requiring Inline graphic to be -Hölder with .

Remark H.1.

We apply two different methods to establish the stability theorem of directional density ridges. Method A utilizes the subspace constrained gradient flow constructed in Section H.1 and its quadratic behavior (Lemma H.1), while Method B defines a normal flow to the ridge induced by the column space of . Each of these two flows has its pros and cons. The subspace constrained gradient flow aligns more coherently with our directional SCMS algorithm (Algorithm 2) to identify the (estimated) directional ridge from data, because it relies only on the first- and second-order derivatives of the (estimated) density . Nevertheless, the subspace constrained gradient flow does not necessarily converge to in the optimal direction, that is, the normal direction to . This can be seen from the explicit formula (G4) of , which spans the normal space of . The normal flow

defined in Method B, however, converges to in its normal direction by construction. In general, the normal flow tends to the ridge faster than the subspace constrained gradient flow, but it may be complicated to compute in any practical ridge-finding task due to its involvement with third-order derivatives of the (estimated) density . Recently, [90] presented explicit formulae for finding density ridges via such a normal flow and its discrete gradient descent approximation. Additionally, they defined a smoothed version of the ridgeness function that also circumvents the computations of third-order derivatives of .

I. Proofs of Proposition 4.1, Proposition 4.2 and Theorem 4.2

Proposition I.1.

Assume that the directional kernel is non-increasing, twice continuously differentiable and convex with . Given the directional KDE and the directional SCMS sequence defined by (4.13) or (4.14), the following properties hold:

(a) The estimated density sequence is non-decreasing and thus converges.

(b) .

(c) If the kernel is also strictly decreasing on , then .

Proof. Proof of Proposition 4.1 —

(a) The sequence is bounded if the kernel is non-increasing with . Hence, it suffices to show that it is non-decreasing. The convexity and differentiability of kernel imply that

(I1)

for all . Then, with and the iterative formula (4.14) in the main paper, we derive that

where we use the orthogonality between and in (i), multiply to both the numerators and denominators of the two summands to obtain (ii), leverage the fact that in (iii), use the inequality in (iv). It thus completes the proof of (a).

(b) Our derivation in (a) already shows that

Notice that, on the one hand, the differentiability of kernel and the compactness of imply that for all , where only depends on the bandwidth and kernel . On the other hand, our argument in (a) already proves the convergence of . Therefore,

as . The result follows.

(c) Given the iterative formula (4.14) in the main paper, we deduce that

where we leverage the orthogonality between and to obtain (i) and (ii). Under the assumption that the kernel is strictly decreasing and (twice) continuously differentiable, we know that is lower bounded away from 0 on . Therefore, with the result in (b), the above calculation indicates that

as . The result follows.

Remark I.1.

The conditions imposed on kernel in Proposition 4.1 is satisfied by some commonly used kernels, such as the von Mises kernel . However, they can be further relaxed. On the one hand, it is sufficient to assume that the kernel is twice continuously differentiable except for finitely many points on . On the other hand, as long as the kernel satisfies and the true directional density is positive almost everywhere on , Lemma 4.1 demonstrates that with probability tending to 1 when and . Therefore, our upper bound on in our proof of (c) will be asymptotically valid for all , even without the strict decreasing property of kernel . Under such relaxation, our conclusions in Proposition 4.1 are applicable to directional SCMS algorithms with other kernels that have bounded supports on .

Proposition I.2. Convergence of the SCGA Algorithm on .

For any SCGA sequence defined by (4.18) with , the following properties hold:

(a) Under condition (A1), the objective function sequence is non-decreasing and thus converges.

(b) Under condition (A1), .

(c) Under conditions (A1-3), whenever with the convergence radius satisfying
where is a constant defined in (h) of Lemma G.1 and is a quantity depending on both the dimension and the functional norm up to the fourth-order (partial) derivatives of .

Proof. Proof of Proposition 4.2 —

The proof is similar to our arguments in Proposition 3.1. For the completeness, we still delineate the detailed steps because the proof requires some nontrivial techniques, such as parallel transports and line integrals, on general Riemannian manifolds.

(a) We first derive the following property of the objective function supported on , which is a counterpart of Fact 1 in the proof of Proposition 3.1.

Property 1. Given (A1), the function is -smooth on , that is, is -Lipschitz. This property follows easily from the differentiability of guaranteed by condition (A1) and Theorem 4.34 in [69] that

(I2)

for any , where lies on the geodesic curve with and . Then,

(I3)

where the equality (i) follows from the fundamental theorem for line integrals (Theorem 11.39 in [68]), equality (ii) utilizes the isometric property of parallel transports and inequality (iv) follows from (I2). Moreover, since the velocity of the geodesic is always constant, we deduce that and the equality (iii) follows. We will make use of the following direction of the inequality (I3):

(I4)

Moreover, when ,

showing that the objective function is non-decreasing along the SCGA path on . Given the compactness of and the differentiability of , we know that the sequence is bounded. Thus, it converges.

(b) From (a), we know that when ,

Since the sequence converges, it follows that

Recall from (2.5) that , so as well.

(c) Given condition (A2) and the fact that , we know that

Let be the projection of in the SCGA sequence onto the directional ridge . Since by (h) of Lemma G.1, is well defined when . Recall from (G3) that the column space of

coincides with the normal space of within the tangent space . We define a geodesic with and calculate that

where we utilize the isometric properties of parallel transports in (i), note that the velocity of geodesic is constant, i.e. for any to obtain (ii), leverage (d) of Lemma G.1 to deduce (iii) and use the fact that when in the inequality (iv). In particular for the inequality (iii), is a full column rank matrix and lies within the column space of . Since the nonzero singular values of are lower bounded by , it follows that

In addition, we also know that comes from the supremum norm of over the geodesic connecting and with being the Riemannian connection, which in turn depends on the uniform functional norm of the partial derivatives of up to the fourth order. By (b), we deduce that

The results follow.

The nonzero curvature structure of the unit (hyper-sphere) Inline graphic , on which the objective function (or density) lies, induces an extra challenge in establishing the linear convergence of population and sample-based SCGA algorithms. Some useful techniques used in analyzing non-asymptotic convergence of first-order methods in , such as the law of cosines and linearizations of the objective function, would fail on Inline graphic [116]. Therefore, we first introduce a practical trigonometric distance bound for the Alexandrov space [18] with its sectional curvature bounded from below.

Lemma I.1. Lemma 5 in (116); see also (14).

If are the sides (i.e. side lengths) of a geodesic triangle in an Alexandrov space with sectional curvature lower bounded by , and is the angle between sides and , then

(I5)

The sketching proof of Lemma I.1 can be found in Lemma 5 of [116]. Note that the sectional curvature Inline graphic on . We inherit the notation in [116] and denote by for the curvature-dependent quantity in the inequality (I5). One can show by differentiating with respect to that is strictly increasing and greater than 1 for any and fixed . With Lemma I.1 in hand, we are able to state a straightforward corollary indicating an important relation between two consecutive points in the SCGA sequence Inline graphic on defined by (4.18):

graphic file with name DmEquation272.gif

(I6)

Corollary I.1.

For any point in a geodesically convex set on , the update in (I6) satisfies

where is the geodesic distance between and on .

Proof. Proof of Corollary I.1 —

Recall that the (population) SCGA iterative formula on is given by . Note that for the geodesic triangle with , we have that

and

By letting and in Lemma I.1, we obtain that

Some rearrangements will yield the final display.

Note that Inline graphic in our conditions (A2–3) is a geodesically convex set, where the minimal geodesic between two points in the set always lies within the set. Hence, Corollary I.1 is applicable to our interested SCGA algorithm initialized within .

Theorem I.1. Linear Convergence of the SCGA Algorithm on .

Assume conditions (A1–4) throughout the theorem.

(a) Q-Linear convergence of : Consider a convergence radius satisfying
where is the constant defined in (h) of Lemma G.1 and is a quantity defined in (c) of Proposition 4.2 that depends on both the dimension and the functional norm up to the fourth-order (partial) derivatives of . Whenever and the initial point with , we have that

(b) R-Linear convergence of : Under the same radius in (a), we have that whenever and the initial point with ,

We further assume (D1–2) in the rest of statements. Suppose that and .

(c) Q-Linear convergence of : Under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

(d) R-Linear convergence of : Under the same radius and in (a), we have that
with probability tending to 1 whenever and the initial point with .

Proof. Proof of Theorem 4.2 —

The proof is similar to our argument in Theorem 3.1, except that the objective function is supported on a nonlinear manifold here. The key arguments are credited to Corollary I.1. We first recall the following two properties. Property 1. Given (A1), the function is -smooth on , that is, is -Lipschitz. Property 2. Given conditions (A1–3), we know that and

for any with .

Property 1 has been established in the proof of Proposition 4.2, indicating that the objective function sequence is non-decreasing when . Property 2 is a natural corollary by Proposition 4.2, because and

is the objective function value after one-step SCGA iteration on with step size . The iteration will move closer to the directional ridge . With the help of these two properties, we start the proofs of (a–d).

(a) We first prove the following claim using Lemma E.1: for all and ,

(I7)

where . By the differentiability of ensured by condition (A1) and Taylor’s theorem on , we deduce that

where we leverage the equality in (i) and (iii), use conditions (A2) and (A4) that and in (ii), apply the quadratic bound for in condition (A4) to obtain (iv), and leverage the facts that and when in (v); recall (2.5). Our claim (I7) is thus proved.

In addition, given Property 2 and any , we derive that

where we apply (I3) to obtain the inequality. This indicates that

(I8)

for any . Therefore, by Corollary I.1, we obtain that

whenever , where we utilize Corollary I.1 and the monotonicity of with respect to in (i), apply (I7) and (I8) to obtain (ii), and use the choice of to argue that

in (iii). By telescoping, we conclude that when and ,

The result follows.

(b) The result follows obviously from (a) and the fact that for all .

(c) The proof is logically similar to the proof of (c) in Theorem 3.1. We write the spectral decompositions of and as

By Weyl’s theorem (Theorem 4.3.1 in [58]) and uniform bounds (4.6),

Thus, will satisfy conditions (A2) with high probability when is sufficiently small and is sufficiently large. According to Davis–Kahan theorem (Lemma D.2 here), uniform bounds (4.6), and the continuity of exponential maps, we have that

for any , where we utilize the Davis–Kahan theorem and in (i). Hence, when and ,

(I9)

with probability tending to 1.

We now claim that and

(I10)

for all . We again prove this claim by induction on the iteration number. Note that when , we derive that

where we apply the triangle inequality in (i) and leverage the result in (a) and (I9) to obtain (ii). The triangle inequality is valid in this context because the geodesic measures the minimal distance between two points on . Moreover, by the choice of and (I9), we are sure that . In the induction from , we suppose that and the claim (I10) holds at iteration . The same argument then implies that the claim (I10) holds for iteration and that . The claim (I10) is thus verified.

Now, given that , we iterate the claim (I10) to show that

where the fourth inequality follows by summing the geometric series, and the last equality is due to our notation . It completes the proof.

(d) The result follows directly from (c) and the inequality for all .

Contributor Information

Yikun Zhang, Department of Statistics, University of Washington, Seattle, WA 98195, USA.

Yen-Chi Chen, Department of Statistics, University of Washington, Seattle, WA 98195, USA.

References

1. Absil, P.-A., Mahony, R. & Sepulchre, R. (2008) Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press. [Google Scholar]
2. Absil, P. A., Mahony, R. & Trumpf, J. (2013) An extrinsic look at the riemannian hessian. Geometric Science of Information. ( F. Nielsen & F. Barbaresco eds). Berlin Heidelberg: Springer, pp. 361–368. [Google Scholar]
3. Aliyari Ghassabeh, Y. (2015) A sufficient condition for the convergence of the mean shift algorithm with gaussian kernel. J. Multivariate Anal., 135, 1–10. [Google Scholar]
4. Anitescu, M. (2000) Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim., 10, 1116–1135. [Google Scholar]
5. Argus, D. F., Gordon, R. G. & DeMets, C. (2011) Geologically current motion of 56 plates relative to the no-net-rotation reference frame. Geochemistry, Geophysics, Geosystems, 12. [Google Scholar]
6. Arias-Castro, E., Mason, D. & Pelletier, B. (2016) On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J. Mach. Learn. Res., 17, 1–28. [Google Scholar]
7. Bai, Z., Rao, C. & Zhao, L. (1988) Kernel estimators of density function of directional data. J. Multivariate Anal., 27, 24–39. [Google Scholar]
8. Balakrishnan, S., Wainwright, M. J. & Yu, B. (2017) Statistical guarantees for the em algorithm: From population to sample-based analysis. Ann. Statist., 45, 77–120. [Google Scholar]
9. Banerjee, A., Dhillon, I. S., Ghosh, J. & Sra, S. (2005) Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6, 1345–1382. [Google Scholar]
10. Banyaga, A. & Hurtubise, D. (2004) Lectures on Morse Homology. Texts in the Mathematical Sciences. Netherlands: Springer. [Google Scholar]
11. Beck, A. & Tetruashvili, L. (2013) On the convergence of block coordinate descent type methods. SIAM J. Optim., 23, 2037–2060. [Google Scholar]
12. Beran, R. (1979) Exponential models for directional data. Ann. Statist., 7, 1162–1178. [Google Scholar]
13. Bird, P. (2003) An updated digital model of plate boundaries. Geochemistry, Geophysics, Geosystems, 4. [Google Scholar]
14. Bonnabel, S. (2013) Stochastic gradient descent on riemannian manifolds. IEEE Trans. Automat. Control, 58, 2217–2229. [Google Scholar]
15. Boumal, N. (2020) An introduction to optimization on smooth manifolds. Available online, Aug.. [Google Scholar]
16. Bowman, A. W. (1984) An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71, 353–360. [Google Scholar]
17. Bubeck, S. (2015) Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8, 231–357. [Google Scholar]
18. Burago, Y., Gromov, M. & Perel’man, G. (1992) A.d. alexandrov spaces with curvature bounded below. Russian Math. Surveys, 47, 1–58. [Google Scholar]
19. Carreira-Perpiñán, M. Á. (2007) Gaussian mean-shift is an em algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 29, 767–776. [DOI] [PubMed] [Google Scholar]
20. Chacón, E. J., Duong, T. & Wand, P. M. (2011) Asymptotics for general multivariate kernel density derivative estimators. Statist. Sinica, 21, 807. [Google Scholar]
21. Charles, Z. & Papailiopoulos, D. (2018) Stability and generalization of learning algorithms that converge to global optima. International Conference on Machine Learning. PMLR, PMLR, pp. 745–754.
22. Chen, Y.-C., Genovese, C. R. & Wasserman, L. (2015a) Asymptotic theory for density ridges. Ann. Statist., 43, 1896–1928. [Google Scholar]
23. Chen, Y.-C., Ho, S., Freeman, P. E., Genovese, C. R. & Wasserman, L. (2015b) Cosmic web reconstruction through density ridges: method and algorithm. Monthly Notices of the Royal Astronomical Society, 454, 1140–1156. [Google Scholar]
24. Chen, Y.-C., Genovese, C. R., Ho, S. & Wasserman, L. (2015c) Optimal ridge detection using coverage risk. Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. [Google Scholar]
25. Chen, Y.-C., Genovese, C. R. & Wasserman, L. (2016a) A comprehensive approach to mode clustering. Electron. J. Stat., 10, 210–241. [Google Scholar]
26. Chen, Y.-C., Ho, S., Brinkmann, J., Freeman, P. E., Genovese, C. R., Schneider, D. P. & Wasserman, L. (2016b) Cosmic web reconstruction through density ridges: catalogue. Monthly Notices of the Royal Astronomical Society, 461, 3896–3909. [Google Scholar]
27. Chen, Y.-C. (2017) A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1, 161–187. [Google Scholar]
28. Chen, Y.-C. (2022) Solution manifold and its statistical applications. Electron. J. Stat., 16, 408–450. [Google Scholar]
29. Cheng, Y. (1995) Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17, 790–799. [Google Scholar]
30. Chrisman, N. R. (2017) Calculating on a round planet. International Journal of Geographical Information Science, 31, 637–657. [Google Scholar]
31. Comaniciu, D. & Meer, P. (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24, 603–619. [Google Scholar]
32. Cuevas, A. (2009) Set estimation: Another bridge between statistics and geometry. Bol. Estad. Investig. Oper, 25, 71–85. [Google Scholar]
33. Damon, J. (1999) Properties of ridges and cores for two-dimensional images. J. Math. Imaging Vis., 10, 163–174. [Google Scholar]
34. Daniilidis, A., Ley, O. & Sabourau, S. (2010) Asymptotic behaviour of self-contracted planar curves and gradient orbits of convex functions. J. Math. Pures Appl., 94, 183–199. [Google Scholar]
35. Daniilidis, A., David, G., Durand-Cartagena, E. & Lemenant, A. (2015) Rectifiability of self-contracted curves in the euclidean space and applications. J. Geom. Anal., 25, 1211–1239. [Google Scholar]
36. Davis, C. & Kahan, W. M. (1970) The rotation of eigenvectors by a perturbation. iii. SIAM J. Numer. Anal., 7, 1–46. [Google Scholar]
37. do Carmo, M. (2016) Differential Geometry of Curves and Surfaces: Revised and UpdatedSecond Edition. Dover Books on Mathematics. Dover Publications. [Google Scholar]
38. Drusvyatskiy, D. & Lewis, A. S. (2018) Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res., 43, 919–948. [Google Scholar]
39. Eberly, D. (1996) Ridges in Image and Data Analysis. Computational Imaging and Vision. Springer Netherlands. [Google Scholar]
40. Einmahl, U. & Mason, D. M. (2005) Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist., 33, 1380–1403. [Google Scholar]
41. Fazel, M., Ge, R., Kakade, S. & Mesbahi, M. (2018) Global convergence of policy gradient methods for the linear quadratic regulator. International Conference on Machine Learning. PMLR, PMLR, pp. 1467–1476.
42. Federer, H. (1959) Curvature measures. Trans. Amer. Math. Soc., 93, 418–491. [Google Scholar]
43. García-Portugués, E. (2013) Exact risk improvement of bandwidth selectors for kernel density estimation with directional data. Electron. J. Stat., 7, 1655–1685. [Google Scholar]
44. García-Portugués, E., Crujeiras, R. M. & González-Manteiga, W. (2013) Kernel density estimation for directional-linear data. J. Multivariate Anal., 121, 152–175. [Google Scholar]
45. Genovese, C. R., Perone-Pacifico, M., Verdinelli, I. & Wasserman, L. (2014) Nonparametric ridge estimation. Ann. Statist., 42, 1511–1545. [Google Scholar]
46. Ghassabeh, Y. A., Linder, T. & Takahara, G. (2013) On some convergence properties of the subspace constrained mean shift. Pattern Recognition, 46, 3140–3147. [Google Scholar]
47. Ghassabeh, Y. A. & Rudzicz, F. (2020) Modified subspace constrained mean shift algorithm. J. Classification, 1–17. [Google Scholar]
48. Giné, E. & Guillou, A. (2002) Rates of strong uniform consistency for multivariate kernel density estimators. Annales de l’Institut Henri Poincare (B) Probability and Statistics, 38, 907–921. [Google Scholar]
49. Gupta, C., Balakrishnan, S. & Ramdas, A. (2021) Path length bounds for gradient descent and flow. J. Mach. Learn. Res., 22, 1–63. [Google Scholar]
50. Hall, P. (1983) Large sample optimality of least squares cross-validation in density estimation. Ann. Statist., 1156–1174. [Google Scholar]
51. Hall, P., Watson, G. S. & Cabrara, J. (1987) Kernel density estimation with spherical data. Biometrika, 74, 751–762. [Google Scholar]
52. Hall, P., Qian, W. & Titterington, D. M. (1992) Ridge finding from noisy data. J. Comput. Graph. Statist., 1, 197–211. [Google Scholar]
53. Hall, P., Peng, L. & Rau, C. (2001) Local likelihood tracking of fault lines and boundaries. J. R. Stat. Soc. Ser. B Stat. Methodol., 63, 569–582. [Google Scholar]
54. Harris, R. A. (2017) Large earthquakes and creeping faults. Reviews of Geophysics, 55, 169–198. [Google Scholar]
55. Hastie, T. & Stuetzle, W. (1989) Principal curves. J. Amer. Statist. Assoc., 84, 502–516. [Google Scholar]
56. Hauberg, S. (2015) Principal curves on riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell., 38, 1915–1921. [DOI] [PubMed] [Google Scholar]
57. Horn, R. A. & Johnson, C. R. (1991) Topics in Matrix Analysis. Cambridge Univ. Press. [Google Scholar]
58. Horn, R. A. & Johnson, C. R. (2012) Matrix Analysis, 2nd edn. Cambridge Univ. Press. [Google Scholar]
59. Irwin, M. C. (2001) Smooth dynamical systems, vol. 17. World Scientific. [Google Scholar]
60. Izenman, A. J. (2012) Introduction to manifold learning. Wiley Interdiscip. Rev. Comput. Stat., 4, 439–446. [Google Scholar]
61. Jones, M. C., Marron, J. S. & Sheather, S. J. (1996) A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc., 91, 401–407. [Google Scholar]
62. Kafai, M., Miao, Y. & Okada, K. (2010) Directional mean shift and its application for topology classification of local 3d structures. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, pp. 170–177. [Google Scholar]
63. Karimi, H., Nutini, J. & Schmidt, M. (2016) Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing, pp. 795–811. [Google Scholar]
64. Klemelä, J. (2000) Estimation of densities and derivatives of densities with directional data. J. Multivariate Anal., 73, 18–40. [Google Scholar]
65. Kobayashi, T. & Otsu, N. (2010) Von mises-fisher mean shift for clustering on a hypersphere. 20th International Conference on Pattern Recognition. IEEE, pp. 2130–2133. [Google Scholar]
66. Kozak, D., Becker, S., Doostan, A. & Tenorio, L. (2019) Stochastic subspace descent. arXiv preprint arXiv: 1904.01145.
67. Kozak, D., Becker, S., Doostan, A. & Tenorio, L. (2020) A stochastic subspace approach to gradient-free optimization in high dimensions. arXiv preprint arXiv, 2003.02684. [Google Scholar]
68. Lee, J. (2012) Introduction to Smooth Manifolds. Graduate Texts in Mathematics, 2nd edn. Springer. [Google Scholar]
69. Lee, J. M. (2018) Introduction to Riemannian manifolds. Springer. [Google Scholar]
70. Ley, C. & Verdebout, T. (2017) Modern directional statistics. CRC Press. [Google Scholar]
71. Li, X., Hu, Z. & Wu, F. (2007) A note on the convergence of the mean shift. Pattern Recognition, 40, 1756–1762. [Google Scholar]
72. Lojasiewicz, S. (1963) A topological property of real analytic subsets. Coll. du CNRS. Les équations aux dérivées partielles, 117, 87–89. [Google Scholar]
73. Luo, Z.-Q. & Tseng, P. (1992) On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl., 72, 7–35. [Google Scholar]
74. Mardia, K. & Jupp, P. (2000) Directional Statistics. Wiley Series in Probability and Statistics. Wiley. [Google Scholar]
75. Marzio, M. D., Panzera, A. & Taylor, C. C. (2011) Kernel density estimation on the torus. J. Statist. Plann. Inference, 141, 2156–2173. [Google Scholar]
76. Necoara, I., Nesterov, Y. & Glineur, F. (2019) Linear convergence of first order methods for non-strongly convex optimization. Math. Programming, 175, 69–107. [Google Scholar]
77. Nesterov, Y., et al. (2018) Lectures on convex optimization, vol. 137. Springer. [Google Scholar]
78. Nocedal, J. & Wright, S. J. (2006) Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. New York: Springer. [Google Scholar]
79. Norgard, G. & Bremer, P.-T. (2012) Second derivative ridges are straight lines and the implications for computing lagrangian coherent structures. Phys. D, 241, 1475–1476. [Google Scholar]
80. Oba, S., Kato, K. & Ishii, S. (2005) Multi-scale clustering for gene expression profiling data. Proceedings of Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05). IEEE, pp. 210–217. [Google Scholar]
81. Ok, E. A. (2007) Real Analysis with Economic Applications, vol. 10. Princeton University Press. [Google Scholar]
82. Oliveira, M., Crujeiras, R. M. & Rodríguez-Casal, A. (2012) A plug-in rule for bandwidth selection in circular density estimation. Comput. Stat. Data Anal., 56, 3898–3908. [Google Scholar]
83. Ozertem, U. & Erdogmus, D. (2011) Locally defined principal curves and surfaces. J. Mach. Learn. Res., 12, 1249–1286. [Google Scholar]
84. Peikert, R., Günther, D. & Weinkauf, T. (2013) Comment on “second derivative ridges are straight lines and the implications for computing lagrangian coherent structures, physica d 2012.05. 006”. Phys. D, 242, 65–66. [Google Scholar]
85. Pennec, X. (2006) Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vision, 25, 127–154. [Google Scholar]
86. Pewsey, A. & García-Portugués, E. (2021) Recent advances in directional statistics. Test, 1–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
87. Polyak, B. (1963) Gradient methods for the minimisation of functionals. Comput. Math. Math. Phys., 3, 864–878. [Google Scholar]
88. Qiao, W. (2021) Asymptotic confidence regions for density ridges. Bernoulli, 27, 946–975. [Google Scholar]
89. Qiao, W. & Polonik, W. (2016) Theoretical analysis of nonparametric filament estimation. Ann. Statist., 44, 1269–1297. [Google Scholar]
90. Qiao, W. & Polonik, W. (2021) Algorithms for ridge estimation with convergence guarantees. arXiv preprint arXiv:2104.12314.
91. Rudemo, M. (1982) Empirical choice of histograms and kernel density estimators. Scand. J. Statist., 65–78. [Google Scholar]
92. Rudin, W. (1976) Principles of Mathematical Analysis, 3rd edn. McGraw-Hill New York. [Google Scholar]
93. Saavedra-Nieves, P. & María Crujeiras, R. (2020) Nonparametric estimation of directional highest density regions. arXiv preprint arXiv:2009.08915.
94. Saragih, J. M., Lucey, S. & Cohn, J. F. (2009) Face alignment through subspace constrained mean-shifts. Proceedings of the IEEE 12th International Conference on Computer Vision. IEEE, pp. 1034–1041. [Google Scholar]
95. Sasaki, H., Kanamori, T. & Sugiyama, M. (2017) Estimating density ridges by direct estimation of density-derivative-ratios. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54. ( A. Singh & J. Zhu eds). FL, USA: PMLR: Fort Lauderdale, pp. 204–212. [Google Scholar]
96. Scott, D. (2015) Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley Series in Probability and Statistics. Wiley. [Google Scholar]
97. Sheather, S. J. (2004) Density estimation. Statist. Sci., 19, 588–597. [Google Scholar]
98. Sheather, S. J. & Jones, M. C. (1991) A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. Ser. B Stat. Methodol., 53, 683–690. [Google Scholar]
99. Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. [Google Scholar]
100. Snyder, J., Voxland, P. & U.S.), G. S. (1989) An Album of Map Projections. An Album of Map Projections. U.S: Government Printing Office. [Google Scholar]
101. Sousbie, T., Pichon, C., Courtois, H., Colombi, S. & Novikov, D. (2007) The three-dimensional skeleton of the SDSS. The Astrophysical Journal, 672, L1–L4. [Google Scholar]
102. Stone, C. J. (1984) An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist., 1285–1297. [Google Scholar]
103. Subarya, C., Chlieh, M., Prawirodirdjo, L., Avouac, J.-P., Bock, Y., Sieh, K., Meltzner, A. J., Natawidjaja, D. H. & McCaffrey, R. (2006) Plate-boundary deformation associated with the great sumatra–andaman earthquake. Nature, 440, 46–51. [DOI] [PubMed] [Google Scholar]
104. Subbarao, R. & Meer, P. (2006) Nonlinear mean shift for clustering over analytic manifolds. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1. IEEE, pp. 1168–1175. [Google Scholar]
105. Subbarao, R. & Meer, P. (2009) Nonlinear mean shift over riemannian manifolds. Int. J. Comput. Vis., 84, 1. [Google Scholar]
106. Taylor, C. C. (2008) Automatic bandwidth selection for circular density estimation. Comput. Statist. Data Anal., 52, 3493–3500. [Google Scholar]
107. van der Vaart, A. W. (1998) Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge Univ. Press. [Google Scholar]
108. van der Vaart, A. W. & Wellner, J. A. (1996) Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media. [Google Scholar]
109. von Luxburg, U. (2007) A tutorial on spectral clustering. Statist. Comput., 17, 395–416. [Google Scholar]
110. Wasserman, L. (2006) All of Nonparametric Statistics (Springer Texts in Statistics). Berlin, Heidelberg: Springer-Verlag. [Google Scholar]
111. Wasserman, L. (2018) Topological data analysis. Annu. Rev. Stat. Appl., 5, 501–532. [Google Scholar]
112. Wright, S. J. (2015) Coordinate descent algorithms. Math. Programming, 151, 3–34. [Google Scholar]
113. Yang, M.-S., Chang-Chien, S.-J. & Kuo, H.-C. (2014) On mean shift clustering for directional data on a hypersphere. Proceedings of the Artificial Intelligence and Soft Computing. Cham: Springer International Publishing, pp. 809–818. [Google Scholar]
114. You, S., Bas, E., Erdogmus, D. & Kalpathy-Cramer, J. (2011) Principal curved based retinal vessel segmentation towards diagnosis of retinal diseases. Proceedings of the IEEE First International Conference on Healthcare Informatics, Imaging and Systems Biology. IEEE, pp. 331–337. [Google Scholar]
115. Yu, Y., Wang, T. & Samworth, R. J. (2014) A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102, 315–323. [Google Scholar]
116. Zhang, H. & Sra, S. (2016) First-order methods for geodesically convex optimization. Proceedings of the 29th Annual Conference on Learning Theory ( V. Feldman, A. Rakhlin & O. Shamir eds). Proceedings of Machine Learning Research, vol. 49. Columbia University, New York, New York, USA: PMLR, pp. 1617–1638. [Google Scholar]
117. Zhang, Y. & Chen, Y.-C. (2021a) The em perspective of directional mean shift algorithm. arXiv preprint arXiv:2101.10058.
118. Zhang, Y. & Chen, Y.-C. (2021b) Kernel smoothing, mean shift, and their learning theory with directional data. J. Mach. Learn. Res., 22, 1–92. [Google Scholar]
119. Zhang, Y. & Chen, Y.-C. (2021c) Mode and ridge estimation in euclidean and directional product spaces: A mean shift approach. arXiv preprint arXiv:2110.08505.
120. Zhao, L. & Wu, C. (2001) Central limit theorem for integrated squared error of kernel estimators of spherical density. Sci. China Ser. A Math., 44, 474–483. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

EuDirSCMS-main_iaac005

Click here for additional data file.^{(5.6MB, zip)}

Data Availability Statement

[ref1] 1. Absil, P.-A., Mahony, R. & Sepulchre, R. (2008) Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press. [Google Scholar]

[ref2] 2. Absil, P. A., Mahony, R. & Trumpf, J. (2013) An extrinsic look at the riemannian hessian. Geometric Science of Information. ( F. Nielsen & F. Barbaresco eds). Berlin Heidelberg: Springer, pp. 361–368. [Google Scholar]

[ref3] 3. Aliyari Ghassabeh, Y. (2015) A sufficient condition for the convergence of the mean shift algorithm with gaussian kernel. J. Multivariate Anal., 135, 1–10. [Google Scholar]

[ref4] 4. Anitescu, M. (2000) Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim., 10, 1116–1135. [Google Scholar]

[ref5] 5. Argus, D. F., Gordon, R. G. & DeMets, C. (2011) Geologically current motion of 56 plates relative to the no-net-rotation reference frame. Geochemistry, Geophysics, Geosystems, 12. [Google Scholar]

[ref6] 6. Arias-Castro, E., Mason, D. & Pelletier, B. (2016) On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J. Mach. Learn. Res., 17, 1–28. [Google Scholar]

[ref7] 7. Bai, Z., Rao, C. & Zhao, L. (1988) Kernel estimators of density function of directional data. J. Multivariate Anal., 27, 24–39. [Google Scholar]

[ref8] 8. Balakrishnan, S., Wainwright, M. J. & Yu, B. (2017) Statistical guarantees for the em algorithm: From population to sample-based analysis. Ann. Statist., 45, 77–120. [Google Scholar]

[ref9] 9. Banerjee, A., Dhillon, I. S., Ghosh, J. & Sra, S. (2005) Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6, 1345–1382. [Google Scholar]

[ref10] 10. Banyaga, A. & Hurtubise, D. (2004) Lectures on Morse Homology. Texts in the Mathematical Sciences. Netherlands: Springer. [Google Scholar]

[ref11] 11. Beck, A. & Tetruashvili, L. (2013) On the convergence of block coordinate descent type methods. SIAM J. Optim., 23, 2037–2060. [Google Scholar]

[ref12] 12. Beran, R. (1979) Exponential models for directional data. Ann. Statist., 7, 1162–1178. [Google Scholar]

[ref13] 13. Bird, P. (2003) An updated digital model of plate boundaries. Geochemistry, Geophysics, Geosystems, 4. [Google Scholar]

[ref14] 14. Bonnabel, S. (2013) Stochastic gradient descent on riemannian manifolds. IEEE Trans. Automat. Control, 58, 2217–2229. [Google Scholar]

[ref15] 15. Boumal, N. (2020) An introduction to optimization on smooth manifolds. Available online, Aug.. [Google Scholar]

[ref16] 16. Bowman, A. W. (1984) An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71, 353–360. [Google Scholar]

[ref17] 17. Bubeck, S. (2015) Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8, 231–357. [Google Scholar]

[ref18] 18. Burago, Y., Gromov, M. & Perel’man, G. (1992) A.d. alexandrov spaces with curvature bounded below. Russian Math. Surveys, 47, 1–58. [Google Scholar]

[ref19] 19. Carreira-Perpiñán, M. Á. (2007) Gaussian mean-shift is an em algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 29, 767–776. [DOI] [PubMed] [Google Scholar]

[ref20] 20. Chacón, E. J., Duong, T. & Wand, P. M. (2011) Asymptotics for general multivariate kernel density derivative estimators. Statist. Sinica, 21, 807. [Google Scholar]

[ref21] 21. Charles, Z. & Papailiopoulos, D. (2018) Stability and generalization of learning algorithms that converge to global optima. International Conference on Machine Learning. PMLR, PMLR, pp. 745–754.

[ref22] 22. Chen, Y.-C., Genovese, C. R. & Wasserman, L. (2015a) Asymptotic theory for density ridges. Ann. Statist., 43, 1896–1928. [Google Scholar]

[ref23] 23. Chen, Y.-C., Ho, S., Freeman, P. E., Genovese, C. R. & Wasserman, L. (2015b) Cosmic web reconstruction through density ridges: method and algorithm. Monthly Notices of the Royal Astronomical Society, 454, 1140–1156. [Google Scholar]

[ref24] 24. Chen, Y.-C., Genovese, C. R., Ho, S. & Wasserman, L. (2015c) Optimal ridge detection using coverage risk. Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. [Google Scholar]

[ref25] 25. Chen, Y.-C., Genovese, C. R. & Wasserman, L. (2016a) A comprehensive approach to mode clustering. Electron. J. Stat., 10, 210–241. [Google Scholar]

[ref26] 26. Chen, Y.-C., Ho, S., Brinkmann, J., Freeman, P. E., Genovese, C. R., Schneider, D. P. & Wasserman, L. (2016b) Cosmic web reconstruction through density ridges: catalogue. Monthly Notices of the Royal Astronomical Society, 461, 3896–3909. [Google Scholar]

[ref27] 27. Chen, Y.-C. (2017) A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1, 161–187. [Google Scholar]

[ref28] 28. Chen, Y.-C. (2022) Solution manifold and its statistical applications. Electron. J. Stat., 16, 408–450. [Google Scholar]

[ref29] 29. Cheng, Y. (1995) Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17, 790–799. [Google Scholar]

[ref30] 30. Chrisman, N. R. (2017) Calculating on a round planet. International Journal of Geographical Information Science, 31, 637–657. [Google Scholar]

[ref31] 31. Comaniciu, D. & Meer, P. (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24, 603–619. [Google Scholar]

[ref32] 32. Cuevas, A. (2009) Set estimation: Another bridge between statistics and geometry. Bol. Estad. Investig. Oper, 25, 71–85. [Google Scholar]

[ref33] 33. Damon, J. (1999) Properties of ridges and cores for two-dimensional images. J. Math. Imaging Vis., 10, 163–174. [Google Scholar]

[ref34] 34. Daniilidis, A., Ley, O. & Sabourau, S. (2010) Asymptotic behaviour of self-contracted planar curves and gradient orbits of convex functions. J. Math. Pures Appl., 94, 183–199. [Google Scholar]

[ref35] 35. Daniilidis, A., David, G., Durand-Cartagena, E. & Lemenant, A. (2015) Rectifiability of self-contracted curves in the euclidean space and applications. J. Geom. Anal., 25, 1211–1239. [Google Scholar]

[ref36] 36. Davis, C. & Kahan, W. M. (1970) The rotation of eigenvectors by a perturbation. iii. SIAM J. Numer. Anal., 7, 1–46. [Google Scholar]

[ref37] 37. do Carmo, M. (2016) Differential Geometry of Curves and Surfaces: Revised and UpdatedSecond Edition. Dover Books on Mathematics. Dover Publications. [Google Scholar]

[ref38] 38. Drusvyatskiy, D. & Lewis, A. S. (2018) Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res., 43, 919–948. [Google Scholar]

[ref39] 39. Eberly, D. (1996) Ridges in Image and Data Analysis. Computational Imaging and Vision. Springer Netherlands. [Google Scholar]

[ref40] 40. Einmahl, U. & Mason, D. M. (2005) Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist., 33, 1380–1403. [Google Scholar]

[ref41] 41. Fazel, M., Ge, R., Kakade, S. & Mesbahi, M. (2018) Global convergence of policy gradient methods for the linear quadratic regulator. International Conference on Machine Learning. PMLR, PMLR, pp. 1467–1476.

[ref42] 42. Federer, H. (1959) Curvature measures. Trans. Amer. Math. Soc., 93, 418–491. [Google Scholar]

[ref43] 43. García-Portugués, E. (2013) Exact risk improvement of bandwidth selectors for kernel density estimation with directional data. Electron. J. Stat., 7, 1655–1685. [Google Scholar]

[ref44] 44. García-Portugués, E., Crujeiras, R. M. & González-Manteiga, W. (2013) Kernel density estimation for directional-linear data. J. Multivariate Anal., 121, 152–175. [Google Scholar]

[ref45] 45. Genovese, C. R., Perone-Pacifico, M., Verdinelli, I. & Wasserman, L. (2014) Nonparametric ridge estimation. Ann. Statist., 42, 1511–1545. [Google Scholar]

[ref46] 46. Ghassabeh, Y. A., Linder, T. & Takahara, G. (2013) On some convergence properties of the subspace constrained mean shift. Pattern Recognition, 46, 3140–3147. [Google Scholar]

[ref47] 47. Ghassabeh, Y. A. & Rudzicz, F. (2020) Modified subspace constrained mean shift algorithm. J. Classification, 1–17. [Google Scholar]

[ref48] 48. Giné, E. & Guillou, A. (2002) Rates of strong uniform consistency for multivariate kernel density estimators. Annales de l’Institut Henri Poincare (B) Probability and Statistics, 38, 907–921. [Google Scholar]

[ref49] 49. Gupta, C., Balakrishnan, S. & Ramdas, A. (2021) Path length bounds for gradient descent and flow. J. Mach. Learn. Res., 22, 1–63. [Google Scholar]

[ref50] 50. Hall, P. (1983) Large sample optimality of least squares cross-validation in density estimation. Ann. Statist., 1156–1174. [Google Scholar]

[ref51] 51. Hall, P., Watson, G. S. & Cabrara, J. (1987) Kernel density estimation with spherical data. Biometrika, 74, 751–762. [Google Scholar]

[ref52] 52. Hall, P., Qian, W. & Titterington, D. M. (1992) Ridge finding from noisy data. J. Comput. Graph. Statist., 1, 197–211. [Google Scholar]

[ref53] 53. Hall, P., Peng, L. & Rau, C. (2001) Local likelihood tracking of fault lines and boundaries. J. R. Stat. Soc. Ser. B Stat. Methodol., 63, 569–582. [Google Scholar]

[ref54] 54. Harris, R. A. (2017) Large earthquakes and creeping faults. Reviews of Geophysics, 55, 169–198. [Google Scholar]

[ref55] 55. Hastie, T. & Stuetzle, W. (1989) Principal curves. J. Amer. Statist. Assoc., 84, 502–516. [Google Scholar]

[ref56] 56. Hauberg, S. (2015) Principal curves on riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell., 38, 1915–1921. [DOI] [PubMed] [Google Scholar]

[ref57] 57. Horn, R. A. & Johnson, C. R. (1991) Topics in Matrix Analysis. Cambridge Univ. Press. [Google Scholar]

[ref58] 58. Horn, R. A. & Johnson, C. R. (2012) Matrix Analysis, 2nd edn. Cambridge Univ. Press. [Google Scholar]

[ref59] 59. Irwin, M. C. (2001) Smooth dynamical systems, vol. 17. World Scientific. [Google Scholar]

[ref60] 60. Izenman, A. J. (2012) Introduction to manifold learning. Wiley Interdiscip. Rev. Comput. Stat., 4, 439–446. [Google Scholar]

[ref61] 61. Jones, M. C., Marron, J. S. & Sheather, S. J. (1996) A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc., 91, 401–407. [Google Scholar]

[ref62] 62. Kafai, M., Miao, Y. & Okada, K. (2010) Directional mean shift and its application for topology classification of local 3d structures. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, pp. 170–177. [Google Scholar]

[ref63] 63. Karimi, H., Nutini, J. & Schmidt, M. (2016) Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing, pp. 795–811. [Google Scholar]

[ref64] 64. Klemelä, J. (2000) Estimation of densities and derivatives of densities with directional data. J. Multivariate Anal., 73, 18–40. [Google Scholar]

[ref65] 65. Kobayashi, T. & Otsu, N. (2010) Von mises-fisher mean shift for clustering on a hypersphere. 20th International Conference on Pattern Recognition. IEEE, pp. 2130–2133. [Google Scholar]

[ref66] 66. Kozak, D., Becker, S., Doostan, A. & Tenorio, L. (2019) Stochastic subspace descent. arXiv preprint arXiv: 1904.01145.

[ref67] 67. Kozak, D., Becker, S., Doostan, A. & Tenorio, L. (2020) A stochastic subspace approach to gradient-free optimization in high dimensions. arXiv preprint arXiv, 2003.02684. [Google Scholar]

[ref68] 68. Lee, J. (2012) Introduction to Smooth Manifolds. Graduate Texts in Mathematics, 2nd edn. Springer. [Google Scholar]

[ref69] 69. Lee, J. M. (2018) Introduction to Riemannian manifolds. Springer. [Google Scholar]

[ref70] 70. Ley, C. & Verdebout, T. (2017) Modern directional statistics. CRC Press. [Google Scholar]

[ref71] 71. Li, X., Hu, Z. & Wu, F. (2007) A note on the convergence of the mean shift. Pattern Recognition, 40, 1756–1762. [Google Scholar]

[ref72] 72. Lojasiewicz, S. (1963) A topological property of real analytic subsets. Coll. du CNRS. Les équations aux dérivées partielles, 117, 87–89. [Google Scholar]

[ref73] 73. Luo, Z.-Q. & Tseng, P. (1992) On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl., 72, 7–35. [Google Scholar]

[ref74] 74. Mardia, K. & Jupp, P. (2000) Directional Statistics. Wiley Series in Probability and Statistics. Wiley. [Google Scholar]

[ref75] 75. Marzio, M. D., Panzera, A. & Taylor, C. C. (2011) Kernel density estimation on the torus. J. Statist. Plann. Inference, 141, 2156–2173. [Google Scholar]

[ref76] 76. Necoara, I., Nesterov, Y. & Glineur, F. (2019) Linear convergence of first order methods for non-strongly convex optimization. Math. Programming, 175, 69–107. [Google Scholar]

[ref77] 77. Nesterov, Y., et al. (2018) Lectures on convex optimization, vol. 137. Springer. [Google Scholar]

[ref78] 78. Nocedal, J. & Wright, S. J. (2006) Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. New York: Springer. [Google Scholar]

[ref79] 79. Norgard, G. & Bremer, P.-T. (2012) Second derivative ridges are straight lines and the implications for computing lagrangian coherent structures. Phys. D, 241, 1475–1476. [Google Scholar]

[ref80] 80. Oba, S., Kato, K. & Ishii, S. (2005) Multi-scale clustering for gene expression profiling data. Proceedings of Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05). IEEE, pp. 210–217. [Google Scholar]

[ref81] 81. Ok, E. A. (2007) Real Analysis with Economic Applications, vol. 10. Princeton University Press. [Google Scholar]

[ref82] 82. Oliveira, M., Crujeiras, R. M. & Rodríguez-Casal, A. (2012) A plug-in rule for bandwidth selection in circular density estimation. Comput. Stat. Data Anal., 56, 3898–3908. [Google Scholar]

[ref83] 83. Ozertem, U. & Erdogmus, D. (2011) Locally defined principal curves and surfaces. J. Mach. Learn. Res., 12, 1249–1286. [Google Scholar]

[ref84] 84. Peikert, R., Günther, D. & Weinkauf, T. (2013) Comment on “second derivative ridges are straight lines and the implications for computing lagrangian coherent structures, physica d 2012.05. 006”. Phys. D, 242, 65–66. [Google Scholar]

[ref85] 85. Pennec, X. (2006) Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vision, 25, 127–154. [Google Scholar]

[ref86] 86. Pewsey, A. & García-Portugués, E. (2021) Recent advances in directional statistics. Test, 1–58. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref87] 87. Polyak, B. (1963) Gradient methods for the minimisation of functionals. Comput. Math. Math. Phys., 3, 864–878. [Google Scholar]

[ref88] 88. Qiao, W. (2021) Asymptotic confidence regions for density ridges. Bernoulli, 27, 946–975. [Google Scholar]

[ref89] 89. Qiao, W. & Polonik, W. (2016) Theoretical analysis of nonparametric filament estimation. Ann. Statist., 44, 1269–1297. [Google Scholar]

[ref90] 90. Qiao, W. & Polonik, W. (2021) Algorithms for ridge estimation with convergence guarantees. arXiv preprint arXiv:2104.12314.

[ref91] 91. Rudemo, M. (1982) Empirical choice of histograms and kernel density estimators. Scand. J. Statist., 65–78. [Google Scholar]

[ref92] 92. Rudin, W. (1976) Principles of Mathematical Analysis, 3rd edn. McGraw-Hill New York. [Google Scholar]

[ref93] 93. Saavedra-Nieves, P. & María Crujeiras, R. (2020) Nonparametric estimation of directional highest density regions. arXiv preprint arXiv:2009.08915.

[ref94] 94. Saragih, J. M., Lucey, S. & Cohn, J. F. (2009) Face alignment through subspace constrained mean-shifts. Proceedings of the IEEE 12th International Conference on Computer Vision. IEEE, pp. 1034–1041. [Google Scholar]

[ref95] 95. Sasaki, H., Kanamori, T. & Sugiyama, M. (2017) Estimating density ridges by direct estimation of density-derivative-ratios. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54. ( A. Singh & J. Zhu eds). FL, USA: PMLR: Fort Lauderdale, pp. 204–212. [Google Scholar]

[ref96] 96. Scott, D. (2015) Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley Series in Probability and Statistics. Wiley. [Google Scholar]

[ref97] 97. Sheather, S. J. (2004) Density estimation. Statist. Sci., 19, 588–597. [Google Scholar]

[ref98] 98. Sheather, S. J. & Jones, M. C. (1991) A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. Ser. B Stat. Methodol., 53, 683–690. [Google Scholar]

[ref99] 99. Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. [Google Scholar]

[ref100] 100. Snyder, J., Voxland, P. & U.S.), G. S. (1989) An Album of Map Projections. An Album of Map Projections. U.S: Government Printing Office. [Google Scholar]

[ref101] 101. Sousbie, T., Pichon, C., Courtois, H., Colombi, S. & Novikov, D. (2007) The three-dimensional skeleton of the SDSS. The Astrophysical Journal, 672, L1–L4. [Google Scholar]

[ref102] 102. Stone, C. J. (1984) An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist., 1285–1297. [Google Scholar]

[ref103] 103. Subarya, C., Chlieh, M., Prawirodirdjo, L., Avouac, J.-P., Bock, Y., Sieh, K., Meltzner, A. J., Natawidjaja, D. H. & McCaffrey, R. (2006) Plate-boundary deformation associated with the great sumatra–andaman earthquake. Nature, 440, 46–51. [DOI] [PubMed] [Google Scholar]

[ref104] 104. Subbarao, R. & Meer, P. (2006) Nonlinear mean shift for clustering over analytic manifolds. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1. IEEE, pp. 1168–1175. [Google Scholar]

[ref105] 105. Subbarao, R. & Meer, P. (2009) Nonlinear mean shift over riemannian manifolds. Int. J. Comput. Vis., 84, 1. [Google Scholar]

[ref106] 106. Taylor, C. C. (2008) Automatic bandwidth selection for circular density estimation. Comput. Statist. Data Anal., 52, 3493–3500. [Google Scholar]

[ref107] 107. van der Vaart, A. W. (1998) Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge Univ. Press. [Google Scholar]

[ref108] 108. van der Vaart, A. W. & Wellner, J. A. (1996) Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media. [Google Scholar]

[ref109] 109. von Luxburg, U. (2007) A tutorial on spectral clustering. Statist. Comput., 17, 395–416. [Google Scholar]

[ref110] 110. Wasserman, L. (2006) All of Nonparametric Statistics (Springer Texts in Statistics). Berlin, Heidelberg: Springer-Verlag. [Google Scholar]

[ref111] 111. Wasserman, L. (2018) Topological data analysis. Annu. Rev. Stat. Appl., 5, 501–532. [Google Scholar]

[ref112] 112. Wright, S. J. (2015) Coordinate descent algorithms. Math. Programming, 151, 3–34. [Google Scholar]

[ref113] 113. Yang, M.-S., Chang-Chien, S.-J. & Kuo, H.-C. (2014) On mean shift clustering for directional data on a hypersphere. Proceedings of the Artificial Intelligence and Soft Computing. Cham: Springer International Publishing, pp. 809–818. [Google Scholar]

[ref114] 114. You, S., Bas, E., Erdogmus, D. & Kalpathy-Cramer, J. (2011) Principal curved based retinal vessel segmentation towards diagnosis of retinal diseases. Proceedings of the IEEE First International Conference on Healthcare Informatics, Imaging and Systems Biology. IEEE, pp. 331–337. [Google Scholar]

[ref115] 115. Yu, Y., Wang, T. & Samworth, R. J. (2014) A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102, 315–323. [Google Scholar]

[ref116] 116. Zhang, H. & Sra, S. (2016) First-order methods for geodesically convex optimization. Proceedings of the 29th Annual Conference on Learning Theory ( V. Feldman, A. Rakhlin & O. Shamir eds). Proceedings of Machine Learning Research, vol. 49. Columbia University, New York, New York, USA: PMLR, pp. 1617–1638. [Google Scholar]

[ref117] 117. Zhang, Y. & Chen, Y.-C. (2021a) The em perspective of directional mean shift algorithm. arXiv preprint arXiv:2101.10058.

[ref118] 118. Zhang, Y. & Chen, Y.-C. (2021b) Kernel smoothing, mean shift, and their learning theory with directional data. J. Mach. Learn. Res., 22, 1–92. [Google Scholar]

[ref119] 119. Zhang, Y. & Chen, Y.-C. (2021c) Mode and ridge estimation in euclidean and directional product spaces: A mean shift approach. arXiv preprint arXiv:2110.08505.

[ref120] 120. Zhao, L. & Wu, C. (2001) Central limit theorem for integrated squared error of kernel estimators of spherical density. Sci. China Ser. A Math., 44, 474–483. [Google Scholar]

PERMALINK

Linear convergence of the subspace constrained mean shift algorithm: from Euclidean to directional data

Yikun Zhang

Yen-Chi Chen

Abstract

1. Introduction

Fig. 1.

2. Preliminaries

2.1 Kernel Density Estimation with Euclidean Data

2.2 Kernel Density Estimation with Directional Data

Remark 2.1.

2.3 Riemannian Gradient, Hessian and Exponential Map on

3. Linear Convergence of the SCMS Algorithm With Euclidean Data

3.1 Assumptions and Stability of Euclidean Density Ridges

Lemma 3.1. (Theorem 4 in [45]).

Remark 3.1.

3.2 Mean Shift and SCMS Algorithms with Euclidean Data

Lemma 3.2.

3.3 Linear Convergence of Population and Sample-Based SCGA Algorithms

Remark 3.2.

Proposition 3.1. (Convergence of the SCGA Algorithm.)

Corollary 3.1. (Convergence of the SCMS Algorithm.)

Definition 3.2. (Linear Rate of Convergence.)

Fig. 2.

Theorem 3.3. (Linear Convergence of the SCGA Algorithm.)

Remark 3.3.

Corollary 3.2. (Linear Convergence of the SCMS Algorithm.)

4. The SCMS Algorithm With Directional Data and Its Linear Convergence

4.1 Definitions, Assumptions and Stability of Directional Density Ridges

Remark 4.1. (Connection to Solution Manifolds.)

Theorem 4.1.

4.2 Mean Shift and SCMS Algorithm with Directional Data

Lemma 4.1. (Lemma 10 in [118]).

Fig. 3.

Proposition 4.2.

Remark 4.2.

4.3 Linear Convergence of Population and Sample-Based SCGA Algorithms on

Remark 4.3.

Proposition 4.3. (Convergence of the SCGA Algorithm on .)

Corollary 4.1. (Convergence of the Directional SCMS Algorithm.)

Fig. 5.

Theorem 4.4. (Linear Convergence of the SCGA Algorithm on .)

Remark 4.4.

Corollary 4.2. (Linear Convergence of the Directional SCMS Algorithm.)

5. Experiments

5.1 Simulation Study on the Euclidean SCMS Algorithm

Fig. 4.

5.2 Simulation Study on the Directional SCMS Algorithm

5.3 Density Ridges on Earthquake Data

Fig. 6.

6. Discussions

Table 1.

Data Availability Statement

Supplementary Material

Acknowledgment

Funding

A. Algorithmic Summaries of Euclidean and Directional SCMS Algorithms

Fig. A7.

B. Limitations of Euclidean KDE in Handling Directional Data

B.1 Case I: Density Estimation

Example B.1.

Fig. B8.

B.2 Case II: Ridge-Finding Problem

Fig. B9.

Fig. B10.

C. Normal Space of the Euclidean Density Ridge

Lemma C.1.

Proof. Proof of Lemma C.1 —

D. Proofs of Lemma 3.2, Proposition 3.1, and Theorem 3.1

Lemma D.1.

Proof. Proof of Lemma 3.2 —

Remark D.1.

Proposition D.1. (Convergence of the SCGA Algorithm.)

Proof. Proof of Proposition 3.1 —

Lemma D.2. (Davis–Kahan.)

Theorem D.1. (Linear Convergence of the SCGA Algorithm.)

Proof. Proof of Theorem 3.1 —

E. Discussion on Condition (A4)

E.1 Self-Contractedness Assumption

Proposition E.1.