Abstract
This paper studies the linear convergence of the subspace constrained mean shift (SCMS) algorithm, a well-known algorithm for identifying a density ridge defined by a kernel density estimator. By arguing that the SCMS algorithm is a special variant of a subspace constrained gradient ascent (SCGA) algorithm with an adaptive step size, we derive the linear convergence of such SCGA algorithm. While the existing research focuses mainly on density ridges in the Euclidean space, we generalize density ridges and the SCMS algorithm to directional data. In particular, we establish the stability theorem of density ridges with directional data and prove the linear convergence of our proposed directional SCMS algorithm.
Keywords: ridges, subspace constrained mean shift, directional data, optimization on a manifold
Keywords: 62G05, 49Q12, 62H11
1. Introduction
Identifying meaningful lower dimensional structures from a point cloud has long been a popular research topic in Statistics and Machine Learning [60, 111]. One reliable characterization of such a low-dimensional structure is the density ridge, which can be feasibly estimated by a kernel density estimator (KDE) from point cloud data [39, 45]. Loosely speaking, an estimated density ridge signifies a high-density curve or surface in a point cloud; see the left panel of Fig. 1. Let
be the underlying probability density function that generates the data in the Euclidean space
. Its order-
density ridge
with
is the set of points defined as
![]() |
(1.1) |
where
are the eigenvalues of Hessian
and
has its columns as the last
orthonormal eigenvectors. The notion of density ridges has appeared in various scientific fields, such as medical imaging [114], seismology [95] and astronomy [26, 101]. To locate an estimated density ridge defined by (Euclidean) KDE, [83] proposed a practical method called subspace constrained mean shift (SCMS) algorithm.
Fig. 1.

Density ridges estimated by Euclidean and directional SCMS algorithms on two synthetic datasets (drawn as black points) with hidden circular manifold structures (indicated by blue curves) on
and the unit sphere
, respectively. Left: The orange points indicate the estimated ridge obtained by the Euclidean SCMS algorithm from the dataset on
. Right: The red points represent the estimated directional ridge identified by our directional SCMS algorithm, while the orange points indicate the estimated ridge obtained by the Euclidean SCMS algorithm from the dataset on
. This panel is presented under the Hammer projection; see Appendix B for more details.
While the statistical estimation and asymptotic theories of density ridges in
have been well-studied [22, 24, 45, 88, 89], the literature falls short of addressing the algorithmic properties of the ridge-finding method, i.e. the SCMS algorithm. To the extent of our knowledge, [46, 47] were the only available works to investigate the SCMS algorithm and its modified version from an algorithmic perspective. However, they only proved a non-decreasing property of density estimates and the validity of two stopping criteria for the SCMS algorithm. The algorithmic convergence of the SCMS algorithm remains an open question. There are two challenges to answering this question. First, because every iteration of the SCMS algorithm involves a projection matrix defined by the (estimated) Hessian, it is no longer a conventional first-order method in optimization. Second, estimating a density ridge in practice is a nonconvex/nonconcave optimization problem. Thus, the first objective of this paper is to provide a theoretical study on the algorithmic convergence and its associated (linear) rate of convergence for the SCMS algorithm.
In stark contrast to abundant research papers about density ridges in the Euclidean space, little work has been done to examine the statistical properties and any practical algorithm of estimating density ridges on the unit hypersphere
. Nevertheless, data on
are ubiquitous in many scientific fields of study, such as seismology (e.g. longitudes and latitudes of the epicenters of earthquakes) and astronomy (e.g. right ascensions and declinations of astronomical objects). Such data are generally known as directional data in the statistical literature [70, 74]. Hence, the second objective of this paper is to generalize density ridges and the SCMS algorithm to directional data.
More importantly, identifying an estimated density ridge from directional data on
by the Euclidean SCMS algorithm always suffers from high bias near the two poles of
. Consider a synthetic dataset with independently and identically distributed (i.i.d.) observations
from a great circle connecting the North and South Poles of
with additive noises. We apply both the Euclidean and directional SCMS algorithms to this simulated dataset. While the estimated ridges by the Euclidean SCMS algorithm fail to recover the desired great circle in high latitude regions, the ridges identified by our proposed directional SCMS algorithm align well with the underlying circular structure; see the right panel in Fig. 1 for a preview and Appendix B for a more detailed discussion.
Main Results. The main contributions of this paper are summarized as follows:
- We present the convergence analysis of the SCMS and the general SCGA algorithms and prove their linear convergence properties with Euclidean data (Theorem 3.1, Corollary 3.2, and related discussion in Section 3.3):
where
is a sequence of points generated by the SCGA or SCMS algorithm in
,
is the limit point of the sequence, and
is a constant. We generalize density ridges and the SCMS algorithm to directional data on
(Section 4).- We prove the statistical convergence rate of a ridge estimator on the sphere
defined by the directional KDE (Theorem 4.1):
where
and
are the population and estimated directional density ridges, respectively,
is the Hausdorff distance, and
is the dimension of
. - We establish the convergence of the SCMS and the general SCGA algorithms with directional data and derive their linear convergence results (Theorem 4.2, Corollary 4.2 and related expositions in Section 4.3):
where
is the sequence of points generated by the directional SCGA or SCMS algorithm,
is the convergence point,
is a constant and
is the geodesic distance on
.
Other Related Literature. The problem of density ridge estimation has its unique standing in both the computer science and statistics literature; see [33, 39, 52, 53] and references therein. Among various definition of density ridges [79, 84], our definition follows from [22, 39, 45], because its statistical estimation theory has been well established and it is feasible to be directly generalized to directional densities. Practically, the SCMS algorithm for identifying an estimated density ridge first appeared in the field of computer vision [94] before its introduction to the statistical community by [83]. More recently, [90] proposed alternative methods to the SCMS algorithm for finding density ridges, which are based on a gradient descent of the ridgeness and have connections to solution manifolds [28]. They presented the convergence analysis on continuous versions of their proposed methods and discretized them via Euler’s method. Our directional SCMS algorithm is extended from the directional mean shift algorithm [62, 65, 80, 113, 117, 118]. As we cast the (directional) SCMS algorithms into subspace constrained gradient ascent (SCGA) algorithms (on a hypersphere), it is worth mentioning that one should not confuse the SCGA algorithm here with the projected gradient ascent/descent method for a constrained problem in the standard optimization theory; see Section 3.2 in [17] for some references of the latter one. The SCGA algorithm discussed in this paper is a gradient ascent algorithm but with a subspace constrained gradient. When the subspace coincides with alternating one-dimensional coordinate spaces, the SCGA algorithm reduces to the well-known coordinate ascent/descent method [112]. Some linear convergence results of the coordinate descent algorithms were previously established by [11, 73]. Other related work includes [66, 67], though, in their problem setups, the projection matrix onto the subspace is random and has its expectation equal to the identity matrix. Our interested SCGA algorithm always has a deterministic constrained subspace defined by the eigenspace associated with the last several eigenvalues of the Hessian of the density
.
Outlines and Notations. Section 2 introduces the definitions of Euclidean and directional KDEs and reviews some preliminary concepts of differential geometry on
. We discuss the assumptions on the Euclidean density ridges and establish the (linear) convergence results of the SCGA and SCMS algorithms in Section 3. In Section 4, we generalize the definition of density ridges to the directional data scenario and prove the (linear) convergence properties of the SCGA and SCMS algorithms on
. Some simulation studies and real-world applications of Euclidean and directional SCMS algorithms are presented in Section 5, whose code is available at https://github.com/zhangyk8/EuDirSCMS. We conclude the paper and discuss some potential impacts in Section 6.
Throughout the paper we use
as the intrinsic dimension of density ridges, whose ambient spaces are
in the Euclidean data case and
in the directional data case. Notice that a quantity under the directional data setting that has its counterpart in the Euclidean data case will be denoted by the same notation with an extra underline. For instance,
is a ridge of the density
in the Euclidean space
, while
refers to a ridge of the directional density
on the sphere
.
Let
be a smooth function and
be a multi-index (that is,
are nonnegative integers and
). Define
as the
-th order partial derivative operator, where
is often written as
. For
, we define the functional norms
![]() |
When
, this becomes the infinity norm of
; for
, the above norms are indeed some semi-norms. We also define
.
The (total) gradient and Hessian of
are defined as
and
. Inductively, the third derivative of
is a
array given by
. When
is a directional density supported on
, the preceding functional norms are defined via the Riemannian gradient, Hessian and high-order derivatives of
within the tangent space
at
, and the supremum will be taken over
instead of
. They are equivalent to the derivatives of
with respect to the local coordinate chart on
; see Section 2.3 for a review.
Let
denote the
entry of a matrix
. Then, the Frobenius norm is
, where
is the trace of the square matrix
, and the operator norm is
. In most cases, we consider the
(operator) norm
. We define
. The inequality relationships between the above matrix norms are
,
and
.
We use the big-O notation
if the absolute value of
is upper bounded by a positive constant multiple of
for all sufficiently large
. In contrast,
when
. For random vectors, the notation
is short for a sequence of random vectors that converges to zero in probability. The expression
denotes the sequence that is bounded in probability; see Section 2.2 of [107] for details.
2. Preliminaries
In this section, we review the KDE with Euclidean and directional data as well as some differential geometry concepts on
.
2.1 Kernel Density Estimation with Euclidean Data
Let
be a random sample from a distribution
with density
supported on the Euclidean space
. We call such random sample
Euclidean data in the sequel. The (Euclidean) KDE at point
with a kernel function
and bandwidth parameter
is written as [27, 96, 110]:
![]() |
(2.1) |
The kernel
is generally a unimodal function satisfying the following properties:
(K1)
.(K2)
is (radially) symmetric, i.e.
.(K3)
and
, where
is the usual
norm in
.
One possible approach to construct a multivariate kernel
with the above properties is to derive it from a kernel profile as follows:
![]() |
(2.2) |
where
is the normalizing constant such that
satisfies (K1) and the function
is called the profile of the kernel. This kernel form is generally used in deriving (subspace constrained) mean shift algorithms; see Section 3.2. An important example of the profile function is
for
, leading to the multivariate Gaussian kernel
.
Another approach of designing a multivariate kernel function is to leverage the product kernel technique as
, where
are kernels function defined on
satisfying the properties (K1-3). This leads to a multivariate KDE as
![]() |
(2.3) |
In fact, the multivariate Gaussian kernel
can be obtained by defining its kernel profile as
for
or taking
. In practice, the multivariate KDE (2.1) with Gaussian kernel is the most popular nonparametric density estimator with Euclidean data.
The most crucial part in applying the KDE is to select the bandwidth parameter
. Common methods in the literature aim at minimizing the mean integrated square error (MISE):
![]() |
or its asymptotic part through the rule of thumb [99], cross validation [16, 50, 91, 102] and plug-in methods [98]. As choosing the bandwidth is not the main focus of this paper, we refer the interested reader to [61, 97] and Chapter 6.5 of [96] for comprehensive reviews.
2.2 Kernel Density Estimation with Directional Data
The Euclidean KDE (2.1) exhibits some salient drawbacks in dealing with directional data samples; see Appendix B for a detailed exposition. Fortunately, the theory of kernel density estimation with directional data has been well studied since late 1970s [7, 12, 43, 51, 86, 120]. Let
be a random sample generated from an underlying directional density function
on
with
where
is the Lebesgue measure on
. The directional KDE is given by
![]() |
(2.4) |
where
is a directional kernel (i.e. a rapidly decaying function with nonnegative values and defined on
for some constant
),
is the bandwidth parameter and
is a normalizing constant satisfying
.
Remark 2.1.
The distance metric used by the directional KDE (2.4) on
is identical to the standard Euclidean metric in the ambient space
. This is because the standard Euclidean metric
of
is topologically equivalent (but not strongly equivalent) to the geodesic distance
on
due to the following equality:
(2.5) See Section C.1.5 in [81] for the definition of equivalence of metrics. Hence, the distance metric in (2.4) is indeed intrinsic on
and adaptive to its geometry.
As in the applications of Euclidean KDEs, the bandwidth selection is a critical part in determining the performances of directional KDEs [7, 43, 51, 75, 82, 93, 106]. On the contrary, the choice of the kernel is less crucial; see, e.g. Page 72 of [110] and Section 6.3.2 in [96] for the reasoning. A popular candidate is the so-called von Mises kernel
, which serves as a counterpart of the Gaussian kernel for directional KDEs. Its name originates from the famous
-von Mises–Fisher distribution on
, which is denoted by
and has the density:
![]() |
(2.6) |
where
is the directional mean,
is the concentration parameter and
is the modified Bessel function of the first kind at order
. For more details on statistical properties of the von Mises–Fisher distribution and directional KDE, we refer the interested reader to [9, 44, 74].
2.3 Riemannian Gradient, Hessian and Exponential Map on
Given that the unit hypersphere
is a nonlinear manifold, the Riemannian gradient and Hessian of a smooth function
on
are defined within its tangent spaces. They are different from but also interconnected with the total gradient and Hessian of
in the ambient Euclidean space
.
Riemannian Gradient on
. Let
be the tangent space of
at point
, which consists of all the vectors starting from
and tangent to
. Given a smooth function
, its Riemannian gradient
is defined as
![]() |
(2.7) |
for any (unit) vector
, where
is the inner product (or Riemannian metric) in
and
is the differential operator of
at
; see, e.g. Section 3.1 in [10] for more details. Note that the Riemannian metric on
coincides with the standard inner product
in the ambient space
; see Section 3.6.1 in [1]. If
is smooth in an open neighborhood containing
and we consider
as vectors in
, then the inner product in
reduces to the usual one in
and the Riemannian gradient
can be expressed in terms of the total gradient
as
![]() |
(2.8) |
where
is the identity matrix. The left-hand side of (2.8) is the projection of the total gradient
onto the tangent space
at
.
Riemannian Hessian on
. The Riemannian Hessian
at point
is a symmetric bilinear map from the tangent space
into itself defined as
![]() |
(2.9) |
for any
, where
is the Riemannian connection on
. Similar to
, the Riemannian Hessian
has the following explicit formula when viewed in the ambient Euclidean space
:
![]() |
(2.10) |
where
and
are the total gradient and Hessian of
in
. This formula can be derived via the Riemannian connection and Weingarten map on
[2] and Section 5.5 in [1] or geodesics on
[118].
Exponential Map. An exponential map
at
is a mapping that takes a vector
to a point
along the curve
with
and
. Here,
is a curve of minimum length between
and
(i.e. the so-called geodesic on
). An intuitive way of thinking of the exponential map
evaluated at
on
is that starting at point
, we identify another point
on
along the geodesic (or great circle) in the direction of
so that the geodesic distance between
and
is
. As
is a compact Riemannian manifold, the exponential map
is a diffeomorphism (smooth bijection) from a neighborhood of
to its image on
; see Lemma 6.16 in [69]. The inverse of an exponential map (or logarithmic map) is defined within a neighborhood
around
as a mapping
such that
represents the vector in
starting at
, pointing to
, and with its length equal to the geodesic distance between
and
.
3. Linear Convergence of the SCMS Algorithm With Euclidean Data
Given the definition of a order-
ridge
in (1.1) of the (smooth) density
on the Euclidean space
, we introduce, in this section, some commonly assumed conditions to regularize
and its stability theorem. After revisiting the frameworks of the Euclidean mean shift and SCMS algorithms as well as deriving the SCMS algorithm as the SCGA algorithm with an adaptive step size, we present our (linear) convergence analysis on the SCGA and SCMS algorithms.
3.1 Assumptions and Stability of Euclidean Density Ridges
Under the spectral decomposition on the Hessian
as
, we know that
is a real orthogonal matrix with the eigenvectors of
as its columns and
is a diagonal matrix with
. Given that
, we let
be the projection matrix onto the column space of
and
be the projection matrix onto the complement space, where
and
is the identity matrix in
. Then, the order-
principal gradient
(or projected gradient in [22, 45] is defined as
![]() |
(3.1) |
and
will be called the residual gradient. The order-
density ridge can be equivalently defined as
![]() |
(3.2) |
It follows that the 0-ridge
is the set of local modes of
, whose statistical properties and practical estimation algorithm have been well studied in [6, 25]. Thus, we only consider the case when
in the sequel. We define the projection from point
onto a ridge
by
and the distance from point
to
by
. Note that the projection from point
to
may not be unique. To guarantee the uniqueness of the projection, we introduce a concept called the reach [32, 42]:
![]() |
(3.3) |
where
and
is a
-dimensional ball of radius
centered at
. To obtain a well-behaved ridge
, some assumptions need imposing on the underlying density
around a small neighborhood of
.
(A1) (Differentiability) We assume that
is bounded and at least four times differentiable with bounded partial derivatives up to the fourth order for every
.(A2) (Eigengap) We assume that there exist constants
and
such that
and
for any
.- (A3) (Path Smoothness) Under the same
in (A2), we assume that there exists another constant
such that
for all
and
.
Condition (A1) is a natural differentiability assumption under the context of ridge estimation. Condition (A2) is a curvature assumption on the true density
, ensuring that
is ‘strongly concave’ around
inside the
-dimensional linear space spanned by the columns of
. We call this property ‘subspace constrained strong concavity’. It is one of the most important components in establishing the linear convergence of the SCGA and SCMS algorithms; see Remark 3.3 for the reasoning. Condition (A3) regularizes the gradient and third-order derivatives of
from being too steep around the ridge
. They are also imposed by [45] for characterizing a quadratic behavior of
around
and ensuring the stability of
, as well as by [22] to avoid the degenerate normal spaces of
. Consequently,
is a
-dimensional manifold that contains neither intersections nor endpoints; see also Lemma C.1 in the Appendix. Notice that the inequality assumptions in (A3) depend on both the ambient dimension
and the intrinsic dimension
of the ridge
. The larger the dimensions
and
are, the harder the assumptions will hold. This phenomenon, in some sense, reflects the curse of dimensionality in nonparametric ridge estimation.
Given conditions (A1–3), the ridge
will be stable under small perturbations of the underlying density
and its derivatives, which is summarized in the following lemma. The stability of
is generally measured by the Hausdorff distance defined as
![]() |
(3.4) |
where
are two sets in
.
Lemma 3.1. (Theorem 4 in [45]).
Assume conditions (A1–3) for two densities
. When
is sufficiently small, we have
where
and
are the
-ridges of
and
, respectively.
When the true density
that generates the Euclidean data
is replaced by the Euclidean KDE
in the definition (1.1) of density ridges, we obtain a natural (plug-in) estimator of the true ridge
as
![]() |
To regularize statistical behaviors of the estimated ridge
, we make the following assumptions on the kernel of its form (2.2):
- (E1) We assume that the kernel profile
is non-increasing and at least three times continuously differentiable with bounded fourth-order partial derivatives as well as
with
. - (E2) Let
We assume that
is a bounded VC (subgraph) class of measurable functions on
; that is, there exist constants
such that for any
,
where
is the
-covering number of the normed space
,
is any probability measure on
and
is an envelope function of
. Here, the norm
is defined as
.
Remark 3.1.
Recall that the
-covering number
is defined as the minimal number of
-balls
of radius
needed to cover the (function) class
. One popular concept for controlling uniform covering number
is the notion of Vapnik–Červonenkis (subgraph) classes, or simply VC classes. Starting from collections of sets, we say that a collection
of subsets of the sample space
picks out a certain subset of the finite set
if it can be written as
for some
. The collection is said to shatter
if
picks out each of its
subsets. The VC-index
of
is the smallest
for which no set of size
is shattered by
. A collection
of measurable sets is called a VC class if its index
is finite. To generalize this concept to a class
of real-valued and measurable functions defined on
, we say that
is a VC subgraph class if the collection of all subgraphs of the functions in
forms a VC class of sets in
. An important property of VC (subgraph) classes is that their
-covering numbers grow polynomially in
as what condition (E2) is stated; see Theorem 2.6.4 in [108]. More in-depth discussion on VC classes can be found in Chapter 2.6 of the same book.
Condition (E1) can be relaxed such that the kernel profile
is three times continuously differentiable except for finite number of points on
. Such relaxation allows us to include the Epanechnikov and other compactly supported kernel. The integrability assumption on
in condition (E1) is similar to the conditions (K1) and (K3) in Section 2.1 for the purpose of bounding the expectations and variances of the KDE
and its (partial) derivatives. Condition (E2) regularizes the complexity of the kernel and its (partial) derivatives, which is essential in establishing the uniform consistency of
and its derivatives to the corresponding quantities of
as in equation (3.5).
Given conditions (E1) and (E2), the techniques in [20, 40, 48] can be utilized to show the uniform consistency of the Euclidean KDE
and its derivatives as
![]() |
(3.5) |
3.2 Mean Shift and SCMS Algorithms with Euclidean Data
We begin with a quick review on the Euclidean mean shift algorithm, as the SCMS algorithm is built on top of such formulation. Given condition (E1) and the Euclidean KDE
with kernel (2.2), its gradient estimator takes the form
![]() |
(3.6) |
where the first term is a variant of KDEs and the second term is the mean shift vector
![]() |
(3.7) |
This factorization suggests that the mean shift vector aligns with the direction of maximum increase in
. Thus, moving a point along its mean shift vector successively yields an ascending path to a local mode [29, 31, 71]. Let
be the mean shift sequence with the Euclidean KDE
. Then, one step iteration of the mean shift algorithm is written as
![]() |
(3.8) |
showing that the mean shift algorithm is a gradient ascent method with an adaptive step size
![]() |
(3.9) |
Here, we denote by
the denominator of the adaptive step size
. Lemma 3.2 shows that under condition (E1) and the differentiability assumption on
,
tends to a fixed constant with probability tending to 1 for any
as
and
. Therefore, the step size
has its asymptotic rate as
and tends to zero as
and
as well. The proof of Lemma 3.2 can be found in Appendix D.
Lemma 3.2.
Assume conditions (A1) and (E1). The convergence rate of
is
for any
as
and
.
As the mean shift algorithm is not the main focus of this paper, we will make an abuse of notation and denote by
the sequence produced by the SCMS or SCGA algorithm in the sequel. Compared with the mean shift iteration (3.8), the SCMS algorithm updates the sequence
through the SCMS vector
as
![]() |
(3.10) |
See Algorithm 1 in Appendix A for the entire procedure. This also implies that the SCMS algorithm can be viewed as a sample-based SCGA method as
![]() |
(3.11) |
with the same adaptive step size
as the Euclidean mean shift algorithm in (3.8). The formulation (3.11) sheds light on some (linear) convergence properties of the SCMS algorithm as we will demonstrate in the next subsection.
3.3 Linear Convergence of Population and Sample-Based SCGA Algorithms
We have shown in (3.11) that the (usual/Euclidean) SCMS algorithm is a variant of the sample-based SCGA algorithm in
with an adaptive step size
. To establish the (linear) convergence results of the SCMS algorithm with Euclidean KDE
, it suffices to study the (linear) convergence of the sample-based SCGA algorithm with objective function
. To this end, we begin by studying the convergence of the population SCGA algorithm whose objective function is the underlying density
.
Let
be the sequence defined by the population SCGA algorithm and
be the sequence defined by the sample-based SCGA algorithm. The population SCGA algorithm is defined by its iterative formula as
![]() |
(3.12) |
where
is a (fixed) step size. The sample-based SCGA algorithm has its iterative formula as (3.11), except that the standard sample-based SCGA algorithm normally embraces a constant step size
.
Remark 3.2.
In (3.10) and (3.11), we consider the SCMS algorithm as a sample-based SCGA iteration with an adaptive step size
. Our Lemma 3.2 suggests that
tends to zero in a rate
as
and
. However, once the sample size
is fixed and the bandwidth
is chosen, the step size
is not only upper bounded but also uniformly lower bounded away from zero with respect to the iteration number
by the differentiability condition (E1) when the current iterative point
lies within the compact neighborhood
. Note that
is compact because
is a finite union of connected and compact manifolds; see (d) of Lemma C.1. More importantly, these upper and lower bounds of
when
are independent of the iteration number
. Therefore, conditioning on the case when the sample size
is sufficiently large, one can always select a small bandwidth
such that the adaptive step size
of the SCMS algorithm is sufficiently small but not equal to zero.
As revealed by the following proposition, our imposed conditions (A1–3) in Section 3.1 ensure that as long as the step size
is small, the objective function
along any population SCGA sequence
is non-decreasing and the sequence by itself converges to
when it is initialized within a small neighborhood of
.
Proposition 3.1. (Convergence of the SCGA Algorithm.)
For any SCGA sequence
defined by (3.12) with
, the following properties hold.
(a) Under condition (A1), the objective function sequence
is non-decreasing and converges.
(b) Under condition (A1),
.
(c) Under conditions (A1–3),whenever
with the convergence radius
satisfying
where
is a constant defined in (h) of Lemma C.1 while
is a quantity depending on both the dimension
and functional norm
up to the fourth-order (partial) derivatives of
.
The proof of Proposition 3.1 can be found in Appendix D. We make two comments on the choice of the convergence radius
in (c) of Proposition 3.1. The first two quantities in the upper bound of
ensure that
and therefore, the projection of
onto
is well defined. The last quantity in the upper bound of
is critical to guarantee that the distances
from the SCGA sequence
to the ridge
can be controlled by the norms
of order-
principal gradients for
.
Corollary 3.1. (Convergence of the SCMS Algorithm.)
When the fixed sample size
is sufficiently large and the fixed bandwidth
is chosen to be sufficiently small, the following properties hold for the SCMS sequence
with high probability under conditions (A1–3) and (E1–2).
(a) The Euclidean KDE sequence
is non-decreasing and thus converges.
(b)
.
(c)
whenever
with the convergence radius
defined in (c) of Proposition 3.1.
Corollary 3.1 is the sample-based version of Proposition 3.1. On the one hand, when
is sufficiently large and
is small enough, the estimated ridge
also satisfies conditions (A1–3) with high probability; see Lemma 3.1 and the uniform bounds (3.5) of
. On the other hand, the adaptive step size
of the SCMS algorithm can be always smaller than the threshold
when the sample size
is sufficiently large and
is small; see Remark 3.2. Consequently, our arguments in Proposition 3.1 can be applied to establish the (local) convergence of the SCMS sequence here. In addition, we point out that Proposition 2 in [46] also proved the results (a-b) of Corollary 3.1 under condition (E1) and the convexity assumption on the kernel profile
. The difference is that our arguments hold when
is large and
is small while the extra convexity assumption in [46] enables the authors to prove the results (a-b) universally for any choice of the bandwidth
.
By Proposition 3.1 and Corollary 3.1, it is now reasonable to denote the limiting points of the population and sample-based SCGA sequences
and
by
and
, respectively. Before stating our main linear convergence results, we introduce the concepts of Q-linear and R-linear convergence from optimization literature; see, e.g. Appendix A2 in [78].
Definition 3.2. (Linear Rate of Convergence.)
We say that the convergence of the sequence
to
is Q-linear if there exists a constant
such that
We say that the convergence is R-linear if there is a sequence of nonnegative scalars
such that
The linear convergence of the SCGA sequence
will be established under the following local condition.
- (A4) (Quadratic Behaviors of Residual Vectors) We assume that the SCGA sequence
with step size
and
as its limiting point satisfies
for some constant
, where
is the constant defined in condition (A2).
Condition (A4) imposes a direct assumption on the SCGA sequence
, under which the residual vector
and its inner product with the residual gradient
are upper bounded by a quadratic term
. This condition is imposed to guarantee that
is ‘subspace constrained strongly concave’ around
; see also Remark 3.3. Our proof of Theorem 3.1 suggests that the residual vector
is only required to be smaller than the first-order term
. For simplicity, we require it to be quadratic. When condition (A4) fails to hold, the associated SCGA sequence can only converge sublinearly to
. Therefore, it is an essential element in the linear convergence of the SCGA algorithm, and we discuss some potentially weaker assumptions that implicate condition (A4) in Appendix E. Intuitively, the SCGA path converges to
following the direction of principal gradient
. To further gain more insights into the correctness of condition (A4), we consider a special density function
![]() |
(3.13) |
on
, whose one-dimensional ridge is
by the definition (1.1). Some careful calculations suggest that its principal gradient
points toward the ridge
in the direction
when
and in the direction
when
; see Fig. 2 for a graphical illustration. Furthermore, the smallest eigenvalue of
is negative whenever
. Hence, the residual gradient
is perpendicular to the SCGA direction, and condition (A4) naturally holds.
Fig. 2.

Contour lines of the density function (3.13) and its principal gradient flows.
We now present our linear convergence results for the population and sample-based SCGA algorithms.
Theorem 3.3. (Linear Convergence of the SCGA Algorithm.)
Assume conditions (A1–4) throughout the theorem.
(a) Q-Linear convergence of: Consider a convergence radius
satisfying
where
is the constant defined in (h) of Lemma C.1 and
is a quantity defined in (c) of Proposition 3.1 that depends on both the dimension
and the functional norm
up to the fourth-order derivative of
. Whenever
and the initial point
with
, we have that
(b) R-Linear convergence of: Under the same radius
in (a), we have that whenever
and the initial point
with
,
We further assume conditions (E1-2) in the rest of statements. If
and
,
(c) Q-Linear convergence of: under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
(d) R-Linear convergence of: under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
The detailed proof of Theorem 3.3 can be found in Appendix D. Note that, as in (c) of Proposition 3.1, we elucidate a threshold value for the convergence radius
in (a), under which the population SCGA algorithm converges linearly to
. The first three quantities in the threshold value are directly adopted from the upper bound of the convergence radius
in (c) of Proposition 3.1, while the last term controls the ‘subspace constrained strongly concavity’ (3.15) of
within
.
Remark 3.3.
Notice that the standard strong concavity assumption on the objective function (or density function)
is not sufficient to establish the linear convergence of the population SCGA algorithm (3.12). This is because, under the (quasi-)strong concavity assumption [76], the objective function
would satisfy
(3.14) for some constant
, and those standard proofs of the linear convergence of gradient ascent methods rely on this inequality; see Section 3.4 in [17]. However, as indicated in our proof of Theorem 3.3, the linear convergence of the SCGA algorithm requires the following inequality instead:
(3.15) for some constant
, where
is generally chosen to be
. We call the function
satisfying (3.15) to be ‘subspace constrained strongly concave’. Since
the strong concavity assumption (3.14) will not imply the key inequality (3.15) for the linear convergence of the population SCGA algorithm unless the residual gradient term
can be upper bounded by the second-order error term
. The imposed eigengap condition (A2) as well as condition (A4) with its related discussion in Appendix E fill in this gap, ensuring that such a quadratic upper bound holds on the residual gradients along the SCGA sequence.
Corollary 3.2. (Linear Convergence of the SCMS Algorithm.)
Assume conditions (A1–4) and (E1–2). When the fixed sample size
is sufficiently large and the bandwidth
is chosen to be sufficiently small, there exists a convergence radius
such that the SCMS sequence
satisfies the following property with high probability:
whenever
and the initial point
.
Corollary 3.2 should also be regarded as the linear convergence of the sample-based SCGA algorithm to the estimated ridge
defined by the Euclidean KDE
. Based on conditions (E1–2) and the uniform bounds (3.5),
together with its ridge
and sample-based SCGA sequence
satisfy conditions (A1–4) with probability tending to 1 as
and
. As a result, one can follow our argument in (a) of Theorem 3.1 to establish the linear convergence of the sample-based SCGA algorithm with a fixed step size
satisfying
. Furthermore, when the fixed sample size
is sufficiently large and the bandwidth
is chosen to be small, the adaptive step size
of the SCMS algorithm always falls below the threshold
for linear convergence but is also uniformly bounded away from zero with respect to the iteration number
; see our Remark 3.2. By taking the infimum of the adaptive step size
with respect to
, one can thus establish the linear convergence of the SCMS algorithm with its rate of convergence as
and
.
4. The SCMS Algorithm With Directional Data and Its Linear Convergence
In this section, we generalize the definition (1.1) of density ridges to directional densities on
and propose our directional SCMS algorithm to identify directional density ridges. In addition, we prove the linear convergence of our directional SCMS algorithm by adjusting the arguments in Section 3.3. Throughout this section,
denotes a random sample from a directional distribution with density
supported on the unit hypersphere
that is embedded in the ambient Euclidean space
.
4.1 Definitions, Assumptions and Stability of Directional Density Ridges
To apply the matrix forms of the Riemannian gradient
and Hessian
of a directional density
in the ambient space
, we first extend
from its support
to
by defining
![]() |
(4.1) |
Now, given the expressions of
and
defined in (2.8) and (2.10), we perform the spectral decomposition on
as
, where
is a real orthogonal matrix with columns
as the eigenvectors of
that are associated with the eigenvalues
and lie within the tangent space
at
, and
. Note that the Riemannian Hessian
has a unit eigenvector
that is orthogonal to
and corresponds to eigenvalue 0.
Let
be the last
columns of
, i.e. the unit eigenvectors inside the tangent space
corresponding to the
smallest eigenvalues of
. Let
be the projection matrix onto the linear space spanned by the columns of
, and
. We define the order-
principal Riemannian gradient
by
![]() |
(4.2) |
where the last equality follows from the fact that the columns of
are orthogonal to the unit vector
. The order-
density ridge on
(or directional density ridge) is the set of points defined as
![]() |
(4.3) |
Our definition of density ridges on
can be arguably generalized to any smooth function
supported on an arbitrary Riemannian manifold. It also follows that the 0-ridge
is the set of local modes of
on
, whose statistical properties and practical estimation algorithm are discussed in [118]. Therefore, we only focus on the case when
in this paper. To regularize the directional density ridge
, we modify our assumptions on the Euclidean density ridge
in Section 3.1 as follows:
(A1) (Differentiability) Under the extension (4.1) of the directional density
, we assume that the total gradient
, total Hessian matrix
and third-order derivative tensor
in
exist, and are continuous on
and square integrable on
. We also assume that
has bounded fourth-order derivatives on
.(A2) (Eigengap) We assume that there exist constants
and
such that
and
for any
.- (A3) (Path Smoothness) Under the same
in (A2), we assume that there exists another constant
such that
for all
and
.
Recall that
is a
-neighborhood of the directional ridge
in the ambient space
. The discussions about conditions (A1–3) in Section 3.1 apply to their directional counterparts (A1–3), except that the eigengap condition (A2) is imposed on eigenvalues
within the tangent space
at
. However, since the only eigenvalue of
associated with the eigenvector outside the tangent space
is 0, the eigengap condition (A2) is also valid to the entire spectrum of
in the ambient space
. The extension of
in (4.1) has also been used by [43, 44, 120]. Because the directional density
remains unchanged along every radial direction of
under the extension (4.1), the radial component of its total gradient is
for all
, and the Riemannian gradient (2.8) of
on
becomes
![]() |
(4.4) |
Similarly, the Riemannian Hessian (2.10) of
on
reduces to
![]() |
(4.5) |
Both the Riemannian gradient and Hessian of
on
are invariant under this extension.
Remark 4.1. (Connection to Solution Manifolds.)
Example 4 in [28] showed that any Euclidean density ridge
defined in (1.1) is a concrete example of a solution manifold
with
being a vector-valued function. It is not difficult to verify that our defined directional density ridge
in (4.3) also belongs to the general form of the solution manifold
, where we may rewrite
with
defined by
recalling that
are the last
eigenvectors of the Riemannian Hessian
of the directional density
. More importantly, our imposed conditions (A1–3) in the Euclidean ridge case and (A1–3) in the directional ridge case imply all the required assumptions in [28], i.e. the differentiability of
and non-degeneracy of the normal space of
; see (d) of Lemmas C.1 and G.1 in the Appendix. Therefore, the discussion about statistical properties and (normal) gradient flows of a generic solution manifold
apply to the (directional) density ridge
or
here.
Similar to Euclidean density ridges, we establish the following stability theorem of directional density ridges. To measure the distance between two directional ridges
defined by the directional densities
and
, we adopt the definition (3.4) of Hausdorff distance between two sets in the ambient Euclidean space
. Note that the Euclidean norm used in the definition (3.4) is upper bounded by the geodesic distance when our interested sets lie on
; see also (2.5). We will leverage this property in our proof of Theorem 4.1; see Appendix H for details.
Theorem 4.1.
Suppose that conditions (A1–3) hold for the directional density
and that condition (A1) holds for
. When
is sufficiently small,
(a) conditions (A2–3) holds for
.
(b)
.
(c)
for a constant
.
One natural estimator of the directional density ridge
can be obtained by plugging the directional KDE
into the definition (4.3) as
![]() |
To regularize the statistical behavior of the estimated directional ridge
, we consider the following assumptions that are generalized from conditions (E1–2):
- (D1) Assume that
is a bounded and three times continuously differentiable function with a bounded fourth order derivative on
for some constant
such that 
- (D2) Let
We assume that
is a bounded VC (subgraph) class of measurable functions on
; that is, there exist constants
such that for any
,
where
is the
-covering number of the normed space
,
is any probability measure on
and
is an envelope function of
. Here, the norm
is defined as
.
The differentiability assumption in condition (D1) can be relaxed such that
is (three times) continuously differentiable except for a set of points with Lebesgue measure
on
. Conditions (D1) and (A1) are generally required for establishing the pointwise convergence rates of the directional KDE and its derivatives [43, 44, 51, 64, 120]. Under these two conditions,
appearing in the step sizes
or
of the directional mean shift or SCMS algorithm can also be shown to diverge at the order
as
and
; see Section 4.2 for details. Condition (D2) regularizes the complexity of kernel
and its derivatives as in condition (E2) in order for the uniform convergence rates of the directional KDE and its derivatives; see (4.6) below. One can justify via integration by parts that the von-Mises kernel
and many compactly supported kernels satisfy conditions (D1–2).
Given conditions (D1–2), the techniques in [7, 43, 44, 51, 118, 120] can be utilized to demonstrate that
![]() |
(4.6) |
where
is the Riemannian connection on
with
so that
,
and
; see Section 5.3 in [1] and Chapter 4 in [69].
4.2 Mean Shift and SCMS Algorithm with Directional Data
Before deriving our directional SCMS algorithm, we first review the mean shift algorithm with directional data
[62, 80, 113]. The formal derivation can be found in Section 3 of [118]. Given the directional KDE
in (2.4), the directional mean shift vector can be defined as
![]() |
(4.7) |
Similar to the Euclidean mean shift vector (3.7),
also points toward the direction of maximum increase in
after being projected onto the tangent space
. Thus, the directional mean shift iteration translates a point
as
with an extra projection
to draw the shifted point back to
.
Let
denote the sequence defined by the above directional mean shift procedure. Later, by abuse of notation, we will use the same notation to denote the directional SCGA/SCMS sequence with
. As
, some simple algebra shows that the directional mean shift algorithm can be written into the following fixed-point iteration formula:
![]() |
(4.8) |
From (4.8), it is also possible to write the directional mean shift algorithm as a gradient ascent method on
with the iteration formula [116]:
![]() |
(4.9) |
where the adaptive step size
is given by
![]() |
(4.10) |
Here, we denote the angle between
and
(or equivalently, the angle between
and
) by
; see Section 5.2 in [118] for detailed derivations. Within some small neighborhoods around local modes of
,
and the adaptive step size
will be dominated by
. The following lemma characterizes the asymptotic behaviors of
on
and consequently,
.
Lemma 4.1. (Lemma 10 in [118]).
Assume conditions (D1) and (A1). For any fixed
, we have
as
and
, where
is a constant depending only on kernel
and dimension
.
Lemma 4.1 indicates that
with probability tending to 1 as
and
for any
. The conclusion may seem counterintuitive at the first glance, but one should be aware that the consistency of
holds only on its tangent component; see (4.6). The radial component of
that is perpendicular to
diverges, despite the fact that the true directional density
does not have any radial component. Using Lemma 4.1, one can argue that the adaptive step size
in (4.10) of the directional mean shift algorithm as a gradient ascent method on
tends to zero at the rate
as
and
.
In the sequel, we denote by
the iterative sequence generated by our directional SCMS algorithm. There are two different methods of defining a directional SCMS iteration, while we will demonstrate that one of them is superior.
Method 1: As in the Euclidean SCMS algorithm, one can define the directional SCMS sequence by the directional mean shift vector (4.7) as
![]() |
(4.11) |
where
,
and
is the estimated version of
defined by the directional KDE
. Here, we plug in (4.7) and leverage the orthogonality between the columns of
and
in (*).
Unlike the Euclidean SCMS algorithm, we need an extra standardization step
to project the updated point back to
, which leads to the following fixed-point iteration:
![]() |
(4.12) |
where the components
and
are always orthogonal for any
; see Fig. 3 for a graphical illustration.
Fig. 3.

An illustration of one-step iterations under two candidate directional SCMS algorithms
Method 2: The fixed-point iteration formula (4.8) of the directional mean shift algorithm suggests a more efficient formulation of the directional SCMS algorithm as
![]() |
(4.13) |
where we replace the directional mean shift vector
with the standardized total gradient estimator
in (4.11). This directional SCMS is again a fixed-point iteration as
![]() |
(4.14) |
A direct computation demonstrates that, by the non-increasing property of kernel
and the fact that
for
,
![]() |
(4.15) |
Because the radial components
and
in directional SCMS iterative formulae (4.12) and (4.14), respectively, make no contributions to the iteration of point
on
, the inequality (4.15) indicates that the directional SCMS algorithm with iterative formula (4.14) takes a larger step size in moving the SCMS sequence
on
. This helps accelerate the movements of those points that are far away from the ridge
or lie in the regions with low density values of
. In this sense, the directional SCMS algorithm with iterative formula (4.14) will be superior to (4.12); see Fig. 3 for a graphical demonstration. We thus choose Method 2 as our directional SCMS algorithm. Algorithm 2 in Appendix A provides the detailed steps of implementing Method 2 in practice.
Inspired by Proposition 2 in [46] for the Euclidean SCMS algorithm, we derive the ascending property of our directional SCMS algorithm (4.13) and two convergent results for stopping the algorithm in the following proposition. The proof is deferred to Appendix I, in which our argument is similar to but logically different from the proof of Proposition 2 in [46].
Proposition 4.2.
Assume that the directional kernel
is non-increasing, twice continuously differentiable and convex with
. Given the directional KDE
and the directional SCMS sequence
defined by (4.13) or (4.14), the following properties hold:
(a) The estimated density sequence
is non-decreasing and thus converges.
(b)
.
(c) If the kernel
is also strictly decreasing on
, then
.
Remark 4.2.
Our results (b) and (c) in Proposition 4.2 demonstrate that the stopping criterion of our directional SCMS algorithm can follow either the norm of the principal Riemannian gradient estimator
or the (Euclidean) distance
between two consecutive iterative points, where the latter one requires a strictly decreasing kernel such as the von Mises kernel
.
Motivated by the iterative formula (4.9) for the gradient ascent algorithm on
, we consider writing our directional SCMS algorithm as a variant of the SCGA algorithm on
with an iterative formula:
![]() |
(4.16) |
where
is the exponential map at
and
is the adaptive step size. Analogous to the Euclidean SCMS algorithm and its SCGA representation (3.11), the formulation (4.16) will reveal the (linear) convergence properties of our directional SCMS algorithm in the upcoming Section 4.3. To derive an explicit formula for
, we recall the fixed-point equation (4.14) of our directional SCMS algorithm and compute the geodesic distance between
and
(one-step directional SCMS update) as
![]() |
where, in the second equality, we equate the geodesic distance between
and
to the norm of the tangent vector inside the exponential map in (4.16). This suggests that our directional SCMS algorithm is a sample-based SCGA algorithm on
with adaptive step size
![]() |
(4.17) |
for
, where
’ denotes the angle between
and
. Note that the above derivation is based on the orthogonality between
and the order-
principal Riemannian gradient estimator
![]() |
see Fig. 3 for a graphical illustration. When our directional SCMS algorithm approaches the estimated ridge
,
’ tends to 0 and
is approximately equal to 1. Thus, the step size
is also controlled by
as in the directional mean shift scenario; see Equation (4.10). Therefore, Lemma 4.1 is still effective to argue that the step size
converges to 0 with probability tending to 1 when
and
.
4.3 Linear Convergence of Population and Sample-Based SCGA Algorithms on
As we have shown in (4.16) that our proposed directional SCMS algorithm is an example of the sample-based SCGA method with directional KDE
on
with an adaptive step size
, our main focus in this subsection will be on the (linear) convergence of such SCGA algorithm on
. We first consider the population SCGA algorithm on
defined by its iterative formula as
![]() |
(4.18) |
with a suitable choice of the step size
. The sample-based version substitutes the subspace constrained Riemannian gradient
with its estimator
and generally has a constant step size
; see (4.16). In the sequel, we denote the sequence defined by the population SCGA algorithm with objective function
on
by
and the sequence defined by the sample-based SCGA algorithm with objective function
on
by
.
Remark 4.3.
Note that the definition (4.18) of the SCGA algorithm is adaptive to any Riemannian manifold
, not restricting to the unit hypersphere
. The only requirement on
for (4.18) to be valid is that the exponential map
is well defined within a small neighborhood of
on the tangent space
for each
. More importantly, our assumptions (A1–3) and condition (A4) are generalizable to any smooth function
supported on
, and our (linear) convergence results are applicable to the SCGA algorithm (4.18) on
whose sectional curvature is lower bounded by a real number; see one of the key lemmas in our proofs (Lemma I.1).
Similar to the SCGA algorithm in the Euclidean space
, the following proposition demonstrates that the SCGA algorithm (4.18) on
yields a non-decreasing sequence of the objective function
supported on
and a convergent SCGA sequence to the directional ridge
, as long as the step size
is sufficiently small.
Proposition 4.3. (Convergence of the SCGA Algorithm on
.)
For any SCGA sequence
defined by (4.18) with
, the following properties hold:
(a) Under condition (A1), the objective function sequence
is non-decreasing and thus converges.
(b) Under condition (A1),
.
(c) Under conditions (A1–3),whenever
with the convergence radius
satisfying
where
is a constant defined in (h) of Lemma G.1 while
is a quantity depending on both the dimension
and the functional norm
up to the fourth-order (partial) derivatives of
.
The proof of Proposition 4.3 can be found in Appendix I. The upper bound for the convergence radius
has the same meaning as in Proposition 3.1 for the Euclidean SCGA algorithm, ensuring that
and the distances from the SCGA sequence
on
to the directional ridge
can be upper bounded by the norms
of order-
principal Riemannian gradients for all
.
Corollary 4.1. (Convergence of the Directional SCMS Algorithm.)
When the fixed sample size
is sufficiently large and the bandwidth
is chosen to be correspondingly small, the following properties hold for the directional SCMS sequence
with high probability under conditions (A1–3) and (D1–2):
(a) The directional KDE sequence
is non-decreasing and thus converges.
(b)
.
(c)
whenever
with the convergence radius
defined in (c) of Proposition 4.3.
Corollary 4.1 should also be considered as the convergence results of the sample-based SCGA algorithm on
. To justify Corollary 4.1, we know from Theorem 4.1 that conditions (A1–3) also hold with high probability for the directional KDE
and its estimated directional ridge
when
is sufficiently large and
is small enough. Further, by Lemma 4.1, the adaptive step size
of our directional SCMS algorithm can be smaller than the threshold value
in Proposition 4.3 but also universally bounded away from zero with respect to the iteration number
, given a sufficiently large but fixed sample
and a sufficiently small bandwidth
; recall our Remark 3.2. As a result, Corollary 4.1 follows from Proposition 4.3. Notice that the statements in Proposition 4.2 are essentially the same as the results (a–b) in Corollary 4.1 here. However, similar to Proposition 2 in [46] for the Euclidean SCMS algorithm, Proposition 4.2 for the directional SCMS algorithm is established under the convexity assumption on the directional kernel
and holds for any sample size
and bandwidth
. On the contrary, the results (a–b) in Corollary 4.1 are asymptotic and probabilistic properties, in which we require
and
.
According to Proposition 4.3 and Corollary 4.1, we can denote the limiting points of the population and sample-based SCGA algorithms on
by
and
, respectively. The definition of the linear convergence of any converging sequence on
(or an arbitrary Riemannian manifold) is similar to the one in the flat Euclidean space
(see Definition 3.2), except that the Euclidean distance is replaced with the geodesic distance on
in the definition; see Section 4.5 in [1].
Using the notation in [116], we let
. Given that the sectional curvature is
on
, we have
. One can show by differentiating
that
is strictly increasing with respect to
and
for any
. Analogous to the Euclidean SCGA algorithms, we will establish the linear convergence of the SCGA sequence
on
(or any Riemannian manifold whose sectional curvature is lower bounded by a real number) as well as its sample-based version under the following local condition.
- (A4) (Quadratic Behaviors of Residual Vectors) We assume that the SCGA sequence
on
with step size
and
as its limiting point satisfies
for some constant
, where
is the constant defined in condition (A2) and
is the logarithmic map.
Condition (A4) serves as a generalization of its Euclidean counterpart condition (A4) to
, which again requires a quadratic behavior of the residual vector
within the tangent space
. Under this condition, the objective (density) function
is ‘subspace constrained geodesically strongly concave’ around the directional ridge
; see also Remark 4.4. Some discussions about potentially weaker assumptions that imply condition (A4) in Appendix E are also applicable in the manifold setting under some modifications; see Remark E.1. One intuitive example that condition (A4) holds is presented at the second row of Fig. 5, where the directional SCMS/SCGA iterative vector
is always orthogonal to the residual space
for all
around the (estimated) ridge on
.
Fig. 5.

Density ridges estimated by the directional SCMS algorithm performed on the two simulated datasets and their (linear) convergence plots. Horizontally, the first row displays the results on the simulated vMF mixture dataset, while the second row presents the results on the circular simulated dataset on
. Vertically, the first column includes plots with directional KDE, estimated ridges and trajectories of directional SCMS sequences from two (randomly) chosen initial points on
. The second and third columns present the convergence plots for the log-distances of points in the highlighted sequences (indicated by hollow cyan points) to their limiting points or the estimated ridges on
.
Theorem 4.4. (Linear Convergence of the SCGA Algorithm on
.)
Assume conditions (A1–4) throughout the theorem.
(a) Q-Linear convergence of: Consider a convergence radius
satisfying
where
is the constant defined in (h) of Lemma G.1 and
is a quantity defined in (c) of Proposition 4.3 that depends on both the dimension
and the functional norm
up to the fourth-order (partial) derivatives of
. Whenever
and the initial point
with
, we have that
(b) R-Linear convergence of: Under the same radius
in (a), we have that whenever
and the initial point
with
,
We further assume (D1–2) in the rest of statements. Suppose that
and
.
(c) Q-Linear convergence of: Under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
(d) R-Linear convergence of: Under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
The detailed proof of Theorem 4.4 is in Appendix I. The theorem illuminates both the step size requirement and the convergence radius
for the linear convergence of SCGA algorithms on
. Similar to Euclidean SCGA algorithms in Theorem 3.3, the upper bound of the convergence radius
consists of the three quantities adopted from Proposition 4.3 and a quantity controlling the ‘subspace constrained geodesically strong concavity’ around the directional ridge
.
Remark 4.4.
Similar to Euclidean SCGA algorithms, the geodesically strong concavity assumption [116] on the objective function
is not sufficient to prove the linear convergence of the SCGA algorithm (4.18) on
. We instead establish the following ‘subspace constrained geodesically strong concavity’ under some mild conditions (A1–4):
(4.19) for some constant
, where
is generally chosen to be
. In fact, the most critical factors for establishing this property is the eigengap condition (A2) and the quadratic behaviors of residual vectors stated in condition (A4).
Corollary 4.2. (Linear Convergence of the Directional SCMS Algorithm.)
Assume conditions (A1–4) and (D1–2). When the fixed sample size
is sufficiently large and the fixed bandwidth is chosen to be sufficiently small, there exists a convergence radius
such that the directional SCMS sequence
satisfies
with high probability whenever
and the initial point
.
We also identify Corollary 4.2 as the linear convergence of the sample-based SCGA algorithm on
to the estimated directional ridge
defined by the directional KDE
. The corollary can be justified by noticing that, under conditions (D1–2) and the uniform bounds (4.6),
satisfies conditions (A1–3) with probability tending to 1 as
and
; see Theorem 4.1. With this fact, one can leverage our argument in (a) of Theorem 4.4 to prove the linear convergence of the sample-based SCGA algorithm on
with a fixed step size
satisfying
. Additionally, when the fixed sample size
is sufficiently large and the bandwidth is chosen to be accordingly small, the adaptive step size
of our directional SCMS algorithm in (4.17) always falls below the threshold value
for linear convergence by Lemma 4.1 but is also bounded away from zero; recall Remark 3.2. Taking the infimum of
with respect to the iteration number
under a fixed
and
yields our results in Corollary 4.2.
5. Experiments
In this section, we first validate our linear convergence results of both Euclidean and directional SCMS algorithms on some simulated datasets. Then, we apply these two algorithms to a real-world earthquake dataset so as to identify its density ridges and compare the estimated ridges with boundaries of tectonic plates and fault lines, on which earthquakes are known to happen frequently.
We leverage the Gaussian kernel profile
in the Euclidean SCMS algorithm and the von Mises kernel
in the directional SCMS algorithm. In addition, the logarithms of the estimated densities are utilized in our actual implementations (Step 2 in Algorithms 1 and 2 in Appendix A) of the Euclidean and directional SCMS algorithms because of two advantages. First, using the log-density in the Euclidean SCMS algorithm leads to a faster convergence process [46]; see our empirical illustration in Fig. A7. Second, estimating a hidden manifold with a density ridge defined by a log-density stabilizes the valid region for a well-defined ridge compared with the corresponding ridge defined by the original density; see Theorem 7 (Surrogate theorem) in [45].
Unless stated otherwise, we set the default bandwidth parameter of the Euclidean SCMS algorithm to the normal reference rule in [20, 25], which is
![]() |
(5.1) |
where
is the sample standard deviation along
-th coordinate and
is the (Euclidean) dimension of the data in
. As mentioned by [25], there are two advantages of applying the normal reference rule (5.1) in our context. First, the KDE
under
tends to be oversmoothing [97], because the bandwidth minimizes the asymptotic MISE for estimating the first-order derivatives of a multivariate Gaussian distribution with covariance matrix
; see Corollary 4 in [20]. More importantly, the Euclidean SCMS algorithm with an oversmoothed KDE
would not produce too many spurious ridges. Second, compared with cross validation methods,
is easy to compute in practice, especially when the dimension of data is high. The default bandwidth parameter of the directional SCMS algorithm is selected via the rule of thumb in Proposition 2 of [43], which optimizes the asymptotic MISE for a
distribution. The concentration parameter
is estimated by Equation (4.4) in [9]. That is,
![]() |
(5.2) |
where
given the directional dataset
and we recall that
is the modified Bessel function of the first kind of order
. As
-von Mises–Fisher distribution behaves as the Gaussian distribution on
, choosing the bandwidth (5.2) also helps smooth out the resulting directional KDE. The tolerance level is always set to be
for any SCMS algorithm.
5.1 Simulation Study on the Euclidean SCMS Algorithm
To evaluate the algorithmic rate of convergence of the Euclidean SCMS algorithm (Algorithm 1), we generate the first simulated dataset by randomly drawing 1000 data points from a Gaussian mixture model with density
, where
,
and
. Another simulated dataset consists of 1000 data points randomly generated from an upper half circle with radius 2 and i.i.d. Gaussian noises
. When applying Algorithm 1 with the estimated log-density on each of these two simulated datasets, we choose the set of initial mesh points as the simulated dataset itself and remove those initial points whose density values are below 25% of the maximum density from the set of mesh points in order to obtain a cleaner ridge structure.
Figure 4 presents the Euclidean KDE plots, estimated density ridges from the Euclidean SCMS algorithm and their (linear) convergence plots on the two simulated datasets. The linear trends of those plots in the second and third columns of Fig. 4 empirically demonstrate the correctness of our Theorem 3.1 and Corollary 3.2 about the linear convergence of the Euclidean SCMS algorithm.
Fig. 4.

Density ridges estimated by the Euclidean SCMS algorithm on the two simulated datasets and their (linear) convergence plots. Horizontally, the first row displays the results of the simulated Gaussian mixture dataset, while the second row presents the results of the half circle simulated dataset. Vertically, the first column includes plots with Euclidean KDE, estimated ridges, and trajectories of SCMS sequences from two (randomly) chosen initial points. The second and third columns present the (linear) convergence plots for the log-distances of points in the highlighted sequences (indicated by hollow cyan points) to their limiting points or the estimated ridges.
5.2 Simulation Study on the Directional SCMS Algorithm
Analogous to our simulation study for the linear convergence of the Euclidean SCMS algorithm, we verify the linear convergence of our directional SCMS algorithm (Algorithm 2) on two different simulated datesets. One of them comprises 1000 data points randomly generated from a vMF mixture model
with
,
and
. The other simulated dataset is identical to the example in the right panel of Fig. 1 and the underlying dataset in Fig. B9, which consists of 1000 randomly sampled points from a circle connecting two poles on
with i.i.d. additive Gaussian noises
to their Cartesian coordinates and additional
normalization onto
. In our implementation of Algorithm 2 with the directional log-density on the two simulated datasets, we also set each initial mesh as the dataset itself and remove those points whose density values are below 10% of the maximal density value from each set of mesh points.
Figure 5 shows the directional KDE plots, estimated density ridges on
from the directional SCMS algorithm and their (linear) convergence plots on the aforementioned simulated datasets. Those linear decreasing trends in the convergence plots, possibly after several pilot iterations, illustrate the locally linear convergence of the directional SCMS algorithm that we proved in Theorem 4.4 and Corollary 4.2. Note that those minor perturbations at the tails of some linear convergence plots in Fig. 5 are due to precision errors.
5.3 Density Ridges on Earthquake Data
It is well known that earthquakes on Earth tend to strike more frequently along the boundaries of tectonic plates and fault lines (i.e. sections of a plate or two plates are moving in different directions); see [54, 103] for more details. We analyze earthquakes with magnitudes of 2.5+ occurring between 2020-10-01 00:00:00 UTC and 2021-03-31 23:59:59 UTC, which can be obtained from the Earthquake Catalog (https://earthquake.usgs.gov/earthquakes/search/) of the United States Geological Survey. The dataset
contains 15049 earthquakes worldwide in this half-year period.
The normal reference rule (5.1) leads to the bandwidth parameter
and the rule of thumb (5.2) yields
under the earthquake dataset
. However, as these bandwidths lead to oversmoothing density estimates, we decrease the bandwidths for the Euclidean and directional SCMS algorithms to
and
respectively, in order to detect more ridge structures. We generate 5000 points uniformly on the sphere
as the initial mesh points.
To compare the earthquake ridges obtained by the Euclidean and directional SCMS algorithms with the boundaries of tectonic plates, we download the boundary geometry file of the 56 tectonic plates from https://www.kaggle.com/cwthompson/tectonic-plate-boundaries according to the models of [5, 13] and overlap them with the estimated ridges in Fig. 6. The results suggest that the ridges identified by the Euclidean and directional SCMS algorithms on the earthquake dataset coincide with the boundaries of tectonic plates to a large extent. Note that the Euclidean and directional ridges on the earthquake dataset
do not show too much difference, because most of the observed earthquakes are in the low latitude region (
) where most human beings live. Yet, the ridges estimated by our proposed directional SCMS algorithm do align better with the boundary of the Eurasian Plate near the North Pole than the ones estimated by the Euclidean SCMS algorithm, which confirms the superiority of our directional SCMS algorithm in the high latitude region; see also Appendix B for more in-depth analysis.
Fig. 6.

Comparisons between density ridges obtained by the Euclidean SCMS algorithm on angular coordinates and the directional SCMS algorithm on Cartesian coordinates from the earthquake dataset. On each panel, the ground-truth boundaries of tectonic plates are plots in blue curves.
We further quantify the performances of earthquake ridges
and
estimated by the Euclidean and directional SCMS algorithms from two different perspectives. First, given the fact that an estimated ridge should lie on the region where earthquakes happen more intensively, we compute the mean geodesic distances from each point in the earthquake dataset
to the ridges
and
, respectively, as
![]() |
where
is the number of earthquakes in the dataset. The ridge
estimated by our directional SCMS algorithm is around 4% closer to the earthquakes in
on average. Second, we assess the estimation errors of
and
with respect to the boundaries of tectonic plates. To this end, we view the surface of the Earth as a unit sphere
and define a manifold-recovering error measure [119] between the set of boundary points
and an estimated ridge
as
![]() |
(5.3) |
where
and
are the cardinalities of
and
, respectively. Note that although the density ridge
and the boundaries of tectonic plates
are continuous structures in theory, they are generally represented by sets of discrete points in practice. That is why we can calculate their cardinalities without computing complicated integrals. Moreover, the manifold-recovering error measure is an average between the mean geodesic distances from each point in
to
and from each point in
to
. We define such a balanced error measure to avoid biasing toward an estimated ridge
that only approximates a small portion of
in high accuracy but fails to cover other parts of
; see Fig. 4 in [119] for an illustrative example. The manifold-recovering error measures of the ridges
and
estimated by the Euclidean and directional SCMS algorithms with respect to the boundaries of tectonic plates
are
![]() |
Our directional SCMS algorithm again reduces the estimation error by around 3.9%. In summary, the earthquake ridges yielded by our directional SCMS algorithm are not only closer to the earthquakes on average than the ones identified by the Euclidean SCMS algorithm but also have a lower error in approximating the boundaries of tectonic plates.
6. Discussions
In this paper, we have provided a rigorous proof for the linear convergence of the well-known SCMS algorithm by viewing it as an example of the SCGA algorithm. We have also generalized the definition of density ridges from the usual densities supported on compact sets in
to the directional densities supported on
with nonzero curvature. The stability theorem of directional density ridges has been established, and the linear convergence of our proposed directional SCMS algorithm has been proved. Table 1 summarizes the frameworks of considering the (directional) mean shift/SCMS algorithms as gradient ascent/SCGA methods (on
) and our results of asymptotic convergence rates of their corresponding step sizes.
Table 1.
Comparisons between the Euclidean and directional mean shift (MS) or SCMS algorithms and summary of the asymptotic convergence rates of their adaptive sizes when viewed as GA/SCGA algorithms in
or on
.
Our theoretical analyses of the SCGA algorithm in the Euclidean space
and on the unit hypersphere
has potential implications beyond proving the linear convergence of SCMS algorithms. In the optimization literature [1, 77, 78, 116], it is well known that a standard gradient ascent method (on a smooth manifold) will converge linearly given an appropriate step size when the objective function is smooth and (geodesically) strongly concave. However, as we have discussed in Remarks 3.3 and 4.4, the smoothness and (geodesically) strong concavity assumptions are not sufficient for the linear convergence of the SCGA algorithms. Therefore, identifying density ridges with the SCGA algorithms is not only a nonconvex optimization problem, but also fundamentally more complex than standard gradient ascent methods. The assumptions and proof arguments developed in this paper may give some insights into the linear convergence of the SCGA algorithms with other forms of subspace constrained gradients.
There are still many open problems related to the SCMS algorithm. First, a central issue in determining the performance of an SCMS algorithm is the bandwidth selection. There is a variety of bandwidth selection mechanisms available to the Euclidean KDE and its derivatives in the literature [20, 96], but it is unclear how they can be applied to the SCMS algorithm. We plan to specialize or generalize such techniques to the SCMS algorithm under both the Euclidean and directional data. Second, our definition of density ridges is generalizable to any density supported on an arbitrary Riemannian manifold. As [56] has formulated the principal curve on a Riemannian manifold based on its classical definition in [55], it will be interesting to propose a new definition of principal curves from the perspective of density ridges on Riemannian manifolds and derive a more general SCMS algorithm, possibly based on some existing nonlinear mean shift methods on manifolds [104, 105].
Data Availability Statement
The data and code underlying this paper are available at https://github.com/zhangyk8/EuDirSCMS. Specifically, the earthquake data in Section 5.3 were obtained from the Earthquake Catalog (https://earthquake.usgs.gov/earthquakes/search/) of the United States Geological Survey.
Supplementary Material
Acknowledgment
We thank the anonymous reviewers for their helpful comments that improved the quality of this paper.
Funding
Y.C. is supported by the National Science Foundation [DMS-1952781 and DMS-2112907] and CAREER award [DMS-2141808], and the National Institutes of Health [U24-AG072122].
A. Algorithmic Summaries of Euclidean and Directional SCMS Algorithms
In this section, we provide algorithmic summaries of the Euclidean and directional SCMS algorithms for practical reference. Algorithm 1 describes each step of the Euclidean SCMS algorithm in detail. In our actual implementation of the algorithm, we replace the density estimator
with
. To demonstrate that the (directional) SCMS algorithms under the log-density implementation give rise to a faster convergence process, we repeat our experiments in Sections 5.1 and 5.2 (i.e. Figs 4 and 5) 20 times for each simulated dataset with the (directional) SCMS algorithms under the original (estimated) density and the (estimated) log-density, respectively. The comparisons between their running times are shown in Fig. A7, in which the (directional) SCMS algorithms under the log-density implementation clearly outperform their counterparts with the original density in terms of the average elapsed time until convergence.
Fig. A7.

Running time comparisons between the (directional) SCMS algorithms with the original density and the log-density applied to our simulated datasets in Figs 4 and 5.
Additionally, when the observational data in practice are noisy, it is common to incorporate an extra denoising step before Step 2 of Algorithm 1 to remove observations in low-density areas and stabilize the (Euclidean) SCMS algorithm; see [23, 45] for comparative studies that demonstrate the significance of denoising.
We summarize the directional SCMS algorithm in Algorithm 2. Note that in Step 2-1 of Algorithm 2, we compute the scaled versions
and
for
because the estimated principal Riemannian gradient
and Hessian
are often very small. The scaling stabilizes the numerical computation. The spectral decomposition is thus performed on the scaled Hessian estimator
, and the scaled principal Riemannian gradient estimator is calculated as
![]() |
where
has its columns equal to the orthonormal eigenvectors associated with the
smallest eigenvalues of the scaled Hessian estimator
(or equivalently,
) inside the tangent space
.
B. Limitations of Euclidean KDE in Handling Directional Data
In this section, we demonstrate with examples and simulation studies that it is inadequate to analyze angular or directional data with Euclidean KDE (2.1) and SCMS algorithm (Algorithm 1). Consider a directional data sample
generated from a directional density
on
. In real-world applications, the random observations
on
are commonly represented by their angular coordinates
with
or equivalently,
for
, where
are longitudes and
are latitudes.
B.1 Case I: Density Estimation
As the angular coordinates
of the directional dataset
have their ranges in a subset
of the flat Euclidean space
, it is tempting to apply the Euclidean KDE on
to construct a density estimator as
![]() |
(B1) |
where
uses a radial symmetric kernel with profile
, and
leverages a product kernel. However, the Euclidean KDEs in (B1) (both
and
) exhibit two potential drawbacks of dealing with directional data.
First,
in (B1) is an estimator of the directional density
under its angular representation
. Here,
is
-periodic in its first coordinate and
-periodic in its second coordinate. Then, the bias of
in estimating
is
![]() |
where
and
is the Laplacian of
; see [27] for details. However, the second-order partial derivative
along the lines of constant latitude (or parallels) would tend to infinity as we approach the north and south poles, given that the first-order partial derivative
is bounded. One method to justify this claim is that the curvatures of these parallels, which are equivalent to the reciprocals of their radii, tend to infinity as these radii shrink. In addition, one should recall that the curvature of a function
is defined as
. Therefore, applying (B1) to estimate the angular representation
of the directional density
will produce high bias as the estimator
approaches the high-latitude regions (around the north and south poles); see also Panel (c) of Fig. B9.
Second, the Euclidean KDE
leverages the Euclidean distances between any query point
and observations
under their angular coordinates to construct the density estimates, instead of using the (intrinsic) geodesic distances. Note that the Euclidean distance in the angular coordinate system is not equivalent to the Euclidean distance in the ambient Euclidean space
containing the directional data on
. As a result, some observations that have dramatically different geodesic distances to density query points can have the same density contributions in
, as illustrated in Example B.1.
Example B.1.
Suppose that we want to estimate the density values at
and
, where
is of a small value. Consider a random sample consisting of only two observations
and
. If we use the Euclidean distance, the distance between
and the distance between
are the same. Therefore, when we use the Euclidean KDE
to estimate the underlying density, the contribution of
to
will be the same as the contribution of
to
. Nevertheless, their geodesic distances are very different, because
while
is a quantity close to zero; see Fig. B8 for a graphical illustration. It explains, from a different angle, why the Euclidean KDE
will have a large bias in estimating the underlying density when the query point
is within the high latitude region.
Fig. B8.

Graphical illustration of geodesic distances between
and
as well as
and
.
B.2 Case II: Ridge-Finding Problem
Consider the following simulated example of identifying a density ridge via the Euclidean SCMS algorithm (Algorithm 1) and our proposed directional SCMS algorithm (Algorithm 2). We generate 1000 data points
uniformly frbecauseom a great circle connecting the North and South Poles of
with some i.i.d. additive Gaussian noises
to their Cartesian coordinates. Then, all the simulated points will be standardized back to
via
normalization. The angular coordinates of these simulated points are denoted by
accordingly. Figure B9 presents the result of applying both the Euclidean SCMS algorithm (with the Gaussian kernel) to angular coordinates and the directional SCMS algorithm (with the von Mises kernel) to Cartesian coordinates of our simulated dataset. As shown in the panel (b) of Fig. B9, the Euclidean SCMS algorithm exhibits high bias in estimating the true circular structure near two poles of
, while our directional SCMS algorithm is able to seek out the true circular structure under negligible errors. The density plot in the panel (c) of Fig. B9 exhibits two nonsmoothing peaks on the North Pole due to the infinite Hessian matrices of the underlying density in its angular coordinate; recall our discussion in Section B.1. This also explains the chaotic behavior of the Euclidean KDE in high-latitude regions.
Fig. B9.

Euclidean and directional SCMS algorithms performed on the simulated dataset. Panels (a)–(c): Outcomes of the Euclidean SCMS algorithm with the contour plot for the Euclidean KDE. Panels (d)–(f): Outcomes of our directional SCMS algorithm with the contour plot for the directional KDE. Panels (a)–(b) and (d)–(e) are shown in the view of Hammer projections (page 160 in [100], while Panels (c) and (f) are presented under the orthographic projections.
At this point, some readers may have a natural concern: why we do not directly apply the Euclidean SCMS algorithm to the Cartesian coordinates
of the available data points? We discuss the potential downsides of this approach from two different aspects.
The Euclidean SCMS algorithm is not intrinsically designed for handling the directional data
. Directly applying the algorithm to these Cartesian coordinates leads to an estimated ridge not lying on
. While the
normalization is able to standardize the ridge points back to
, this standardization process will inevitable introduce extra bias.When estimating the underlying density of
, we know from (3.5) and some KDE literature [20, 27, 96] that the (uniform) rates of convergence of the Euclidean KDE and its derivatives depend on the dimension
of the ambient space instead of the intrinsic dimension
of directional data. This dimensionality effect also appears in the (linear) convergence of the downstream SCMS algorithm, which, for instance, shrinks the upper bounds of the (linear) convergence radius and step size threshold in Theorem 3.1. Thus, analyzing directional data
with the Euclidean KDE and SCMS algorithm will slow down the statistical and algorithmic rates of convergence of the density estimators as well as lower the accuracy of the resulting ridge in recovering the underlying structure inside the dataset.
To support our above explanations, we extend our simulation study in Fig. B9 as follows. We vary the maximum latitude attained by the underlying (intrinsic) circular structure on
from
to
while keeping the circle parallel to the original great circle connecting the North and South Poles of
; see the panel (a) in Fig. B10 for an illustration. For each of these underlying circles, we follow the same sampling scheme as in Fig. B9, i.e. sampling 1000 points uniformly on the circle with some i.i.d. additive Gaussian noises
to their Cartesian coordinates and
normalization back to
. The Cartesian coordinates of the simulated points from each circular structure are denoted by
while their angular coordinates are represented by
. Then, we apply our directional SCMS algorithm to
from each of these simulated datasets. Moreover, the Euclidean SCMS algorithm is applied to both the angular coordinates
and Cartesian coordinates
from each of these simulated datasets, where we consider
as a dataset in the ambient space
in the latter case. Here, the sets of initial points for the Euclidean and directional SCMS algorithms are the simulated datasets themselves. Finally, we compute the average geodesic distance errors on
from the resulting ridges to the corresponding true circular structures. To reduce the randomness of our simulation studies, we also repeat the above sampling and experimental procedures 20 times for each true circular structure.
Fig. B10.

Euclidean and directional SCMS algorithms applied to the simulated datasets whose true structures are circles on
attaining their maximum latitudes from
to
, respectively. The dots on each line plot in the panels (b–d) are the means of the associated statistics for the repeated experiments, while the error bars indicate their corresponding standard deviations.
We present our comparisons of the Euclidean and directional SCMS algorithms based on three metrics in Fig. B10: (i) average geodesic distance errors between the estimated ridges and the true circular structures, (ii) the number of iteration steps and (iii) the running time. Notice that, as the latitudes of the underlying circular structures increase, the distance errors of (Euclidean) ridges based on the Euclidean SCMS algorithm applied on the angular coordinates
rise. Conversely, the distance errors of directional ridges and the ridges based on the Euclidean SCMS algorithm in
decreases when the true circular structures climb on
; see the panel (b) of Fig. B10. While the performances of our directional SCMS algorithm and the Euclidean SCMS algorithm in
are almost indistinguishable in terms of the average geodesic distance errors, our directional SCMS algorithm significantly outperforms the Euclidean SCMS algorithm with regards to time efficiency; see the panels (c–d) of Fig. B10. Note that the Euclidean SCMS algorithm exhibits high variance in the number of iteration steps under the repeated experiments, because each simulated dataset may contain some outliers that are far away from the true circular structure on
and the Euclidean SCMS algorithm requires exceptionally large iterative steps to converge when initialized from these outliers. Our directional SCMS algorithm, however, is stabler in its iterative step due to the fact that it is adaptive to the geometry of
.
Other potential issues of analyzing directional data with Euclidean methods and ignoring the curvature of
can be found in [30]. In summary, it is highly inadequate and inefficient to handle directional data with the Euclidean KDE and SCMS algorithm, which calls for the needs to introduce the directional KDE (2.4) and propose our well-designed SCMS algorithm for analyzing directional data (Algorithm 2).
C. Normal Space of the Euclidean Density Ridge
As we will refer to conditions (A1–3) frequently in the next two sections, we restate them here:
(A1) (Differentiability) We assume that
is bounded and at least four times differentiable with bounded partial derivatives up to the fourth order for every
.(A2) (Eigengap) We assume that there exist constants
and
such that
and
for any
.- (A3) (Path Smoothness) Under the same
in (A2), we assume that there exists another constant
such that
for all
and
.
Given a matrix-valued function
, its gradient
will be an
array defined as
. The derivative of
in the directional of a vector
is defined as
![]() |
When the matrix
, we will use the notation
interchangeably to denote its directional derivative along
.
Recall that an order-
ridge of the density
in
is the collection of points defined as
![]() |
Lemma C.1 below shows that under conditions (A1–3), the Jacobian matrix
has rank
at every point of
, and
is a
-dimensional manifold by the implicit function theorem [92]. Consequently, the row space of
spans the normal space to
.
If we define
, the derivation in pages 60–63 of [39] shows that
![]() |
(C1) |
for
, and the column space of
spans the normal space to
. Let
![]() |
for
. Then,
![]() |
(C2) |
However, the columns of
are not orthonormal. Thus, we leverage the orthonormalization in [22] to construct
whose columns are orthonormal and span the same column space as
in the following steps. Under the condition that
has full rank
at every point
(see Lemma C.1),
is positive definite, and we perform the Cholesky decomposition on it, that is,
![]() |
(C3) |
where
is a lower triangular matrix whose diagonal elements are positive. We then define
![]() |
(C4) |
Notice that
intrinsically depend on the dimension
of the ridge
, but we do not explicate these dependencies in their notations. As discussed in [22],
might not be unique because the eigenvalues of
can have their multiplicities greater than 1. Any collection of linearly independent unit eigenvectors of
fits into the above construction for
. However, as will be shown later, this volatility of
will not affect our results, as we only require the smoothness of
to develop a lower bound of
.
Lemma C.1.
Assume conditions (A1–3). Given that
and
are defined in (C2) and (C4), we have the following properties:
(a)and
have the same column space. In addition,
That is,
is the projection matrix onto the columns of
.
(b) The columns of
are orthonormal to each other.
(c) For
, the column space of
is normal to the (tangent) direction of
at
.
(d) For all
,
. Moreover,
is a
-dimensional manifold that contains neither intersections and nor endpoints. Namely,
is a finite union of connected and compact manifolds.
(e) For, all the
nonzero singular values of
are greater than
and therefore,
(f) Whenis sufficiently small and
,
for some constant
.
(g) Assume that another density functionalso satisfies conditions (A1–3) and
is sufficiently small. Then
for some constant
and any
, where
is the matrix defined in (C4) with the underlying density
.
(h) The reach ofsatisfies
for some constant
.
Lemma C.1 is extended from Lemma 2 in [22] to handle the density ridge
with
. As our conditions (A1–3) imply the imposed conditions of Lemma 2 in [22], our proof of Lemma C.1 essentially follows from their arguments with some minor modifications.
Proof. Proof of Lemma C.1 —
We adopt and generalize parts of the proof of Lemma 2 in [22].
(a) This property is a natural corollary of the Cholesky decomposition as
(b) Some direct calculations show that
(c) It can be proved by the argument of Lemma 1 in [22]. Or, we define an arbitrary parametrized curve
lying within
for some
. Then
aligns with the tangent direction at
. Since
, taking the derivative with respect to
gives us that
with
. Hence, by the arbitrariness of
, the column of
is normal to the tangent direction of
at
.
(d) We prove that the
nonzero singular values of
are bounded away from 0. Recall that
with
for
. Under conditions (A2-3),
It shows that all the singular values of
are less than
. Moreover, under condition (A2) again, all the
nonzero singular values of
are greater than
. By Theorem 3.3.16 in [57], we know that all the
nonzero singular values of
are greater than
. Therefore,
. The rest of the proof follows directly from Claim 4 in [22].
(e) By the proof of (d), we already know that all the
nonzero singular values of
are greater than
. Thus,
, and
where
is the smallest singular value of matrix
.
Finally, the proofs of properties (f), (g) and (h) are essentially the same as the corresponding claims in [22]. We thus omitted them.
As we have discussed in Remark 4.1, property (d) of Lemma C.1 demonstrates that our imposed assumptions (A1–3) for the ridge
is sufficient to imply the critical full-rank condition of its normal space in [28] in order for
to be a well-defined solution manifold.
D. Proofs of Lemma 3.2, Proposition 3.1, and Theorem 3.1
Lemma D.1.
Assume conditions (A1) and (E1). The convergence rate of
is
for any
as
and
.
Another interpretation of Lemma 3.2 is that
diverges to infinity at the rate
![]() |
if we select the bandwidth
to minimize the asymptotic MISE [27], where ‘
’ stands for the asymptotic equivalence.
Proof. Proof of Lemma 3.2 —
Note that
(D1) Given the differentiability of
guaranteed by condition (A1), the expectation of
is given by
By condition (E1), the dominating constant
is finite and therefore,
(D2) In addition, we calculate the variance of
as
Again, by condition (E1), the dominating constant
is finite. Thus, by the central limit theorem,
(D3) where
. Combining (D1), (D2) and (D3), we conclude that
for any
as
and
.
Remark D.1.
Some previous research papers on the mean shift algorithm [3; 6; 19; 71] have already justified that the algorithm converges to a local mode of the KDE
when its local modes are isolated and the algorithm starts within some small neighborhoods of these estimated local modes. Lemma 3.2 here provides a (probabilistic) perspective on the linear convergence of the mean shift algorithm. It is well known that the set of the true local modes of
can be approximated by the set of estimated modes defined by
[25]. Moreover, around the true local modes of the density
, one can argue that
is strongly convex and has a Lipschitz gradient with probability tending to 1 by the uniform consistency of
and
as
and
; see the uniform bounds (3.5). Hence, by some standard results in optimization theory (e.g. Chapter 3 in [17]), a sample-based gradient ascent algorithm with objective function
converges linearly to (estimated) local modes around their neighborhoods as long as its step size is below some threshold value. Finally, recall that (i) the mean shift algorithm is a special variant of the sample-based gradient ascent method with an adaptive size
by (3.8) and (ii)
can be sufficiently small but bounded away from 0 when
is large and
is small by Lemma 3.2; see also Remark 3.2. Therefore, the mean shift algorithm will converge linearly with high probability around some small neighborhood of the local modes of
when the sample size
is sufficiently large and
is chosen to be small.
Proposition D.1. (Convergence of the SCGA Algorithm.)
For any SCGA sequence
defined by (3.12) with
, the following properties hold.
(a) Under condition (A1), the objective function sequence
is non-decreasing and converges.
(b) Under condition (A1),
.
(c) Under conditions (A1-3),whenever
with the convergence radius
satisfying
where
is a constant defined in (h) of Lemma C.1 and
is a quantity depending on both the dimension
and functional norm
up to the fourth-order (partial) derivatives of
.
Proof. Proof of Proposition 3.1 —
(a) We first derive the following fact about the objective function
.
Fact 1. Given (A1),
is
-smooth, that is,
is
-Lipschitz.
This fact follows from the differentiability of
ensured by condition (A1) and Taylor’s theorem that
for any
, where
is within a
-neighborhood of
. Moreover,
(D4) When
, we have that
(D5) showing that the objective function
is non-decreasing along
. Given the boundedness of
guaranteed by condition (A1), we know that the sequence
is bounded. Thus,
also converges.
(b) From (a), we know that when
,
Since the sequence
converges as
, it follows that
(c) Given condition (A2) and the fact that
, we know that
Let
denote the projection of
in the SCGA sequence onto the ridge
. Since
by (h) of Lemma C.1,
is well defined when
. Given that the definition of
in (C2), we know that
(D6) when
, where we leverage (e) of Lemma C.1 to obtain the inequality (i). More specifically,
is a full row rank matrix by (d) of Lemma C.1 and
lies within the row space of
because
is normal to
at
. Since the nonzero singular values of
are lower bounded by
, it follows that
. From the above derivation, we also know that
is indeed the supremum norm of
over the line segment connecting
and
, which depends on the uniform functional norm
of the partial derivatives of
up to the fourth order. The result follows from (b).
The following Davis–Kahan theorem [36] is one of the most notable theorems in matrix perturbation theory. We present the theorem in a modified version from [45, 109]. Other useful variants of the Davis–Kahan theorem can be found in [115].
Lemma D.2. (Davis–Kahan.)
Let
and
be two symmetric matrices in
, whose spectra (Definition 1.1.4 in [58] are
and
, and
be an interval. Denote by
the set of eigenvalues of
that are contained in
, and by
the matrix whose columns are the corresponding (unit) eigenvectors to
(more formally,
is the image of the spectral projection induced by
). Denote by
and
the analogous quantities for
. If
then the distance
between two subspaces is bounded by
for any orthogonally invariant norm
, such as the Frobenius norm
and the
-operator norm
, where
is a diagonal matrix with the ascending principal angles between the column spaces of
and
on the diagonal.
Note that when we take the Frobenius norm in Lemma D.2,
by some simple algebra. Consequently, we will utilize the following inequality from the Davis–Kahan theorem in our subsequent proofs as
![]() |
(D7) |
Theorem D.1. (Linear Convergence of the SCGA Algorithm.)
Assume conditions (A1–4) throughout the theorem.
(a) Q-Linear convergence of: Consider a convergence radius
satisfying
where
is the constant defined in (h) of Lemma C.1 and
is a quantity defined in (c) of Proposition 3.1 that depends on both the dimension
and the functional norm
up to the fourth-order derivative of
. Whenever
and the initial point
with
, we have that
(b) R-Linear convergence of: Under the same radius
in (a), we have that whenever
and the initial point
with
,
We further assume conditions (E1–2) in the rest of statements. If
and
,
(c) Q-Linear convergence of: under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
(d) R-Linear convergence of: under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
Proof. Proof of Theorem 3.1 —
The entire proof is inspired by some standard results in optimization theory. However, the objective function
is no longer strongly concave, and we focus on the SCGA iteration instead of the standard gradient ascent method. We first recall the following two facts.
Fact 1. Given condition (A1),
is
-smooth, that is,
is
-Lipschitz.
Fact 2. Given conditions (A1–3), we know that
for any
and
for any
.
Fact 1 has been proved in Proposition 3.1, implying that the objective function sequence
is non-decreasing when
. Fact 2 is a natural corollary by Proposition 3.1, because
and
is the objective function value after one-step SCGA iteration with step size
. The iteration will move
toward the ridge
. With the help of these two facts, we start the proofs of (a–d).
(a) We first show that the following claim: for all
and the initial point
,
(D8) where
. By the differentiability of
guaranteed by condition (A1) and Taylor’s theorem, we have that
where we use the equality
in (i) and (iii), leverage conditions (A2) and (A4) to obtain that
and
in (ii), apply the quadratic bound on
from condition (A4) to obtain (iv), and use the fact that
in (v). We also use the fact that
and
in (v). Claim (D8) thus follows.
Given Fact 2 and any
,
where we apply (D4) to obtain the inequality. It implies that
(D9) Therefore,
whenever
, where we apply (D8) and (D9) in (i) and use the choice of
to argue that
in (ii). By telescoping, we conclude that when
and
,
The result follows.
(b) The result follows easily from (a) and the inequality
for all
.
(c) The proof here is partially inspired by the proof of Theorem 2 in [8]. We write the spectral decompositions of
and
as
where
and
. By Weyl’s theorem (Theorem 4.3.1 in [58]) and uniform bounds (3.5), we know that for any
,
Thus,
satisfies the first two inequalities in condition (A2) when
is sufficiently small and
is sufficiently large. According to the Davis–Kahan theorem (Lemma D.2 here) and uniform bounds (3.5),
for any
, where we use (D7) and the fact that
to obtain (i). Hence, when
and
,
(D10) with probability tending to 1.
We now claim that
and
(D11) for all
. We will prove this claim by induction on the iteration number. Note that when
, we derive from triangle inequality that
where we apply the result in (a) to obtain the last inequality. Moreover, by the choice of
and (D10), we are guaranteed that
. In the induction from
, we suppose that
and the claim (D11) holds at iteration
. The same argument then implies that the claim (D11) holds for iteration
and that
. The claim (D11) is thus proved.
Now, given that
, we iterate the claim (D11) to show that
where the fourth inequality follows by summing the geometric series, and the last equality is due to our notation
. It completes the proof.
(d) The result follows easily from (c) and the inequality
for all
.
E. Discussion on Condition (A4)
In this section, we explore several avenues to derive condition (A4) based on some potentially weaker assumptions. Recall from Section 3.3 that condition (A4) requires the following:
- (A4) (Quadratic Behaviors of Residual Vectors) We assume that the SCGA sequence
with step size
and
as its limiting point satisfies that
for some constant
, where
is the constant defined in condition (A2).
E.1 Self-Contractedness Assumption
One important assumption that connects condition (A4) with the existing conditions (A1–3) in Section 3.1 is the so-called self-contracted property [34, 35, 49]:
- (A5) (Self-Contractedness) We assume that the SCGA sequence
satisfies that 
Condition (A5) requires the SCGA sequence to move toward the ridge
under a relatively straight and shrinking path. As we have proved in Proposition 3.1 that the SCGA sequence
converges to
when the sequence is initialized near
and its step size is small, condition (A5) is indeed a mild assumption as long as the sequence
does not move erratically around
. More importantly, we demonstrate by Proposition E.1 below that condition (A5) can be implied by a subspace constrained version of the concavity assumption on the objective (density) function
.
Proposition E.1.
Assume condition (A1) and the following assumption on the objective function
:
(A6) (Subspace Constrained Concavity ) For anywith
being a constant radius, it holds that
Then, the SCGA sequence
defined in (3.12) with step size
and initial point
is self-contracted.
Notice that the density function (3.13) satisfies the ‘subspace constrained concavity’ condition (A6) around a small neighborhood of its ridge
. Moreover, it is intuitive to verify that condition (A6) is a weaker assumption compared with our established ‘subspace constrained strong concavity’ in Theorem 3.1; see also Remark 3.3.
Proof. Proof of Proposition E.1 —
The proof is inspired by Lemma 14 in [49]. We show the self-contractedness for
as follows, where
is arbitrary. For all
and
with
, we calculate that
where we apply condition (A6) in inequality (i), use the ascending property of
from (a) of Proposition 3.1 to argue that
in inequality (ii), and leverage the inequality (D5) guaranteed by condition (A1) to obtain (iii). The self-contractedness of the SCGA sequence
thus follows.
Under the self-contractedness condition (A5), we argue by the following lemma that the existing conditions (A1–3) in the literature [22, 45] is nearly sufficient to imply the quadratic behavior of the residual vector
along the SCGA sequence
. In other words, condition (A4) and the linear convergence of the SCGA algorithm hold without any extra assumption.
Lemma E.1.
Assume condition (A5) throughout the lemma.
(a) The total length of the SCGA trajectory is of the linear order, i.e.
(b) We further assume conditions (A1–2). Then,for any
with some radius
, where we recall that
is the effective radius in condition (A2) under which the underlying density
has an eigengap
between the
-th and
-th eigenvalues of its Hessian matrix
.
Proof. Proof of Lemma E.1 —
(a) This result follows directly from Theorem 15 of [49]. Note that although their results are stated for the standard gradient descent path, the associated proof only utilizes the self-contractedness property of the iterative path. Thus, their proofs are applicable to our SCGA setting under condition (A5).
(b) We first decompose the vector
into an infinite sum of SCGA iterations
for
and obtain that
(E1) for any
, where we leverage the orthogonality between
and
and the idempotence of
for all
in (ii). See also Fig. E11 for a graphical illustration of the decomposition. By Davis–Kahan theorem (Lemma D.2 and (D7) here) and conditions (A1–2), we deduce that for all
,
where we use the Taylor’s theorem in (i) as well as apply the self-contractedness condition (A5) and possibly shrink the radius
so that
in (ii). Hence, by (E1) and the fact that
, we obtain that
implying the second bound in condition (A4) with
. In addition,
The results follow.
Fig. E11.

Decomposition of the vector
into the summation
of subspace constrained gradient ascent iterative vectors.
According to (b) of Lemma E.1, condition (A4) will hold with
whenever
![]() |
(E2) |
The choice of
is a valid constant under the differentiability condition (A1). More importantly, (E2) is essentially the same assumption as the first inequality of condition (A3). Compared with the corresponding condition in (A3), the upper bound in (E2) for
around the ridge
is only shrunk by a dimension-dependent factor
. As condition (A3) and (E2) are local, this adjustment does not induce too much extra strictness on the underlying density
.
E.2 Subspace Constrained Polyak-Łojasiewicz Inequality Assumption
We have demonstrated in Appendix E.1 that the crucial condition (A4) is valid under the self-contractedness assumption on the SCGA sequence
. Consequently, the linear convergence of the SCGA algorithm can be established by slightly modifying the common assumptions (A1–3) in ridge estimation. Nevertheless, the self-contractedness property of the SCGA sequence
does not always hold in practice, and it may only be implied by the subspace constrained concavity condition (A6) as proved in Proposition E.1.
Given the fact that the underlying density function
or its estimator
may not satisfy the subspace constrained concavity assumption in many practical applications of SCGA and SCMS algorithms, we present another approach to deduce condition (A4) based on the well-known Polyak–Łojasiewicz inequality [72, 87]. Given any SCGA sequence
with limiting point
and step size
, we consider the following condition:
- (A7) (Subspace Constrained Polyak–Łojasiewicz Inequality) For all
, there exists a constant
such that 
Similar to the standard Polyak–Łojasiewicz inequality, there exist some objective functions that satisfy the subspace constrained Polyak–Łojasiewicz inequality but fail to be concave in the subspace constrained sense as in condition (A6); see [21, 41] and Equation (36) in [28]. From this aspect, condition (A7) incorporates some extra SCGA sequences satisfying condition (A4) and converging linearly to the ridge
. However, as the subspace constrained Polyak–Łojasiewicz inequality does not imply condition (A5) or (A6), it should not be regarded as a more general condition. Furthermore, unlike the standard gradient ascent/descent method (Theorem 2 in [63]), the error bound condition (i.e. Equation (D6) here) does not imply the subspace constrained Polyak–Łojasiewicz inequality, indicating a challenge in validating condition (A7) in practice.
Despite these disadvantages, the subspace constrained Polyak–Łojasiewicz inequality condition does give rise to a concise proof for the linear convergence of the objective function value
along the SCGA sequence
.
Proposition E.2.
Assume conditions (A1) and (A7). Then, for any SCGA sequence
with step size
, we have that
Proof. Proof of Proposition E.2 —
The proof is inspired by Theorem 1 in [63]. From (D5) and condition (A7), we know that
for all
when
. By some rearrangements, we conclude that
The final display follows from telescoping.
More importantly, the subspace constrained Polyak–Łojasiewicz inequality controls the total length of the SCGA path to be of the linear order and implicates the quadratic behaviors of residual vectors as required by condition (A4).
Lemma E.2.
Assume conditions (A1) and (A7) throughout the lemma.
(a) The total length of the SCGA trajectory is of the linear order, i.e.
(b) We further assume condition (A2). Then,for any
with some radius
, where we recall that
is the effective radius in condition (A2) under which the underlying density
has an eigengap
between the
-th and
-th eigenvalues of its Hessian matrix
.
Proof. Proof of Lemma E.2 —
(a) This part of the proof is inspired by the arguments in Theorem 9 of [49]. Based on the proof of (a) in Proposition 3.1 under condition (A1), we know from (D5) that
when
. Using this inequality and condition (A7), we derive that
where we use the inequality
to obtain (i) and apply condition (A7) in inequality (ii). Since
, some rearrangement of the above inequality suggests that
Therefore,
where we leverage condition (A7) again in (i). In addition, to obtain inequalities (ii) and (iii), we recall from the proof of (d) for Lemma C.1 that
, in which the singular values of
is bounded by
and the singular values of
is bounded by
. The result thus follows.
(b) This part of the proof is analogous to our arguments in (b) of Lemma E.1, except that the SCGA sequence
is no longer self-contracted. For the completeness, we still repeat some arguments and highlight the differences here. By Davis–Kahan theorem (Lemma D.2 and (D7) here) and conditions (A1–2), we have that for all
,
where we possibly shrink the radius
so that
to obtain inequality (ii). Notice also that, since
may not hold without the self-contractedness property, we use a looser bound
from (a) to derive inequality (i). Therefore, by (E1) and the fact that
, we obtain that
which implies the second bound in condition (A4) with
. Finally,
The results follow.
The results in (b) of Lemma E.2 also imply condition (A4) with
whenever
![]() |
(E3) |
Once again, the choice of
is feasible under condition (A1) and the upper bound (E3) can be viewed as a variant of the first inequality in condition (A3). From this perspective, the subspace constrained Polyak–Łojasiewicz inequality (A7) also leads to an alternative assumptions for condition (A4) and the linear convergence of the SCGA algorithm.
Remark E.1.
Note that the results in Proposition E.2 can be generalized to the directional or arbitrary manifold cases under conditions (A1–3). First, the subspace constrained Polyak–Łojasiewicz inequality for the SCGA sequence
on
or an arbitrary manifold can be modified as
where
is the objective (density) function. Based on the proof of (a) in Proposition 4.2 and our arguments in (a) of Lemma E.2, it follows that the total length of the SCGA trajectory on
or an arbitrary manifold is of the linear order, i.e.
Second, to establish the quadratic bounds for
and
, one can follow the arguments in the proof of (b) in Lemma E.2 and leverage the two facts:
1. The tangent vector
can be decomposed into an infinite sum of SCGA updates (4.18) on
or an arbitrary manifold as
See Fig. E12 for a graphical illustration. This equation is valid because parallel transports preserve inner products and are linear.
2. Under conditions (A1–2), we know that
for some constant
, where we leverage the fact that the vector field
with
and
has its variation
bounded by
according to the Davis–Kahan theorem for any
. However, we are not sure if the self-contractedness condition can also be adaptive to the directional or general manifold cases, given that the arguments in Theorem 15 of [49] are based on the Euclidean geometry.
Fig. E12.

Decomposition of the vector
within the tangent space
into the summation
of parallel transported SCGA iterative vectors. Here, the blue curves on
are iterative paths of the SCGA algorithm, while the green vectors are tangent vectors
after being parallel transported to
.
F. Other Technical Concepts of Differential Geometry on
Taylor’s Theorem on
. Given a smooth function
on
, its Taylor’s expansion is often written as [85]:
![]() |
(F1) |
for any
, where
is the exponential map at
. One may replace the exponential map with a more general concept called the retractions on an arbitrary manifold; see Section 4.1 and Proposition 5.5.5 in [1].
Parallel Transport. When comparing vectors in two different tangent spaces
on
, we leverage the notion of parallel transport
to transport vectors from one tangent space to another along a geodesic. In addition,
is a tangent vector in
after being parallel transported from
along a geodesic (or great circle) on
. The parallel transport mapping
is a linear isometry along any smooth curve on
, i.e.
for any
; see Proposition 5.5 in [69] or Proposition 1 in Section 4-4 of [37].Sectional Curvature. Sectional curvature is the Gaussian curvature of a two-dimensional submanifold formed as the image of a two-dimensional subspace of a tangent space after exponential mapping; see Section 3-2 in [37] for detailed discussions about the Gaussian curvature. It is known that a two-dimensional submanifold with positive, zero or negative sectional curvature is locally isometric to a two-dimensional sphere, a Euclidean plane or a hyperbolic plane with the same Gaussian curvature [116].
- Geodesically Strong Concavity. A function
is said to be geodesically concave if for any
, it holds that
for any
, where
is a geodesic with
and
. When
is differentiable, an equivalent statement of the geodesic concavity is that (Theorem 11.17 in [15])
A function
is said to be geodesically
-strongly concave if for any
, it holds that 
G. Normal Space of Directional Density Ridge
Recall that we extend the directional density
from its support
to
by defining
for all
. As we will refer to conditions (A1–3) frequently in the next three sections, we restate them here:
(A1) (Differentiability) Under the extension (4.1) of the directional density
, we assume that the total gradient
, total Hessian matrix
and third-order derivative tensor
in
exist, and are continuous on
and square integrable on
. We also assume that
has bounded fourth-order derivatives on
.(A2) (Eigengap) We assume that there exist constants
and
such that
and
for any
.- (A3) (Path Smoothness) Under the same
in (A2), we assume that there exists another constant
such that
for all
and
.
Recall that an order-
density ridge of a directional density
on
is the set of points defined as
![]() |
(G1) |
Lemma G.1 below shows that under conditions (A1–3), the Jacobian matrices
and
(i.e. projecting the columns of
onto the tangent space
) both have rank
at every point on
, and
will be a
-dimensional submanifold on
by the implicit function theorem [68, 92]. Analogous to the discussion about the normal space of a Euclidean density ridge in Appendix C, we define
![]() |
Different from the Euclidean density ridge case, it is the column space of
![]() |
(G2) |
that spans the normal space of
within the ambient space
. It can be seen from our Remark 4.1 that the rows of
![]() |
span the normal space of the solution manifold
; see also Lemma 1 in [28]. Consequently, the column space of
spans the normal space of
within the tangent space
at each
. The technique in pages 60–63 of [39] is still valid to argue that
![]() |
(G3) |
for
, where we use the fact that
on
under the extension of
as in (A1). Let
![]() |
for
. Then,
![]() |
(G4) |
As in the Euclidean data case, the columns of
are not orthonormal, and we again leverage the orthonormalization technique in [22] to construct
that shares the same column space with
but has orthonormal columns. That is, under the condition that
has full column rank
at every point
(see Lemma G.1),
![]() |
(G5) |
with the Cholesky decomposition
, where
is a lower triangular matrix whose diagonal elements are positive. Finally, the non-uniqueness of
will not affect our subsequent discussions about the properties of directional density ridges.
Lemma G.1.
Assume conditions (A1–3). Given that
,
and
are defined in (G4) and (G5), we have the following properties:
(a)and
have the same column space. In addition,
That is,
is the projection matrix onto the columns of
.
(b) The columns of
are orthonormal to each other.
(c) For
, the column space of
is normal to the (tangent) direction of
at
.
(d) For, the smallest eigenvalue
, and
Moreover, all the nonzero singular values of
are greater than
, and
Therefore,
is a
-dimensional submanifold that contains neither intersections nor endpoints on
. Namely,
is a finite union of connected and compact submanifolds on
.
(e) For all,
(f) Whenis sufficiently small and
,
for some constant
.
(g) Assume that another directional density functionalso satisfies conditions (A1–3) after the extension
in
, and
is sufficiently small. Then,
for some constant
and any
, where
is the matrix defined in (G5) with directional density
.
(h) The reach ofsatisfies
for some constant
.
This lemma is a direct extension of Lemma C.1 to the directional data scenario; thus, its proof is similar to the proof of Lemma C.1.
Proof. Proof of Lemma G.1 —
The proofs of properties (a), (b) and (c) can be inherited from the corresponding ones in Lemma C.1 with mild modifications and we thus omit them.
(d) We will prove that the
nonzero singular values of
and
are bounded away from 0. Recall that
with
for
. Under condition (A2),
It shows that all the singular values of
or simply
are less than
. Moreover, under condition (A2) again, all the
singular values of
are greater than
.
By Theorem 3.3.16 in [57], we know that all the singular values of
and
are greater than
where
are singular values of a matrix
in their descending order. Therefore, the minimum eigenvalue of
satisfies
(G6) Now, given
and
, we know that
If we denote the orthonormal eigenvectors of
by
, then
are the orthonormal eigenvectors of
, whose eigenvalues are thus lower bounded by
due to (G6). Hence,
.
By the implicit function theorem and the extra constraint
,
is a
-dimensional submanifold on
. It also implies that
cannot have intersections, because otherwise the intersected points will violate the rank condition. Finally, we argue by contradiction that
has no endpoints. Assume, on the contrary, that
has an end point
. Our preceding argument has shown that
, the derivative of
, is bounded. In addition,
. However, this contradicts to the implicit function theorem indicating that
is a
-dimensional submanifold on
, because at the end point
, there exists no local coordinate chart for
defined on an open set in
. The results follow.
(e) By the proof of (d), we already know that all the
nonzero singular values of
and
are greater than
. Also, all the
nonzero singular values of
are greater than
. Thus, the results follow easily from the argument of (e) in Lemma C.1.
Finally, the proofs of properties (f), (g) and (h) are essentially the same as the corresponding claims in [22]. We thus omitted them. For (h), the reader should be aware that we have extended the directional density
from
to
. In addition, it is the columns of
that span the normal space of
in the ambient space, whose nonzero singular values are lower bounded by
. The proof of (h) can also be found in Theorem 3 of [28].
H. Stability of Directional Density Ridge
H.1 Subspace Constrained Gradient Flows
This subsection is modified from Section 4 in [45] for directional densities and their ridges on
. A map
is a subspace constrained gradient flow with the principal Riemannian gradient
if
and
![]() |
(H1) |
where the last equality follows from (4.4). Given the definition of the directional density ridge
in (G1), it consists of the destinations of the subspace constrained gradient flow
, i.e.
if
for some
satisfying (H1). It will be convenient to parametrize the SCGA path with
by arc length. Let
be the arc length from
to
:
![]() |
Denote the inverse of
by
. Note that
![]() |
With
, we have that
![]() |
(H2) |
which is a reparametrization of (H1) by arc length. Note that
always lies on
because its velocity is within the tangent space
for every
. Lemma 2 in [45] justifies the uniqueness of
passing through any particular point
under conditions (A1–3). The (reversed) subspace constrained gradient flow
can be lifted onto the directional function
, as we may define
![]() |
(H3) |
Sometimes, we may add the subscript
to the curves
if we want to emphasize that
start from or pass through the specific point
.
To analyze the behavior of the subspace constrained gradient flow
lifted on
, we need the derivative of the projection matrix
along the path
. Recall that
. The collection
defines a matrix field: there is a matrix
attached to each point
. As mentioned earlier, there is a unique path
and unique
such that
for any
. Define
![]() |
(H4) |
where
with
being the Riemannian connection on
. Under conditions (A1–3),
has a quadratic-like behavior near the directional ridge
, analogous to Lemma 3 in [45].
Lemma H.1.
Assume that conditions (A1–3) holds. For all
, we have the following properties:
(a)
,
, and
. Thus,
is non-decreasing in
.
(b) The second derivative of
satisfies
.
(c)
.
Proof. Proof of Lemma H.1 —
The proof is adopted from Lemma 3 in [45].
(a) The first property
is obvious from the definition (H3). Then,
since
for all
. By the definition of
in (G1),
when
. Thus,
and
is non-decreasing in
.
(b) Note that
Differentiating both sides of the equation, we have that
Since
(idempotent), we have that
, and hence the second term on the right-hand side of the above equation becomes
Thus,
By (a) and (H2), we conclude that
(H5) Now, we will bound the two terms in (H5), respectively. As for the first term
, we notice that
is in the column space of
. Hence,
where
. Therefore, from condition (A2),
and consequently,
As for the second term
, we notice that
, where
, and
. Then,
However,
. To see this, note that
and it implies that
showing that
. To bound
, we proceed as follows. As before, we let
. Then, by the Davis–Kahan theorem (Lemma D.2 here),
Note that
, because
. Thus, from condition (A3),
Therefore,
.
(c) For some
,
by (a) and (b). As
is parametrized by arc length, we conclude that
The result follows.
The statement (c) in Lemma H.1 is known as the quadratic growth condition in the optimization literature [4, 38]. Under conditions (A1–3), such a quadratic growth of the subspace constrained gradient flow
lifted onto the directional density
enables us to quantify the stability of directional ridges under small perturbations on the directional density and develop the linear convergence of the (directional) SCGA algorithms on
.
H.2 Proof of Theorem 4.1
We now show that if two directional densities
and
are close, their corresponding ridges
and
are also close. We will use, for instance,
and
, to refer to the principal (Riemannian) gradient and projection matrix with its columns as the eigenvectors corresponding to the smallest
eigenvalues of the (Riemannian) Hessian
with the tangent space of
defined by
.
Theorem H.1.
Suppose that conditions (A1–3) hold for the directional density
and that condition (A1) holds for
. When
is sufficiently small,
(a) conditions (A2–3) holds for
.
(b)
.
(c)
for a constant
.
Proof. Proof of Theorem 4.1 —
Our arguments are modified from the proof of Theorem 4 in [45] as well as Proposition 4 and Theorem 5 in [28].
(a) We write the spectral decompositions of
and
as
By Weyl’s Theorem (Theorem 4.3.1 in [58]), we know that
where we recall that there are at most
nonzero eigenvalues of the Riemannian Hessian
on
. Thus,
satisfies condition (A2). Moreover, since condition (A3) depends only on the first and third order derivative of
, they hold for
when
is small enough.
(b) We present two methods based on two different flows to prove this statement and comment their pros and cons in Remark H.1. Method A: By the Davis–Kahan theorem (Lemma D.2 and (D7)),
for any
. Then, given that
,
Therefore, by the differentiability of
from (A1) and the compactness of
, we obtain from the above calculations that
for some constant
that only depends on the dimension
.
Now, let
. Then,
, and
. Let
be the SCGA flow through
as defined in Section H.1 so that
for some
. Note that
. From property (a) of Lemma H.1, we have that
. Moreover, by Taylor’s theorem,
for some
between
and
. Since
, from property (b) of Lemma H.1,
and consequently,
, where
denotes the geodesic distance between
and
on
. Therefore,
Now let
. The same argument shows that
for some constant
because conditions (A1–3) hold for
.
As a result,
.
Method B: Since we are only required to bound the maximum Euclidean distance between
and
, i.e.
, we may view
and
as solution manifolds in
and tentatively ignore the manifold constraint
. Define
. Given that
, the gradient of
,
(H6) is a vector in
. Let
. We define a flow
such that
It can be argued by Theorem 7 in [28] that
when
for some small
. In addition, we can always choose
to be small enough so that
. By Theorem 3.39 in [59],
is uniquely defined because the gradient
is well defined for all
. We can also reparametrize
by arc length as
Let
be the terminal time/arc-length point and
be the destination of
on
. The above argument also demonstrates that the flows
or
converge to the manifold
from the normal direction of
, because we can write
and the column space of
spans the normal space of
at
. The goal now is to bound
because its length must be greater or equal to
. We then define
. Differentiating
with respect to
leads to
(H7) by (d) in Lemma G.1. (Note that
because
and by the continuity of
, we can always choose
such that
for all
.) As
, by the proof of Method A, we know that
where
is some value between
and
. Hence,
, which is independent of
. This implies that
We can exchange the role of
and
and apply the same argument to show that
In total, this leads to the conclusion that
.
(c) By (h) in Lemma G.1, the reach of
has a lower bound,
. Note that
and
depend on the first three order derivatives of
. Thus, the lower bound for the reach of
will be identical to the one for
with an error rate
.
Note that for the stability of directional ridges, one can relax the condition (A1) by requiring
to be
-Hölder with
.
Remark H.1.
We apply two different methods to establish the stability theorem of directional density ridges. Method A utilizes the subspace constrained gradient flow constructed in Section H.1 and its quadratic behavior (Lemma H.1), while Method B defines a normal flow to the ridge
induced by the column space of
. Each of these two flows has its pros and cons. The subspace constrained gradient flow aligns more coherently with our directional SCMS algorithm (Algorithm 2) to identify the (estimated) directional ridge from data, because it relies only on the first- and second-order derivatives of the (estimated) density
. Nevertheless, the subspace constrained gradient flow does not necessarily converge to
in the optimal direction, that is, the normal direction to
. This can be seen from the explicit formula (G4) of
, which spans the normal space of
. The normal flow
defined in Method B, however, converges to
in its normal direction by construction. In general, the normal flow tends to the ridge
faster than the subspace constrained gradient flow, but it may be complicated to compute in any practical ridge-finding task due to its involvement with third-order derivatives of the (estimated) density
. Recently, [90] presented explicit formulae for finding density ridges via such a normal flow and its discrete gradient descent approximation. Additionally, they defined a smoothed version of the ridgeness function that also circumvents the computations of third-order derivatives of
.
I. Proofs of Proposition 4.1, Proposition 4.2 and Theorem 4.2
Proposition I.1.
Assume that the directional kernel
is non-increasing, twice continuously differentiable and convex with
. Given the directional KDE
and the directional SCMS sequence
defined by (4.13) or (4.14), the following properties hold:
(a) The estimated density sequence
is non-decreasing and thus converges.
(b)
.
(c) If the kernel
is also strictly decreasing on
, then
.
Proof. Proof of Proposition 4.1 —
(a) The sequence
is bounded if the kernel
is non-increasing with
. Hence, it suffices to show that it is non-decreasing. The convexity and differentiability of kernel
imply that
(I1) for all
. Then, with
and the iterative formula (4.14) in the main paper, we derive that
where we use the orthogonality between
and
in (i), multiply
to both the numerators and denominators of the two summands to obtain (ii), leverage the fact that
in (iii), use the inequality
in (iv). It thus completes the proof of (a).
(b) Our derivation in (a) already shows that
Notice that, on the one hand, the differentiability of kernel
and the compactness of
imply that
for all
, where
only depends on the bandwidth
and kernel
. On the other hand, our argument in (a) already proves the convergence of
. Therefore,
as
. The result follows.
(c) Given the iterative formula (4.14) in the main paper, we deduce that
where we leverage the orthogonality between
and
to obtain (i) and (ii). Under the assumption that the kernel
is strictly decreasing and (twice) continuously differentiable, we know that
is lower bounded away from 0 on
. Therefore, with the result in (b), the above calculation indicates that
as
. The result follows.
Remark I.1.
The conditions imposed on kernel
in Proposition 4.1 is satisfied by some commonly used kernels, such as the von Mises kernel
. However, they can be further relaxed. On the one hand, it is sufficient to assume that the kernel
is twice continuously differentiable except for finitely many points on
. On the other hand, as long as the kernel
satisfies
and the true directional density
is positive almost everywhere on
, Lemma 4.1 demonstrates that
with probability tending to 1 when
and
. Therefore, our upper bound on
in our proof of (c) will be asymptotically valid for all
, even without the strict decreasing property of kernel
. Under such relaxation, our conclusions in Proposition 4.1 are applicable to directional SCMS algorithms with other kernels that have bounded supports on
.
Proposition I.2. Convergence of the SCGA Algorithm on
.
For any SCGA sequence
defined by (4.18) with
, the following properties hold:
(a) Under condition (A1), the objective function sequence
is non-decreasing and thus converges.
(b) Under condition (A1),
.
(c) Under conditions (A1-3),whenever
with the convergence radius
satisfying
where
is a constant defined in (h) of Lemma G.1 and
is a quantity depending on both the dimension
and the functional norm
up to the fourth-order (partial) derivatives of
.
Proof. Proof of Proposition 4.2 —
The proof is similar to our arguments in Proposition 3.1. For the completeness, we still delineate the detailed steps because the proof requires some nontrivial techniques, such as parallel transports and line integrals, on general Riemannian manifolds.
(a) We first derive the following property of the objective function
supported on
, which is a counterpart of Fact 1 in the proof of Proposition 3.1.
Property 1. Given (A1), the function
is
-smooth on
, that is,
is
-Lipschitz. This property follows easily from the differentiability of
guaranteed by condition (A1) and Theorem 4.34 in [69] that
(I2) for any
, where
lies on the geodesic curve
with
and
. Then,
(I3) where the equality (i) follows from the fundamental theorem for line integrals (Theorem 11.39 in [68]), equality (ii) utilizes the isometric property of parallel transports and inequality (iv) follows from (I2). Moreover, since the velocity of the geodesic
is always constant, we deduce that
and the equality (iii) follows. We will make use of the following direction of the inequality (I3):
(I4) Moreover, when
,
showing that the objective function
is non-decreasing along the SCGA path
on
. Given the compactness of
and the differentiability of
, we know that the sequence
is bounded. Thus, it converges.
(b) From (a), we know that when
,
Since the sequence
converges, it follows that
Recall from (2.5) that
, so
as well.
(c) Given condition (A2) and the fact that
, we know that
Let
be the projection of
in the SCGA sequence onto the directional ridge
. Since
by (h) of Lemma G.1,
is well defined when
. Recall from (G3) that the column space of
coincides with the normal space of
within the tangent space
. We define a geodesic
with
and calculate that
where we utilize the isometric properties of parallel transports in (i), note that the velocity of geodesic is constant, i.e.
for any
to obtain (ii), leverage (d) of Lemma G.1 to deduce (iii) and use the fact that
when
in the inequality (iv). In particular for the inequality (iii),
is a full column rank matrix and
lies within the column space of
. Since the nonzero singular values of
are lower bounded by
, it follows that
In addition, we also know that
comes from the supremum norm of
over the geodesic connecting
and
with
being the Riemannian connection, which in turn depends on the uniform functional norm
of the partial derivatives of
up to the fourth order. By (b), we deduce that
The results follow.
The nonzero curvature structure of the unit (hyper-sphere)
, on which the objective function (or density)
lies, induces an extra challenge in establishing the linear convergence of population and sample-based SCGA algorithms. Some useful techniques used in analyzing non-asymptotic convergence of first-order methods in
, such as the law of cosines and linearizations of the objective function, would fail on
[116]. Therefore, we first introduce a practical trigonometric distance bound for the Alexandrov space [18] with its sectional curvature bounded from below.
Lemma I.1. Lemma 5 in (116); see also (14).
If
are the sides (i.e. side lengths) of a geodesic triangle in an Alexandrov space with sectional curvature lower bounded by
, and
is the angle between sides
and
, then
(I5)
The sketching proof of Lemma I.1 can be found in Lemma 5 of [116]. Note that the sectional curvature
on
. We inherit the notation in [116] and denote
by
for the curvature-dependent quantity in the inequality (I5). One can show by differentiating
with respect to
that
is strictly increasing and greater than 1 for any
and fixed
. With Lemma I.1 in hand, we are able to state a straightforward corollary indicating an important relation between two consecutive points in the SCGA sequence
on
defined by (4.18):
![]() |
(I6) |
Corollary I.1.
For any point
in a geodesically convex set on
, the update in (I6) satisfies
where
is the geodesic distance between
and
on
.
Proof. Proof of Corollary I.1 —
Recall that the (population) SCGA iterative formula on
is given by
. Note that for the geodesic triangle
with
, we have that
and
By letting
and
in Lemma I.1, we obtain that
Some rearrangements will yield the final display.
Note that
in our conditions (A2–3) is a geodesically convex set, where the minimal geodesic between two points in the set
always lies within the set. Hence, Corollary I.1 is applicable to our interested SCGA algorithm initialized within
.
Theorem I.1. Linear Convergence of the SCGA Algorithm on
.
Assume conditions (A1–4) throughout the theorem.
(a) Q-Linear convergence of: Consider a convergence radius
satisfying
where
is the constant defined in (h) of Lemma G.1 and
is a quantity defined in (c) of Proposition 4.2 that depends on both the dimension
and the functional norm
up to the fourth-order (partial) derivatives of
. Whenever
and the initial point
with
, we have that
(b) R-Linear convergence of: Under the same radius
in (a), we have that whenever
and the initial point
with
,
We further assume (D1–2) in the rest of statements. Suppose that
and
.
(c) Q-Linear convergence of: Under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
(d) R-Linear convergence of: Under the same radius
and
in (a), we have that
with probability tending to 1 whenever
and the initial point
with
.
Proof. Proof of Theorem 4.2 —
The proof is similar to our argument in Theorem 3.1, except that the objective function
is supported on a nonlinear manifold
here. The key arguments are credited to Corollary I.1. We first recall the following two properties.
Property 1. Given (A1), the function
is
-smooth on
, that is,
is
-Lipschitz.
Property 2. Given conditions (A1–3), we know that
and
for any
with
.
Property 1 has been established in the proof of Proposition 4.2, indicating that the objective function sequence
is non-decreasing when
. Property 2 is a natural corollary by Proposition 4.2, because
and
is the objective function value after one-step SCGA iteration on
with step size
. The iteration will move
closer to the directional ridge
. With the help of these two properties, we start the proofs of (a–d).
(a) We first prove the following claim using Lemma E.1: for all
and
,
(I7) where
. By the differentiability of
ensured by condition (A1) and Taylor’s theorem on
, we deduce that
![]()
where we leverage the equality
in (i) and (iii), use conditions (A2) and (A4) that
and
in (ii), apply the quadratic bound for
in condition (A4) to obtain (iv), and leverage the facts that
and
when
in (v); recall (2.5). Our claim (I7) is thus proved.
In addition, given Property 2 and any
, we derive that
where we apply (I3) to obtain the inequality. This indicates that
(I8) for any
. Therefore, by Corollary I.1, we obtain that
whenever
, where we utilize Corollary I.1 and the monotonicity of
with respect to
in (i), apply (I7) and (I8) to obtain (ii), and use the choice of
to argue that
in (iii). By telescoping, we conclude that when
and
,
The result follows.
(b) The result follows obviously from (a) and the fact that
for all
.
(c) The proof is logically similar to the proof of (c) in Theorem 3.1. We write the spectral decompositions of
and
as
By Weyl’s theorem (Theorem 4.3.1 in [58]) and uniform bounds (4.6),
Thus,
will satisfy conditions (A2) with high probability when
is sufficiently small and
is sufficiently large. According to Davis–Kahan theorem (Lemma D.2 here), uniform bounds (4.6), and the continuity of exponential maps, we have that
for any
, where we utilize the Davis–Kahan theorem and
in (i). Hence, when
and
,
(I9) with probability tending to 1.
We now claim that
and
(I10) for all
. We again prove this claim by induction on the iteration number. Note that when
, we derive that
where we apply the triangle inequality in (i) and leverage the result in (a) and (I9) to obtain (ii). The triangle inequality is valid in this context because the geodesic measures the minimal distance between two points on
. Moreover, by the choice of
and (I9), we are sure that
. In the induction from
, we suppose that
and the claim (I10) holds at iteration
. The same argument then implies that the claim (I10) holds for iteration
and that
. The claim (I10) is thus verified.
Now, given that
, we iterate the claim (I10) to show that
where the fourth inequality follows by summing the geometric series, and the last equality is due to our notation
. It completes the proof.
(d) The result follows directly from (c) and the inequality
for all
.
Contributor Information
Yikun Zhang, Department of Statistics, University of Washington, Seattle, WA 98195, USA.
Yen-Chi Chen, Department of Statistics, University of Washington, Seattle, WA 98195, USA.
References
- 1. Absil, P.-A., Mahony, R. & Sepulchre, R. (2008) Optimization Algorithms on Matrix Manifolds. Princeton, NJ: Princeton University Press. [Google Scholar]
- 2. Absil, P. A., Mahony, R. & Trumpf, J. (2013) An extrinsic look at the riemannian hessian. Geometric Science of Information. ( F. Nielsen & F. Barbaresco eds). Berlin Heidelberg: Springer, pp. 361–368. [Google Scholar]
- 3. Aliyari Ghassabeh, Y. (2015) A sufficient condition for the convergence of the mean shift algorithm with gaussian kernel. J. Multivariate Anal., 135, 1–10. [Google Scholar]
- 4. Anitescu, M. (2000) Degenerate nonlinear programming with a quadratic growth condition. SIAM J. Optim., 10, 1116–1135. [Google Scholar]
- 5. Argus, D. F., Gordon, R. G. & DeMets, C. (2011) Geologically current motion of 56 plates relative to the no-net-rotation reference frame. Geochemistry, Geophysics, Geosystems, 12. [Google Scholar]
- 6. Arias-Castro, E., Mason, D. & Pelletier, B. (2016) On the estimation of the gradient lines of a density and the consistency of the mean-shift algorithm. J. Mach. Learn. Res., 17, 1–28. [Google Scholar]
- 7. Bai, Z., Rao, C. & Zhao, L. (1988) Kernel estimators of density function of directional data. J. Multivariate Anal., 27, 24–39. [Google Scholar]
- 8. Balakrishnan, S., Wainwright, M. J. & Yu, B. (2017) Statistical guarantees for the em algorithm: From population to sample-based analysis. Ann. Statist., 45, 77–120. [Google Scholar]
- 9. Banerjee, A., Dhillon, I. S., Ghosh, J. & Sra, S. (2005) Clustering on the unit hypersphere using von mises-fisher distributions. J. Mach. Learn. Res., 6, 1345–1382. [Google Scholar]
- 10. Banyaga, A. & Hurtubise, D. (2004) Lectures on Morse Homology. Texts in the Mathematical Sciences. Netherlands: Springer. [Google Scholar]
- 11. Beck, A. & Tetruashvili, L. (2013) On the convergence of block coordinate descent type methods. SIAM J. Optim., 23, 2037–2060. [Google Scholar]
- 12. Beran, R. (1979) Exponential models for directional data. Ann. Statist., 7, 1162–1178. [Google Scholar]
- 13. Bird, P. (2003) An updated digital model of plate boundaries. Geochemistry, Geophysics, Geosystems, 4. [Google Scholar]
- 14. Bonnabel, S. (2013) Stochastic gradient descent on riemannian manifolds. IEEE Trans. Automat. Control, 58, 2217–2229. [Google Scholar]
- 15. Boumal, N. (2020) An introduction to optimization on smooth manifolds. Available online, Aug.. [Google Scholar]
- 16. Bowman, A. W. (1984) An alternative method of cross-validation for the smoothing of density estimates. Biometrika, 71, 353–360. [Google Scholar]
- 17. Bubeck, S. (2015) Convex optimization: Algorithms and complexity. Found. Trends Mach. Learn., 8, 231–357. [Google Scholar]
- 18. Burago, Y., Gromov, M. & Perel’man, G. (1992) A.d. alexandrov spaces with curvature bounded below. Russian Math. Surveys, 47, 1–58. [Google Scholar]
- 19. Carreira-Perpiñán, M. Á. (2007) Gaussian mean-shift is an em algorithm. IEEE Trans. Pattern Anal. Mach. Intell., 29, 767–776. [DOI] [PubMed] [Google Scholar]
- 20. Chacón, E. J., Duong, T. & Wand, P. M. (2011) Asymptotics for general multivariate kernel density derivative estimators. Statist. Sinica, 21, 807. [Google Scholar]
- 21. Charles, Z. & Papailiopoulos, D. (2018) Stability and generalization of learning algorithms that converge to global optima. International Conference on Machine Learning. PMLR, PMLR, pp. 745–754.
- 22. Chen, Y.-C., Genovese, C. R. & Wasserman, L. (2015a) Asymptotic theory for density ridges. Ann. Statist., 43, 1896–1928. [Google Scholar]
- 23. Chen, Y.-C., Ho, S., Freeman, P. E., Genovese, C. R. & Wasserman, L. (2015b) Cosmic web reconstruction through density ridges: method and algorithm. Monthly Notices of the Royal Astronomical Society, 454, 1140–1156. [Google Scholar]
- 24. Chen, Y.-C., Genovese, C. R., Ho, S. & Wasserman, L. (2015c) Optimal ridge detection using coverage risk. Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. [Google Scholar]
- 25. Chen, Y.-C., Genovese, C. R. & Wasserman, L. (2016a) A comprehensive approach to mode clustering. Electron. J. Stat., 10, 210–241. [Google Scholar]
- 26. Chen, Y.-C., Ho, S., Brinkmann, J., Freeman, P. E., Genovese, C. R., Schneider, D. P. & Wasserman, L. (2016b) Cosmic web reconstruction through density ridges: catalogue. Monthly Notices of the Royal Astronomical Society, 461, 3896–3909. [Google Scholar]
- 27. Chen, Y.-C. (2017) A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1, 161–187. [Google Scholar]
- 28. Chen, Y.-C. (2022) Solution manifold and its statistical applications. Electron. J. Stat., 16, 408–450. [Google Scholar]
- 29. Cheng, Y. (1995) Mean shift, mode seeking, and clustering. IEEE Trans. Pattern Anal. Mach. Intell., 17, 790–799. [Google Scholar]
- 30. Chrisman, N. R. (2017) Calculating on a round planet. International Journal of Geographical Information Science, 31, 637–657. [Google Scholar]
- 31. Comaniciu, D. & Meer, P. (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans. Pattern Anal. Mach. Intell., 24, 603–619. [Google Scholar]
- 32. Cuevas, A. (2009) Set estimation: Another bridge between statistics and geometry. Bol. Estad. Investig. Oper, 25, 71–85. [Google Scholar]
- 33. Damon, J. (1999) Properties of ridges and cores for two-dimensional images. J. Math. Imaging Vis., 10, 163–174. [Google Scholar]
- 34. Daniilidis, A., Ley, O. & Sabourau, S. (2010) Asymptotic behaviour of self-contracted planar curves and gradient orbits of convex functions. J. Math. Pures Appl., 94, 183–199. [Google Scholar]
- 35. Daniilidis, A., David, G., Durand-Cartagena, E. & Lemenant, A. (2015) Rectifiability of self-contracted curves in the euclidean space and applications. J. Geom. Anal., 25, 1211–1239. [Google Scholar]
- 36. Davis, C. & Kahan, W. M. (1970) The rotation of eigenvectors by a perturbation. iii. SIAM J. Numer. Anal., 7, 1–46. [Google Scholar]
- 37. do Carmo, M. (2016) Differential Geometry of Curves and Surfaces: Revised and UpdatedSecond Edition. Dover Books on Mathematics. Dover Publications. [Google Scholar]
- 38. Drusvyatskiy, D. & Lewis, A. S. (2018) Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res., 43, 919–948. [Google Scholar]
- 39. Eberly, D. (1996) Ridges in Image and Data Analysis. Computational Imaging and Vision. Springer Netherlands. [Google Scholar]
- 40. Einmahl, U. & Mason, D. M. (2005) Uniform in bandwidth consistency of kernel-type function estimators. Ann. Statist., 33, 1380–1403. [Google Scholar]
- 41. Fazel, M., Ge, R., Kakade, S. & Mesbahi, M. (2018) Global convergence of policy gradient methods for the linear quadratic regulator. International Conference on Machine Learning. PMLR, PMLR, pp. 1467–1476.
- 42. Federer, H. (1959) Curvature measures. Trans. Amer. Math. Soc., 93, 418–491. [Google Scholar]
- 43. García-Portugués, E. (2013) Exact risk improvement of bandwidth selectors for kernel density estimation with directional data. Electron. J. Stat., 7, 1655–1685. [Google Scholar]
- 44. García-Portugués, E., Crujeiras, R. M. & González-Manteiga, W. (2013) Kernel density estimation for directional-linear data. J. Multivariate Anal., 121, 152–175. [Google Scholar]
- 45. Genovese, C. R., Perone-Pacifico, M., Verdinelli, I. & Wasserman, L. (2014) Nonparametric ridge estimation. Ann. Statist., 42, 1511–1545. [Google Scholar]
- 46. Ghassabeh, Y. A., Linder, T. & Takahara, G. (2013) On some convergence properties of the subspace constrained mean shift. Pattern Recognition, 46, 3140–3147. [Google Scholar]
- 47. Ghassabeh, Y. A. & Rudzicz, F. (2020) Modified subspace constrained mean shift algorithm. J. Classification, 1–17. [Google Scholar]
- 48. Giné, E. & Guillou, A. (2002) Rates of strong uniform consistency for multivariate kernel density estimators. Annales de l’Institut Henri Poincare (B) Probability and Statistics, 38, 907–921. [Google Scholar]
- 49. Gupta, C., Balakrishnan, S. & Ramdas, A. (2021) Path length bounds for gradient descent and flow. J. Mach. Learn. Res., 22, 1–63. [Google Scholar]
- 50. Hall, P. (1983) Large sample optimality of least squares cross-validation in density estimation. Ann. Statist., 1156–1174. [Google Scholar]
- 51. Hall, P., Watson, G. S. & Cabrara, J. (1987) Kernel density estimation with spherical data. Biometrika, 74, 751–762. [Google Scholar]
- 52. Hall, P., Qian, W. & Titterington, D. M. (1992) Ridge finding from noisy data. J. Comput. Graph. Statist., 1, 197–211. [Google Scholar]
- 53. Hall, P., Peng, L. & Rau, C. (2001) Local likelihood tracking of fault lines and boundaries. J. R. Stat. Soc. Ser. B Stat. Methodol., 63, 569–582. [Google Scholar]
- 54. Harris, R. A. (2017) Large earthquakes and creeping faults. Reviews of Geophysics, 55, 169–198. [Google Scholar]
- 55. Hastie, T. & Stuetzle, W. (1989) Principal curves. J. Amer. Statist. Assoc., 84, 502–516. [Google Scholar]
- 56. Hauberg, S. (2015) Principal curves on riemannian manifolds. IEEE Trans. Pattern Anal. Mach. Intell., 38, 1915–1921. [DOI] [PubMed] [Google Scholar]
- 57. Horn, R. A. & Johnson, C. R. (1991) Topics in Matrix Analysis. Cambridge Univ. Press. [Google Scholar]
- 58. Horn, R. A. & Johnson, C. R. (2012) Matrix Analysis, 2nd edn. Cambridge Univ. Press. [Google Scholar]
- 59. Irwin, M. C. (2001) Smooth dynamical systems, vol. 17. World Scientific. [Google Scholar]
- 60. Izenman, A. J. (2012) Introduction to manifold learning. Wiley Interdiscip. Rev. Comput. Stat., 4, 439–446. [Google Scholar]
- 61. Jones, M. C., Marron, J. S. & Sheather, S. J. (1996) A brief survey of bandwidth selection for density estimation. J. Amer. Statist. Assoc., 91, 401–407. [Google Scholar]
- 62. Kafai, M., Miao, Y. & Okada, K. (2010) Directional mean shift and its application for topology classification of local 3d structures. Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops. IEEE, pp. 170–177. [Google Scholar]
- 63. Karimi, H., Nutini, J. & Schmidt, M. (2016) Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. Machine Learning and Knowledge Discovery in Databases. Cham: Springer International Publishing, pp. 795–811. [Google Scholar]
- 64. Klemelä, J. (2000) Estimation of densities and derivatives of densities with directional data. J. Multivariate Anal., 73, 18–40. [Google Scholar]
- 65. Kobayashi, T. & Otsu, N. (2010) Von mises-fisher mean shift for clustering on a hypersphere. 20th International Conference on Pattern Recognition. IEEE, pp. 2130–2133. [Google Scholar]
- 66. Kozak, D., Becker, S., Doostan, A. & Tenorio, L. (2019) Stochastic subspace descent. arXiv preprint arXiv: 1904.01145.
- 67. Kozak, D., Becker, S., Doostan, A. & Tenorio, L. (2020) A stochastic subspace approach to gradient-free optimization in high dimensions. arXiv preprint arXiv, 2003.02684. [Google Scholar]
- 68. Lee, J. (2012) Introduction to Smooth Manifolds. Graduate Texts in Mathematics, 2nd edn. Springer. [Google Scholar]
- 69. Lee, J. M. (2018) Introduction to Riemannian manifolds. Springer. [Google Scholar]
- 70. Ley, C. & Verdebout, T. (2017) Modern directional statistics. CRC Press. [Google Scholar]
- 71. Li, X., Hu, Z. & Wu, F. (2007) A note on the convergence of the mean shift. Pattern Recognition, 40, 1756–1762. [Google Scholar]
- 72. Lojasiewicz, S. (1963) A topological property of real analytic subsets. Coll. du CNRS. Les équations aux dérivées partielles, 117, 87–89. [Google Scholar]
- 73. Luo, Z.-Q. & Tseng, P. (1992) On the convergence of the coordinate descent method for convex differentiable minimization. J. Optim. Theory Appl., 72, 7–35. [Google Scholar]
- 74. Mardia, K. & Jupp, P. (2000) Directional Statistics. Wiley Series in Probability and Statistics. Wiley. [Google Scholar]
- 75. Marzio, M. D., Panzera, A. & Taylor, C. C. (2011) Kernel density estimation on the torus. J. Statist. Plann. Inference, 141, 2156–2173. [Google Scholar]
- 76. Necoara, I., Nesterov, Y. & Glineur, F. (2019) Linear convergence of first order methods for non-strongly convex optimization. Math. Programming, 175, 69–107. [Google Scholar]
- 77. Nesterov, Y., et al. (2018) Lectures on convex optimization, vol. 137. Springer. [Google Scholar]
- 78. Nocedal, J. & Wright, S. J. (2006) Numerical Optimization. Springer Series in Operations Research and Financial Engineering, 2nd edn. New York: Springer. [Google Scholar]
- 79. Norgard, G. & Bremer, P.-T. (2012) Second derivative ridges are straight lines and the implications for computing lagrangian coherent structures. Phys. D, 241, 1475–1476. [Google Scholar]
- 80. Oba, S., Kato, K. & Ishii, S. (2005) Multi-scale clustering for gene expression profiling data. Proceedings of Fifth IEEE Symposium on Bioinformatics and Bioengineering (BIBE’05). IEEE, pp. 210–217. [Google Scholar]
- 81. Ok, E. A. (2007) Real Analysis with Economic Applications, vol. 10. Princeton University Press. [Google Scholar]
- 82. Oliveira, M., Crujeiras, R. M. & Rodríguez-Casal, A. (2012) A plug-in rule for bandwidth selection in circular density estimation. Comput. Stat. Data Anal., 56, 3898–3908. [Google Scholar]
- 83. Ozertem, U. & Erdogmus, D. (2011) Locally defined principal curves and surfaces. J. Mach. Learn. Res., 12, 1249–1286. [Google Scholar]
- 84. Peikert, R., Günther, D. & Weinkauf, T. (2013) Comment on “second derivative ridges are straight lines and the implications for computing lagrangian coherent structures, physica d 2012.05. 006”. Phys. D, 242, 65–66. [Google Scholar]
- 85. Pennec, X. (2006) Intrinsic statistics on riemannian manifolds: Basic tools for geometric measurements. J. Math. Imaging Vision, 25, 127–154. [Google Scholar]
- 86. Pewsey, A. & García-Portugués, E. (2021) Recent advances in directional statistics. Test, 1–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87. Polyak, B. (1963) Gradient methods for the minimisation of functionals. Comput. Math. Math. Phys., 3, 864–878. [Google Scholar]
- 88. Qiao, W. (2021) Asymptotic confidence regions for density ridges. Bernoulli, 27, 946–975. [Google Scholar]
- 89. Qiao, W. & Polonik, W. (2016) Theoretical analysis of nonparametric filament estimation. Ann. Statist., 44, 1269–1297. [Google Scholar]
- 90. Qiao, W. & Polonik, W. (2021) Algorithms for ridge estimation with convergence guarantees. arXiv preprint arXiv:2104.12314.
- 91. Rudemo, M. (1982) Empirical choice of histograms and kernel density estimators. Scand. J. Statist., 65–78. [Google Scholar]
- 92. Rudin, W. (1976) Principles of Mathematical Analysis, 3rd edn. McGraw-Hill New York. [Google Scholar]
- 93. Saavedra-Nieves, P. & María Crujeiras, R. (2020) Nonparametric estimation of directional highest density regions. arXiv preprint arXiv:2009.08915.
- 94. Saragih, J. M., Lucey, S. & Cohn, J. F. (2009) Face alignment through subspace constrained mean-shifts. Proceedings of the IEEE 12th International Conference on Computer Vision. IEEE, pp. 1034–1041. [Google Scholar]
- 95. Sasaki, H., Kanamori, T. & Sugiyama, M. (2017) Estimating density ridges by direct estimation of density-derivative-ratios. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, vol. 54. ( A. Singh & J. Zhu eds). FL, USA: PMLR: Fort Lauderdale, pp. 204–212. [Google Scholar]
- 96. Scott, D. (2015) Multivariate Density Estimation: Theory, Practice, and Visualization. Wiley Series in Probability and Statistics. Wiley. [Google Scholar]
- 97. Sheather, S. J. (2004) Density estimation. Statist. Sci., 19, 588–597. [Google Scholar]
- 98. Sheather, S. J. & Jones, M. C. (1991) A reliable data-based bandwidth selection method for kernel density estimation. J. R. Stat. Soc. Ser. B Stat. Methodol., 53, 683–690. [Google Scholar]
- 99. Silverman, B. W. (1986) Density Estimation for Statistics and Data Analysis. London: Chapman and Hall. [Google Scholar]
- 100. Snyder, J., Voxland, P. & U.S.), G. S. (1989) An Album of Map Projections. An Album of Map Projections. U.S: Government Printing Office. [Google Scholar]
- 101. Sousbie, T., Pichon, C., Courtois, H., Colombi, S. & Novikov, D. (2007) The three-dimensional skeleton of the SDSS. The Astrophysical Journal, 672, L1–L4. [Google Scholar]
- 102. Stone, C. J. (1984) An asymptotically optimal window selection rule for kernel density estimates. Ann. Statist., 1285–1297. [Google Scholar]
- 103. Subarya, C., Chlieh, M., Prawirodirdjo, L., Avouac, J.-P., Bock, Y., Sieh, K., Meltzner, A. J., Natawidjaja, D. H. & McCaffrey, R. (2006) Plate-boundary deformation associated with the great sumatra–andaman earthquake. Nature, 440, 46–51. [DOI] [PubMed] [Google Scholar]
- 104. Subbarao, R. & Meer, P. (2006) Nonlinear mean shift for clustering over analytic manifolds. 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol. 1. IEEE, pp. 1168–1175. [Google Scholar]
- 105. Subbarao, R. & Meer, P. (2009) Nonlinear mean shift over riemannian manifolds. Int. J. Comput. Vis., 84, 1. [Google Scholar]
- 106. Taylor, C. C. (2008) Automatic bandwidth selection for circular density estimation. Comput. Statist. Data Anal., 52, 3493–3500. [Google Scholar]
- 107. van der Vaart, A. W. (1998) Asymptotic Statistics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge Univ. Press. [Google Scholar]
- 108. van der Vaart, A. W. & Wellner, J. A. (1996) Weak convergence and empirical processes: with applications to statistics. Springer Science & Business Media. [Google Scholar]
- 109. von Luxburg, U. (2007) A tutorial on spectral clustering. Statist. Comput., 17, 395–416. [Google Scholar]
- 110. Wasserman, L. (2006) All of Nonparametric Statistics (Springer Texts in Statistics). Berlin, Heidelberg: Springer-Verlag. [Google Scholar]
- 111. Wasserman, L. (2018) Topological data analysis. Annu. Rev. Stat. Appl., 5, 501–532. [Google Scholar]
- 112. Wright, S. J. (2015) Coordinate descent algorithms. Math. Programming, 151, 3–34. [Google Scholar]
- 113. Yang, M.-S., Chang-Chien, S.-J. & Kuo, H.-C. (2014) On mean shift clustering for directional data on a hypersphere. Proceedings of the Artificial Intelligence and Soft Computing. Cham: Springer International Publishing, pp. 809–818. [Google Scholar]
- 114. You, S., Bas, E., Erdogmus, D. & Kalpathy-Cramer, J. (2011) Principal curved based retinal vessel segmentation towards diagnosis of retinal diseases. Proceedings of the IEEE First International Conference on Healthcare Informatics, Imaging and Systems Biology. IEEE, pp. 331–337. [Google Scholar]
- 115. Yu, Y., Wang, T. & Samworth, R. J. (2014) A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102, 315–323. [Google Scholar]
- 116. Zhang, H. & Sra, S. (2016) First-order methods for geodesically convex optimization. Proceedings of the 29th Annual Conference on Learning Theory ( V. Feldman, A. Rakhlin & O. Shamir eds). Proceedings of Machine Learning Research, vol. 49. Columbia University, New York, New York, USA: PMLR, pp. 1617–1638. [Google Scholar]
- 117. Zhang, Y. & Chen, Y.-C. (2021a) The em perspective of directional mean shift algorithm. arXiv preprint arXiv:2101.10058.
- 118. Zhang, Y. & Chen, Y.-C. (2021b) Kernel smoothing, mean shift, and their learning theory with directional data. J. Mach. Learn. Res., 22, 1–92. [Google Scholar]
- 119. Zhang, Y. & Chen, Y.-C. (2021c) Mode and ridge estimation in euclidean and directional product spaces: A mean shift approach. arXiv preprint arXiv:2110.08505.
- 120. Zhao, L. & Wu, C. (2001) Central limit theorem for integrated squared error of kernel estimators of spherical density. Sci. China Ser. A Math., 44, 474–483. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data and code underlying this paper are available at https://github.com/zhangyk8/EuDirSCMS. Specifically, the earthquake data in Section 5.3 were obtained from the Earthquake Catalog (https://earthquake.usgs.gov/earthquakes/search/) of the United States Geological Survey.



























![Lemma 3.1. (Theorem 4 in [45]).](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f47/9893762/5630ae417ecf/DmEquation22.gif)
























































































































































































![Lemma 4.1. (Lemma 10 in [118]).](https://cdn.ncbi.nlm.nih.gov/pmc/blobs/9f47/9893762/f0c65fc49565/DmEquation66.gif)












































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































































