An Efficient Orthonormalization-Free Approach for Sparse Dictionary Learning and Dual Principal Component Pursuit

Xiaoyin Hu; Xin Liu

doi:10.3390/s20113041

. 2020 May 27;20(11):3041. doi: 10.3390/s20113041

An Efficient Orthonormalization-Free Approach for Sparse Dictionary Learning and Dual Principal Component Pursuit

Xiaoyin Hu ^1,^2,^*,^†, Xin Liu ^1,^2,^3,^†

PMCID: PMC7308875 PMID: 32471176

Abstract

Sparse dictionary learning (SDL) is a classic representation learning method and has been widely used in data analysis. Recently, the $ℓ_{m}$ -norm ( $m \geq 3, m \in N$ ) maximization has been proposed to solve SDL, which reshapes the problem to an optimization problem with orthogonality constraints. In this paper, we first propose an $ℓ_{m}$ -norm maximization model for solving dual principal component pursuit (DPCP) based on the similarities between DPCP and SDL. Then, we propose a smooth unconstrained exact penalty model and show its equivalence with the $ℓ_{m}$ -norm maximization model. Based on our penalty model, we develop an efficient first-order algorithm for solving our penalty model (PenNMF) and show its global convergence. Extensive experiments illustrate the high efficiency of PenNMF when compared with the other state-of-the-art algorithms on solving the $ℓ_{m}$ -norm maximization with orthogonality constraints.

Keywords: dual principal component pursuit, orthogonality constraint, sparse dictionary learning, stiefel manifold

1. Introduction

In this paper, we focus on solving the optimization problem with orthogonality constraints:

\begin{matrix} min_{W \in R^{n \times p}} & f (W) : = - \frac{1}{m} {∥W^{⊤} Y∥}_{m}^{m} \\ s . t . & W^{⊤} W = I_{p}, \end{matrix}

(1)

where W is the variable, $Y \in R^{n \times N}$ is a given data matrix, and $I_{p}$ denotes the identity matrix in $R^{p \times p}$ . Besides, the $ℓ_{m}$ -norm is defined as ${∥Y∥}_{m} = {[\sum_{i = 1}^{n} \sum_{j = 1}^{N} {(Y_{i j})}^{m}]}^{\frac{1}{m}}$ with constant $m \in (2, 4]$ . For brevity, the orthogonality constraints $W^{⊤} W = I_{p}$ in (1) can be expressed as $W \in S_{n, p} : = {W \in R^{n \times p} | W^{⊤} W = I_{p}}$ . Here, $S_{n, p}$ denotes the Stiefel manifold in real matrix space, and we call it the Stiefel manifold for simplicity in the rest of our paper.

The sparse dictionary learning (SDL) exploits the low-dimensional features within a set of unlabeled data, and therefore plays an important role in unsupervised representative learning. More specifically, given a data set $Y = [y_{1}, y_{2}, \dots, y_{N}] \in R^{n \times N}$ that contains N samples in $R^{n}$ , SDL aims to compute a full-rank matrix $D \in R^{n \times p}$ named as dictionary, and an associated sparse representation $X = [x_{1}, \dots, x_{N}]$ that satisfies

Y = D X,

(2)

or equivalently, find a $W = {D^{⊤}}^{- 1}$ such that

X = W^{⊤} Y .

(3)

As a result, the SDL can be solved by finding a $W \in R^{n \times n}$ , which leads to a sparse $W^{⊤} Y$ . Some existing works introduce the $ℓ_{0}$ -norm or $ℓ_{1}$ -norm penalty term to promote the underlying sparsity of $W^{⊤} Y$ and present various algorithms for solving the consequent optimization models, see the work in [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] for instance. Interested readers are referred to a recent paper [16] and the references therein. However, the $ℓ_{1}$ -norm minimization-based models are known to be sensitive with noise, and so far the existing approaches are not efficient enough for the purpose of solving real application problems which are often large-scale [17]. Consequently, a proper model with an efficient algorithm for SDL is desired, especially for the large-scale case.

Recently, an $ℓ_{4}$ -norm maximization model is proposed in [17], which can recover the entire dictionary in a single run. This new formulation is motivated by the fact that maximizing a higher-order norm promotes spikiness and sparsity at the same time. The authors of [17] demonstrate that the global minimizers of $ℓ_{4}$ -norm maximization with orthogonality constraints are very close to the true dictionary. Moreover, concaveness of the objective function in Equation (1) enables a fast fixed-point type algorithm, named matching, stretching, and projection (MSP). MSP achieves significant speedup compared with existing methods based on $ℓ_{0}$ -norm or $ℓ_{1}$ -norm penalty minimization. As maximizing any higher-order norm over a lower-order norm constraint leads to sparse and spiky solutions, Shen et al. [18] extend $ℓ_{4}$ -norm maximization technique to a generalized $l_{m}$ -norm maximization ( $m \geq 3$ ). In addition, the authors propose a gradient projection method (GPM) for solving it with guaranteed global convergence.

However, both MSP and GPM invoke polar decomposition to keep the feasibility in each iteration. As illustrated in [19,20,21], orthonormalization lacks concurrency, which results in low scalability in column-wise parallel computing, particularly when the number of columns is large.

Several infeasible approaches have been developed to avoid orthonormalization. Gao et al. [19] propose the proximal linearized augmented Lagrangian method (PLAM) as well as its enhanced version, PCAL. Based on the merit function used in Gao et al. [19], Xiao et al. [21] propose an exact penalty model with a convex and compact auxiliary constraint, named PenC, for optimization problems with orthogonality constraints. The authors propose an approximated gradient method named PenCF for solving PenC and showed its global convergence and local convergence rate under mild conditions. The above-mentioned infeasible approaches do not require orthonormalization in each iteration. Numerical experiments illustrate the promising performance of these infeasible approaches with the existing state-of-the-art algorithms.

Although PCAL and PenCF avoid the orthonormalization process by taking infeasible steps, these approaches require additional constraints to restrict the sequence in a compact set in $R^{n \times p}$ , which can undermine their overall efficiency. Therefore, to develop an efficient algorithm on solving SDL, an infeasible model without constraints is desired.

Similar to the $ℓ_{1}$ -norm penalty model for SDL, dual principal component pursuit (DPCP) aims to recover a tangent vector in $R^{n}$ from samples $Y = [y_{1}, \dots, y_{n}] \in R^{n \times N}$ contaminated by outliers. Specifically, DPCP solves the following nonsmooth nonconvex optimization problem with a spherical constraint:

\begin{matrix} min_{W \in R^{n}} & {∥W^{⊤} Y∥}_{1} \\ s . t . & W^{⊤} W = 1 . \end{matrix}

(4)

Due to the ability on recovering an $n - 1$ dimensional hyperplane from $R^{n}$ , DPCP has wide applications in 3D computer vision, such as detecting planar structures in 3D point clouds in KITTI dataset [22,23] and estimating relative poses in multiple-view [24].

Existing approaches [25,26,27,28] for solving convex problem (4) are not scalable and not competent in high dimensional cases [29]. On the other hand, the Random Sampling and Consensus (RANSAC) algorithm [30] has been one of the most popular methods in computer vision for the high relative dimension setting. RANSAC alternates between fitting a subspace to a randomly sampled minimal number of points ( $n - 1$ in the case of DPCP) and measuring the quality of selected subspace by using the number of data-points close to the subspace. In particular, as described in [29], RANSAC can be extremely effective when the probability of sampling outlier-free samples inside the allocated time budget is large. Recently, Tsakiris and Vidal [31] introduce Denoised-DPCP (DPCP-d) by minimizing ${∥y - W^{⊤} Y∥}_{F}^{2} + γ {∥y∥}_{1}$ over the constraints $W^{⊤} W = 1, y \in R^{N}$ . In the same paper, Tsakiris and Vidal [31] propose an Iteratively-Reweighted-Least-Squares algorithm (DPCP-IRLS) for solving the non-convex DPCP problem (4). The authors illustrate that DPCP-IRLS can successfully handle $30 %$ to $50 %$ of outliers and showed its high efficiency compared with RANSAC. In addition, Zhu et al. [32] propose a projected subgradient-based algorithm named DPCP-PSGM, which exhibits great efficiency on reconstructing road-plane in the KITTA dataset. There are also some approaches using smoothing techniques to approximate the $ℓ_{1}$ -norm term such as Logcosh [8,33], Huber loss [34], pseudo-Huber [5], etc. Then, algorithms for minimizing a smooth objective function on a sphere can be applied. Nonetheless, these smoothing techniques often introduce approximation errors as the smooth objective functions usually lead to dense solutions. Qu et al. [35] and Sun et al. [8] propose a rounding step as postprocessing to achieve exact recovery [16] by solving a linear programming, which leads to addition computational cost.

The main difficulties in developing efficient algorithms are the nonsmoothness and nonconvexity in DPCP models. By observing the similarity between SDL and DPCP, we consider to adopt the $ℓ_{m}$ -norm maximization to reformulate DPCP as a smooth optimization problem on sphere.

1.1. Contribution

In this paper, we first point out that the DPCP problem can be formulated as the $ℓ_{m}$ -norm ( $m \in (2, 4]$ ) maximization (1) with $p = 1$ . Therefore, both SDL and DPCP can be unified as a smooth optimization problem on the Stiefel manifold.

Motivated by PenC [21], we propose a novel penalty function as the following expression,

h (W) : = f (W) - \frac{1}{2} 〈W^{⊤} W - I_{p}, Φ (W^{⊤} \nabla f (W))〉 + \frac{β}{6} {∥W∥}_{F}^{6} - \frac{β}{2} {∥W∥}_{F}^{2},

(5)

where $β > 0$ is the penalty-parameter and $Φ$ is the operator that symmetrizes the square matrix, defined by $Φ (M) = \frac{M + M^{⊤}}{2}$ . We show that $h (W)$ is bounded from below, then the convex compact constraint in PenC can be avoided. Therefore, we propose the following smooth unconstrained penalty model for $ℓ_{m}$ -norm maximization (PenNM),

\begin{matrix} min_{W \in R^{n \times p}} h (W) . \end{matrix}

(6)

We prove that Equation (6) with $m \in (2, 4]$ is an exact penalty function of Equation (1) under some mild conditions. Moreover, when $p = 1$ , we verify that PenNM does not introduce any first-order stationary point other than those of Equation (1) and $x = 0$ . Based on the new exact penalty model, we propose an efficient orthonormalization-free first-order algorithm named PenNMF with no additional constraint. In PenNMF, we adopt an approximate gradient in each iterate instead of the exact one in which the second-order derivative of the original objective involves. The global convergence of PenNMF under mild conditions can be established.

The numerical experiments on synthetic and real imaginary data demonstrate that PenNMF outperforms PenCF and MSP/GPM in solving SDL, especially in large-scale cases. As an infeasible method, PenNMF shows superior performance when compared with MSP and GPM, which invoke an orthonormalization process to keep the feasibility. Moreover, when compared with PenCF, PenNMF also shows better performance, implying the benefits of avoiding the constraints in PenC. In our numerical experiments on DPCP, our proposed model (1) with $p = 1$ shows comparable accuracy with $ℓ_{1}$ -norm based penalty model (4) on solving road-plane recovery in KITTA dataset. In some test examples, (1) can have even better accuracy than (4). Besides, PenNMF takes less CPU time while achieving comparable accuracy in reconstructing road-plane in KITTA dataset when compared with other state-of-the-art algorithms such as DPCP-PSGM and DPCP-d.

1.2. Notations and Terminologies

Norms: In this paper, ${∥\cdot∥}_{m}$ denotes the element-wise m-th norm of a vector or matrix, i.e., ${∥A∥}_{m} = {(\sum_{i = 1} \sum_{j = 1} {| A_{i j} |}^{m})}^{1 / m}$ . Besides, ${∥\cdot∥}_{F}$ denotes the Frobenius norm and $∥\cdot∥$ denotes the 2-th operator norm, i.e., $∥A∥$ equals the maximum singular value of A. Besides, we denote $σ_{min} (A)$ as the smallest singular value of a given matrix A. The operator $A \circ B$ stands for the Hadamard product of matrices A and B with the same size. $| A |$ and $A^{\circ l}$ represent the component-wise absolute value and l-th power of matrix A, respectively. Besides, for two symmetric matrices A and B, $A ⪰ B$ denotes that $A - B$ is semi-positive definite, and $A ≻ B$ denotes that $A - B$ is positive definite.

Optimality Condition:W is a first-order stationary point of (1) if and only if

\begin{matrix} \{\begin{matrix} (I_{n} - W W^{⊤}) \nabla f (W) & = & 0; \\ W^{⊤} \nabla f (W) & = & \nabla f {(W)}^{⊤} W; \\ W^{⊤} W & = & I_{p} . \end{matrix} \end{matrix}

(7)

Besides, W is a first-order stationary point of PenNM if and only if $\nabla h (W) = 0$ .

2. Model Description

In this section, we first discuss how to reformulate DPCP as an $ℓ_{m}$ -norm maximization with orthognoality constraints. To construct a orthonormalization-free algorithm, we minimize $h (W)$ rather than directly solve (1). As an unconstrained penalty problem for (1), the model (6) may introduce additional infeasible first-order stationary points. Therefore, in this section, we characterize the equivalence between (1) and (6) to provide theoretical guarantees for our approach.

2.1. $ℓ_{m}$ -Norm Maximization for DPCP Problems

Based on the fact that maximization of a higher-order norm promotes spikiness and sparsity, we maximize the $ℓ_{m}$ -norm of ${\hat{W}}^{⊤} Y$ over the constraint ${\hat{W}}^{⊤} Y Y^{⊤} \hat{W} = 1$ . The model can be expressed as

\begin{matrix} min_{\hat{W} \in R^{n}} & - \frac{1}{m} {∥{\hat{W}}^{⊤} Y∥}_{m}^{m} \\ s . t . & {\hat{W}}^{⊤} Y Y^{⊤} \hat{W} = 1 . \end{matrix}

Although with different constraints to (1), (4) can be reshaped to the formulation of (1). Let $Y = R Z$ be the rank-revealing QR decomposition of Y, where $Z \in R^{n \times N}$ is an orthogonal matrix and $R \in R^{n \times n}$ is an upper-triangular matrix, and denote $W = R^{- T} \hat{W}$ , then the optimization model can be reshaped as

\begin{matrix} min_{W \in R^{n}} & - \frac{1}{m} {∥W^{⊤} Z∥}_{m}^{m} \\ s . t . & W^{⊤} W = 1 . \end{matrix}

(8)

Clearly, problem (8) is a special case of (1) with $p = 1$ . Moreover, suppose $W^{*}$ is a global minimizer of (8), the solution for DPCP problem can be recovered by ${\hat{W}}^{*} = R^{- T} W^{*}$ . The detailed framework for solving DPCP by $ℓ_{m}$ -norm maximization is presented in Algorithm 1.

Algorithm 1 Framework for Solving DPCP by

ℓ_{m}

-Norm Maximization.

Require: Data matrix

Y \in R^{n \times N}

1: Perform QR-factorization for Y. Namely,

Y = R Z

where R is upper-triangular matrix and

Z \in R^{n \times N}

is orthogonal matrix;

2: Compute the solution

\hat{W}

for (1);

3: Return

W^{*} = R^{- T} \hat{W}

Open in a new tab

2.2. Equivalence

In this subsection, we first derive the expression for $\nabla f (W)$ and $\nabla h (W)$ .

Proposition 1.

The gradient and the Hessian of $f (W)$ can be expressed as

$\begin{matrix} \nabla f (W) = Y [{| Y^{⊤} W |}^{\circ (m - 1)} \circ sign ((Y^{⊤} W))]; \\ \nabla^{2} f (W) [D] = (m - 1) Y [{(Y^{⊤} W)}^{\circ (m - 2)} \circ (Y^{⊤} D)], \end{matrix}$

respectively. Moreover, the gradient of $h (W)$ can be formulated as

$\begin{matrix} \nabla h (W) = & \nabla f (W) (\frac{3}{2} I_{p} - \frac{1}{2} W^{⊤} W) - W Φ (W^{⊤} \nabla f (W)) - \frac{1}{2} \nabla^{2} f (W) [W (W^{⊤} W - I_{p})] \\ + 2 β W W^{⊤} W (W^{⊤} W - I_{p}) + β W (W^{⊤} W W^{⊤} W - I_{p}) . \end{matrix}$

Proof.

From the work in [17] we have $\nabla f (W) = Y [{| Y^{⊤} W |}^{\circ (m - 1)} \circ s i g n (Y^{⊤} W)]$ . Based on the expression for $\nabla f (W)$ , the Hessian of f can be expressed as $\nabla^{2} f (W) [D] = (m - 1) Y [(Y^{⊤} D) \circ {(Y^{⊤} W)}^{\circ (m - 2)}]$ . As a result, $\nabla^{2} f (W) [W (W^{⊤} W - I_{p})] = (m - 1) Y {(Y^{⊤} W)}^{\circ (m - 1)} Y (W^{⊤} W - I_{p})$ .

Therefore, based on ([21], Equation 2.8), the gradient of $h (W)$ can be formulated as

$\begin{matrix} \nabla h (W) = & \nabla f (W) (\frac{3}{2} I_{p} - \frac{1}{2} W^{⊤} W) - W Φ (W^{⊤} \nabla f (W)) - \frac{1}{2} \nabla^{2} f (W) [W (W^{⊤} W - I_{p})] \\ + 2 β W W^{⊤} W (W^{⊤} W - I_{p}) + β W (W^{⊤} W W^{⊤} W - I_{p}) \end{matrix}$

□

With the expression for $\nabla h (W)$ , we can establish the equivalence between (1) and our proposed model, (6). The equivalence is illustrated in Theorem 4, and the main body of the proofs is presented in Appendix A.

Theorem 2.

(First-order equivalence) Suppose $β \geq (4 m + 8) {∥Y∥}_{F}^{m}$ and $\tilde{W}$ is a first-order stationary point of (6), then either ${\tilde{W}}^{⊤} \tilde{W} = I_{p}$ holds, which further implies that $\tilde{W}$ is a first-order stationary point of problem (1), or the inequality $σ_{min} ({\tilde{W}}^{⊤} \tilde{W}) \leq \frac{(2 m + 4) {∥Y∥}_{F}^{m}}{β}$ holds.

Theorem 2 characterizes the relationship between the first-order stationary points of (1) and those of (6). Namely, the penalty model only yields the first-order stationary points other than those of the original model (1) far away from the Stiefel manifold. When $p = 1$ , we can derive a stronger result on those additional first-order stationary points produced by the penalty model in Corollary 3.

Corollary 3.

(Stronger first-order equivalence for $p = 1$ ) Suppose $p = 1$ in (1), $β \geq (4 m + 8) {∥Y∥}_{F}^{m}$ , and $\tilde{W}$ is a first-order stationary point of (6), then either ${\tilde{W}}^{⊤} \tilde{W} = I_{p}$ holds, which further implies that $\tilde{W}$ is a first-order stationary point of problem (1), or $\tilde{W} = 0$ .

Theorem 2 characterizes the equivalence between (1) and (6) in the sense that all the infeasible first-order stationary points of (6) is relatively far away from the constraint $W^{⊤} W = I_{p}$ . Besides, Corollary 3 shows that when $p = 1$ , the only infeasible first-order stationary point of (6) is 0. Therefore, when we achieve a solution near the constraint $W^{⊤} W = I_{p}$ by solving (1), we can conclude that W is a first-order stationary point of (1). Instead of directly solving (1), we can compute the first-order stationary point of (6) and thus avoid intensive orthonormalization in the computation.

3. Algorithm

3.1. Global Convergence

In this section, we focus on developing an infeasible approach for solving (6). The calculation of the gradient of $h (W)$ is involved with the second-order derivative, which is typically even more expensive than the iterations in MSP/GPM. Therefore, we consider to solve (6) by an approximated gradient descent algorithm. Let $D (W) : = \nabla f (W) - W Φ (W^{⊤} \nabla f (W)) + β W (W^{⊤} W W^{⊤} W - I_{p})$ be the approximation for the gradient of $h (W)$ , we present the detailed algorithm as Algorithm 2.

Algorithm 2 First-Order Method for Solving (6). (PenNMF)

Require:

f : R^{n \times p} \mapsto R

β > 0

;

1: Randomly choose

W^{0}

satisfies

{W^{0}}^{⊤} W^{0} = I_{p}

, set

k = 0

;

2: while not terminate do

3: Compute inexact gradient

D (W^{k}) = \nabla f (W^{k}) - W^{k} Φ ({W^{k}}^{⊤} \nabla f (W^{k})) + β W^{k} ({W^{k}}^{⊤} W^{k} {W^{k}}^{⊤} W^{k} - I_{p});

4: Compute stepsize

η^{k}

;

W^{k + 1} = W^{k} - η^{k} D (W^{k})

;

k + +

;

7: end while

8: Return

W^{k}

Open in a new tab

Next, we establish the convergence of PenNMFin Theorem 4, which illustrates the global convergence and worst-case convergence rate of PenNMF under mild conditions. The main body of the proof is presented in Appendix B.

Theorem 4.

(Global convergence) Suppose $δ \in (0, \frac{1}{3}]$ and $β \geq max \{228 m {∥Y∥}_{F}^{m}, \frac{32}{δ} {∥Y∥}_{F}^{m}\}$ . Let ${W^{k}}$ be the iterate sequence generated by PenNMF, starting from any initial point $W^{0}$ satisfying $| | {W^{0}}^{⊤} W^{0} - I_{p} {| |}_{F}^{2} \leq \frac{1}{8} δ$ , and the stepsize $η_{k} \in [\frac{1}{2} \bar{η}, \bar{η}]$ , where $\bar{η} \leq \frac{1}{2 M_{1}}$ . Then, $W^{k}$ weakly converges to a first-order stationary point of (1). Moreover, for any $k = 1, 2, \dots$ , the convergence rate of PenNMF can be estimated by

$min_{0 \leq i \leq k} {∥D (W^{i})∥}_{F} \leq \sqrt{\frac{8 m {∥Y∥}_{F}^{m} + 2 β δ}{\bar{η} (k + 1)}} .$ (9)

3.2. Some Practical Settings

As illustrated in Algorithm 2, the hyperparameters in PenNMF are the penalty parameter $β$ and stepsize $η^{k}$ . In the theoretical analysis for PenNMF, the upper bound of $η^{k}$ adopted in Theorem 4 is too restrictive in practice. There are many adaptive stepsize for first-order algorithms, and here we consider the Barzilai–Borwein (BB) stepsize [36],

η_{B B 1, k} = \frac{〈S_{k}, Y_{k}〉}{〈Y_{k}, Y_{k}〉}, η_{B B 2, k} = \frac{〈S_{k}, S_{k}〉}{〈S_{k}, Y_{k}〉},

(10)

and alternating Barzilai–Borwein (ABB) stepsize [37],

η_{A B B}^{k} = \{\begin{matrix} η_{B B 1}^{k} & m o d (k, 2) = 1 \\ η_{B B 2}^{k} & m o d (k, 2) = 0, \end{matrix}

(11)

where and $S_{k} = W^{k} - W^{k - 1}$ , $Y_{k} = \nabla h (W^{k}) - \nabla h (W^{k - 1})$ . We suggest to choose the stepsize $η^{k}$ as ABB stepsize in PenNMF, and we test PenNMF with ABB stepsize in our numerical experiments.

Another parameter is $β$ , which controls the smooth penalty term in $h (W)$ . Similarly, the lower-bound for $β$ in Theorem 4 is too large to be practical. In our numerical examples, we uses the constant $s : = {∥\nabla f (W^{0})∥}_{F}$ , which is an approximation to ${∥\nabla^{2} f (W^{0})∥}_{F}$ , as an upper-bound for $β$ . According to the work in [21], we suggest to choose the penalty parameter by $β = 0.01 {∥\nabla^{2} f (W^{0})∥}_{F}$ .

Additionally, to achieve high accuracy in feasibility, we perform the polar factorization to the final solution generated by PenCF and PenNMF as the default postprocess. More precisely, when we compute the final solution $W^{k}$ by PenNMF, we can compute its rank-revealing singular-value decomposition $W^{k} = U^{k} Σ^{k} {V^{k}}^{⊤}$ and return ${\hat{W}}^{k} : = U^{k} {V^{k}}^{⊤}$ . Using the same proof techniques in [21], our postprocess leads to decrease in feasibility as well as the functional value. Moreover, the numerical experiments in [19] show that the introduced orthonormalization process results in little changes in $\nabla h (W)$ . Therefore, we suggest to perform the described postprocess for PenNMF.

4. Numerical Examples

In this section, we present our preliminary numerical examples. We compare our algorithm with some state-of-the-art algorithms on SDL and DPCP problems, which are formulated as (1) and (8), respectively. Then, we observe the performance of our algorithm under different selections of parameters, and then choose the default setting. All the numerical experiments in this section are tested on an Intel(R) Core(R) Silver 4110 CPU @ 2.1 GHz, with 32 cores and 394 GB of memory running under Ubuntu 18.04 and MATLAB R2018a.

4.1. Numerical Results on Sparse Dictionary Earning

In this subsection, we mainly compare the numerical performance of PenNMF with some state-of-the-art algorithms on SDL. As illustrated in Table 2 in [17], MSP is significantly faster than the Riemannian subgradient [3] and Riemannian trust-region method [8]. Therefore, to have a better illustration on the performance of PenNMF, we compare PenNMF with state-of-the-art algorithms on solving (1), which is a smooth optimization problem with orthogonality constraints. We first select two state-of-the-art algorithms on solving optimization problems with orthogonality constraints. One is Manopt [38,39], a projection-based feasible method. In our numerical test, we choose nonlinear conjugate gradient with inexact linear-search strategy to accelerate Manopt. Another one is PenCF [21], which is an infeasible approach for optimization problems with orthogonality constraints. In our algorithms we choose to apply Alternating Bzarzilar–Borwein stepsize to accelerate PenNMF, and uses all parameters as default setting described in [21]. Besides, we test the MSP algorithm [17] and GPM algorithm [18]. It is worth to mention that when $m = 4$ , the MSP and GPM are actually the same. According to the numerical examples in [18], $m = 3$ has better recovery quality than the case $m = 4$ . Therefore, in our numerical experiments, we test the mentioned algorithms on the case where $m = 3$ .

The stopping criteria for Manopt, MSP/GPM is ${∥\nabla f (W^{k}) - W^{k} Λ (W^{k})∥}_{F} \leq 10^{- 2}$ , while the stopping criteria for PenCF and PenNMF is ${∥\nabla h (W^{k})∥}_{F} \leq 10^{- 2}$ . Besides, the max iteration for all compared algorithms is set as 200.

In all test examples, we randomly generate the sparse representation X by $X^{*} = randn (n, N) . * (randn (n, N) < 0.3)$ and the dictionary $W^{*}$ by randomly selecting a point on Stiefel manifold. Then, the original data matrix Y is constructed by $Y_{o} = {W^{*}}^{⊤} X^{*}$ . To test the performance of all compared algorithms, we add different types of noise to $Y_{o}$ . We first fix the level of noise $θ = 0.3$ and choose n from 20 to 100. Then, we test the performance of compared algorithms with different types of noisy while fix $n = 50$ . In our numerical tests, the “Noise” denotes the Gaussian Noise, where Y is constructed by $Y = Y_{o} + θ \cdot randn (n, N)$ . Besides, the term “Outliers” denotes the Gaussian outliers, where $outliers = randn (n, round (θ m)); Y = cat (2, Y_{o}, outliers);$ . Additionally, the term “Corruption” refers to the Gaussian corruption to $Y_{o}$ , which is achieved by $rademacher = (rand (n, m) < 0.5) * 2 - 1; Y = Y_{o} + (rand (n, m) < θ) . * rademacher$ . Besides, the term ’CPU time’ denotes the averaged run-time, while the term ’Error’ denotes the $1 - {∥{\hat{W}}^{⊤} W^{*}∥}_{4}^{4}$ , where $\hat{W}$ denotes the final output of all the compared algorithm.

The numerical results are listed in Figure 1. From Figure 1d–f, j–l we conclude that all these compared algorithms achieve almost the same accuracy in all the cases. Besides, for Gaussian noise, the performance of PenNMF is comparable to MSP/GPM algorithm and outperforms Manopt. Moreover, with Gasuuain outliers and Gaussian corruption, the performance of PenNMF is better than PenCF, MSP/GPM, and Manopt. One possible explanation is that for Manopt invokes computing the Riemannian gradient, line-search in each iteration, resulting in higher computational complexity than MSP/GPM. Besides, the infeasible approaches overcome the bottleneck in the orthonormalization process in Manopt and MSP/GPM, and thus achieve comparable performance to MSP/GPM. Additionally, PenCF solves a constrained model by taking approximated gradient descent steps, while in PenNMF the model is an unconstrained one. The absence of constraint helps to improve the performance of PenNMF.

A detailed comparison among MSP, Manopt, PenCF, and PenNMF. (a)–(c) a comparison with different level of noisy on CPU time; (d)–(f) a comparison with different level of noisy on errors; (g)–(i) a comparison with different n on their CPU time; (j)–(l) a comparison with different n on their errors. The errors are evaluated by $1 - {∥{\hat{W}}^{⊤} W^{*}∥}_{4}^{4}$ , where $\hat{W}$ denotes the final output of all the compared algorithm.

Besides testing on synthetic datasets, we also perform extensive experiments to verify the performance of PenNMF on real imagery data. A classic application of dictionary learning involves learning sparse representations of image patches [40]. In this paper, we extend the experiments in [17] to learn patches from grayscale and color images. Based on the $512 \times 512$ grayscale image “Barbara”, we construct the clean data matrix $Y_{o}$ by vectorizing each $16 \times 16$ patches from it. Then, we use the same approach to construct the clean data matrix Y from $512 \times 512$ grayscale images “Boat” and “Lena”, together with a $256 \times 256$ grayscale image ”House”. In “Barbara”, “Boat”, and “Lena”, the clean data matrix $Y \in R^{256 \times 247, 009}$ , and the data matrix from “House” satisfies $Y \in R^{256 \times 58, 081}$ . Besides, we construct the matrix $Y \in R^{192 \times 62, 001}$ by vectorizing the $8 \times 8 \times 3$ patches from the $256 \times 256$ RGB image “Duck”. In such setting, all the compared algorithms recover the dictionary for all three channels simultaneously rather than learn them once for each channel in “Duck”. Such approach is aslo applied to generate the data matrix in $R^{192 \times 146, 633}$ from $338 \times 450$ RGB image “Chateau”. We run MSP/GPM, PenNMF, PenCF, and Manopt with $m = 3$ to compute the dictionary from $Y = Y_{o} + θ \cdot randn (n, N)$ with different level of noise, where $Y_{o}$ is generated in the same manner as our first numerical experiment and has the same size as these patched figures. The numerical results are presented in Figure 2 and Figure A1. In all experiments, PenNMF takes less time than PenCF, MSP/GPM, and Manopt, which further illustrate the high efficiency of PenNMF in tackling the real imagery data, especially in the large-scale case.

The CPU time of PenCF, PenNMF, MSP/GPM, and Manopt on computing the dictionary. (a) Barbara, $Y \in R^{256 \times 247, 009}$ ; (b) Boat, $Y \in R^{256 \times 247, 009}$ ; (c) Duck, $Y \in R^{192 \times 62, 001}$ ; (d) House, $Y \in R^{256 \times 58, 081}$ ; (e) Lena, $Y \in R^{256 \times 247, 009}$ ; (f) Chateau, $Y \in R^{192 \times 146, 633}$ .

4.2. Dual Principal Component Pursuit

In this subsection, we first verify the recovery property of our proposed model (8), which is a special case of (1) by fixing $p = 1$ . We first compare the distance between global minimizer of (8) and the ground-truth for DPCP problem. We first fix $n = 30$ and randomly select $W^{*} \in R^{n}$ . Then, we randomly generate $N_{1}$ inliers in the hyperplane whose normal vector is $W^{*}$ . Besides, we randomly generate $N_{2}$ outliers in $R^{n}$ following Gaussian distribution. Additionally, the data is corrupted by Gaussian noise by adding $\frac{θ}{\sqrt{n}} \cdot randn (n, N)$ to Y. Then, we normalize each sample in Y. The range of $N_{1}$ is $[10, 500]$ , whereas the range of $N_{2}$ is $[10, 3000]$ . We run each test problem for 5 instances. Moreover, in each instance, we run DPCP-PSGM to solve (4) and PenNMF to solve (8) with $m = 3$ and 4, and get the solution $\tilde{W}$ for each model. We plot the principal angle between $\tilde{W}$ and $W^{*}$ in Figure 3. From Figure 3a,b we can conclude that (4) can tolerate $O (N_{1}^{2})$ outliers while achieve exact recovery, which coincides the theoretical results presented in [32]. For model (8), numerical experiments do not show the exact recovery ability of (8) for $m = 3$ and 4. However, with some tolerance on the principal angle, we also observe that (4) can tolerate $O (N_{1}^{2})$ outliers. Moreover, we conclude that with $m = 3$ , (8) has better ability to recover the normal vector than $m = 4$ . As a result, in the rest of this subsection, we only test (8) with $m = 3$ . In addition, we analyze the number of successfully recovered instances, where the $\sqrt{1 - {〈\tilde{W}, W^{*}〉}^{2}}$ is less than $0.1$ or $0.2$ . The results are presented in Figure 4. From Figure 4, we can conclude that, with tolerance on the errors, the $ℓ_{m}$ -norm maximization model can successfully recover the normal vector. Moreover, in model (8), $m = 3$ has better performance than $m = 4$ , which coincides with the numerical experiments in [18]. Therefore, when applying $ℓ_{m}$ -norm maximization model to solving the DPCP problems, we suggest to choose $m = 3$ in (8).

A comparison between the models (8) and (4) on the average recovery error $\sqrt{1 - {〈\tilde{W}, W^{*}〉}^{2}}$ of 5 random trials. (a)–(c) average recovery errors with $θ = 0$ ; (d)–(g) average recovery errors with $θ = 0.1$ .

A comparison on the number of successfully recovered instances on the different level of noise. (a) $\sqrt{1 - {〈\tilde{W}, W^{*}〉}^{2}}$ is less than $0.1$ ; (b) $\sqrt{1 - {〈\tilde{W}, W^{*}〉}^{2}}$ is less than $0.2$ .

In the rest of this subsection, we test the numerical performance of PenNMF on solving DPCP problem, which plays an important role in autonomous driving applications. DPCP is applied to recover the road-plane, which can be regarded as inliers, from the 3d point clouds in KITTA dataset [22], which is recorded from a moving platform while driving in and around Karlsruhe, Germany. This dataset consists of image data together with corresponding 3D points collected by a rotating 3D laser scanner [32]. Moreover, DPCP only uses the 3D point clouds with the objective of determining the 3D points that lie on the road plane (inliers) and those off that plane (outliers): Given a 3D point cloud of a road scene, the DPCP problem focuses on reconstructing an affine plane ${x \in R^{3} | a^{⊤} x - b = 0}$ as a representation for the road. Equivalently, this task can be converted to a linear subspace learning problem by embedding the affine plane into the linear hyperplane $H \subseteq R^{4}$ with normal vector $\tilde{b} = [a, - b]$ , through the mapping $x \to [x, 1]$ [29]. We use the experimental set-up in [29,32] to further compare Equations (4) and (8), RANSAC, and other alternative methods in the task of 3D road plane detection in KITTA dataset. Each point cloud contains over $10^{5}$ samples with approximately $50 %$ outliers. Besides, the samples are homogenized and normalized to unit $ℓ_{2}$ -norm.

We use 11 frames annotated in [29,32] from KITTA dataset. We compare DPCP-PSGM [29], DPCP-IRLS, and DPCP-d [31], which focus on solving the $ℓ_{1}$ -norm minimization model (4). Besides, we test RANSAC and $ℓ_{2, 1}$ -RPCA [25]. Additionally, we test PenNMF and MSP/GPM on solving our proposed model (8), which is a special case of (1). For DPCP-PSGM, DPCP-d, DPCP-IRLS, and $ℓ_{2, 1}$ -RPCA, all parameters are set by following the suggestions in [32].

Figure 5 illustrates the numerical performance of all the compared algorithms. We present the numerical results in Figure 5d–f. Moreover, we draw the performance profiles proposed by Dolan and Moré [41] in Figure 5a–c to present an illustrative comparison on the performance of all compared algorithms. The performance profiles can be regarded as distribution functions for a performance metric for benchmarking and comparing optimization algorithms. Besides, we draw the recovery results of frames 328 and 441 in KITTA-CITY-71, which is presented in Figure 6. Here the term “AUC” denotes the area under the AUC curve, and “iterations” denotes the total iterations taken by these compared algorithms. Besides, “Prob” in Figure 5d–f denotes the indexes of tested frames, which are presented in Table 1.

A comparison between PenNMF, MSP, DPCP-PSGM, DPCP-D, and Random Sampling and Consensus (RANSAC). (a)–(c) performance profile [41] of AUC, iterations and CPU time; (d)–(f) the numerical results of AUC, iterations and CPU time.

Illustrations to some results in our numerical tests, with inliers in blue and outliers in red. (a) Frame 328 from KITTI-CITY-71, $N = 121766$ ; (b) Frame 441 from KITTI-CITY-71, $N = 119428$ . Inliers/outliers are detected by using a ground-truth thresholding on the distance to the hyperplane recovered by each compared method. The results are represented by projecting 3D point clouds onto the image.

Table 1.

The testing instances and their corresponding frames in KITTA dataset.

Dataset	KITTI-CITY-71				KITTI-CITY-5					KITTI-CITY-48
Frame id.	221	328	441	881	1	45	120	137	153	0	21
Test id.	1	2	3	4	5	6	7	8	9	10	11

Open in a new tab

From Figure 5a, we can conclude that PenNMF and MSP/GPM successfully recover the hyperplanes with comparable accuracy. Moreover, in problems $3, 7,$ and 9, PenNMF and MSP produce better classification accuracy than other approaches. Besides, in the aspect of CPU time, PenNMF and MSP cost much less time than other compared algorithms in most cases. Moreover, from Figure 5c, we can conclude that PenNMF takes less time than MSP as well as other compared algorithms in almost all the cases. As a result, we can conclude that our proposed model (1) is easy to be solved and PenNMF shows better efficiency than MSP in our test examples.

5. Conclusions

Sparse dictionary learning (SDL) and dual principal pursuit (DPCP) are two powerful tools in data science. In this paper, we formulate DPCP as a special case of the $ℓ_{m}$ -norm maximization on the Stiefel manifold proposed for SDL. Then, we propose a novel smooth unconstrained penalty model PenNM for the original optimization problem with orthogonality constraints. We show PenNM is an exact penalty function of (1) under mild assumptions. We develop an novel approximate gradient approach PenNMF for solving PenNM. The global convergence of PenNMF as well as its sublinear convergence rate are established. Numerical experiments illustrate that our proposed approach enjoys better performance than MSP/GPM [17,18] on various testing problems.

Acknowledgments

We gratefully thank Yuexiang Zhai, Hermish Mehta, Zhengyuan Zhou, and Yi Ma for sharing their codes on MSP. Besides, we also gratefully thank Tianyu Ding, Zhihui Zhu, Tianjiao Ding, Yunchen Yang, Rene Vidal, Manolis Tsakiris, and Daniel Robinson for sharing their codes on DPCP problems.

Abbreviations

The following abbreviations are used in this manuscript:

DPCP	dual principal pursuit
DPCP-d	Denoised-DPCP
DPCP-IRLS	Iteratively-Reweighted-Least-Squares algorithm
DPCP-PSGM	Projected subgradient-based algorithm for solving DPCP
GPM	gradient projection method
MSP	matching, stretching, and projection
PenNM	penalty model for $ℓ_{m}$ -norm maximization
PenNMF	first-order algorithm for solving our penalty model
RANSAC	Random Sampling and Consensus
SDL	Sparse dictionary learning

Open in a new tab

Appendix A. Proof for Theorem 2 and Corollary 3

In this section, we present the proof for Theorem 2 and Corollary 3. As (6) is an unconstrained optimization problem, the upper-bound for ${∥\nabla h (W)∥}_{F}$ should be estimated. Before estimating the upper bounds of ${∥\nabla f (W)∥}_{F}$ and ${∥\nabla^{2} f (W) [W (W^{⊤} W - I_{p})]∥}_{F}$ , we first present two linear algebraic inequalities:

Lemma A1.

For any $A, B \in R^{n \times N}$ , ${∥A \circ B∥}_{F} \leq {∥A∥}_{F} {∥B∥}_{F}$ .

Proof.

$\begin{matrix} {∥A \circ B∥}_{F}^{2} = \sum_{i} \sum_{j} A_{i j}^{2} B_{i j}^{2} \leq (\sum_{i} \sum_{j} A_{i j}^{2}) (\sum_{i} \sum_{j} B_{i j}^{2}) = {∥A∥}_{F}^{2} {∥B∥}_{F}^{2} . \end{matrix}$

□

Lemma A2.

For any $A \in R^{n \times N}$ and any $m \geq 3$ , we have ${∥A^{\circ (m)}∥}_{F} \leq {∥A∥}_{F}^{m}$ .

Proof.

This lemma directly follows the fact that

${∥A^{\circ (m)}∥}_{F}^{2} = \sum_{i = 1}^{n} \sum_{j = 1}^{N} A_{i, j}^{2 m} \leq {(\sum_{i = 1}^{n} \sum_{j = 1}^{N} A_{i, j}^{2})}^{m} = {∥A∥}_{F}^{2 m} .$

□

Now, we present the upper bound estimation for ${∥\nabla f (W)∥}_{F}$ and ${∥\nabla^{2} f (W) [W (W^{⊤} W - I_{p})]∥}_{F}$ .

Lemma A3.

For any $W \in R^{n \times p}$ , ${∥\nabla f (W)∥}_{F} \leq {∥Y∥}_{F}^{m} {∥W∥}^{m - 1} .$

Proof.

Due to the fact that $\nabla f (W) = Y [{(Y^{⊤} W)}^{\circ (m - 1)} \circ sign (Y^{⊤} W)]$ ,

$\begin{matrix} {∥\nabla f (W)∥}_{F} \leq & {∥Y∥}_{F} {∥{(Y^{⊤} W)}^{\circ (m - 1)} \circ sign (Y^{⊤} W)∥}_{F} \\ = & {∥Y∥}_{F} {∥{(Y^{⊤} W)}^{\circ (m - 1)}∥}_{F} \\ \leq & {∥Y∥}_{F} {∥Y^{⊤} W∥}_{F}^{m - 1} \\ \leq & {∥Y∥}_{F}^{m} {∥W∥}^{m - 1} . \end{matrix}$

Here, the last inequality follows the fact that ${∥A B∥}_{F} \leq {∥A∥}_{F} ∥B∥ \leq {∥A∥}_{F} {∥B∥}_{F}$ . □

Lemma A4.

For any $W \in R^{n \times p}$ ,

${∥\nabla^{2} f (W) [W (W^{⊤} W - I_{p})]∥}_{F} \leq (m - 1) {∥W∥}^{m - 2} ∥W (W^{⊤} W - I_{p})∥ {∥Y∥}_{F}^{m} .$

Proof.

From the expression of $\nabla^{2} f (W)$ in Proposition 1,

$\begin{matrix} {∥\nabla^{2} f (W) [W (W^{⊤} W)]∥}_{F} \\ = & (m - 1) {∥Y [{(Y^{⊤} W)}^{\circ (m - 2)} \circ (Y^{⊤} W (W^{⊤} W - I_{p}))]∥}_{F} \\ \leq & (m - 1) {∥Y∥}_{F} {∥{(Y^{⊤} W)}^{\circ (m - 2)} \circ (Y^{⊤} W (W^{⊤} W - I_{p}))∥}_{F} \\ \leq & (m - 1) {∥Y∥}_{F} {∥{(Y^{⊤} W)}^{\circ (m - 2)}∥}_{F} {∥Y^{⊤} W (W^{⊤} W - I_{p})∥}_{F} \\ \leq & (m - 1) {∥W∥}^{m - 1} ∥W (W^{⊤} W - I_{p})∥ {∥Y∥}_{F}^{m} . \end{matrix}$

Here, the second inequality directly uses Lemma A1 and the last inequality follows Lemma A2. □

In the rest of this section, we consider the equivalence between (1) and (6). We first establish the relationships between the first-order stationary points of (6) and problem (1).

From the optimality condition of (6), we derive an important equality in Lemma A5.

Lemma A5.

For any first-order stationary point $\tilde{W}$ of (6) and any symmetric matrix $T \in R^{p \times p}$ that satisfies $T {\tilde{W}}^{⊤} \tilde{W} = {\tilde{W}}^{⊤} \tilde{W} T$ , we have

$\begin{matrix} 0 = & tr (T ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) (β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) - \frac{3}{2} Λ (\tilde{W}))) \\ - \frac{1}{2} tr (T {\tilde{W}}^{⊤} \nabla^{2} f (\tilde{W}) [\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})]) . \end{matrix}$ (A1)

Proof.

Suppose $\tilde{W}$ is a first-order stationary point of (6), by the first-order optimality condition, $\nabla h (\tilde{W}) = 0$ . Then, for any symmetric matrix $T \in R^{p \times p}$ that satisfies $T {\tilde{W}}^{⊤} \tilde{W} = {\tilde{W}}^{⊤} \tilde{W} T$ , $〈\tilde{W} T, \nabla h (\tilde{W})〉 = 0$ .

As described in Proposition 1, $\nabla f (W)$ can be separated into three parts, we estimate their inner-product with $\tilde{W} T$ respectively.

First,

$\begin{matrix} 〈\tilde{W} T, \nabla f (\tilde{W}) (\frac{3}{2} I_{p} - \frac{1}{2} {\tilde{W}}^{⊤} \tilde{W}) - \tilde{W} Φ ({\tilde{W}}^{⊤} \nabla f (\tilde{W}))〉 \\ = & tr (T {\tilde{W}}^{⊤} \nabla f (\tilde{W}) (\frac{3}{2} I_{p} - \frac{1}{2} {\tilde{W}}^{⊤} \tilde{W}) - T {\tilde{W}}^{⊤} \tilde{W} Λ (\tilde{W})) \\ = & tr (T (\frac{3}{2} I_{p} - \frac{1}{2} {\tilde{W}}^{⊤} \tilde{W}) Λ (\tilde{W}) - T {\tilde{W}}^{⊤} \tilde{W} Λ (\tilde{W})) \\ = & \frac{3}{2} tr (T (I_{p} - {\tilde{W}}^{⊤} \tilde{W}) Λ (\tilde{W})) . \end{matrix}$

Here, the second equality follows the fact that that $tr (B C) = tr (B C B) = 0$ holds for any symmetric B and skew-symmetric C. Besides, the last inequality follows that $({\tilde{W}}^{⊤} \tilde{W} - I_{p}) T = T ({\tilde{W}}^{⊤} \tilde{W} - I_{p})$ . As a result, we achieve the following equality:

$\begin{matrix} 〈\tilde{W} T, \nabla f (\tilde{W}) (\frac{3}{2} I_{p} - \frac{1}{2} {\tilde{W}}^{⊤} \tilde{W}) - \tilde{W} Φ ({\tilde{W}}^{⊤} \nabla f (\tilde{W}))〉 - \frac{1}{2} 〈\tilde{W} T, \nabla^{2} f (\tilde{W}) [\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})]〉 \\ = & - \frac{3}{2} tr (T ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) Λ (\tilde{W})) - \frac{1}{2} 〈T \tilde{W}, \nabla^{2} f (\tilde{W}) [\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})]〉 . \end{matrix}$

Additionally, we estimate their inner-product of $\tilde{W} T$ and $β \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} {\tilde{W}}^{⊤} \tilde{W} - I_{p})$ and achieve the following equality

$\begin{matrix} 〈\tilde{W} T, β ({\tilde{W}}^{⊤} \tilde{W} {\tilde{W}}^{⊤} \tilde{W} - I_{p})〉 \\ = & 〈\tilde{W} T, β \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) ({\tilde{W}}^{⊤} \tilde{W} - I_{p})〉 \\ = & β tr (T {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) ({\tilde{W}}^{⊤} \tilde{W} - I_{p})) . \end{matrix}$

Based on the above two equations, multiplying $({\tilde{W}}^{⊤} \tilde{W} - I_{p}) {\tilde{W}}^{⊤}$ on both sides of $0 = \nabla h (\tilde{W})$ results in

$\begin{matrix} 0 = & 〈\tilde{W} T, \nabla h (\tilde{W})〉 \\ = & tr (T ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) (β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) - \frac{3}{2} Λ (\tilde{W}))) \\ - \frac{1}{2} tr (T {\tilde{W}}^{⊤} \nabla^{2} f (\tilde{W}) [\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})]), \end{matrix}$ (A2)

and thus we complete the proof. □

Then based on the equality in Lemma A5, the following proposition shows that all first-order stationary point of (6) is uniformly bounded.

Proposition A6.

For any first-order stationary point $\tilde{W}$ of (6), suppose $β \geq (4 m + 8) {∥Y∥}_{F}^{m}$ , then ${∥\tilde{W}∥}^{2} \leq 1 + \frac{(m + 2) {∥Y∥}_{F}^{m}}{β}$ .

Proof.

Let u denotes the top eigenvector of ${\tilde{W}}^{⊤} \tilde{W}$ , i.e. ${\tilde{W}}^{⊤} \tilde{W} u = ∥{\tilde{W}}^{⊤} \tilde{W}∥ u$ .

Suppose $\tilde{W}$ is a first-order stationary point that satisfies ${∥\tilde{W}∥}^{2} > 1 + \frac{(m + 2) {∥Y∥}_{F}^{m}}{β}$ . By Lemma A5 we first have

$\begin{matrix} 0 = & 〈u u^{⊤} \tilde{W}, \nabla f (\tilde{W})〉 \\ = & tr (β u u^{⊤} ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p})) - \frac{3}{2} tr (u u^{⊤} ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) Λ (\tilde{W})) \\ - \frac{1}{2} tr (u u^{⊤} {\tilde{W}}^{⊤} \nabla^{2} f (\tilde{W}) [\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})]) \\ \geq & β ({∥\tilde{W}∥}^{6} - {∥\tilde{W}∥}^{2}) - \frac{3}{2} {∥\tilde{W}∥}^{2} u^{⊤} Λ (\tilde{W}) u - \frac{m - 1}{2} {∥Y∥}_{F}^{m} {∥\tilde{W}∥}^{m + 2} \\ \geq & β ({∥\tilde{W}∥}^{6} - {∥\tilde{W}∥}^{2}) - \frac{(m + 2) {∥Y∥}_{F}^{m} {∥\tilde{W}∥}^{m + 2}}{2} \\ \geq & (β - \frac{(m + 2) {∥Y∥}_{F}^{m}}{2}) {∥\tilde{W}∥}^{6} - β {∥\tilde{W}∥}^{4} \\ = & {∥\tilde{W}∥}^{4} [(β - \frac{(m + 2) {∥Y∥}_{F}^{m}}{2}) {∥\tilde{W}∥}^{2} - β] \\ > & 0, \end{matrix}$ (A3)

which leads to the contradictory and shows that ${∥\tilde{W}∥}^{2} \leq 1 + \frac{(m + 2) {∥Y∥}_{F}^{m}}{β}$ . Here, the second equality directly follows Lemma A5. The first inequality uses Lemma A4 and the fact that $∥\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})∥ \leq {∥\tilde{W}∥}^{3}$ . Besides, the second inequality follows the fact that ${\tilde{W}}^{⊤} \tilde{W} ⪰ {∥\tilde{W}∥}^{2} u u^{⊤}$ . The second inequality uses the fact that $u^{⊤} Λ (\tilde{W}) u \leq {∥Λ (\tilde{W})∥}_{F} \leq {∥{\tilde{W}}^{⊤} \nabla f (\tilde{W})∥}_{F} \leq {∥\tilde{W}∥}_{2} {∥\nabla f (\tilde{W})∥}_{F}$ . The fourth inequality uses the fact that $tr ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) \leq {∥\tilde{W}∥}_{F}^{2}$ , and the last inequality follows the fact that ${∥\tilde{W}∥}^{2} - 1 \geq \frac{(m + 2) {∥Y∥}_{F}^{m}}{β}$ . □

Combine Lemma A5 and Proposition A6, we restate Theorem 2 as Theorem A7 and achieve the equivalence between (1) and (6).

Theorem A7.

Suppose $β \geq (4 m + 8) {∥Y∥}_{F}^{m}$ , and $\tilde{W}$ is a first-order stationary point of (6), then either ${\tilde{W}}^{⊤} \tilde{W} = I_{p}$ holds, which further implies that $\tilde{W}$ is a first-order stationary point of problem (1), or the inequality $σ_{min} ({\tilde{W}}^{⊤} \tilde{W}) \leq \frac{(2 m + 4) {∥Y∥}_{F}^{m}}{β}$ holds.

Proof.

When $β \geq 4 (m + 2) {∥Y∥}_{F}^{m}$ , any first-order stationary point $\tilde{W}$ of (6) satisfies that ${∥\tilde{W}∥}^{2} \leq 2$ .

Suppose $\tilde{W}$ satisfies ${\tilde{W}}^{⊤} \tilde{W} ⪰ \frac{(2 m + 4) {∥Y∥}_{F}^{m}}{β} I_{p}$ , then $\frac{β}{2} {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) ⪰ (m + 2) {∥\tilde{W}∥}^{m} {∥Y∥}_{F}^{m} \cdot I_{p}$ . Then from Lemma A5, we have

$\begin{matrix} 0 = & tr ({({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2} (β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) - \frac{3}{2} Λ (\tilde{W}))) \\ - \frac{1}{2} tr (({\tilde{W}}^{⊤} \tilde{W} - I_{p}) {\tilde{W}}^{⊤} \nabla^{2} f (\tilde{W}) [\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})]) \\ \geq & tr ({({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2} (β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) - \frac{m + 2}{2} {∥W∥}^{m} {∥Y∥}_{F}^{m} \cdot I_{p})) \\ \geq & tr (β {({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2} {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p})) \\ \geq & 0, \end{matrix}$

showing that $tr (β {({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2} {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p})) = 0$ . Then by the positive-definiteness of ${\tilde{W}}^{⊤} \tilde{W}$ , we can conclude that ${\tilde{W}}^{⊤} \tilde{W} - I_{p} = 0$ .

As a result, we have that either ${\tilde{W}}^{⊤} \tilde{W} - I_{p} = 0$ or $σ_{min} ({\tilde{W}}^{⊤} \tilde{W}) \leq \frac{(2 m + 4) {∥Y∥}_{F}^{m}}{β}$ , and completes the proof. □

Corollary A8.

Suppose $p = 1$ in (1), $β \geq (4 m + 8) {∥Y∥}_{F}^{m}$ , and $\tilde{W}$ is a first-order stationary point of (6), then either ${\tilde{W}}^{⊤} \tilde{W} = I_{p}$ holds, which further implies that $\tilde{W}$ is a first-order stationary point of problem (1), or $\tilde{W} = 0$ .

Proof.

By the same routine of Theorem 2, when $β \geq 4 (m + 2) {∥Y∥}_{F}^{m}$ , any first-order stationary point $\tilde{W}$ of (6) satisfies that ${∥\tilde{W}∥}^{2} \leq 2$ . Then, following the same proof routine in Lemma A5 and Theorem 2, we have

$\begin{matrix} 0 \geq & tr ({({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2} (β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) - \frac{m + 2}{2} {∥W∥}^{m} {∥Y∥}_{F}^{m} \cdot I_{p})) \\ = & {({∥\tilde{W}∥}^{2} - 1)}^{2} (β {∥\tilde{W}∥}^{2} ({∥\tilde{W}∥}^{2} + I_{p}) - \frac{m + 2}{2} {∥W∥}^{m} {∥Y∥}_{F}^{m}) \\ \geq & 0 . \end{matrix}$

When $β \geq \frac{4 m + 8}{2} {∥Y∥}_{F}$ , we have

$\frac{β}{2} {∥W∥}_{F}^{4} + \frac{β}{2} {∥W∥}_{F}^{2} > \frac{m + 2}{2} {∥Y∥}_{F} {∥\tilde{W}∥}_{2}^{m}$ (A4)

holds for any $W \in R^{n} ∖ 0$ .

Then, for any $\tilde{W} \neq 0$ , we can conclude that

$(β {∥\tilde{W}∥}^{2} ({∥\tilde{W}∥}^{2} + I_{p}) - \frac{m + 2}{2} {∥W∥}^{m} {∥Y∥}_{F}^{m}) > 0,$

and thus ${∥\tilde{W}∥}^{2} - 1 = 0$ . As a result, from (A1), when $\tilde{W}$ is a first-order stationary point of (6), either $\tilde{W} = 0$ or ${\tilde{W}}^{⊤} \tilde{W} - I_{p} = 0$ . □

Appendix B. Proof for Theorem 4

In this section, we present the main body of the proof for Theorem 4. To show the convergence of PenNMF, we first present some preliminary lemmas. Then, we show that the updating direction $D (W^{k})$ is a descending direction and thus $h (W^{k + 1}) \leq h (W^{k})$ , as illustrated in Lemma A12. Together with Lemma A10, we show that the sequence is restricted in the neighborhood of the constraints, and we achieve the global convergence property of PenNMF in Theorem 4. We first estimate the upper-bound of the term $|f (W) - \frac{1}{2} 〈W^{⊤} W - I_{p}, Λ (W)〉|$ in $h (W)$ .

Lemma A9.

For any $W \in R^{n \times p}$ ,

$|f (W) - \frac{1}{2} 〈W^{⊤} W - I_{p}, Λ (W)〉| \leq \frac{m}{2} {∥Y∥}_{F}^{m} max {{∥W∥}^{2} + 1, 2} {∥W∥}^{m} .$

Proof.

We first estimate the upper-bound for $| f (W) |$ , which can be achieved by

$| f (W) | = \frac{1}{m} {∥W^{⊤} Y∥}_{m}^{m} \leq \frac{1}{m} {∥W^{⊤} Y∥}_{F}^{m} \leq \frac{1}{m} {∥Y∥}_{F}^{m} {∥W∥}^{m} .$

Besides, from Lemma A3, we have

$\begin{matrix} |\frac{1}{2} 〈W^{⊤} W - I_{p}, Λ (W)〉| = \frac{1}{2} |〈W (W^{⊤} W - I_{p}), \nabla f (W)〉| \\ \leq & \frac{1}{2} {∥W (W^{⊤} W - I_{p})∥}_{F} {∥\nabla f (W)∥}_{F} \\ \leq & \frac{m}{2} {∥W (W^{⊤} W - I_{p})∥}_{2} {∥Y∥}_{F}^{m} {∥W∥}^{m - 1} \\ \leq & \frac{m}{2} {∥Y∥}_{F}^{m} max {1, {∥W∥}^{2}} {∥W∥}^{m} . \end{matrix}$

Combine the above two equations, we achieve

$| f (W) - \frac{1}{2} 〈W^{⊤} W - I_{p}, Λ (W)〉 | \leq \frac{m}{2} {∥Y∥}_{F}^{m} max {{∥W∥}^{2} + 1, 2} {∥W∥}^{m},$

and complete the proof. □

We then show that the penalty term $ψ (W) : = \frac{1}{6} {∥W∥}_{F}^{6} - \frac{1}{2} {∥W∥}_{F}^{2}$ builds a barrier around $S_{n, p}$ , i.e., those points that are sufficiently far from $S_{n, p}$ have higher functional value than those points that are close to $S_{n, p}$ .

Lemma A10.

Suppose for any $δ \in (0, \frac{1}{3}]$ and $β \geq max {228 m {∥Y∥}_{F}^{m}, \frac{32 m}{δ} {∥Y∥}_{F}^{m}}$ , we have

$max_{{∥W^{⊤} W - I_{p}∥}_{F}^{2} \leq \frac{δ}{8}} h (W) \leq min_{{∥W^{⊤} W - I_{p}∥}_{F}^{2} \geq δ} h (W) .$ (A5)

Proof.

Let $ψ (W) : = \frac{1}{6} ({∥W∥}^{6} - {∥W∥}^{2})$ . For any $W_{1}$ satisfies $ψ (W_{1}) \leq δ$ and $W_{2}$ satisfies ${∥W_{2}∥}^{2} \leq 2$ and $ψ (W_{2}) \geq 2 δ$ , then

$\begin{matrix} h (W_{2}) - h (W_{1}) \\ \geq & β δ - | \nabla f (W_{1}) | - \frac{1}{2} 〈W_{1}^{⊤} W_{1} - I_{p}, Λ (W_{1})〉 | - | \nabla f (W_{2}) | - \frac{1}{2} 〈W_{2}^{⊤} W_{2} - I_{p}, Λ (W_{2})〉 | \\ \geq & β δ - m {∥Y∥}_{F}^{m} {(1 + δ)}^{m + 2} - m {∥Y∥}_{F}^{m} 2^{m + 2} \\ \geq & β δ - 19 m {∥Y∥}_{F}^{m} \\ \geq & 0 . \end{matrix}$ (A6)

Here, the second inequality uses the fact that ${∥W_{1}∥}_{2}^{2} \leq 1 + {∥W_{1}^{⊤} W_{1} - I_{p}∥}_{F} \leq 1 + δ$ , and ${∥W_{2}∥}^{2} \leq 2$ .

Moreover, when ${∥W_{2}∥}_{F}^{2} \geq 2$ , we have $ψ (W_{2}) \geq \frac{β}{6} {∥\tilde{W}∥}^{6}$ . Then, $\frac{1}{2} ψ (W_{2}) \geq \frac{1}{3} β \geq δ β \geq ψ (W_{1})$ .

$\begin{matrix} h (W_{2}) - h (W_{1}) \geq & - m {∥Y∥}_{F}^{m} {(1 + δ)}^{m + 2} + ψ (W_{2}) - 16 m {∥Y∥}_{F}^{m} {∥W∥}^{m + 2} - δ β \\ \geq & - 19 m {∥Y∥}_{F}^{m} {∥W_{2}∥}^{m + 2} + \frac{β}{12} {∥\tilde{W}∥}^{6} \\ \geq & 0 . \end{matrix}$ (A7)

Here, the second inequality follows the fact that ${(1 + δ)}^{m} \leq {(1 + δ)}^{4} < 3$ .

Besides, as ${∥W^{⊤} W - I_{p}∥}_{F}^{2} \leq δ$ implies

$\frac{1}{6} {∥W∥}_{F}^{6} - \frac{1}{2} {∥W∥}_{F}^{2} + \frac{p}{3} = \frac{1}{6} tr [{(W^{⊤} W - I_{p})}^{2} (W^{⊤} W + 2 I_{p})] .$ (A8)

As a result, $ψ (W) \leq 3 δ$ implies ${∥W^{⊤} W - I_{p}∥}_{F} \leq δ$ . Besides, ${∥W^{⊤} W - I_{p}∥}_{F}^{2} \leq δ$ implies $ψ (W) \leq \frac{2}{3} δ$ . Therefore,

$max_{{∥W^{⊤} W - I_{p}∥}_{F}^{2} \leq \frac{δ}{8}} h (W) \leq min_{{∥W^{⊤} W - I_{p}∥}_{F}^{2} \geq δ} h (W) .$ (A9)

□

Lemma A10 shows that the smooth penalty term builds a barrier around $S_{n, p}$ . Moreover, we characterize the relations between ${∥D (W)∥}_{F}$ and ${∥W^{⊤} W - I_{p}∥}_{F}$ in the following lemma.

Lemma A11.

Suppose $δ \in (0, \frac{1}{3}]$ , set $| | W^{⊤} W - I_{p} {| |}_{F}^{2} \leq \frac{1}{8} δ$ , and $β \geq (4 m + 8) {∥Y∥}_{F}^{m}$ . Then,

$\begin{matrix} {∥D (W)∥}_{F} \geq \frac{\sqrt{3} β}{6} \cdot {∥W^{⊤} W - I_{p}∥}_{F}, \end{matrix}$ (A10)

where $D (W) : = \nabla f (W) - W Φ (W^{⊤} \nabla f (W)) + β W (W^{⊤} W W^{⊤} W - I_{p})$ .

Proof.

First, we present two linear algebra relationships: The first is the inequality ${| | A | |}_{F} \geq {∥\frac{A + A^{⊤}}{2}∥}_{F}$ holds for any square matrix A, which is quite obvious and the proof is omitted. The second is the equality $| | A B + {B A | |}_{F} = {2 | | A B | |}_{F}$ holds for any symmetric matrices A and B, which results from the fact $| | A B + {B A | |}_{F}^{2} = {2 | | A B | |}_{F}^{2} + 2 tr (A B A B) = {2 | | A B | |}_{F}^{2} + 2 tr (A^{\frac{1}{2}} B A^{\frac{1}{2}} A^{\frac{1}{2}} B A^{\frac{1}{2}}) = {4 | | A B | |}_{F}^{2}$ .

It follows from the above facts that

$\begin{matrix} {∥W^{⊤} D (W)∥}_{F} \geq \frac{1}{2} {∥W^{⊤} D (W) + D {(W)}^{⊤} W∥}_{F} \\ = & \frac{1}{2} {∥(β W^{⊤} W (W^{⊤} W + I_{p}) - Λ (W)) (W^{⊤} W - I_{p}) + (W^{⊤} W - I_{p}) (β W^{⊤} W (W^{⊤} W + I_{p}) - Λ (W))∥}_{F} \\ \geq & \frac{β}{3} \cdot | | W^{⊤} W - I_{p} {| |}_{F}, \end{matrix}$

where the last equality uses the fact that $σ_{min} (W^{⊤} W (W^{⊤} W + I_{p})) \geq σ_{min} (W^{⊤} W) \geq 1 - δ \geq \frac{2}{3}$ .

Together with the facts that ${∥W^{⊤} D (W)∥}_{F} \leq {∥W∥}_{2} {∥D (W)∥}_{F}$ and $σ_{max} (W^{⊤} W) \leq 1 + δ \leq \frac{4}{3}$ , we have

$\begin{matrix} {∥D (W)∥}_{F} \geq & \frac{1}{{∥W∥}_{F}} {∥W^{⊤} D (W)∥}_{F} \geq \frac{\sqrt{3}}{2} {∥W^{⊤} D (W)∥}_{F} \\ \geq & \frac{\sqrt{3} β}{6} {∥W^{⊤} W - I_{p}∥}_{F} . \end{matrix}$

□

Let $M_{1} : = sup \frac{{∥\nabla h (W_{1}) - \nabla h (W_{2})∥}_{F}}{{∥W_{1} - W_{2}∥}_{F}}$ , then we have that the following illustrating that PenNMF generates a descending sequence ${W_{k}}$ .

Lemma A12.

Suppose $δ \in (0, \frac{1}{3}]$ and $β \geq max \{228 m {∥Y∥}_{F}^{m}, \frac{32}{δ} {∥Y∥}_{F}^{m}\}$ . Let ${W^{k}}$ be the iterate sequence generated by PenNMF, starting from any initial point $W^{0}$ satisfying $| | {W^{0}}^{⊤} W^{0} - I_{p} {| |}_{F}^{2} \leq \frac{1}{8} δ$ , and the stepsize $η_{k} \in [\frac{1}{2} \bar{η}, \bar{η}]$ , where $\bar{η} \leq \frac{1}{2 M_{1}}$ . Then, it holds that

$\begin{matrix} h (W^{k + 1}) \leq h (W^{k}) - \frac{\bar{η}}{4} {∥D (W^{k})∥}_{F}^{2} \end{matrix}$ (A11)

for any $k = 0, 1, \dots$ .

Proof.

By the explicit expression of $\nabla h (W^{k})$ , we first have

$\begin{matrix} {∥\nabla h (W^{k}) - D (W^{k})∥}_{F} \\ = & {∥\frac{1}{2} \nabla f (W^{k}) ({W^{k}}^{⊤} W^{k} - I_{p}) + \frac{1}{2} \nabla^{2} f (W^{k}) [W^{k} ({W^{k}}^{⊤} W^{k} - I_{p})]∥}_{F} \\ \leq & \frac{1}{2} [{∥\nabla f (W^{k}) ({W^{k}}^{⊤} W^{k} - I_{p})∥}_{F} + {∥\nabla^{2} f (W^{k}) [W^{k} ({W^{k}}^{⊤} W^{k} - I_{p})]∥}_{F}] \\ \leq & \frac{m}{2} {∥Y∥}_{F}^{m} {∥W^{k}∥}^{m - 1} {∥{W^{k}}^{⊤} W^{k} - I_{p}∥}_{F} \\ \leq & 2 m {∥Y∥}_{F}^{m} {∥{W^{k}}^{⊤} W^{k} - I_{p}∥}_{F} . \end{matrix}$ (A12)

Here, the first inequality follows Lemma A3 and Lemma A4.

Besides, by the definition of $M_{1}$ , we have

$h (W^{k + 1}) \leq h (W^{k}) + 〈W^{k + 1} - W^{k}, \nabla h (W^{k})〉 + \frac{M_{1}}{2} {∥W^{k + 1} - W^{k}∥}_{F}^{2} .$ (A13)

Suppose ${∥{W^{k}}^{⊤} W^{k} - I_{p}∥}_{F}^{2} \leq δ$ , then by Lemma A11 we can conclude that

${∥\nabla h (W^{k}) - D (W^{k})∥}_{F} \leq 2 m {∥Y∥}_{F}^{m} {∥{W^{k}}^{⊤} W^{k} - I_{p}∥}_{F} \leq \frac{4 \sqrt{3} m {∥Y∥}_{F}^{m}}{β} {∥D (W^{k})∥}_{F} .$ (A14)

Substitute $W^{k + 1} - W^{k} = - η D (W^{k})$ and (A14) into (A13), we have

$\begin{matrix} h (W^{k + 1}) - h (W^{k}) \\ \leq & 〈W^{k + 1} - W^{k}, \nabla h (W^{k})〉 + \frac{M_{1}}{2} {∥W^{k + 1} - W^{k}∥}_{F}^{2} \\ \leq & - η^{k} 〈D (W^{k}), D (W^{k})〉 + \frac{M_{1}}{2} {∥η^{k} D (W^{k})∥}_{F}^{2} + |〈\nabla h (W^{k}) - D (W^{k}), D (W^{k})〉| \\ \leq & (- η^{k} + \frac{4 \sqrt{3} m {∥Y∥}_{F}^{m}}{β} η^{k} + \frac{M_{1}}{2} {(η^{k})}^{2}) {∥D (W^{k})∥}_{F}^{2} \\ \leq & - \frac{η^{k}}{2} {∥D (W^{k})∥}_{F}^{2} \leq - \frac{\bar{η}}{4} {∥D (W^{k})∥}_{F}^{2} . \end{matrix}$ (A15)

Then by Lemma A10, as $h (W^{k + 1}) \leq h (W^{k})$ , we can conclude that ${∥{W^{k + 1}}^{⊤} W^{k + 1} - I_{p}∥}_{F}^{2} \leq δ$ . Then, by induction we can conclude that ${∥{W^{k}}^{⊤} W^{k} - I_{p}∥}_{F}^{2} \leq δ$ holds for $k = 1, 2, 3, \dots$ . Then, by (A15) again we conclude that

$\begin{matrix} h (W^{k + 1}) \leq h (W^{k}) - \frac{\bar{η}}{4} {∥D (W^{k})∥}_{F}^{2} \end{matrix}$ (A16)

holds for $k = 1, 2, 3, \dots$ and completes our proof. □

The following lemma shows that when our algorithm stops at $\tilde{W}$ , then $\tilde{W}$ is a first-order stationary point of (1).

Lemma A13.

Suppose $δ \in (0, \frac{1}{3}]$ and $β \geq max \{228 m {∥Y∥}_{F}^{m}, \frac{32}{δ} {∥Y∥}_{F}^{m}\}$ . For any $\tilde{W}$ satisfying ${∥{\tilde{W}}^{⊤} \tilde{W} - I_{p}∥}_{F}^{2} \leq δ$ and $D (\tilde{W}) = 0$ , we have that $\tilde{W}$ is a first-order stationary point of (1).

Proof.

Suppose $D (\tilde{W}) = 0$ , then by the same proof routine in Theorem 2, we consider the inner-product of $D (\tilde{W})$ and $\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})$ :

$\begin{matrix} 0 = & 〈D (\tilde{W}), \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})〉 \\ = & 〈\nabla f (\tilde{W}) - \tilde{W} Λ (\tilde{W}) + β \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} {\tilde{W}}^{⊤} \tilde{W} - I_{p}), \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})〉 \\ = & tr (({\tilde{W}}^{⊤} \tilde{W} - I_{p}) {\tilde{W}}^{⊤} \nabla f (\tilde{W}) - ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) {\tilde{W}}^{⊤} \tilde{W} Λ (\tilde{W}) + β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) {({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2}) \\ = & tr ({({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2} (β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) - Λ (\tilde{W})) \\ \geq & tr ({({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2} (β {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) - (4 m + 8) {∥Y∥}_{F}^{m} \cdot I_{p}) \\ \geq & \frac{β}{2} tr ({({\tilde{W}}^{⊤} \tilde{W} - I_{p})}^{2}) \geq 0 . \end{matrix}$

Here, the fourth equation follows the definition of $Λ (\tilde{W}) : = Φ ({\tilde{W}}^{⊤} \nabla f (\tilde{W}))$ and the first inequality uses the fact that $∥\tilde{W}∥ \leq 2$ , then together with Lemma A3, we can conclude that ${∥Λ (\tilde{W})∥}_{2} ⪯ (4 m + 8) {∥Y∥}_{F}^{m} \cdot I_{p}$ . Besides, the last inequality uses the fact that $\frac{β}{2} {\tilde{W}}^{⊤} \tilde{W} ({\tilde{W}}^{⊤} \tilde{W} + I_{p}) ⪰ (4 m + 8) {∥Y∥}_{F}^{m} \cdot I_{p}$ .

Then we can conclude that that ${\tilde{W}}^{⊤} \tilde{W} = I_{p}$ . By the definition of $\nabla h (\tilde{W})$ , we have

$\nabla h (\tilde{W}) - D (\tilde{W}) = - \frac{1}{2} \nabla f (\tilde{W}) ({\tilde{W}}^{⊤} \tilde{W} - I_{p}) - \frac{1}{2} \nabla^{2} f (\tilde{W}) [\tilde{W} ({\tilde{W}}^{⊤} \tilde{W} - I_{p})] = 0,$ (A17)

showing that $\nabla f (\tilde{W}) = 0$ . Together with Theorem 2 we can conclude that $\tilde{W}$ is a first-order stationary point of (1). □

Based on the Lemmas A10–A13, we restate Theorem 4 as Theorem A14 and show the global convergence property of PenNMF in the following theorem.

Theorem A14.

Suppose $δ \in (0, \frac{1}{3}]$ and $β \geq max \{228 m {∥Y∥}_{F}^{m}, \frac{32}{δ} {∥Y∥}_{F}^{m}\}$ . Let ${W^{k}}$ be the iterate sequence generated by PenNMF, starting from any initial point $W^{0}$ satisfying $| | {W^{0}}^{⊤} W^{0} - I_{p} {| |}_{F}^{2} \leq \frac{1}{8} δ$ , and the stepsize $η_{k} \in [\frac{1}{2} \bar{η}, \bar{η}]$ , where $\bar{η} \leq \frac{1}{2 M_{1}}$ . Then, $W^{k}$ weakly converges to a first-order stationary point of (1). Moreover, for any $k = 1, 2, \dots$ , the convergence rate of PenNMF can be estimated by

$min_{0 \leq i \leq k} {∥D (W^{i})∥}_{F} \leq \sqrt{\frac{8 m {∥Y∥}_{F}^{m} + 2 β δ}{\bar{η} (k + 1)}} .$ (A18)

Proof.

By Lemma A12, it holds that

$h (W^{k + 1}) \leq h (W^{k}) - \frac{\bar{η}}{4} {∥D (W^{k})∥}_{F}^{2} .$

If $W^{*}$ is a cluster point of ${W^{k}}$ , we have $W^{k + 1} - W^{k} = 0$ . Together with ${W^{*}}^{⊤} W^{*} = I_{p}$ implied by Lemma A11, we can conclude that $W^{*}$ is a first-order stationary point of problem (1).

Calculating the summation of the above inequalities from $k = 0$ to $N - 1$ , we have

$\begin{matrix} \sum_{i = 0}^{k} \frac{\bar{η}}{4} {∥D (W^{i})∥}_{F}^{2} \leq h (W^{0}) - h (W^{k}) < h (W^{0}) - inf_{| | W^{⊤} W - I_{p} {| |}_{F}^{2} \leq δ} h (W) \\ < & sup_{| | W^{⊤} W - I_{p} {| |}_{F}^{2} \leq δ} \tilde{h} (W) - inf_{| | W^{⊤} W - I_{p} {| |}_{F}^{2} \leq δ} \tilde{h} (W) + \frac{β}{4} (| | {W^{0}}^{⊤} W^{0} - I_{p} {| |}_{F}^{2} - | | {W^{k}}^{⊤} W^{k} - I_{p} {| |}_{F}^{2}) \\ \leq & 2 m {∥Y∥}_{F}^{m} + \frac{β δ}{2}, \end{matrix}$ (A19)

showing that $\underset{k \to + \infty}{lim inf} {∥D (W^{k})∥}_{F} = 0$ , which further implies that $D (W^{*}) = 0$ . By Lemma A13, $D (W^{*}) = 0$ implies that $W^{*}$ is a first-order stationary point of (1).

Moreover, by (A19), we have that

$min_{0 \leq i \leq k} {∥D (W^{i})∥}_{F}^{2} \leq \frac{1}{k + 1} \sum_{i = 0}^{k} {∥D (W^{i})∥}_{F}^{2} \leq \frac{8 m {∥Y∥}_{F}^{m} + 2 β δ}{\bar{η} (k + 1)},$

and complete the proof. □

Appendix C. Additional Experimental Results

In this section, we propose some additional numerical experiments. Figure A1 shows the top 12 basis computed from the testing instances in Section 4.1 by PenNMF. As described in [17], the top bases are those with the largest coefficients in terms of $ℓ_{1}$ -norm.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, resources, data curation, writing—original draft preparation, writing—review and editing, visualization, supervision, project administration, X.H. and X.L.; funding acquisition, X.L. All authors have read and agreed to the published version of the manuscript.

Funding

Research is supported in part by the National Natural Science Foundation of China (No. 11971466, 11991021, and 11991020); Key Research Program of Frontier Sciences, Chinese Academy of Sciences (No. ZDBS-LY-7022); the National Center for Mathematics and Interdisciplinary Sciences, Chinese Academy of Sciences; and the Youth Innovation Promotion Association, Chinese Academy of Sciences.

Conflicts of Interest

The authors declare no conflicts of interest.

References

1.Hansen T.L., Badiu M.A., Fleury B.H., Rao B.D. A sparse Bayesian learning algorithm with dictionary parameter estimation; Proceedings of the Sensor Array and Multichannel Signal Processing Workshop (SAM); A Coruña, Spain. 22–25 June 2014; pp. 385–388. [Google Scholar]
2.Shen H., Li X., Zhang L., Tao D., Zeng C. Compressed Sensing-Based Inpainting of Aqua Moderate Resolution Imaging Spectroradiometer Band 6 Using Adaptive Spectrum-Weighted Sparse Bayesian Dictionary Learning. IEEE Trans. Geosci. Remote Sens. 2014;52:894–906. doi: 10.1109/TGRS.2013.2245509. [DOI] [Google Scholar]
3.Bai Y., Jiang Q., Sun J. Subgradient descent learns orthogonal dictionaries. arXiv. 20181810.10702 [Google Scholar]
4.Gilboa D., Buchanan S., Wright J. Efficient dictionary learning with gradient descent. arXiv. 20181809.10313 [Google Scholar]
5.Kuo H.W., Zhang Y., Lau Y., Wright J. Geometry and symmetry in short-and-sparse deconvolution. SIAM J. Math. Data Sci. 2020;2:216–245. doi: 10.1137/19M1237569. [DOI] [Google Scholar]
6.Rambhatla S., Li X., Haupt J. NOODL: Provable Online Dictionary Learning and Sparse Coding. arXiv. 20191902.11261 [Google Scholar]
7.Song X., Wu L. A Novel Hyperspectral Endmember Extraction Algorithm Based on Online Robust Dictionary Learning. Remote Sens. 2019;11:1792. doi: 10.3390/rs11151792. [DOI] [Google Scholar]
8.Sun J., Qu Q., Wright J. Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inf. Theory. 2016;63:853–884. doi: 10.1109/TIT.2016.2632162. [DOI] [Google Scholar]
9.Wang D., Wan J., Chen J., Zhang Q. An Online Dictionary Learning-Based Compressive Data Gathering Algorithm in Wireless Sensor Networks. Sensors. 2016;16:1547. doi: 10.3390/s16101547. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Yang L., Fang J., Cheng H., Li H. Sparse Bayesian dictionary learning with a Gaussian hierarchical model. Signal Process. 2017;130:93–104. doi: 10.1016/j.sigpro.2016.06.016. [DOI] [Google Scholar]
11.Wang Y., Wu S., Yu B. Unique Sharp Local Minimum in ℓ1-minimization Complete Dictionary Learning. arXiv. 20191902.08380 [Google Scholar]
12.Zhang Y., Kuo H.W., Wright J. Structured local optima in sparse blind deconvolution. IEEE Trans. Inf. Theory. 2019;66:419–452. doi: 10.1109/TIT.2019.2940657. [DOI] [Google Scholar]
13.Zhou Q., Feng Z., Benetos E. Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF. Sensors. 2019;19:3206. doi: 10.3390/s19143206. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Ling Y., Gao H., Zhou S., Yang L., Ren F. Robust Sparse Bayesian Learning-Based Off-Grid DOA Estimation Method for Vehicle Localization. Sensors. 2020;20:302. doi: 10.3390/s20010302. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Liu S., Huang Y., Wu H., Tan C., Jia J. Efficient Multi-Task Structure-Aware Sparse Bayesian Learning for Frequency-Difference Electrical Impedance Tomography. IEEE Trans. Industr. Inform. 2020 doi: 10.1109/TII.2020.2965202. [DOI] [Google Scholar]
16.Qu Q., Zhu Z., Li X., Tsakiris M.C., Wright J., Vidal R. Finding the Sparsest Vectors in a Subspace: Theory, Algorithms, and Applications. arXiv. 20202001.06970 [Google Scholar]
17.Zhai Y., Yang Z., Liao Z., Wright J., Ma Y. Complete Dictionary Learning via ℓ4-Norm Maximization over the Orthogonal Group. arXiv. 20191906.02435 [Google Scholar]
18.Shen Y., Xue Y., Zhang J., Letaief K.B., Lau V. Complete Dictionary Learning via ℓp-norm Maximization. arXiv. 20202002.10043 [Google Scholar]
19.Gao B., Liu X., Yuan Y.x. Parallelizable Algorithms for Optimization Problems with Orthogonality Constraints. SIAM J. Sci. Comput. 2019;41:A1949–A1983. doi: 10.1137/18M1221679. [DOI] [Google Scholar]
20.Wen Z., Yang C., Liu X., Zhang Y. Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 2016;66:1175–1203. doi: 10.1007/s10915-015-0061-0. [DOI] [Google Scholar]
21.Xiao N., Liu X., Yuan X. A Class of Smooth Exact Penalty Function Methods for Optimization Problems with Orthogonality Constraints. [(accessed on 26 May 2020)]; Available online: http://www.optimization-online.org/DB_HTML/2020/02/7607.html.
22.Geiger A., Lenz P., Stiller C., Urtasun R. Vision meets robotics: The kitti dataset. Int. J. Rob. Res. 2013;32:1231–1237. doi: 10.1177/0278364913491297. [DOI] [Google Scholar]
23.Silberman N., Hoiem D., Kohli P., Fergus R. European Conference on Computer Vision. Springer; Berlin/Heidelberg, Germany: 2012. Indoor segmentation and support inference from RGBD images; pp. 746–760. [Google Scholar]
24.Hartley R., Zisserman A. Multiple View Geometry in Computer Vision. Cambridge University Press; Cambridge, UK: 2003. [Google Scholar]
25.Xu H., Caramanis C., Sanghavi S. Robust PCA via outlier pursuit. IEEE Trans. Inf. Theory. 2010;58:3047–3064. doi: 10.1109/TIT.2011.2173156. [DOI] [Google Scholar]
26.Soltanolkotabi M., Candes E.J. A geometric analysis of subspace clustering with outliers. Ann. Stat. 2012;40:2195–2238. doi: 10.1214/12-AOS1034. [DOI] [Google Scholar]
27.Rahmani M., Atia G.K. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Trans. Signal Process. 2017;65:6260–6275. doi: 10.1109/TSP.2017.2749215. [DOI] [Google Scholar]
28.You C., Robinson D.P., Vidal R. Provable self-representation based outlier detection in a union of subspaces; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 3395–3404. [Google Scholar]
29.Ding T., Zhu Z., Ding T., Yang Y., Robinson D., Vidal R., Tsakiris M. Noisy dual principal component pursuit; Proceedings of the International Conference on Machine learning; Long Beach, CA, USA. 10–15 June 2019. [Google Scholar]
30.Fischler M.A., Bolles R.C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM. 1981;24:381–395. doi: 10.1145/358669.358692. [DOI] [Google Scholar]
31.Tsakiris M.C., Vidal R. Dual principal component pursuit. J. Mach. Learn. Res. 2018;19:684–732. [Google Scholar]
32.Zhu Z., Wang Y., Robinson D.P., Naiman D.Q., Vidal R., Tsakiris M.C. Dual principal component pursuit: probability analysis and efficient algorithms. arXiv. 20181812.09924 [Google Scholar]
33.Shi L., Chi Y. Manifold gradient descent solves multi-channel sparse blind deconvolution provably and efficiently. arXiv. 20191911.11167 [Google Scholar]
34.Qu Q., Li X., Zhu Z. A nonconvex approach for exact and efficient multichannel sparse blind deconvolution; Proceedings of the Advances in Neural Information Processing Systems; Vancouver, CB, Canada. 8–14 December 2019; pp. 4017–4028. [Google Scholar]
35.Qu Q., Sun J., Wright J. Finding a sparse vector in a subspace: Linear sparsity using alternating directions; Proceedings of the Advances in Neural Information Processing Systems; Montreal, QB, Canada. 8–13 December 2014; pp. 3401–3409. [Google Scholar]
36.Barzilai J., Borwein J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988;8:141–148. doi: 10.1093/imanum/8.1.141. [DOI] [Google Scholar]
37.Dai Y.H., Fletcher R. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 2005;100:21–47. doi: 10.1007/s00211-004-0569-y. [DOI] [Google Scholar]
38.Absil P.A., Mahony R., Sepulchre R. Optimization Algorithms on Matrix Manifolds. Princeton University Press; Princeton, NJ, USA: 2009. [Google Scholar]
39.Boumal N., Mishra B., Absil P.A., Sepulchre R. Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 2014;15:1455–1459. [Google Scholar]
40.Mairal J., Elad M., Sapiro G. Sparse Representation for Color Image Restoration. IEEE Trans. Image Process. 2008;17:53–69. doi: 10.1109/TIP.2007.911828. [DOI] [PubMed] [Google Scholar]
41.Dolan E.D., Moré J.J. Benchmarking optimization software with performance profiles. Math. Program. 2002;91:201–213. doi: 10.1007/s101070100263. [DOI] [Google Scholar]

[B1-sensors-20-03041] 1.Hansen T.L., Badiu M.A., Fleury B.H., Rao B.D. A sparse Bayesian learning algorithm with dictionary parameter estimation; Proceedings of the Sensor Array and Multichannel Signal Processing Workshop (SAM); A Coruña, Spain. 22–25 June 2014; pp. 385–388. [Google Scholar]

[B2-sensors-20-03041] 2.Shen H., Li X., Zhang L., Tao D., Zeng C. Compressed Sensing-Based Inpainting of Aqua Moderate Resolution Imaging Spectroradiometer Band 6 Using Adaptive Spectrum-Weighted Sparse Bayesian Dictionary Learning. IEEE Trans. Geosci. Remote Sens. 2014;52:894–906. doi: 10.1109/TGRS.2013.2245509. [DOI] [Google Scholar]

[B3-sensors-20-03041] 3.Bai Y., Jiang Q., Sun J. Subgradient descent learns orthogonal dictionaries. arXiv. 20181810.10702 [Google Scholar]

[B4-sensors-20-03041] 4.Gilboa D., Buchanan S., Wright J. Efficient dictionary learning with gradient descent. arXiv. 20181809.10313 [Google Scholar]

[B5-sensors-20-03041] 5.Kuo H.W., Zhang Y., Lau Y., Wright J. Geometry and symmetry in short-and-sparse deconvolution. SIAM J. Math. Data Sci. 2020;2:216–245. doi: 10.1137/19M1237569. [DOI] [Google Scholar]

[B6-sensors-20-03041] 6.Rambhatla S., Li X., Haupt J. NOODL: Provable Online Dictionary Learning and Sparse Coding. arXiv. 20191902.11261 [Google Scholar]

[B7-sensors-20-03041] 7.Song X., Wu L. A Novel Hyperspectral Endmember Extraction Algorithm Based on Online Robust Dictionary Learning. Remote Sens. 2019;11:1792. doi: 10.3390/rs11151792. [DOI] [Google Scholar]

[B8-sensors-20-03041] 8.Sun J., Qu Q., Wright J. Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Trans. Inf. Theory. 2016;63:853–884. doi: 10.1109/TIT.2016.2632162. [DOI] [Google Scholar]

[B9-sensors-20-03041] 9.Wang D., Wan J., Chen J., Zhang Q. An Online Dictionary Learning-Based Compressive Data Gathering Algorithm in Wireless Sensor Networks. Sensors. 2016;16:1547. doi: 10.3390/s16101547. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10-sensors-20-03041] 10.Yang L., Fang J., Cheng H., Li H. Sparse Bayesian dictionary learning with a Gaussian hierarchical model. Signal Process. 2017;130:93–104. doi: 10.1016/j.sigpro.2016.06.016. [DOI] [Google Scholar]

[B11-sensors-20-03041] 11.Wang Y., Wu S., Yu B. Unique Sharp Local Minimum in ℓ1-minimization Complete Dictionary Learning. arXiv. 20191902.08380 [Google Scholar]

[B12-sensors-20-03041] 12.Zhang Y., Kuo H.W., Wright J. Structured local optima in sparse blind deconvolution. IEEE Trans. Inf. Theory. 2019;66:419–452. doi: 10.1109/TIT.2019.2940657. [DOI] [Google Scholar]

[B13-sensors-20-03041] 13.Zhou Q., Feng Z., Benetos E. Adaptive Noise Reduction for Sound Event Detection Using Subband-Weighted NMF. Sensors. 2019;19:3206. doi: 10.3390/s19143206. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B14-sensors-20-03041] 14.Ling Y., Gao H., Zhou S., Yang L., Ren F. Robust Sparse Bayesian Learning-Based Off-Grid DOA Estimation Method for Vehicle Localization. Sensors. 2020;20:302. doi: 10.3390/s20010302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15-sensors-20-03041] 15.Liu S., Huang Y., Wu H., Tan C., Jia J. Efficient Multi-Task Structure-Aware Sparse Bayesian Learning for Frequency-Difference Electrical Impedance Tomography. IEEE Trans. Industr. Inform. 2020 doi: 10.1109/TII.2020.2965202. [DOI] [Google Scholar]

[B16-sensors-20-03041] 16.Qu Q., Zhu Z., Li X., Tsakiris M.C., Wright J., Vidal R. Finding the Sparsest Vectors in a Subspace: Theory, Algorithms, and Applications. arXiv. 20202001.06970 [Google Scholar]

[B17-sensors-20-03041] 17.Zhai Y., Yang Z., Liao Z., Wright J., Ma Y. Complete Dictionary Learning via ℓ4-Norm Maximization over the Orthogonal Group. arXiv. 20191906.02435 [Google Scholar]

[B18-sensors-20-03041] 18.Shen Y., Xue Y., Zhang J., Letaief K.B., Lau V. Complete Dictionary Learning via ℓp-norm Maximization. arXiv. 20202002.10043 [Google Scholar]

[B19-sensors-20-03041] 19.Gao B., Liu X., Yuan Y.x. Parallelizable Algorithms for Optimization Problems with Orthogonality Constraints. SIAM J. Sci. Comput. 2019;41:A1949–A1983. doi: 10.1137/18M1221679. [DOI] [Google Scholar]

[B20-sensors-20-03041] 20.Wen Z., Yang C., Liu X., Zhang Y. Trace-penalty minimization for large-scale eigenspace computation. J. Sci. Comput. 2016;66:1175–1203. doi: 10.1007/s10915-015-0061-0. [DOI] [Google Scholar]

[B21-sensors-20-03041] 21.Xiao N., Liu X., Yuan X. A Class of Smooth Exact Penalty Function Methods for Optimization Problems with Orthogonality Constraints. [(accessed on 26 May 2020)]; Available online: http://www.optimization-online.org/DB_HTML/2020/02/7607.html.

[B22-sensors-20-03041] 22.Geiger A., Lenz P., Stiller C., Urtasun R. Vision meets robotics: The kitti dataset. Int. J. Rob. Res. 2013;32:1231–1237. doi: 10.1177/0278364913491297. [DOI] [Google Scholar]

[B23-sensors-20-03041] 23.Silberman N., Hoiem D., Kohli P., Fergus R. European Conference on Computer Vision. Springer; Berlin/Heidelberg, Germany: 2012. Indoor segmentation and support inference from RGBD images; pp. 746–760. [Google Scholar]

[B24-sensors-20-03041] 24.Hartley R., Zisserman A. Multiple View Geometry in Computer Vision. Cambridge University Press; Cambridge, UK: 2003. [Google Scholar]

[B25-sensors-20-03041] 25.Xu H., Caramanis C., Sanghavi S. Robust PCA via outlier pursuit. IEEE Trans. Inf. Theory. 2010;58:3047–3064. doi: 10.1109/TIT.2011.2173156. [DOI] [Google Scholar]

[B26-sensors-20-03041] 26.Soltanolkotabi M., Candes E.J. A geometric analysis of subspace clustering with outliers. Ann. Stat. 2012;40:2195–2238. doi: 10.1214/12-AOS1034. [DOI] [Google Scholar]

[B27-sensors-20-03041] 27.Rahmani M., Atia G.K. Coherence pursuit: Fast, simple, and robust principal component analysis. IEEE Trans. Signal Process. 2017;65:6260–6275. doi: 10.1109/TSP.2017.2749215. [DOI] [Google Scholar]

[B28-sensors-20-03041] 28.You C., Robinson D.P., Vidal R. Provable self-representation based outlier detection in a union of subspaces; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; Honolulu, HI, USA. 21–26 July 2017; pp. 3395–3404. [Google Scholar]

[B29-sensors-20-03041] 29.Ding T., Zhu Z., Ding T., Yang Y., Robinson D., Vidal R., Tsakiris M. Noisy dual principal component pursuit; Proceedings of the International Conference on Machine learning; Long Beach, CA, USA. 10–15 June 2019. [Google Scholar]

[B30-sensors-20-03041] 30.Fischler M.A., Bolles R.C. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM. 1981;24:381–395. doi: 10.1145/358669.358692. [DOI] [Google Scholar]

[B31-sensors-20-03041] 31.Tsakiris M.C., Vidal R. Dual principal component pursuit. J. Mach. Learn. Res. 2018;19:684–732. [Google Scholar]

[B32-sensors-20-03041] 32.Zhu Z., Wang Y., Robinson D.P., Naiman D.Q., Vidal R., Tsakiris M.C. Dual principal component pursuit: probability analysis and efficient algorithms. arXiv. 20181812.09924 [Google Scholar]

[B33-sensors-20-03041] 33.Shi L., Chi Y. Manifold gradient descent solves multi-channel sparse blind deconvolution provably and efficiently. arXiv. 20191911.11167 [Google Scholar]

[B34-sensors-20-03041] 34.Qu Q., Li X., Zhu Z. A nonconvex approach for exact and efficient multichannel sparse blind deconvolution; Proceedings of the Advances in Neural Information Processing Systems; Vancouver, CB, Canada. 8–14 December 2019; pp. 4017–4028. [Google Scholar]

[B35-sensors-20-03041] 35.Qu Q., Sun J., Wright J. Finding a sparse vector in a subspace: Linear sparsity using alternating directions; Proceedings of the Advances in Neural Information Processing Systems; Montreal, QB, Canada. 8–13 December 2014; pp. 3401–3409. [Google Scholar]

[B36-sensors-20-03041] 36.Barzilai J., Borwein J.M. Two-point step size gradient methods. IMA J. Numer. Anal. 1988;8:141–148. doi: 10.1093/imanum/8.1.141. [DOI] [Google Scholar]

[B37-sensors-20-03041] 37.Dai Y.H., Fletcher R. Projected Barzilai-Borwein methods for large-scale box-constrained quadratic programming. Numer. Math. 2005;100:21–47. doi: 10.1007/s00211-004-0569-y. [DOI] [Google Scholar]

[B38-sensors-20-03041] 38.Absil P.A., Mahony R., Sepulchre R. Optimization Algorithms on Matrix Manifolds. Princeton University Press; Princeton, NJ, USA: 2009. [Google Scholar]

[B39-sensors-20-03041] 39.Boumal N., Mishra B., Absil P.A., Sepulchre R. Manopt, a Matlab toolbox for optimization on manifolds. J. Mach. Learn. Res. 2014;15:1455–1459. [Google Scholar]

[B40-sensors-20-03041] 40.Mairal J., Elad M., Sapiro G. Sparse Representation for Color Image Restoration. IEEE Trans. Image Process. 2008;17:53–69. doi: 10.1109/TIP.2007.911828. [DOI] [PubMed] [Google Scholar]

[B41-sensors-20-03041] 41.Dolan E.D., Moré J.J. Benchmarking optimization software with performance profiles. Math. Program. 2002;91:201–213. doi: 10.1007/s101070100263. [DOI] [Google Scholar]

PERMALINK

An Efficient Orthonormalization-Free Approach for Sparse Dictionary Learning and Dual Principal Component Pursuit

Xiaoyin Hu

Xin Liu

Abstract

1. Introduction

1.1. Contribution

1.2. Notations and Terminologies

2. Model Description

2.1. ℓm-Norm Maximization for DPCP Problems

2.2. Equivalence

Proposition 1.

Proof.

Theorem 2.

Corollary 3.

3. Algorithm

3.1. Global Convergence

Theorem 4.

3.2. Some Practical Settings

4. Numerical Examples

4.1. Numerical Results on Sparse Dictionary Earning

Figure 1.

Figure 2.

4.2. Dual Principal Component Pursuit

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Table 1.

5. Conclusions

Acknowledgments

Abbreviations

Appendix A. Proof for Theorem 2 and Corollary 3

Lemma A1.

Proof.

Lemma A2.

Proof.

Lemma A3.

Proof.

Lemma A4.

Proof.

Lemma A5.

Proof.

Proposition A6.

Proof.

Theorem A7.

Proof.

Corollary A8.

Proof.

Appendix B. Proof for Theorem 4

Lemma A9.

Proof.

Lemma A10.

Proof.

Lemma A11.

Proof.

Lemma A12.

Proof.

Lemma A13.

Proof.

Theorem A14.

Proof.

Appendix C. Additional Experimental Results

Figure A1.

Author Contributions

Funding

Conflicts of Interest

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.1. $ℓ_{m}$ -Norm Maximization for DPCP Problems