Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2019 Nov 19;9(4):785–811. doi: 10.1093/imaiai/iaz028

Analysis of fast structured dictionary learningInline graphic

Saiprasad Ravishankar 1,, Anna Ma 2, Deanna Needell 3
PMCID: PMC7737167  PMID: 33343894

Abstract

Sparsity-based models and techniques have been exploited in many signal processing and imaging applications. Data-driven methods based on dictionary and sparsifying transform learning enable learning rich image features from data and can outperform analytical models. In particular, alternating optimization algorithms have been popular for learning such models. In this work, we focus on alternating minimization for a specific structured unitary sparsifying operator learning problem and provide a convergence analysis. While the algorithm converges to the critical points of the problem generally, our analysis establishes under mild assumptions, the local linear convergence of the algorithm to the underlying sparsifying model of the data. Analysis and numerical simulations show that our assumptions hold for standard probabilistic data models. In practice, the algorithm is robust to initialization.

Keywords: sparse representations, dictionary learning, transform learning, alternating minimization, convergence guarantees, generative models, fast algorithms

1. Introduction

Various models of signals and images have been exploited in signal processing and imaging applications, such as dictionary and sparsifying transform models, tensor models and manifold models. Wavelets and other analytical sparsifying transforms have been used in compression standards [21], denoising and magnetic resonance image reconstruction from compressive measurements [19]. While these approaches used fixed or analytical image models that are independent of the input data, there has been a rising interest in data-dependent or data-driven models. Learned models may outperform analytical models in various applications. For example, learned dictionaries and sparsifying transforms work well in applications such as denoising [16], in-painting [20,41] and medical image reconstruction [45]. This work focuses on analysing the convergence behaviour of a structured (unitary) sparsifying transform learning algorithm and investigates its ability to recover underlying data models. In the following, we present some background on dictionary and sparsifying operator learning, before discussing the specific learning problem and algorithm, and our contributions.

1.1 Background

Signals can be modeled as sparse in different ways such as in a synthesis dictionary or in a transform domain. In particular, the synthesis dictionary model represents a given signal Inline graphic as Inline graphic with Inline graphic denoting the synthesizing dictionary and Inline graphic denoting the sparse code, i.e. Inline graphic with the Inline graphic ‘norm’ counting the number of non-zero vector entries. The synthesis dictionary model is often referred to as a union of (low-dimensional) subspaces model for signals, wherein different signals may be approximately spanned by different subsets of dictionary columns or atoms. Finding the optimal sparse representation for a signal in the synthesis dictionary model involves solving the well-known synthesis sparse coding problem.1 This problem is known to be non-deterministic polynomial-time hard (NP-hard) in general [23] and numerous algorithms exist for approximating the solution to the sparse coding problem [1315,24,26] that provide the correct solution under certain conditions. On the other hand, the sparsifying transform model assumes that Inline graphic, where Inline graphic denotes a sparsifying transform and Inline graphic is assumed to emit a sparse structure (where zeros correspond to the transform rows that approximately annihilate the signal). The sparsifying transform model is a generalization [30] of the analysis model [22] that assumes that applying Inline graphic to a signal produces several zeros in the output. These models can be viewed as a union of null-spaces model for signals.2 For the transform model, sparse transform-domain approximations are obtained exactly by simple (e.g. hard or soft) thresholding [30].

The learning of dictionaries and sparsifying transforms from a collection of signals has been explored in many recent works [3,25,30,36,39,47]. The learning problems are often highly non-convex (e.g. non-convexity due to product of matrices structure or non-convex constraints such as using the Inline graphic “norm”), and many learning algorithms lack proven convergence guarantees or model recovery guarantees. Recent works [1,2,5,8,9,31,40,46] have studied the convergence of specific learning algorithms. Some of these works demonstrate promising results in applications for efficient synthesis dictionary [8,9,46] or transform [10,31] learning algorithms and prove convergence of the learning methods to the critical points (or generalized stationary points [34]) of the underlying costs. These works all employ the Inline graphic “norm” or other non-convex regularizers in their costs, which work well in applications. Other works such as [1,2] use the Inline graphic norm and prove the recovery of the underlying generative model for specific learning methods using an alternating minimization approach, but rely on restrictive assumptions on sparsity and the initial error. Arora et al. [4] analysed alternating minimization approaches to synthesis dictionary learning and provided a convergence radius Inline graphic (i.e. initializations within the radius provide convergence), but the upper bound on the iterate error included a non-zero offset and fresh samples may be needed in each iteration. In [5], the authors propose and analyse polynomial time algorithms for learning overcomplete dictionaries but comment that their algorithms are not suitable for large-scale applications due to computational runtime costs. Moreover, these and other schemes [40] have not been demonstrated to be practically powerful in applications such as inverse problems and can be computationally expensive.

Often, additional properties may be enforced on the model during learning such as incoherence [11,28], non-singularity [40], etc. In a recent two-part work, Sun et al. [42,43] focused on complete dictionaries and studied the geometric properties of the non-convex objective for dictionary learning over a high-dimensional sphere. Their work showed with high probability that there were no spurious local minimizers and proposed an algorithm that converged to local minimizers. While other works such as [4,6,12,38] provided theoretical guarantees for specific dictionary learning algorithms, they do not enforce structural constraints on the dictionary during learning. This work enforces the learned model to be unitary, which has been demonstrated to be both effective and computationally advantageous in practice [7,18,29,32]. While alternating minimization algorithms for general synthesis dictionary learning typically require iterative or greedy or other approximate techniques to solve the subproblems [1,3], the corresponding algorithms with unitary models, even with the Inline graphic “norm”, typically have efficient closed-form solutions [32]. Although unitary dictionary learning has shown promise empirically, there has been a lack of theoretical guarantees for proposed methods [7,18,29]. Given the recently increasing interest in such models and their effectiveness in applications such as inverse problems [32,33], our work focuses on analysing the convergence of algorithms for such structured non-convex learning problems.

In the following section, we outline the structured (unitary) operator learning approach that involves simple, computationally cheap updates. We investigate its convergence properties in the rest of the paper.

1.2 Unitary operator learning formulation and algorithm

Given an Inline graphic training data set Inline graphic, whose columns represent training signals, our goal is to find an Inline graphic sparsifying transformation matrix Inline graphic and an Inline graphic sparse coefficients (representation) matrix Inline graphic by solving the following constrained optimization problem:

graphic file with name DmEquation1.gif (1.1)

We focus on the learning of unitary sparsifying operators (Inline graphic with Inline graphic denoting the identity matrix) that have shown promise in applications such as denoising [29] and medical image reconstruction [32]. The columns Inline graphic of Inline graphic have at most Inline graphic non-zeros (measured using the Inline graphic “norm”), where Inline graphic is a given parameter. Alternatives to Problem (1.1) involve replacing the column-wise sparsity constraints with a constraint on the total sparsity (aggregate sparsity) of the entire matrix Inline graphic or using a sparsity penalty (e.g. Inline graphic penalties with Inline graphic). Problem (1.1) is an instance of sparsifying transform learning [27,30], with unitary constraint on the operator or filter set. Sparsifying transform learning generalizes conventional analysis dictionary learning. Analysis dictionary learning approaches typically minimize the Inline graphic or Inline graphic norm of Inline graphic subject to non-triviality constraints on Inline graphic that prevent trivial solutions such as the all-zero matrix [48]. Popular variations to model noisy data minimize Inline graphic subject to sparsity-type constraints on Inline graphic and constraints on Inline graphic [35,36,49]. Sparse coding in the latter variation (i.e. estimating Inline graphic for fixed Inline graphic) can be NP-hard in general. Problem (1.1) learns a different generalization of the analysis model, where Inline graphic is assumed “approximately” sparse in the transformed domain. Natural signals and images are well known to be approximately sparse in the wavelet or discrete cosine transform (DCT) domain, etc., and such sparsifying transforms have also been exploited for denoising data. Problem (1.1) with the unitary constraint on Inline graphic is also equivalent to learning a synthesis dictionary Inline graphic for sparsely approximating the training data Inline graphic as Inline graphic.

Alternating minimization algorithms are commonly used for learning synthesis dictionaries [3,17,37,39], analysis dictionaries (such as the noisy data variation above) [36,49] and sparsifying transforms [27,30,44]. In particular, unlike sparse coding in the first two models, which could be NP-hard in general, computing sparse approximations in the transform model is cheap involving thresholding, and thus, various efficient and effective algorithms have been proposed for transform learning with different properties or constraints on Inline graphic. One could alternate between solving for Inline graphic and Inline graphic in Problem (1.1) [29,31]. In this case, the solution for the Inline graphicth Inline graphic update (sparse coding step) is obtained as Inline graphic  Inline graphic, where Inline graphic and Inline graphic denote the Inline graphicth columns of Inline graphic and Inline graphic, respectively, and the operator Inline graphic zeros out all but the Inline graphic largest magnitude elements of a vector, leaving other entries unchanged (i.e. thresholding to Inline graphic largest magnitude elements). The solution for the subsequent Inline graphic update (operator update step) is obtained by first computing the full singular value decomposition (SVD) of Inline graphic as Inline graphic, and then Inline graphic. The algorithm repeats these relatively cheap updates until convergence. The overall method is provided in Algorithm 1.

Although Problem (1.1) is non-convex because of the Inline graphic sparsity and unitary operator constraints, the alternating minimization algorithm involves cheap and closed-form update steps. The thresholding-type solution for the sparse coding step readily generalizes to alternative formulations such as with an Inline graphic aggregate sparsity constraint or sparsity penalties [31]. These advantages of unitary operator learning (that also extend to general sparsifying transform learning [44]) and its effectiveness in applications [32] render it quite attractive vis-à-vis alternatives such as overcomplete synthesis dictionary learning, and hence we investigate it further in this work.

Problem (1.1) is also interpreted as training an efficient convolutional or filterbank model [27,33] for two-dimensional (or higher dimensional) images, with thresholding-type nonlinearities. To see this, we observe that if overlapping patches of an image or collection of images of size Inline graphic are (vectorized and) used for training with periodic image boundary condition (so patches at image boundaries wrap around on the opposite side of the image) and a patch stride of Inline graphic pixel in the horizontal and vertical directions (maximal patch overlap), then the transform learned by Problem (1.1) is applied to sparse code the data by first applying each row to all the image patches via inner products, followed by thresholding operations. The sparse outputs of the transform are thus generated by circularly convolving its reshaped (into two-dimensional patches and flipped) rows with the image followed by thresholding. Thus, Problem (1.1) adapts a collection of orthogonal sparsifying filters for images, and Algorithm 1 can also be implemented with filtering-based operations.

1.3 Contributions

In this work, we focus on investigating the convergence properties of the aforementioned efficient alternating minimization algorithm for unitary sparsifying operator learning. Recent works have shown convergence of the algorithm (or its variants) to critical points of the equivalent unconstrained problem [10,31,32], where the constraints are replaced with barrier penalties (that take value Inline graphic when the constraint is violated and Inline graphic otherwise). Here we further prove the fast local linear convergence of the algorithm to the underlying data models. Our results hold under mild assumptions that depend on the properties of the underlying sparse coefficients matrix Inline graphic. In addition to showing convergence, we also characterize the convergence radius and rate and discuss general and example distributions of the data for which our results hold. We also show experimentally that our assumptions and convergence guarantees hold for well-known probabilistic models of Inline graphic. Our experiments and initial arguments indicate that the learning algorithm is robust to initialization.

1.4 Organization

The rest of this paper is organized as follows. Section 2 presents the main convergence results and proofs. Section 3 presents experimental results supporting the statements in Section 2 and illustrating the empirical behaviour of the transform learning algorithm. In Section 4, we conclude with proposals for future work.

Algorithm 1 Alternating optimization for (1.1) —

  • Input: Training data matrix Inline graphic, maximum iteration count Inline graphic, sparsity Inline graphic

  • Output:  Inline graphic, Inline graphic

  • Initialize:  Inline graphic and Inline graphic

  • for  Inline graphic  do

  • Inline graphic  Inline graphicInline graphicInline graphic

  • Inline graphic =Inline graphicInline graphic

  • Inline graphic

  • Inline graphic

  • end  for

2. Convergence analysis

The main contribution of this work is the convergence analysis of Algorithm 1. We begin this section outlining notation and the assumptions under which our analysis operates. Following this, we summarize the theoretical guarantees of our work and present the proofs for these results.

2.1 Notation

We adopt the following notation in the rest of the paper. Matrix Inline graphic denotes the Inline graphic sparse coefficients matrix, Inline graphic is the Inline graphic sparsifying transform, and Inline graphic denotes the Inline graphic data set. The Inline graphicth approximation of a variable (iterate in the algorithm) is denoted Inline graphic. The capital letter Inline graphic is reserved for the transpose operator, i.e. the variable Inline graphic should be read as the transpose the of Inline graphicth approximation for Inline graphic. With the exception of Inline graphic, capitalized letters are used for matrices and lowercase letters are used for vectors, with further subscripts denoting the row, column or entry of the matrix or vector. The Inline graphicth row, Inline graphicth column and the Inline graphicth entry of a matrix Inline graphic are denoted Inline graphic, Inline graphic and Inline graphic, respectively. For any vector Inline graphic, Inline graphic denotes the function that returns the support, i.e. Inline graphic, where Inline graphic denotes the Inline graphicth entry (scalar) of Inline graphic. The operator Inline graphic leaves the Inline graphic largest magnitude elements of Inline graphic unchanged and zeros out all other entries (i.e. thresholding to Inline graphic largest magnitude elements). Matrix Inline graphic denotes an Inline graphic diagonal matrix of ones and a zero at location Inline graphic. Additionally, Inline graphic denotes an Inline graphic diagonal matrix that has ones at entries Inline graphic for Inline graphic and zeros elsewhere, and matrix Inline graphic is defined in Section 2.2 (see Assumption Inline graphic). The Frobenius norm, denoted Inline graphic, is the square root of the sum of squared elements of Inline graphic and Inline graphic denotes the spectral norm. Lastly, Inline graphic denotes the appropriately sized identity matrix.

2.2 Assumptions

We begin with the following assumptions that will be used in various results:

  • (A1) Generative model: There exists a Inline graphic and unitary Inline graphic such that Inline graphic and Inline graphic (normalized data).

  • (A2) Sparsity: The columns of Inline graphic are Inline graphic-sparse, i.e. Inline graphic  Inline graphic  Inline graphic.

  • (A3) Spectral property: The underlying Inline graphic satisfies the bound Inline graphic, where Inline graphic denotes the condition number (ratio of largest to smallest singular value).

  • (A4) Orthogonal coefficients: The rows of Inline graphic are orthonormal, i.e. Inline graphic.

  • (A5) Initialization:  Inline graphic for an appropriate small Inline graphic.

The first two assumptions are on the model for the data, i.e. we would like the algorithm to find an underlying (unitary) sparsifying transform and representation matrix such that Inline graphic holds (data generated as Inline graphic), where the columns of Inline graphic have Inline graphic non-zeros. The coefficients are assumed “structured” in Assumption Inline graphic, satisfying a spectral property, which will be used to establish our theoretical results. When Inline graphic and Inline graphic, we show that Assumption Inline graphic simplifies to very intuitive and deterministic conditions of uniqueness (the support of no two rows of Inline graphic fully coincide) and irreducibility (each row of Inline graphic has at least one non-zero, i.e. each atom or row of Inline graphic contributes to at least one non-zero in the data representation). More generally or when Inline graphic, the condition that each row of Inline graphic has at least one non-zero is still required in order for Inline graphic to hold (as otherwise Inline graphic) but the assumption does not reduce to a simple setting. We will present an analysis and empirical results showing that the spectral property holds for well-known probabilistic models. The analysis will also show that in general the underlying matrices Inline graphic and Inline graphic defining the spectral property behave similarly for the probabilistic models as Inline graphic as for the special Inline graphic case above. Assumption Inline graphic on orthogonality of coefficient matrix (normalized) rows simplifies the condition in Assumption Inline graphic (since Inline graphic) and is used in presenting/proving one version of the results, but is omitted in the generalization. For well-known probabilistic models of the coefficient matrix, we will show that the orthogonality holds asymptotically. Assumption Inline graphic on algorithm initialization states that the initial sparsifying transform, Inline graphic is sufficiently close to the solution Inline graphic. Such an assumption has also been made in other works, where the issue of good initialization is tackled separately [1,2]. Section 2.3 characterizes Inline graphic in Assumption Inline graphic in more detail. While the main results in Section 2.3.1 use Assumption Inline graphic, we also discuss a generalization in Section 2.3.4. Our theoretical results are stated next.

2.3 Results

In the following, Theorem 2.1 first presents a convergence result using all the aforementioned assumptions. Then Theorem 2.2 generalizes the result by dropping Assumption Inline graphic. Proposition 2.3 states that Assumption Inline graphic holds under a general probabilistic model on the sparse representation matrix Inline graphic. We also later show numerical results illustrating Proposition 2.3. We also provide a corollary on a special case of Theorems 2.1 and 2.2 and some remarks. In particular, Remark 2.1 discusses dropping the data normalization assumption in Inline graphic, and Remark 2.2 discusses the effect of noise on Theorems 2.1 and 2.2. Proposition 2.4 and Remark 2.3 characterize and discuss the behaviour of Inline graphic in Assumption Inline graphic.

2.3.1 Main results

Theorem 2.1

Under Assumptions Inline graphicInline graphic, the Frobenius error between the iterates generated by Algorithm 1 and the underlying model in Assumptions Inline graphic and Inline graphic is bounded as follows:


Theorem 2.1 (2.1)

where Inline graphic and Inline graphic is fixed based on the initialization.

Here, the symbol “Inline graphic” indicates equality up to first-order terms, with the other terms negligible. We will mostly only refer to the dominant component of Inline graphic in the discussions. The latter components are considered in more detail in the convergence radius analysis later (Section 2.3.3 and Appendix A).

Theorem 2.2

Under Assumptions Inline graphicInline graphic and Inline graphic, the iterates in Algorithm 1 converge linearly to the underlying model in Assumptions Inline graphic and Inline graphic, i.e. the Frobenius error between the iterates and the underlying model satisfies


Theorem 2.2 (2.2)

where Inline graphic and Inline graphic is fixed based on the initialization.

Next we discuss special cases of Theorems 2.1 and 2.2 when Inline graphic. In the case of Theorem 2.1, a simple intuitive condition that the supports of no two rows of Inline graphic fully overlap ensures linear convergence (Inline graphic), i.e. ensures Assumption Inline graphic holds.

Corollary 2.1

(Case  Inline graphic) For Theorem 2.1, when Inline graphic and no two rows of Inline graphic have identical support, then Inline graphic holds in Assumption Inline graphic. For Theorem 2.2 (without Assumption (Inline graphic)), when Inline graphic, then Inline graphic holds in Assumption (Inline graphic) if Inline graphic for all Inline graphic, where the norm is computed only with respect to the elements of Inline graphic in the support Inline graphic.

Remark 2.1 discusses the effect of dropping the data normalization assumption (in Inline graphic) on the convergence rate. In particular, the convergence rate factor Inline graphic is modified by being normalized by Inline graphic, keeping it invariant to scaling of Inline graphic.

Remark 2.1

When the unit spectral norm condition on Inline graphic in Assumption Inline graphic is dropped, the Inline graphic bound in Theorem 2.2 holds with Inline graphic. The bound Inline graphic as in (2.2) holds with the aforementioned Inline graphic but with Inline graphic replaced by Inline graphic.

As will be clear from the proofs in Section 2.4, when Assumption Inline graphic stating Inline graphic is relaxed to Inline graphic, then the (common) linear contraction factor Inline graphic for the error in each iteration in Theorem 2.2 (with respect to previous iteration’s error) is replaced with Inline graphic, where Inline graphic is defined similar to Inline graphic but with respect to Inline graphic (which is shown in Section 2.4 to contain Inline graphic).

Finally, we have the following generalization of Theorem 2.2 for noisy models. The Inline graphic in Assumption Inline graphic would be smaller in the presence of noise and the noise is assumed small enough so that the support recovery and Taylor Series convergence properties used in the proofs in Section 2.4 hold.

Remark 2.2

When a noisy model of the data is used in Assumption Inline graphic, i.e. Inline graphic, where Inline graphic denotes noise, then for sufficiently small noise, Theorem 2.2 holds, except that the term Inline graphic, where Inline graphic is a constant, is added to the right-hand side of (2.2).

2.3.2 Convergence rate

While our main results assume that the spectral property in Assumption Inline graphic holds, the next result discusses the scenario and models under which the assumption Inline graphic is generally valid.

Proposition 2.3

Suppose the locations of the Inline graphic non-zeros in each column of Inline graphic are chosen independently and uniformly at random, and the non-zero entries are i.i.d. with mean zero and variance Inline graphic. Then, for fixed Inline graphic, Inline graphic, and Inline graphic, we have that Inline graphic  Inline graphic for large enough Inline graphic with high probability. In particular, we have the following limit almost surely:


Proposition 2.3 (2.3)

Proposition 2.3 holds for several well-known distributions of Inline graphic such as when its column supports are drawn independently and uniformly at random and the non-zero entries are a) i.i.d. with Inline graphic or b) i.i.d. scaled (by Inline graphic) random signs with ‘+’ and ‘−’ being equally probable. Section 3 empirically shows the algorithm’s convergence and the behaviour of Inline graphic when Inline graphic, a commonly used sparsity criterion in many applications (i.e. with Inline graphic, where Inline graphic is a small fraction).

2.3.3 Convergence radius

While the main convergence results make use of Assumption Inline graphic, here we discuss the behaviour of the convergence radius Inline graphic, including when the number of training signals Inline graphic. The following proposition and remark characterize a sufficient Inline graphic in Assumption Inline graphic for Theorems 2.1 and 2.2.

Proposition 2.4

The iterate convergence in Theorem 2.1 holds when the radius of convergence Inline graphic in Assumption Inline graphic satisfies Inline graphic, where Inline graphic with Inline graphic computing the smallest non-zero magnitude in a vector, and Inline graphic with Inline graphic, and Inline graphic is defined as follows:


Proposition 2.4 (2.4)

In Proposition 2.4, Inline graphic arises from the sparse coding step of Algorithm 1 and ensures recovery of the support of the underlying sparse coefficients. The bound Inline graphic arises in the operator update step of Algorithm 1 and is primarily to ensure the convergence and boundedness of Taylor Series expansions discussed in the proof. The largest permissible Inline graphic that suffices for Theorem 2.1 is obtained by maximizing the function Inline graphic over Inline graphic. The end points of this interval both correspond to Inline graphic. So the maximum of the continuous (non-negative) function Inline graphic would occur inside the interval. The constant Inline graphic is monotone decreasing as Inline graphic with the limiting Inline graphic. The result indicates that the radius of convergence depends on the properties of the underlying sparse coefficients.

Remark 2.3

Proposition 2.4 holds for Theorem 2.2 but with Inline graphic depending on Inline graphic. In particular, as Inline graphic, Inline graphic takes the same form as in Proposition 2.4. Moreover, for the distributions in Proposition 2.3, Inline graphic and Inline graphic almost surely as Inline graphic.

Remark 2.3 indicates that Inline graphic in Proposition 2.4 remains unchanged for Theorem 2.2. However, the Inline graphic arising from the operator update step depends on Inline graphic. For example, a bound of Inline graphic (smaller for larger condition numbers) ensures the convergence of one of the Taylor Series in the proof in Section 2.4.2. Importantly, for the distributions of Inline graphic in Proposition 2.3, the limiting value of Inline graphic stated in Remark 2.3 depends only on the ratio Inline graphic.

The limiting behaviour of Inline graphic as Inline graphic would depend on the distribution of Inline graphic. Appendix B discusses some example distributions that satisfy the assumptions in Proposition 2.3 and have the non-zero values bounded away from zero, for which Inline graphic holds for each Inline graphic, where Inline graphic is a positive constant. Practically, peak physical intensity and numerical precision bound non-zero entries of the sparse coefficient matrix. In practice, we expect the radius Inline graphic in Proposition 2.4 to be limited more by Inline graphic, since Inline graphic depends approximately only on the ratio Inline graphic for large Inline graphic (Remark 2.3) and would be a constant when Inline graphic.

2.3.4 Discussion of generalization of convergence radius assumptions

Here we discuss the effect of Inline graphic values larger than in Proposition 2.4 (or Remark 2.3) on the convergence of Algorithm 1. The following lemma shows the behaviour of the sparse coding error for general algorithm initializations (or general Inline graphic values that may not ensure support recovery).

Lemma 2.1

For Inline graphic in Algorithm 1 and under Assumptions Inline graphic and Inline graphic and denoting Inline graphic with Inline graphic for some non-negative Inline graphic, we have that


Lemma 2.1 (2.5)

Appendix C provides the proof of Lemma 2.1. Lemma 2.1 suggests that regardless of how close the initial transform is to the underlying model, the bound on the sparse coding error is at most twice that in Theorem 2.2. In this case, the contraction factor in the operator update step would need to satisfy Inline graphic in order to consistently decrease the error. We have Inline graphic for the operator update step, where Inline graphic is a diagonal matrix with ones at entries Inline graphic for Inline graphic and zeros elsewhere. If the supports of Inline graphic and Inline graphic are mismatched, then Inline graphic could in general have more ones than Inline graphic. In other words, Inline graphic could be larger than the Inline graphic in Theorem 2.2. Thus, the (larger) effective (or overall) factor of Inline graphic could lead to slow convergence initially from more general initializations. This is also corroborated by the experiments in Section 3, where slower convergence is observed from general initializations until the underlying support is fully recovered, at which point, the linear convergence behaviour predicted in Theorem 2.2 is fully observed, with a similar rate of convergence regardless of initialization.

2.4 Proofs of theorems, corollary, propositions and remarks

We first prove Theorem 2.1 and then the proof of Theorem 2.2 is briefly presented highlighting the distinctions arising from the generalization. The proof of Corollary 2.1 is presented for the case of Theorem 2.1 (the proof for the case of Theorem 2.2 is similar). The proof of Remark 2.2 follows along the same lines as those of the theorems and is omitted. Finally, the proof of Proposition 2.3 is presented. The proofs of Proposition 2.4 and Remark 2.3 are outlined in Appendix A.

To prove Theorem 2.1, we will first prove two supporting lemmas that establish properties of the iterates. First, Lemma 2.2 shows that the error between the iterate Inline graphic and Inline graphic is bounded and the bound depends on the approximation error with respect to Inline graphic for the initial Inline graphic (bounded by Inline graphic as in Assumption Inline graphic). Lemmas 2.3 and 2.4 show that the error between the first Inline graphic iterate (Inline graphic) and Inline graphic is bounded above by Inline graphic for Theorems 2.1 and 2.2, respectively. Similar bounds are shown to hold for subsequent iterations. Therefore, for Algorithm 1 to converge linearly, one only needs Inline graphic as in Assumption Inline graphic or as established by Proposition 2.3. The scaling indicated in Remark 2.1 follows from the proofs of Lemmas 2.2 and 2.4.

2.4.1 Proof of Theorem 2.1

For our proofs, we define the sequences Inline graphic and Inline graphic such that

graphic file with name DmEquation7.gif (2.6)

 

graphic file with name DmEquation8.gif (2.7)
Lemma 2.2

(Approximation error for  Inline graphic) For Inline graphic in Algorithm 1 and under Assumptions Inline graphic, the Frobenius norm of the approximation error of the estimated sparse coefficients with respect to Inline graphic is bounded by Inline graphic as defined in Inline graphic. In particular, we have that


Lemma 2.2

where Inline graphic.

Proof.

For each column indexed by Inline graphic, of the sparse coefficients matrix Inline graphic, the following hold:


Proof.
(2.8)

where Inline graphic is a diagonal matrix with a one in the Inline graphicth entry if Inline graphic and zero otherwise and Inline graphic is as defined in (2.6). The last equality above follows from the fact that the support of Inline graphic includes that of Inline graphic, for small enough Inline graphic (Assumption Inline graphic). In particular, since Inline graphic, we have


Proof.

Therefore, whenever Inline graphic with Inline graphic being the smallest non-zero magnitude vector entry, the support of Inline graphic includes3 that of Inline graphic (the entries of the perturbation Inline graphic are not large enough to change the support). The following results then hold:


Proof.

Here, Inline graphic follows by definition of Inline graphic; step Inline graphic holds for the Frobenius norm of a matrix–matrix product, and the last equality holds because Inline graphic (Assumption Inline graphic). By Assumption Inline graphic, Inline graphic, which completes the proof.

Lemma 2.3

(Approximation error for  Inline graphic) For Inline graphic in Algorithm 1 and under Assumptions Inline graphic, the Frobenius norm of the approximation error of the estimated transform with respect to Inline graphic is bounded as


Lemma 2.3

where Inline graphic is a scalar coefficient as in Theorem 2.1.

Proof.

Denote the SVD of Inline graphic as Inline graphic. From Algorithm 1, we have


Proof.

Using the SVD of Inline graphic, we rewrite the above equations as


Proof.
(2.9)

Now the error between Inline graphic and Inline graphic satisfies


Proof.
(2.10)

where the matrix Inline graphic can be further rewritten as follows:


Proof.
(2.11)

The above equality holds for all Inline graphic, which suffices to ensure Inline graphic is invertible. Note that the matrix square root (i.e. the matrix Inline graphic in the decomposition Inline graphic) in (b) above is the positive-definite square root.

Using Taylor Series Expansions for the matrix inverse and positive-definite square root along with (2.7) and the assumption Inline graphic, we have that


Proof.
(2.12)

 


Proof.
(2.13)

where Inline graphic denotes corresponding higher order series terms and is bounded in norm by Inline graphic for some constant Inline graphic.

Substituting these expressions in (2.10), the error between the first transform iterate Inline graphic and Inline graphic is bounded as


Proof.
(2.14)

The approximation error above is bounded in norm by Inline graphic, which is negligible for small Inline graphic. So we only bound the dominant term Inline graphic on the right. The matrix Inline graphic clearly has a zero diagonal (skew-symmetric). Thus, we have the following inequalities:


Proof.
(2.15)

 


Proof.
(2.16)

 


Proof.
(2.17)

where we more simply write (ignoring higher order terms in (2.14)) Inline graphic. Since Inline graphic by Assumption Inline graphic, we obtain the desired result.

Thus, we have shown the results for the Inline graphic case. We complete the proof of Theorem 2.1 by observing that for each subsequent iteration Inline graphic, the same steps as above can be repeated along with the induction hypothesis (IH) to show that

graphic file with name DmEquation24.gif

2.4.2 Proof of Theorem 2.2

Here we present the distinctions in the proof of Theorem 2.2. When Assumption Inline graphic is dropped, Lemma 2.2 and its proof remain unaffected. The change to Lemma 2.3 and its proof are outlined next.

Lemma 2.4

(Removing Assumption  Inline graphic) For Inline graphic in Algorithm 1 and under Assumptions Inline graphic and Inline graphic, the Frobenius norm of the approximation error of the estimated transform with respect to Inline graphic is bounded as


Lemma 2.4

where Inline graphic is a scalar coefficient as in Theorem 2.2.

Proof.

The proof of Lemma 2.4 relies on the general Taylor Series Expansions for the matrix inverse and positive-definite square root. In particular, (2.13) uses these expansions under the assumption that Inline graphic. To establish a result without this assumption, we first use the general Taylor Series Expansions for matrix inverse and square root then rely on algebraic identities of the Kronecker sum and product to manipulate the error bound of Inline graphic.

To that end, let Inline graphic. First, we look at the series expansion of Inline graphic, for which the following equalities hold:


Proof.

where we factored out4  Inline graphic and then computed the series expansion of a matrix inverse. The Taylor series converges when Inline graphic or when Inline graphic.

For the series expansion of the matrix square root in (2.11), we first observe that


Proof.

Let Inline graphic, where Inline graphic denotes the remainder of terms within the square root. The Taylor Series Expansion for Inline graphic can be written as Inline graphic, where the operator Inline graphic reshapes a matrix into a vector by stacking the columns, Inline graphic undoes or inverts the Inline graphic operation by reshaping a vector into an Inline graphic matrix, and the gradient of the square root function is obtained as follows, where Inline graphic denotes the Kronecker product and Inline graphic denotes the Kronecker sum:


Proof.
(2.18)

Using the above expressions, (2.11) in this case becomes


Proof.
(2.19)

with Inline graphic denoting corresponding higher order series terms in each step above.

Now recall from (2.14) that Inline graphic, where Inline graphic  Inline graphic. To bound the required error Inline graphic, first, using the property of the Inline graphic operator that Inline graphic, we can easily obtain a simplified expression for Inline graphic ignoring the Inline graphic terms (since they are bounded in norm by Inline graphic, which is negligible for small Inline graphic and Inline graphic is a constant) in (2.19) as follows:


Proof.
(2.20)

Denoting the SVD of (positive-definite) Inline graphic as Inline graphic, it can be shown that the SVD of the Kronecker sum Inline graphic is5  Inline graphic or that Inline graphic. Using these SVDs and the standard result that


Proof.
(2.21)

the following results readily hold:


Proof.
(2.22)

 


Proof.
(2.23)

Substituting (2.22) and (2.23) in (2.20) simplifies (2.20) as follows:  


Proof.
(2.24)

Moreover, we have that


Proof.
(2.25)

Thus, equation (2.24) further simplifies to


Proof.
(2.26)

where the matrix Inline graphic is defined as


Proof.
(2.27)

Finally, we use (2.26) to obtain


Proof.
(2.28)

Here, the submultiplicativity of the spectral norm and the fact that Inline graphic ensures that


Proof.
(2.29)

where the last equality follows from the facts that Inline graphic (for unitary matrix); Inline graphic (by Assumption (Inline graphic)); Inline graphic, where Inline graphic denotes the smallest matrix singular value; and the fact that Inline graphic (using Assumption (Inline graphic)). Substituting (2.29) in (2.28) and using a similar set of inequalities as in (2.17) to bound the Inline graphic term in (2.28) provides the following bound:


Proof.
(2.30)

where we more simply write Inline graphic. Since by Assumption Inline graphic, Inline graphic, we obtain the desired result.

2.4.3 Proof of Corollary 2.1

We have Inline graphic (focusing on the dominant component) with Inline graphic by Assumptions (Inline graphic) and (Inline graphic), respectively. For brevity in notation, let Inline graphic. Here the matrix Inline graphic zeros out the Inline graphicth row of Inline graphic and Inline graphic zeros out the columns corresponding to the complement of the support of the Inline graphicth row of Inline graphic.

The matrix Inline graphic is then a diagonal matrix where the Inline graphicth entry is Inline graphic and the Inline graphicth entry for Inline graphic is Inline graphic, where Inline graphic coincides with Inline graphic on Inline graphic and is zero outside this support. Clearly, the Inline graphicth row and column of Inline graphic are zero and its other off-diagonal entries are Inline graphic  Inline graphic because each column of Inline graphic has at most Inline graphic non-zeros and Inline graphic for Inline graphic. So, we readily have that

graphic file with name DmEquation41.gif

where the last inequality bound follows from the fact that Inline graphic for all Inline graphic, which holds because each row of Inline graphic has unit Inline graphic norm (Assumption Inline graphic) and no two rows have the exact same support. Inline graphic

2.4.4 Proof of Proposition 2.3

Under the conditions stated in Proposition 2.3, the (dominant) Inline graphic factor is expected to be less than Inline graphic given sufficient training signals, i.e. large Inline graphic.

For the proof, we study the asymptotic behaviour of the matrices Inline graphic and Inline graphic, where Inline graphic, which appear in Inline graphic as defined in Remark 2.1. First, we show that Inline graphic almost surely as Inline graphic using Inline graphic. Then we will show that Inline graphic  Inline graphic almost surely as Inline graphic using Inline graphic.

Let Inline graphic. Then the non-zero entries of Inline graphic have zero mean and variance of Inline graphic. Let Inline graphic denote the indicator function that takes the value Inline graphic when Inline graphic and is zero otherwise. Since Inline graphic, using the law of large numbers, the diagonal entries of Inline graphic converge almost surely as follows:

graphic file with name DmEquation42.gif (2.31)

where Inline graphic is i.i.d. over the columns Inline graphic. The random variable Inline graphic is non-zero (the non-zero part has mean Inline graphic) with probability (w.p.)6  Inline graphic and is zero w.p. Inline graphic, implying Inline graphic. Similarly, the off-diagonal entries Inline graphic for Inline graphic converge as follows:

graphic file with name DmEquation43.gif (2.32)

where Inline graphic is non-zero w.p.7  Inline graphic and zero w.p. Inline graphic, implying Inline graphic, where Inline graphic is the product of two i.i.d. zero mean random variables. Therefore, from (2.31) and (2.32), it follows that Inline graphic converges to Inline graphic almost surely. Thus, as Inline graphic, Inline graphic in the definition of Inline graphic, converges to Inline graphic almost surely.

Now consider Inline graphic and note that the Inline graphicth row and column of the matrix Inline graphic are zero. As Inline graphic, the diagonal entries of Inline graphic have the following limit almost surely:

graphic file with name DmEquation44.gif (2.33)

which holds for all Inline graphic. The expectation follows from the fact that Inline graphic is i.i.d. over the columns8  Inline graphic, is non-zero (mean Inline graphic for non-zero part) w.p. Inline graphic and is zero otherwise.

The following limit holds almost surely for the off-diagonal entries of Inline graphic:

graphic file with name DmEquation45.gif (2.34)

which follows because the indexes Inline graphic, Inline graphic and Inline graphic all lie in the support of the Inline graphicth column (to get non-zero indicator function) w.p. Inline graphic, and the expectation of the product of zero mean i.i.d. random variables is zero. It is obvious from (2.33) and (2.34) that  

graphic file with name DmEquation46.gif (2.35)

Thus, as Inline graphic, Inline graphic  Inline graphic  Inline graphic almost surely, and the same is true for Inline graphic. Combining all the above results, the required result (2.3) is readily established.

Note that under the assumed probabilistic model of Inline graphic, the matrix Inline graphic in the proof of Proposition 2.3 above approaches a diagonal matrix as Inline graphic, whereas in the proof of Corollary 2.1 for the Inline graphic case, it is deterministically a diagonal matrix for each Inline graphic.

3. Experiments

In this section, we provide numerical results supporting our findings. We also discuss the empirical behaviour of the algorithm with respect to different initializations.

3.1 Empirical performance of algorithm

In the first two experiments, we generated the training set Inline graphic using randomly generated Inline graphic and Inline graphic, and set Inline graphic, Inline graphic, and Inline graphic. The transform Inline graphic is generated in each case by applying Matlab’s orth() function on a standard Gaussian matrix. For generating Inline graphic, the support of each column is chosen uniformly at random and the non-zero entries are drawn i.i.d. from a Gaussian distribution with mean zero and variance Inline graphic. Section 2 (Theorems 2.1 and 2.2) established model recovery guarantees for Algorithm 1. Figure 1 shows the empirical evolution of the Frobenius norm of the approximation error of the transform iterates with respect to Inline graphic, for an Inline graphic initialization (Inline graphic – see (2.8)). The plots illustrate the observed linear convergence of the iterates to the underlying true operator WInline graphic.

Fig. 1.


Fig. 1.

The performance of Algorithm 1 for recovering WInline graphic for Inline graphic and Inline graphic.

Figures 2 and 3 show the behaviour of Algorithm 1 with different initializations. We consider six different initializations and plot the evolution of the objective function over iterations. The first initialization, labelled ‘eps’, denotes an initialization as in Fig. 1 with Inline graphic. The other initializations are as follows: entries of Inline graphic drawn i.i.d. from a standard Gaussian distribution (labelled ‘rand’); an Inline graphic identity matrix Inline graphic labelled ‘id’; a discrete cosine transform (DCT) initialization labelled ‘dct’; entries of Inline graphic drawn i.i.d. from a uniform distribution ranging from 0 to 1 (labelled ‘unif’); and Inline graphic labelled ‘zero’. Note that the minimum objective value in (1.1) is Inline graphic. For non-epsilon initializations, we see that the behaviour of Algorithm 1 is split into two phases. In the first phase, the iterates slowly decrease the objective. When the iterates are close enough to a solution, the second phase occurs and during this phase, Algorithm 1 enjoys rapid convergence (towards 0). For different initializations, the algorithm converged to a scaled (by a diagonal Inline graphic matrix), row permuted version of the predetermined WInline graphic. Figures 2 and 3 also show the proportion of recovered (entry-wise) support of Inline graphic (up to row-permutation and sign changes). The grey region highlights the range of iterations in which the true support of Inline graphic is estimated well by the different initializations (i.e. where the proportion of recovered support reaches near 1 or 100%). These empirical results show that the aforementioned second phase of the convergence behaviour occurs in the iterations proceeding the point when the algorithm acquires the true support of Inline graphic. Furthermore, note that the objective’s convergence rate in the second phase is similar to that of the ‘eps’ case, where Inline graphic is selected to ensure that the support of Inline graphic is recovered in one iteration. These results concur with the analysis and discussion in Section 2.

Fig. 2.


Fig. 2.

The performance of Algorithm 1 over iterations with various initializations for Inline graphic: objective function (left) and proportion of recovered support of Inline graphic (right).

Fig. 3.


Fig. 3.

The performance of Algorithm 1 with various initializations for Inline graphic: objective function (left) and proportion of recovered support of Inline graphic (right).

The behaviour of Algorithm 1 is similar for Inline graphic and Inline graphic, with the latter case taking more iterations to enter the second phase of convergence. This makes sense since there are more coefficients to learn for larger Inline graphic. This experiment shows that Algorithm 1 is robust to initialization.

 

3.2 The Inline graphic factor in Proposition 2.3

In our last experiment, we illustrate Proposition 2.3 empirically. For each trial, we fix the signal dimension to be Inline graphic. In addition to varying Inline graphic, we vary Inline graphic. In the first experiment, the Inline graphic non-zero entries for each column of Inline graphic are selected uniformly at random where values are drawn i.i.d. from a Gaussian distribution with mean Inline graphic and variance Inline graphic. We also simulate the case when the non-zeros are i.i.d. scaled random signs with mean Inline graphic and variance Inline graphic, with ‘+’ and ‘-’ being equally probable. We then compute the following functions of Inline graphic: the condition number Inline graphic, the maximum spectral norm over choice Inline graphic for Inline graphic and the contraction factor Inline graphic that is a function of these quantities. The top and bottom rows of Fig. 4 plot these quantities for the Gaussian and scaled sign coefficients, respectively.

Fig. 4.


Fig. 4.

On the x-axis we plot the number of training data points Inline graphic and on the y-axis, (left) the condition number Inline graphic, (center) the maximum spectral norm over choice Inline graphic for Inline graphic and (right) the contraction factor Inline graphic. The top row of plots corresponds to the case when the non-zeros are i.i.d. Gaussian and the bottom plots correspond to the non-zeros being i.i.d. scaled random signs. In both cases, Inline graphic and we vary Inline graphic.

The plots clearly show that Inline graphic for large Inline graphic for each distribution and Inline graphic setting. The maximum spectral norm plots quickly converged close to their expected values of Inline graphic. Moreover, Inline graphic approaches close to Inline graphic as Inline graphic increases as expected, indicating that the probabilistic sparsity model approaches the scenario in Theorem 2.1. We have observed similar empirical behaviour for the Inline graphic factor, when the non-zero entries are drawn from other distributions.

4. Conclusion

In this work, we presented a study of the model recovery properties of the alternating minimization algorithm for structured, unitary sparsifying transform learning. The algorithm converges rapidly to the generative model(s) from local neighbourhoods under mild assumptions and the assumptions are shown to hold for various probabilistic models. In addition to showing that the algorithm convergences linearly, we also characterize the asymptotic behaviour of the convergence rate and radius with respect to the number of data points or training signals Inline graphic. In practice, the sparsifying operator learning method is robust to initialization. Our numerical results and initial analysis showed that the algorithm performs well under various initializations, with similar eventual rates of convergence. We have observed empirically that the algorithm converges to the specific WInline graphic even with quite large perturbations for the initial Inline graphic from WInline graphic (i.e. large Inline graphic values in Assumption Inline graphic). We plan to further analyse the effects of initialization and the behaviour of transform learning in inverse problems in future work.

Funding

This work was partly conceived when S.R. was at the University of Illinois at Urbana-Champaign (supported by National Science Foundation CCF 1320953 to S.R.), and was partly done when S.R. was at the University of Michigan, Ann Arbor and was supported by the Office of Naval Research (N00014-15-1-2141 to S.R.); Defense Advanced Research Projects Agency Young Faculty Award (D14AP00086 to S.R.); US Army Research Office Multidisciplinary University Research Initiative (W911NF-11-1-0391, 2015-05174-05 to S.R.); National Institutes of Health (R01 EB023618, U01 EB018753 to S.R.); and University of Michigan-Shanghai Jiao Tong University seed grant (to S.R.). This material was also supported by the National Science Foundation (DMS-1440140 to A.M. and D.N.) while the authors were in residence at the Mathematical Science Research Institute in Berkeley, California during the Fall 2017 semester; National Science Foundation CAREER (1348721 to A.M. and D.N.); and National Science Foundation BIGDATA (1740325 to A.M. and D.N.).

A. Proofs of Proposition 2.4 and Remark 2.3

Here we present the proof of Proposition 2.4 and briefly comment on Remark 2.3. The form of Inline graphic was discussed in the proof of Lemma 2.2 and ensures recovery of the support of Inline graphic. We derive the form of Inline graphic (i.e. sufficient Inline graphic for the operator update step) based on the proof of Lemma 2.3. In particular, we bound Inline graphic to ensure convergence of the Taylor Series in (2.12) and to bound the higher order terms in the product Inline graphic.

The matrix inverse series Inline graphic converges when Inline graphic. We have Inline graphic  Inline graphic, where the last inequality follows from Assumption Inline graphic and Lemma 2.2. Thus, Inline graphic suffices. Similarly, the series for Inline graphic  Inline graphic converges when the perturbation Inline graphic satisfies Inline graphic. Since Inline graphic  Inline graphic  Inline graphic, we have that Inline graphic or Inline graphic suffices, which also works for the matrix inverse series.

Using the notation above, the product in (2.13) simplifies as

graphic file with name DmEquation47.gif (A.1)

 

graphic file with name DmEquation48.gif (2.13)

where Inline graphic and Inline graphic are the remaining higher order terms in the respective series. The Inline graphic terms in (2.13) are given as Inline graphic. We bound the Frobenius norm of these summands to characterize Inline graphic in Inline graphic.

First, we have the following bound:

graphic file with name DmEquation49.gif (A.2)

We also have the next bound, which follows from Inline graphic (since Inline graphic) and Inline graphic  Inline graphic  Inline graphic:

graphic file with name DmEquation50.gif (A.3)

Third, we have for the matrix square root Taylor series that Inline graphic  Inline graphic  Inline graphic, where the right-hand side is the magnitude of the remainder of the series for Inline graphic after the first order term. Thus, we have the following standard bound for the remainder for some Inline graphic:

graphic file with name DmEquation51.gif (A.4)

Here the last inequality used Inline graphic and Inline graphic when Inline graphic. Finally, we have

graphic file with name DmEquation52.gif (A.5)

Combining (A.2)–(A.5), we easily get Inline graphic with Inline graphic as defined in (2.4). Including the Inline graphic term in (2.14), the effective convergence rate in Lemma 2.3 is Inline graphic with the dominant Inline graphic. Thus, Inline graphic suffices for linear convergence or Inline graphic. Since Inline graphic is monotone increasing in Inline graphic (the upper bound comes from the aforementioned Taylor Series convergence conditions) with Inline graphic, Inline graphic for which Inline graphic. This would be Inline graphic (largest permissible Inline graphic) for the operator update step. It is easy to see that this Inline graphic is equivalently obtained by maximizing Inline graphic in Inline graphic, where Inline graphic. Note that we ignored the higher order effects in our Assumptions, since Inline graphic is negligible for sufficiently small Inline graphic, where the effective convergence rate is approximately Inline graphic.

In the case of Remark 2.3 for Theorem 2.2, the form of Inline graphic remains the same as above. The proof of Lemma 2.4 showed that the matrix inverse series converges when the perturbation term satisfies Inline graphic, or Inline graphic suffices. Similarly, the bounds for the other series terms also depend on Inline graphic. Clearly, as Inline graphic, we approach Assumption Inline graphic for which Inline graphic takes the same form as in Proposition 2.4. The limit for Inline graphic in Remark 2.3 holds for the distributions in Proposition 2.3 because Inline graphic (see (2.35)) almost surely as Inline graphic.

B. Distributions in Section 2.3.3

Various distributions of Inline graphic lead to interesting behaviour for Inline graphic. Here we discuss example distributions and the corresponding behaviour of Inline graphic, which we show to be lower bounded by Inline graphic. The distributions below satisfy the conditions in Proposition 2.3 (i.e. the column supports of Inline graphic of cardinality Inline graphic are drawn independently and uniformly at random, and the non-zero entries are i.i.d. with mean zero and variance Inline graphic) to ensure good convergence rate properties.

  1. The non-zeros are random signs scaled by Inline graphic and ‘+’ and ‘-’ are equally probable Inline graphic.

  2. Non-zeros are uniformly distributed in Inline graphic with Inline graphic. When Inline graphic and Inline graphic, then Inline graphic.

  3. Non-zeros are drawn from the density Inline graphic when Inline graphic and Inline graphic otherwise, with Inline graphic and Inline graphic and Inline graphic. For a given Inline graphic, Inline graphic, Inline graphic, Inline graphic, and Inline graphic  Inline graphic  Inline graphic.

The non-zeros of Inline graphic above are assumed to be upper bounded (in practice, the bound is determined by the peak physical intensity in the signals considered) and lower bounded (determined by numerical precision).

We briefly show the Inline graphic bounds for the examples above. When the non-zeros of Inline graphic are random signs scaled by Inline graphic, it is obvious that Inline graphic.

When the non-zeros are uniformly distributed with Inline graphic for Inline graphic with Inline graphic, then clearly Inline graphic. The variance of the distribution is Inline graphic. Setting this to the required value of Inline graphic yields Inline graphic  Inline graphic. Solving the quadratic equation for Inline graphic yields a root Inline graphic, which is non-negative when Inline graphic (i.e. Inline graphic). Moreover, Inline graphic implies Inline graphic. Then the distributions with Inline graphic and Inline graphic readily satisfy Inline graphic and Inline graphic. Thus, clearly Inline graphic. For the special case Inline graphic and Inline graphic  Inline graphic.

When the non-zeros are drawn from Inline graphic for Inline graphic and Inline graphic otherwise, with Inline graphic and Inline graphic and Inline graphic, clearly Inline graphic and the variance is Inline graphic  Inline graphic  Inline graphic. Setting the variance to Inline graphic yields a nonlinear equation in Inline graphic, Inline graphic and Inline graphic, with many solutions. To extract one set of solutions, we set Inline graphic and Inline graphic for some Inline graphic, which implies Inline graphic. Substituting these in the variance equation simplifies it to Inline graphic. Thus, Inline graphic with Inline graphic and Inline graphic in this case. We then easily get Inline graphic.

C. Proof of Lemma 2.1

Each column Inline graphic (Inline graphic) of the sparse coefficients matrix Inline graphic satisfies

graphic file with name DmEquation53.gif (C.1)

where Inline graphic and Inline graphic is a diagonal matrix with a one at the Inline graphicth entry when Inline graphic and zero otherwise. Matrix Inline graphic is similarly defined with respect to Inline graphic, and ‘Inline graphic’ denotes element-wise multiplication.

It follows that Inline graphic  Inline graphic, where the two summands have disjoint supports because Inline graphic is diagonal with zeros and with ‘-1’ only for the portion of the support of Inline graphic left out in Inline graphic. Therefore, we have

graphic file with name DmEquation54.gif (C.2)

Let Inline graphic. To simplify and bound (C.2), we first consider the case when only one element, say Inline graphic was left out in Inline graphic. Suppose that in its place, we have a new entry Inline graphic with Inline graphic. Then we must have

graphic file with name DmEquation55.gif (C.3)

where the first inequality is necessary for the Inline graphicth entry to swap with the Inline graphicth entry in the support and the second inequality is the reverse triangle inequality. Thus, we have

graphic file with name DmEquation56.gif (C.4)

Note that this holds even if Inline graphic and Inline graphic, i.e. only the Inline graphicth entry is left out of Inline graphic without a new non-zero (Inline graphicth) entry. Using these results, (C.2) can be readily simplified for this case as

graphic file with name DmEquation57.gif (C.5)

The last equality above follows because Inline graphic includes Inline graphic as a non-zero entry, and Inline graphic is the same as Inline graphic except that its Inline graphicth entry is also Inline graphic.

In (C.5), Inline graphic  Inline graphic. The first two summands in (C.5) are bounded by Inline graphic and Inline graphic, respectively. Thus, when one element of the true support is misestimated in each column of Inline graphic, we have

graphic file with name DmEquation58.gif (C6)

This proves (2.5) for the case when (at most) one entry of the support of each Inline graphic is wrongly estimated (left out) in Inline graphic. In the general case, when multiple elements of the support of Inline graphic may be left out in Inline graphic, each such element can be paired with a corresponding ‘new’ element in Inline graphic, and (C.4) holds for each such pair.9 The proof in this general case is similar to the aforementioned case, except that there would be summations over the left out or new indices in various equations. For example, the first summand Inline graphic in (C.5) would include a summation over all ‘new’ indices Inline graphic in Inline graphic. However, this summation is still bounded by Inline graphic. Similarly, the second summand in (C.5) would be summed over the number of (disjoint) pairs, which is again bounded by Inline graphic. Thus, Inline graphic holds generally (including when the true support is correctly estimated in Inline graphic). Therefore, a bound as in (C.6) holds in the general case.

Footnotes

1

For example, one may minimize Inline graphic with respect to Inline graphic subject to Inline graphic, where Inline graphic denotes a set sparsity level or an alternative version of this problem.

2

Depending on the signal set, either a compact (i.e. without too many atoms) dictionary or sparsifying transform may be best suited for them.

3

In this case, the support of Inline graphic in fact coincides with that of Inline graphic. If we relaxed Assumption Inline graphic from Inline graphic to Inline graphic  Inline graphic  Inline graphic, then Inline graphic holds, and the lemma still holds.

4

Matrix Inline graphic must be invertible for Inline graphic to be finite and Assumption Inline graphic to hold.

5

The SVD of the Kronecker sum is established by the following equalities that use the definitions of the Kronecker sum and SVD of Inline graphic and (2.21): Inline graphic.

6

The probability that Inline graphic is Inline graphic.

7

This is the probability that the two indexes Inline graphic and Inline graphic both appear in the support of the Inline graphicth column of Inline graphic. Thus, Inline graphic.

8

Note that Inline graphic.

9

The elements left out of the support of Inline graphic can be paired with ‘new’ elements in Inline graphic one by one, i.e. no overlaps between the pairs. If multiple new elements satisfy (C.4), the pairing picks the one with the smallest magnitude.

References

  • 1. Agarwal, A., Anandkumar, A., Jain, P. & Netrapalli, P. (2016) Learning sparsely used overcomplete dictionaries via alternating minimization. SIAM J. Optim., 26, 2775–2799. [Google Scholar]
  • 2. Agarwal, A., Anandkumar, A., Jain, P., Netrapalli, P. & Tandon, R. (2014) Learning sparsely used overcomplete dictionaries. J. Mach. Learn. Res., 35, 1–15. [Google Scholar]
  • 3. Aharon, M., Elad, M. & Bruckstein, A. (2006) K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process., 54, 4311–4322. [Google Scholar]
  • 4. Arora, S., Ge, R., Ma, T. & Moitra, A. (2015) Simple, efficient, and neural algorithms for sparse coding. Conference on Learning Theory. PMLR, Paris, France. pp. 113–149. [Google Scholar]
  • 5. Arora, S., Ge, R. & Moitra, A. (2014) New algorithms for learning incoherent and overcomplete dictionaries. Proceedings of the 27th Conference on Learning Theory. PMLR, Barcelona, Spain. pp. 779–806. [Google Scholar]
  • 6. Bai, Y., Jiang, Q. & Sun, J. (2018) Subgradient descent learns orthogonal dictionaries. arXiv preprint arXiv:1810.10702. [Google Scholar]
  • 7. Bao, C., Cai, J.-F. & Ji, H. (2013) Fast sparsity-based orthogonal dictionary learning for image restoration. Proceedings of the IEEE International Conference on Computer Vision. IEEE , Sydney, Australia . pp. 3384–3391. [Google Scholar]
  • 8. Bao, C., Ji, H., Quan, Y. & Shen, Z. (2014) L0 norm based dictionary learning by proximal methods with global convergence. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE , Columbus, Ohio. pp. 3858–3865. [Google Scholar]
  • 9. Bao, C., Ji, H., Quan, Y. & Shen, Z. (2016) Dictionary learning for sparse coding: algorithms and convergence analysis. IEEE Trans. Pattern Anal. Mach. Intell., 38, 1356–1369. [DOI] [PubMed] [Google Scholar]
  • 10. Bao, C., Ji, H. & Shen, Z. (2015) Convergence analysis for iterative data-driven tight frame construction scheme. Appl. Comput. Harmon. Anal., 38, 510–523. [Google Scholar]
  • 11. Barchiesi, D. & Plumbley, M. D. (2013) Learning incoherent dictionaries for sparse approximation using iterative projections and rotations. IEEE Trans. Signal Process., 61, 2055–2065. [Google Scholar]
  • 12. Chatterji, N. & Bartlett, P. L. (2017) Alternating minimization for dictionary learning with random initialization. Advances in Neural Information Processing Systems, Curran Associates, Inc, Long Beach, CA. vol. 30. pp. 1997–2006. [Google Scholar]
  • 13. Chen, S. S., Donoho, D. L. & Saunders, M. A. (1998) Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20, 33–61. [Google Scholar]
  • 14. Dai, W. & Milenkovic, O. (2009) Subspace pursuit for compressive sensing signal reconstruction. IEEE Trans. Inf. Theory, 55, 2230–2249. [Google Scholar]
  • 15. Efron, B., Hastie, T., Johnstone, I. & Tibshirani, R. (2004) Least angle regression. Ann. Statist., 32, 407–499. [Google Scholar]
  • 16. Elad, M. & Aharon, M. (2006) Image denoising via sparse and redundant representations over learned dictionaries. IEEE Trans. Image Process., 15, 3736–3745. [DOI] [PubMed] [Google Scholar]
  • 17. Engan, K., Aase, S. & Hakon-Husoy, J. (1999) Method of optimal directions for frame design. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Phoenix, AZ, IEEE. pp. 2443–2446. [Google Scholar]
  • 18. Hanif, M. & Seghouane, A.-K. (2014) Maximum likelihood orthogonal dictionary learning. 2014 IEEE Workshop on Statistical Signal Processing (SSP). IEEE, Gold Coast, Australia. pp. 141–144. [Google Scholar]
  • 19. Lustig, M., Donoho, D. & Pauly, J. (2007) Sparse MRI: the application of compressed sensing for rapid MR Imaging. Magn. Reson. Med., 58, 1182–1195. [DOI] [PubMed] [Google Scholar]
  • 20. Mairal, J., Bach, F., Ponce, J. & Sapiro, G. (2010) Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11, 19–60. [Google Scholar]
  • 21. Marcellin, M. W., Gormish, M. J., Bilgin, A. & Boliek, M. P. (2000) An overview of JPEG-2000. Proceedings of the Data Compression Conference. IEEE, Snowbird, Utah. : pp. 523–541. [Google Scholar]
  • 22. Nam, S., Davies, M. E., Elad, M. & Gribonval, R. (2011) Cosparse analysis modeling—uniqueness and algorithms. ICASSP. IEEE, Prague, Czech Republic . pp. 5804–5807. [Google Scholar]
  • 23. Natarajan, B. K. (1995) Sparse approximate solutions to linear systems. SIAM J. Comput., 24, 227–234. [Google Scholar]
  • 24. Needell, D. & Tropp, J. (2009) CoSaMP: iterative signal recovery from incomplete and inaccurate samples. Appl. Comput. Harmon. Anal., 26, 301–321. [Google Scholar]
  • 25. Olshausen, B. A. & Field, D. J. (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381, 607–609. [DOI] [PubMed] [Google Scholar]
  • 26. Pati, Y., Rezaiifar, R. & Krishnaprasad, P. (1993) Orthogonal Matching Pursuit: recursive function approximation with applications to wavelet decomposition. Asilomar Conference on Signals, Systems and Computers, vol. 1. IEEE, Pacific Grove, CA. pp. 40–44. [Google Scholar]
  • 27. Pfister, L. & Bresler, Y. (2019) Learning filter bank sparsifying transforms. IEEE Trans. Signal Process., 67, 504–519. [Google Scholar]
  • 28. Ramirez, I., Sprechmann, P. & Sapiro, G. (2010) Classification and clustering via dictionary learning with structured incoherence and shared features. Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2010. IEEE, San Francisco, CA. pp. 3501–3508. [Google Scholar]
  • 29. Ravishankar, S. & Bresler, Y. (2013a) Closed-form solutions within sparsifying transform learning. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Vancouver, Canada. pp. 5378–5382. [Google Scholar]
  • 30. Ravishankar, S. & Bresler, Y. (2013b) Learning sparsifying transforms. IEEE Trans. Signal Process., 61, 1072–1086. [DOI] [PubMed] [Google Scholar]
  • 31. Ravishankar, S. & Bresler, Y. (2015) L0 sparsifying transform learning with efficient optimal updates and convergence guarantees. IEEE Trans. Signal Process., 63, 2389–2404. [Google Scholar]
  • 32. Ravishankar, S. & Bresler, Y. (2016) Data-driven learning of a union of sparsifying transforms model for blind compressed sensing. IEEE Trans. Comput. Imaging, 2, 294–309. [Google Scholar]
  • 33. Ravishankar, S. & Wohlberg, B. (2018) Learning multi-layer transform models. 2018 56th Annual Allerton Conference on Communication, Control, and Computing (Allerton). IEEE, Monticello, IL. pp. 160–165. [Google Scholar]
  • 34. Rockafellar, R. T. & Wets, R. J.-B. (1998) Variational Analysis. Heidelberg, Germany: Springer. [Google Scholar]
  • 35. Rubinstein, R., Faktor, T. & Elad, M. (2012) K-SVD dictionary-learning for the analysis sparse model. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Kyoto, Japan. pp. 5405–5408. [Google Scholar]
  • 36. Rubinstein, R., Peleg, T. & Elad, M. (2013) Analysis K-SVD: a dictionary-learning algorithm for the analysis sparse model. IEEE Trans. Signal Process., 61, 661–677. [Google Scholar]
  • 37. Rubinstein, R., Zibulevsky, M. & Elad, M. (2010) Double sparsity: learning sparse dictionaries for sparse signal approximation. IEEE Trans. Signal Process., 58, 1553–1564. [Google Scholar]
  • 38. Schnass, K. (2018) Convergence radius and sample complexity of ITKM algorithms for dictionary learning. App. Comp. Harm. Ana., 45, 22–58. [Google Scholar]
  • 39. Smith, L. N. & Elad, M. (2013) Improving dictionary learning: multiple dictionary updates and coefficient reuse. IEEE Signal Process. Lett., 20, 79–82. [Google Scholar]
  • 40. Spielman, D. A., Wang, H. & Wright, J. (2012) Exact recovery of sparsely-used dictionaries. Proceedings of the 25th Annual Conference on Learning Theory. PMLR, Edinburgh, Scotland. pp. 37.1–37.18. [Google Scholar]
  • 41. Studer, C. & Baraniuk, R. G. (2012) Dictionary learning from sparsely corrupted or compressed signals. ICASSP. IEEE, Kyoto, Japan. pp. 3341–3344. [Google Scholar]
  • 42. Sun, J., Qu, Q. & Wright, J. (2017a) Complete dictionary recovery over the sphere I: overview and the geometric picture. IEEE Trans. Inf. Theory, 63, 853–884. [Google Scholar]
  • 43. Sun, J., Qu, Q. & Wright, J. (2017b) Complete dictionary recovery over the sphere II: recovery by Riemannian trust-region method. IEEE Trans. Inf. Theory, 63, 885–914. [Google Scholar]
  • 44. Wen, B., Ravishankar, S. & Bresler, Y. (2015) Structured overcomplete sparsifying transform learning with convergence guarantees and applications. Int. J. Comput. Vis., 114, 137–167. [Google Scholar]
  • 45. Xu, Q., Yu, H., Mou, X., Zhang, L., Hsieh, J. & Wang, G. (2012) Low-dose X-ray CT reconstruction via dictionary learning. IEEE Trans. Med. Imaging, 31, 1682–1697. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Xu, Y. & Yin, W. (2016) A fast patch-dictionary method for whole image recovery. Inverse Probl. Imaging, 10, 563–583. [Google Scholar]
  • 47. Yaghoobi, M., Blumensath, T. & Davies, M. (2009) Dictionary learning for sparse approximations with the majorization method. IEEE Trans. Signal Process., 57, 2178–2191. [Google Scholar]
  • 48. Yaghoobi, M., Nam, S., Gribonval, R. & Davies, M. (2011) Analysis operator learning for overcomplete cosparse representations. European Signal Processing Conference (EUSIPCO). IEEE, Barcelona, Spain. pp. 1470–1474. [Google Scholar]
  • 49. Yaghoobi, M., Nam, S., Gribonval, R. & Davies, M. E. (2012) Noise aware analysis operator learning for approximately cosparse signals. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, Kyoto, Japan. pp. 5409–5412. [Google Scholar]

Articles from Information and Inference are provided here courtesy of Oxford University Press

RESOURCES