Abstract
Convolutional operator learning is gaining attention in many signal processing and computer vision applications. Learning kernels has mostly relied on so-called patch-domain approaches that extract and store many overlapping patches across training signals. Due to memory demands, patch-domain methods have limitations when learning kernels from large datasets – particularly with multi-layered structures, e.g., convolutional neural networks – or when applying the learned kernels to high-dimensional signal recovery problems. The so-called convolution approach does not store many overlapping patches, and thus overcomes the memory problems particularly with careful algorithmic designs; it has been studied within the “synthesis” signal model, e.g., convolutional dictionary learning. This paper proposes a new convolutional analysis operator learning (CAOL) framework that learns an analysis sparsifying regularizer with the convolution perspective, and develops a new convergent Block Proximal Extrapolated Gradient method using a Majorizer (BPEG-M) to solve the corresponding block multi-nonconvex problems. To learn diverse filters within the CAOL framework, this paper introduces an orthogonality constraint that enforces a tight-frame filter condition, and a regularizer that promotes diversity between filters. Numerical experiments show that, with sharp majorizers, BPEG-M significantly accelerates the CAOL convergence rate compared to the state-of-the-art block proximal gradient (BPG) method. Numerical experiments for sparse-view computational tomography show that a convolutional sparsifying regularizer learned via CAOL significantly improves reconstruction quality compared to a conventional edge-preserving regularizer. Using more and wider kernels in a learned regularizer better preserves edges in reconstructed images.
I. Introduction
LEARNING convolutional operators from large datasets is a growing trend in signal/image processing, computer vision, and machine learning. The widely known patch-domain approaches for learning kernels (e.g., filter, dictionary, frame, and transform) extract patches from training signals for simple mathematical formulation and optimization, yielding (sparse) features of training signals [1]–[9]. Due to memory demands, using many overlapping patches across the training signals hinders using large datasets and building hierarchies on the features, e.g., deconvolutional neural networks [10], convolutional neural network (CNN) [11], and multi-layer convolutional sparse coding [12]. For similar reasons, the memory requirement of patch-domain approaches discourages learned kernels from being applied to large-scale inverse problems.
To moderate these limitations of the patch-domain approach, the so-called convolution perspective has been recently introduced by learning filters and obtaining (sparse) representations directly from the original signals without storing many overlapping patches, e.g., convolutional dictionary learning (CDL) [10], [13]–[17]. For large datasets, CDL using careful algorithmic designs [16] is more suitable for learning filters than patch-domain dictionary learning [1]; in addition, CDL can learn translation-invariant filters without obtaining highly redundant sparse representations [16]. The CDL method applies the convolution perspective for learning kernels within “synthesis” signal models. Within “analysis” signal models, however, there exist no prior frameworks using the convolution perspective for learning convolutional operators, whereas patch-domain approaches for learning analysis kernels are introduced in [3], [4], [6]–[8]. (See brief descriptions about synthesis and analysis signal models in [4, Sec. I].)
Researchers interested in dictionary learning have actively studied the structures of kernels learned by the patch-domain approach [3], [4], [6]–[8], [18]–[20]. In training CNNs (see Appendix A), however, there has been less study of filter structures having non-convex constraints, e.g., orthogonality and unit-norm constraints in Section III, although it is thought that diverse (i.e., incoherent) filters can improve performance for some applications, e.g., image recognition [9]. On the application side, researchers have applied (deep) NNs to signal/image recovery problems. Recent works combined model-based image reconstruction (MBIR) algorithm with image refining networks [21]–[30]. In these iterative NN methods, refining NNs should satisfy the non-expansiveness for fixed-point convergence [29]; however, their trainings lack consideration of filter diversity constraints, e.g., orthogonality constraint in Section III, and thus it is unclear whether the trained NNs are nonexpansive mapping [30].
This paper proposes 1) a new convolutional analysis operator learning (CAOL) framework that learns an analysis sparsifying regularizer with the convolution perspective, and 2) a new convergent Block Proximal Extrapolated Gradient method using a Majorizer (BPEG-M [16]) for solving block multi-nonconvex problems [31]. To learn diverse filters, we propose a) CAOL with an orthogonality constraint that enforces a tight-frame (TF) filter condition in convolutional perspectives, and b) CAOL with a regularizer that promotes filter diversity. BPEG-M with sharper majorizers converges significantly faster than the state-of-the-art technique, Block Proximal Gradient (BPG) method [31] for CAOL. This paper also introduces a new X-ray computational tomography (CT) MBIR model using a convolutional sparsifying regularizer learned via CAOL [32].
The remainder of this paper is organized as follows. Section II reviews how learned regularizers can help solve inverse problems. Section III proposes the two CAOL models. Section IV introduces BPEG-M with several generalizations, analyzes its convergence, and applies a momentum coefficient formula and restarting technique from [16]. Section V applies the proposed BPEG-M methods to the CAOL models, designs two majorization matrices, and describes memory flexibility and applicability of parallel computing to BPEG-M-based CAOL. Section VI introduces the CT MBIR model using a convolutional regularizer learned via CAOL [32], along with its properties, i.e., its mathematical relation to a convolutional autoencoder, the importance of TF filters, and its algorithmic role in signal recovery. Section VII reports numerical experiments that show 1) the importance of sharp majorization in accelerating BPEG-M, and 2) the benefits of BPEG-M-based CAOL – acceleration, convergence, and memory flexibility. Additionally, Section VII reports sparse-view CT experiments that show 3) the CT MBIR using learned convolutional regularizers significantly improves the reconstruction quality compared to that using a conventional edge-preserving (EP) regularizer, and 4) more and wider filters in a learned regularizer better preserves edges in reconstructed images. Finally, Appendix A mathematically formulates unsupervised training of CNNs via CAOL, and shows that its updates attained via BPEG-M correspond to the three important CNN operators. Appendix B introduces some potential applications of CAOL to image processing, imaging, and computer vision.
II. Backgrounds: MBIR Using Learned Regularizers
To recover a signal from a data vector , one often considers the following MBIR optimization problem (Appendix C provides mathematical notations): , where is a feasible set, f(x; y) is data fidelity function that models imaging physics (or image formation) and noise statistics, γ > 0 is a regularization parameter, and g(x) is a regularizer, such as total variation [33, §2–3]. However, when inverse problems are extremely ill-conditioned, the MBIR approach using hand-crafted regularizers g(x) has limitations in recovering signals. Alternatively, there has been a growing trend in learning sparsifying regularizers (e.g., convolutional regularizers [16], [17], [32], [34], [35]) from training datasets and applying the learned regularizers to the following MBIR problem [33]:
| (B1) |
where a learned regularizer quantifies consistency between any candidate x and training data that is encapsulated in some trained sparsifying operators . The diagram in Fig. 1 shows the general process from training sparsifying operators to solving inverse problems via (B1). Such models (B1) arise in a wide range of applications. See some examples in Appendix B.
Fig. 1.

A general flowchart from learning sparsifying operators to solving inverse problems via MBIR using learned operators ; see Section II. For the lth training sample measures its sparse representation or sparsification errors, and sparsity of its representation generated by .
This paper describes multiple aspects of learning convolutional regularizers. The next section first starts with proposing a new convolutional regularizer.
III. CAOL: Models Learning Convolutional Regularizers
The goal of CAOL is to find a set of filters that “best” sparsify a set of training images. Compared to hand-crafted regularizers, learned convolutional regularizers can better extract “true” features of estimated images and remove “noisy” features with thresholding operators. We propose the following CAOL model:
| (P0) |
where ⊛ denotes a convolution operator (see details about boundary conditions in the supplementary material), is a set of training images, is a set of convolutional kernels, is a set of sparse codes, and g(D) is a regularizer or constraint that encourages filter diversity or incoherence, α > 0 is a thresholding parameter controlling the sparsity of features {zl,k}, and β > 0 is a regularization parameter for g(D). We group the K filters into a matrix :
| (1) |
For simplicity, we fix the dimension for training signals, i.e., , but the proposed model (P0) can use training signals of different dimension, i.e., . For sparse-view CT in particular, the diagram in Fig. 2 shows the process from CAOL (P0) to solving its inverse problem via MBIR using learned convolutional regularizers.
Fig. 2.
A flowchart from CAOL (P0) to MBIR using a convolutional sparsifying regularizer learned via CAOL (P3) in sparse-view CT. See details of the CAOL process (P0) and its variants (P1)–(P2), and the CT MBIR process (P3) in Section III and Section VI, respectively.
The following two subsections design the constraint or regularizer g(D) to avoid redundant filters (without it, all filters could be identical).
A. CAOL with Orthogonality Constraint
We first propose a CAOL model with a nonconvex orthogonality constraint on the filter matrix D in (1):
| (P1) |
The orthogonality condition in (P1) enforces a TF condition on the filters {dk} in CAOL (P0). Proposition 3.1 below formally states this relation.
Proposition 3.1 (Tight-frame filters). Filters satisfying the orthogonality constraint satisfy the following TF condition in a convolution perspective:
| (2) |
for both circular and symmetric boundary conditions.
Proof: See Section S.I of the supplementary material. ■
Proposition 3.1 corresponds to a TF result from patch-domain approaches; see Section S.I. (Note that the patch-domain approach in [6, Prop. 3] requires R = K.) However, we constrain the filter dimension to be R ≤ K to have an efficient solution for CAOL model (P1); see Proposition 5.4 later. The following section proposes a more flexible CAOL model in terms of the filter dimensions R and K.
B. CAOL with Diversity Promoting Regularizer
As an alternative to the CAOL model (P1), we propose a CAOL model with a diversity promoting regularizer and a nonconvex norm constraint on the filters {dk}:
| (P2) |
In the CAOL model (P2), we consider the following:
The constraint in (P2) forces the learned filters {dk} to have uniform energy. In addition, it avoids the “scale ambiguity” problem [36].
The regularizer in (P2), gdiv(D), promotes filter diversity, i.e., incoherence between dk and , measured by for k ≠ k′.
When R = K and β → ∞, the model (P2) becomes (P1) since implies (for square matrices A and B, if AB = I then BA = I). Thus (P2) generalizes (P1) by relaxing the off-diagonal elements of the equality constraint in (P1). (In other words, when R = K, the orthogonality constraint in (P1) enforces the TF condition and promotes the filter diversity.) One price of this generalization is the extra tuning parameter β.
(P1)–(P2) are challenging nonconvex optimization problems and block optimization approaches seem suitable. The following section proposes a new block optimization method with momentum and majorizers, to rapidly solve the multiple block multi-nonconvex problems proposed in this paper, while guaranteeing convergence to critical points.
IV. BPEG-M: Solving Block Multi-Nonconvex Problems with Convergence Guarantees
This section describes a new optimization approach, BPEG-M, for solving block multi-nonconvex problems like a) CAOL (P1)–(P2),1 b) CT MBIR (P3) using learned convolutional regularizer via (P1) (see Section VI), and c) “hierarchical” CAOL (A1) (see Appendix A).
A. BPEG-M – Setup
We treat the variables of the underlying optimization problem either as a single block or multiple disjoint blocks. Specifically, consider the following block multi-nonconvex optimization problem:
| (3) |
where variable x is decomposed into B blocks , f is assumed to be continuously differentiable, but functions {gb : b = 1, …, B} are not necessarily differentiable. The function gb can incorporate the constraint , by allowing any gb to be extended-valued, e.g., gb(xb) = ∞ if , for b = 1, …, B. It is standard to assume that both f and {gb} are closed and proper and the sets are closed and nonempty. We do not assume that f, {gb}, or are convex. Importantly, gb can be a nonconvex ℓp quasi-norm, p ∈ [0, 1). The general block multi-convex problem in [16], [38] is a special case of (3).
The BPEG-M framework considers a more general concept than Lipschitz continuity of the gradient as follows:
Definition 4.1 (M-Lipschitz continuity). A function is M-Lipschitz continuous on if there exist a (symmetric) positive definite matrix M such that
where .
Lipschitz continuity is a special case of M-Lipschitz continuity with M equal to a scaled identity matrix with a Lipschitz constant of the gradient ∇f (e.g., for , the (smallest) Lipschitz constant of ∇f is the maximum eigenvalue of ATA). If the gradient of a function is M-Lipschitz continuous, then we obtain the following quadratic majorizer (i.e., surrogate function [39], [40]) at a given point y without assuming convexity:
Lemma 4.2 (Quadratic majorization (QM) via M-Lipschitz continuous gradients). Let . If ∇f is M-Lipschitz continuous, then
Proof: See Section S.II of the supplementary material. ■
Exploiting Definition 4.1 and Lemma 4.2, the proposed method, BPEG-M, is given as follows. To solve (3), we minimize a majorizer of F cyclically over each block x1, …, xB, while fixing the remaining blocks at their previously updated variables. Let be the value of xb after its ith update, and define
At the bth block of the ith iteration, we apply Lemma 4.2 to functional with a -Lipschitz continuous gradient, and minimize the majorized function.2 Specifically, BPEG-M uses the updates
| (4) |
where
| (5) |
the proximal operator is defined by
is the block-partial gradient of f at , an upper-bounded majorization matrix is updated by
| (6) |
and is a symmetric positive definite majorization matrix of . In (5), the matrix is an extrapolation matrix that accelerates convergence in solving block multi-convex problems [16]. We design it in the following form:
| (7) |
for some and δ < 1, to satisfy condition (9) below. In general, choosing λb values in (6)–(7) to accelerate convergence is application-specific. Algorithm 1 summarizes these updates.
The majorization matrices and in (6) influence the convergence rate of BPEG-M. A tighter majorization matrix (i.e., a matrix giving tighter bounds in the sense of Lemma 4.2) provided faster convergence rate [41, Lem. 1], [16, Fig. 2–3]. An interesting observation in Algorithm 1 is that there exists a tradeoff between majorization sharpness via (6) and extrapolation effect via (5) and (7). For example, increasing λb (e.g., λb = 2) allows more extrapolation but results in looser majorization; setting λb → 1 results in sharper majorization but provides less extrapolation.
Fig. 3.
Cost minimization comparisons in CAOL (P1) with different BPG-type algorithms and datasets (R = K = 49 and α = 2.5 × 10−4; solution (31) was used for sparse code updates; BPG (Xu & Ying ‘17) [31] used the maximum eigenvalue of Hessians for Lipschitz constants; the cross mark x denotes a termination point). A sharper majorization leads to faster convergence of BPEG-M; for all the training datasets considered in this paper, the majorization matrix in Proposition 5.1 is sharper than those in Lemmas 5.2–5.3.
Remark 4.3. The proposed BPEG-M framework – with key updates (4)–(5) – generalizes the BPG method [31], and has several benefits over BPG [31] and BPEG-M introduced earlier in [16]:
The BPG setup in [31] is a particular case of BPEG-M using a scaled identity majorization matrix Mb with a Lipschitz constant of . The BPEG-M framework can significantly accelerate convergence by allowing sharp majorization; see [16, Fig. 2–3] and Fig. 3. This generalization was first introduced for block multi-convex problems in [16], but the proposed BPEG-M in this paper addresses the more general problem, block multi-(non)convex optimization.
BPEG-M is useful for controlling the tradeoff between majorization sharpness and extrapolation effect in different blocks, by allowing each block to use different λb values. If tight majorization matrices can be designed for a certain block b, then it could be reasonable to maintain the majorization sharpness by setting λb very close to 1. When setting λb = 1 + ϵ (e.g., ϵ is a machine epsilon) and using (no extrapolation), solutions of the original and its upper-bounded problem become (almost) identical. In such cases, it is unnecessary to solve the upper bounded problem (4), and the proposed BPEG-M framework allows using the solution of without QM; see Section V-B. This generalization was not considered in [31].
The condition for designing the extrapolation matrix (7), i.e., (9) in Assumption 3, is more general than that in [16, (9)] (e.g., (10)). Specifically, the matrices and in (7) need not be diagonalized by the same basis.
The first two generalizations lead to the question, “Under the sharp QM regime (i.e., having tight bounds in Lemma 4.2), what is the best way in controlling {λb} in (6)–(7) in Algorithm 1?” Our experiments show that, if sufficiently sharp majorizers are obtained for partial or all blocks, then giving more weight to sharp majorization provides faster convergence compared to emphasizing extrapolation; for example, λb = 1 + ϵ gives faster convergence than λb = 2.
B. BPEG-M – Convergence Analysis
This section analyzes the convergence of Algorithm 1 under the following assumptions.
Assumption 1) F is proper and lower bounded in dom(F), f is continuously differentiable, gb is proper lower semicontinuous, ∀b.3 (3) has a critical point , i.e., , where ∂F(x) denotes the limiting subdifferential of F at x (see [42, §1.9], [43, §8]).
Assumption 2) The block-partial gradients of f, , are -Lipschitz continuous, i.e.,
| (8) |
for , and (unscaled) majorization matrices satisfy with 0 < mb < ∞, ∀b, i.
Assumption 3) The extrapolation matrices satisfy
| (9) |
for any δ < 1, ∀b, i.
Condition (9) in Assumption 3 generalizes that in [16, Assumption 3]. If eigenspaces of and coincide (e.g., diagonal and circulant matrices), ∀i [16, Assumption 3], (9) becomes
| (10) |
as similarly given in [16, (9)]. This generalization allows one to consider arbitrary structures of across iterations.
Lemma 4.4 (Sequence bounds). Let and {Eb: b = 1, …, B} be as in (6)–(7), respectively. The cost function decrease for the ith update satisfies:
| (11) |
Proof: See Section S.III of the supplementary material. ■
Lemma 4.4 generalizes [31, Lem. 1] using {λb = 2}. Taking the majorization matrices in (11) to be scaled identities with Lipschitz constants, i.e., and , where and are Lipschitz constants, the bound (11) becomes equivalent to that in [31, (13)]. Note that BPEG-M for block multi-convex problems in [16] can be viewed within BPEG-M in Algorithm 1, by similar reasons in [31, Rem. 2] – bound (11) holds for the block multi-convex problems by taking in (10) as in [16, Prop. 3.2].
Proposition 4.5 (Square summability). Let be generated by Algorithm 1. We have
| (12) |
Proof: See Section S.IV of the supplementary material. ■
Proposition 4.5 implies that
| (13) |
and (13) is used to prove the following theorem:
Theorem 4.6 (A limit point is a critical point). Under Assumptions 1–3, let {x(i+1) : i ≥ 0} be generated by Algorithm 1. Then any limit point of {x(i+1) : i ≥ 0} is a critical point of (3). If the subsequence converges to , then
Proof: See Section S.V of the supplementary material. ■
Finite limit points exist if the generated sequence {x(i+1) : i ≥ 0} is bounded; see, for example, [44, Lem. 3.2–3.3]. For some applications, the boundedness of {x(i+1) : i ≥ 0} can be satisfied by choosing appropriate regularization parameters, e.g., [16].
C. Restarting BPEG-M
BPG-type methods [16], [31], [38] can be further accelerated by applying 1) a momentum coefficient formula similar to those used in fast proximal gradient (FPG) methods [45]– [47], and/or 2) an adaptive momentum restarting scheme [48], [49]; see [16]. This section applies these two techniques to further accelerate BPEG-M in Algorithm 1.
First, we apply the following increasing momentum-coefficient formula to (7) [45]:
| (14) |
This choice guarantees fast convergence of FPG method [45]. Second, we apply a momentum restarting scheme [48], [49], when the following gradient-mapping criterion is met [16]:
| (15) |
where the angle between two nonzero real vectors ϑ and ϑ′ is and ω ∈ [−1, 0]. This scheme restarts the algorithm whenever the momentum, i.e., , is likely to lead the algorithm in an unhelpful direction, as measured by the gradient mapping at the -update. We refer to BPEG-M combined with the methods (14)–(15) as restarting BPEG-M (reBPEG-M). Section S.VI in the supplementary material summarizes the updates of reBPEG-M.
To solve the block multi-nonconvex problems proposed in this paper (e.g., (P1)–(P3)), we apply reBPEG-M (a variant of Algorithm 1; see Algorithm S.1), promoting fast convergence to a critical point.
V. Fast and Convergent CAOL via BPEG-M
This section applies the general BPEG-M approach to CAOL. The CAOL models (P1) and (P2) satisfy the assumptions of BPEG-M; see Assumption 1–3 in Section IV-B. CAOL models (P1) and (P2) readily satisfy Assumption 1 of BPEG-M. To show the continuously differentiability of f and the lower boundedness of F, consider that 1) in (P0) is continuously differentiable with respect to D and {zl,k}; 2) the sequences {D(i+1)} are bounded, because they are in the compact set and in (P1) and (P2), respectively; and 3) the positive thresholding parameter α ensures that the sequence is bounded (otherwise the cost would diverge). In addition, for both (P1) and (P2), the lower semicontinuity of regularizer gb holds, ∀b. For D-optimization, the indicator function of the sets and lower semicontinuous, because the sets are compact. For {zl,k}-optimization, the ℓ0-quasi-norm is a lower semicontinuous function. Assumptions 2 and 3 are satisfied with the majorization matrix designs in this section – see Sections V-A–V-B later – and the extrapolation matrix design in (7), respectively.
Since CAOL models (P1) and (P2) satisfy the BPEG-M conditions, we solve (P1) and (P2) by the reBPEG-M method with a two-block scheme, i.e., we alternatively update all filters D and all sparse codes {zl,k : l = 1, …, L, k = 1, …, K}. Sections V-A and V-B describe details of D-block and {zl,k}-block optimization within the BPEG-M framework, respectively. The BPEG-M-based CAOL algorithm is particularly useful for learning convolutional regularizers from large datasets because of its memory flexibility and parallel computing applicability, as described in Section V-C and Sections V-A–V-B, respectively.
A. Filter Update: -Block Optimization
We first investigate the structure of the system matrix in the filter update for (P0). This is useful for 1) accelerating majorization matrix computation in filter updates (e.g., Lemmas 5.2–5.3) and 2) applying R × N-sized adjoint operators (e.g., in (17) below) to an N-sized vector without needing the Fourier approach [16, Sec. V-A] that uses commutativity of convolution and Parseval’s relation. Given the current estimates of {zl,k} : l = 1, …, L, k = 1, …, K}, the filter update problem of (P0) is equivalent to
| (16) |
where D is defined in (1), is defined by
| (17) |
is the rth (rectangular) selection matrix that selects N rows corresponding to the indices Br = {r, …, r + N − 1} from , is a set of padded training data, . Note that applying in (17) to a vector of size N is analogous to calculating cross-correlation between and the vector, i.e., , r = 1, …, R. In general, denotes a padded signal vector.
1). Majorizer Design:
This subsection designs multiple majorizers for the D-block optimization and compares their required computational complexity and tightness. The next proposition considers the structure of Ψl in (17) to obtain the Hessian in (16) for an arbitrary boundary condition.
Proposition 5.1 (Exact Hessian matrix MD). The following matrix is identical to :
| (18) |
A sufficiently large number of training signals (with N ≥ R), L, can guarantee in Proposition 5.1. The drawback of using Proposition 5.1 is its polynomial computational complexity, i.e., O(LR2N) – see Table I. When L (the number of training signals) or N (the size of training signals) are large, the quadratic complexity with the size of filters – R2 – can quickly increase the total computational costs when multiplied by L and N. (The BPG setup in [31] additionally requires O(R3) because it uses the eigendecomposition of (18) to calculate the Lipschitz constant.)
TABLE I.
Computational complexity of different majorization matrix designs for the filter update problem (16)
| Lemmas 5.2–5.3 | Proposition 5.1 |
|---|---|
| O(LRN) | O(LR2N) |
Considering CAOL problems (P0) themselves, different from CDL [13]–[17], the complexity O(LR2N) in applying Proposition 5.1 is reasonable. In BPEG-M-based CDL [16], [17], a majorization matrix for kernel update is calculated every iteration because it depends on updated sparse codes; however, in CAOL, one can precompute MD via Proposition 5.1 (or Lemmas 5.2–5.3 below) without needing to change it every kernel update. The polynomial computational cost in applying Proposition 5.1 becomes problematic only when the training signals change. Examples include 1) hierarchical CAOL, e.g., CNN in Appendix A, 2) “adaptive-filter MBIR” particularly with high-dimensional signals [2], [6], [50], and 3) online learning [51], [52]. Therefore, we also describe a more efficiently computable majorization matrix at the cost of looser bounds (i.e., slower convergence; see Fig 3). Applying Lemma S.1, we first introduce a diagonal majorization matrix MD for the Hessian in (16):
Lemma 5.2 (Diagonal majorization matrix MD). The following satisfies :
| (19) |
where |·| takes the absolute values of the elements of a matrix.
The majorization matrix design in Lemma 5.2 is more efficient to compute than that in Proposition 5.1, because no R2-factor is needed for calculating MD in Lemma 5.2, i.e., O(LRN); see Table I. Designing MD in Lemma 5.2 takes fewer calculations than [16, Lem. 5.1] using Fourier approaches, when . Using Lemma S.2, we next design a potentially sharper majorization matrix than (19), while maintaining the cost O(LRN):
Lemma 5.3 (Scaled identity majorization matrix MD). The following matrix satisfies :
| (20) |
for a circular boundary condition.
Proof: See Section S.VII of the supplementary material. ■
For all the training datasets used in this paper, we observed that the tightness of majorization matrices in Proposition 5.1 and Lemmas 5.2–5.3 for the Hessian is given by
| (21) |
(Note that (18) ⪯ (19) always holds regardless of training data.) Fig. 3 illustrates the effects of the majorizer sharpness in (21) on CAOL convergence rates. As described in Section IV-A, selecting λD (see (22) and (26) below) controls the tradeoff between majorization sharpness and extrapolation effect. We found that using fixed λD = 1 + ϵ gives faster convergence than λD = 2; see Fig. 4 (this behavior is more obvious in solving the CT MBIR model in (P3) via BPEG-M – see [32, Fig. 3]). The results in Fig. 4 and [32, Fig. 3] show that, under the sharp majorization regime, maintaining sharper majorization is more critical in accelerating the convergence of BPEG-M than giving more weight to extrapolation.
Fig. 4.
Cost minimization comparisons in CAOL (P1) with different BPEG-M algorithms and datasets (Lemma 5.2 was used for MD; R = K = 49; deterministic filter initialization and random sparse code initialization). Under the sharp majorization regime, maintaining sharp majorization (i.e., λD = 1 + ϵ) provides faster convergence than giving more weight on extrapolation (i.e., λD = 2). (The same behavior was found in sparse-view CT application [32, Fig. 3].) There exist no differences in convergence between solution (31) and solution (33) using {λZ = 1 + ϵ}.
Sections V-A2 and V-A3 below apply the majorization matrices designed in this section to proximal mappings of D-optimization in (P1) and (P2), respectively.
2). Proximal Mapping with Orthogonality Constraint:
The corresponding proximal mapping problem of (16) using the orthogonality constraint in (P1) is given by
| (22) |
where
| (23) |
| (24) |
for k = 1, …, K, and by (6). One can parallelize over k = 1, …, K in computing in (23). The proposition below provides an optimal solution to (22):
Proposition 5.4. Consider the following constrained minimization problem:
| (25) |
where D is given as (1), , and is given by (18), (19), or (20). The optimal solution to (25) is given by
where has (full) singular value decomposition, .
Proof: See Section S.VIII of the supplementary material. ■
When using Proposition 5.1, of in Proposition 5.4 simplifies to the following update:
Similar to obtaining in (23), computing is parallelizable over k.
3). Proximal Mapping with Diversity Promoting Regularizer:
The corresponding proximal mapping problem of (16) using the norm constraint and diversity promoting regularizer in (P2) is given by
| (26) |
where gdiv(D), , and are given as in (P2), (23), and (24), respectively. We first decompose the regularization term gdiv(D) as follows:
| (27) |
where the equality in (27) holds by using the constraint in (26), and the Hermitian matrix is defined by
| (28) |
Using (27) and (28), we rewrite (26) as
| (29) |
This is a quadratically constrained quadratic program with . We apply an accelerated Newton’s method to solve (29); see Section S.IX. Similar to solving (22) in Section V-A2, solving (26) is a small-dimensional problem (K separate problems of size R).
B. Sparse Code Update: -Block Optimization
Given the current estimate of D, the sparse code update problem for (P0) is given by
| (30) |
This problem separates readily, allowing parallel computation with LK threads. An optimal solution to (30) is efficiently obtained by the well-known hard thresholding:
| (31) |
for k = 1, …, K and l = 1, …, L, where
| (32) |
for all n. Considering λZ (in ) as λZ → 1, the solution obtained by the BPEG-M approach becomes equivalent to (31). To show this, observe first that the BPEG-M-based solution (using MZ = IN) to (30) is obtained by
| (33) |
The downside of applying solution (33) is that it would require additional memory to store the corresponding extrapolated points –– and the memory grows with N, L, and K. Considering the sharpness of the majorizer in (30), i.e., MZ = IN, and the memory issue, it is reasonable to consider the solution (33) with no extrapolation, i.e., :
becoming equivalent to (31) as λZ → 1.
Solution (31) has two benefits over (33): compared to (33), (31) requires only half the memory to update all vectors and no additional computations related to . While having these benefits, empirically (31) has equivalent convergence rates as (33) using {λZ = 1 + ϵ}; see Fig. 4. Throughout the paper, we solve the sparse coding problems (e.g., (30) and {zk}-block optimization in (P3)) via optimal solutions in the form of (31).
C. Lower Memory Use than Patch-Domain Approaches
The convolution perspective in CAOL (P0) requires much less memory than conventional patch-domain approaches; thus, it is more suitable for learning filters from large datasets or applying the learned filters to high-dimensional MBIR problems. First, consider the training stage (e.g., (P0)). The patch-domain approaches, e.g., [1], [6], [7], require about R times more memory to store training signals. For example, 2D patches extracted by -sized windows (with “stride” one and periodic boundaries [6], [12], as used in convolution) require about R (e.g., R = 64 [1], [7]) times more memory than storing the original image of size . For L training images, their memory usage dramatically increases with a factor LRN. This becomes even more problematic in forming hierarchical representations, e.g., CNNs – see Appendix A. Unlike the patch-domain approaches, the memory use of CAOL (P0) only depends on the LN-factor to store training signals. As a result, the BPEG-M algorithm for CAOL (P1) requires about two times less memory than the patch-domain approach [6] (using BPEG-M). See Table II–B. (Both the corresponding BPEG-M algorithms use identical computations per iteration that scale with LR2N; see Table II–A.)
TABLE II.
Comparisons of computational complexity and memory usages between CAOL and patch-domain approach
| A. Computational complexity per BPEG-M iteration | ||
|---|---|---|
| Filter update | Sparse code update | |
| CAOL (P1) | O(LKRN) + O(R2K) | O(LKRN) |
| Patch-domain [6]† | O(LR2N) + O(R3) | O(LR2N) |
| B. Memory usage for BPEG-M algorithm | ||
|---|---|---|
| Filter update | Sparse code update | |
| CAOL (P1) | O(LN) + O(RK) | O(LKN) |
| Patch-domain [6]† | O(LRN) + O(R2) | O(LRN) |
The patch-domain approach [6] considers the orthogonality constraint in (P1) with R = K; see Section III-A. The estimates consider all the extracted overlapping patches of size R with the stride parameter 1 and periodic boundaries, as used in convolution.
Second, consider solving MBIR problems. Different from the training stage, the memory burden depends on how one applies the learned filters. In [53], the learned filters are applied with the conventional convolutional operators – e.g., ⊛ in (P0) – and, thus, there exists no additional memory burden. However, in [2], [54], [55], the -sized learned kernels are applied with a matrix constructed by many overlapping patches extracted from the updated image at each iteration. In adaptive-filter MBIR problems [2], [6], [8], the memory issue pervades the patch-domain approaches.
VI. Sparse-View CT MBIR using Convolutional Regularizer Learned via CAOL, and BPEG-M
This section introduces a specific example of applying the learned convolutional regularizer, i.e., in (P0), from a representative dataset to recover images in extreme imaging that collects highly undersampled or noisy measurements. We choose a sparse-view CT application since it has interesting challenges in reconstructing images that include Poisson noise in measurements, nonuniform noise or resolution properties in reconstructed images, and complicated (or no) structures in the system matrices. For CT, undersampling schemes can significantly reduce the radiation dose and cancer risk from CT scanning. The proposed approach can be applied to other applications (by replacing the data fidelity and spatial strength regularization terms in (P3) below).
We pre-learn TF filters via CAOL (P1) with a set of high-quality (e.g., normal-dose) CT images {xl: l = 1, …, L}. To reconstruct a linear attenuation coefficient image from post-log measurement [54], [56], we apply the learned convolutional regularizer to CT MBIR and solve the following block multi-nonconvex problem [32], [35]:
| (P3) |
Here, is a CT system matrix, is a (diagonal) weighting matrix with elements based on a Poisson-Gaussian model for the pre-log measurements with electronic readout noise variance σ2 [54]–[56], is a pre-tuned spatial strength regularization vector [57] with non-negative elements 4 that promotes uniform resolution or noise properties in the reconstructed image [54, Appx.], an indicator function ϕ(a) is equal to 0 if a = 0, and is 1 otherwise, is unknown sparse code for the kth filter, and α′ > 0 is a thresholding parameter.
We solved (P3) via reBPEG-M in Section IV with a two-block scheme [32], and summarize the corresponding BPEG-M updates as
| (34) |
where
| (35) |
by (6), a diagonal majorization matrix is designed by Lemma S.1, and flips a column vector in the vertical direction (e.g., it rotates 2D filters by 180°). Interpreting the update (34) leads to the following two remarks:
Remark 6.1. When the convolutional regularizer learned via CAOL (P1) is applied to MBIR, it works as an autoencoding CNN:
| (36) |
(setting and generalizing α′ to in (P3)). This is an explicit mathematical motivation for constructing architectures of iterative regression CNNs for MBIR, e.g., BCD-Net [28], [58]–[60] and Momentum-Net [29], [30]. Particularly when the learned filters in (36) satisfy the TF condition, they are useful for compacting energy of an input signal x and removing unwanted features via the non-linear thresholding in (36).
Remark 6.2. Update (34) improves the solution x(i+1) by weighting between a) the extrapolated point considering the data fidelity, i.e., η(i+1) in (35), and b) the “refined” update via the (ψ-weighting) convolutional autoencoder, i.e., .
VII. Results and Discussion
A. Experimental Setup
This section examines the performance (e.g., scalability, convergence, and acceleration) and behaviors (e.g., effects of model parameters on filters structures and effects of dimensions of learned filter on MBIR performance) of the proposed CAOL algorithms and models, respectively.
1). CAOL:
We tested the introduced CAOL models/algorithms for four datasets: 1) the fruit dataset with L = 10 and N = 100 × 100 [10]; 2) the city dataset with L = 10 and N = 100 × 100 [14]; 3) the CT dataset of L = 80 and N = 128×128, created by dividing down-sampled 512×512 XCAT phantom slices [61] into 16 sub-images [13], [62] – referred to the CT-(i) dataset; 4) the CT dataset of with L = 10 and N = 512×512 from down-sampled 512×512 XCAT phantom slices [61] – referred to the CT-(ii) dataset. The preprocessing includes intensity rescaling to [0, 1] [10], [13], [14] and/or (global) mean substraction [63, §2], [1], as conventionally used in many sparse coding studies, e.g., [1], [10], [13], [14], [63]. For the fruit and city datasets, we trained K = 49 filters of size R = 7×7. For the CT dataset (i), we trained filters of size R = 5×5, with K = 25 or K = 20. For CT reconstruction experiments, we learned the filters from the CT-(ii) dataset; however, we did not apply mean subtraction because it is not modeled in (P3).
The parameters for the BPEG-M algorithms were defined as follows.5 We set the regularization parameters α, β as follows:
CAOL (P1): To investigate the effects of α, we tested (P1) with different α’s in the case R = K. For the fruit and city datasets, we used α = 2.5×{10−5, 10−4}; for the CT-dataset (for CT reconstruction experiments), see details in [32, Sec. V1].
CAOL (P2): Once α is fixed from the CAOL (P1) experiments above, we tested (P2) with different β’s to see its effects in the case R > K. For the CT-(i) dataset, we fixed α = 10−4, and used β = {5×106, 5×104}.
We set λD = 1 + ϵ as the default. We initialized filters in either deterministic or random ways. The deterministic filter initialization follows that in [6, Sec. 3.4]. When filters were randomly initialized, we used a scaled one-vector for the first filter. We initialize sparse codes mainly with a deterministic way that applies (31) based on . If not specified, we used the random filter and deterministic sparse code initializations. For BPG [31], we used the maximum eigenvalue of Hessians for Lipschitz constants in (16), and applied the gradient-based restarting scheme in Section IV-C. We terminated the iterations if the relative error stopping criterion (e.g., [16, (44)]) is met before reaching the maximum number of iterations. We set the tolerance value as 10−13 for the CAOL algorithms using Proposition 5.1, and 10−5 for those using Lemmas 5.2–5.3, and the maximum number of iterations to 2×104.
The CAOL experiments used the convolutional operator learning toolbox [64].
2). Sparse-View CT MBIR with Learned Convolutional Regularizer via CAOL:
We simulated sparse-view sinograms of size 888 × 123 (‘detectors or rays’ × ‘regularly spaced projection views or angles’, where 984 is the number of full views) with GE LightSpeed fan-beam geometry corresponding to a monoenergetic source with 105 incident photons per ray and no background events, and electronic noise variance σ2 = 52. We avoided an inverse crime in our imaging simulation and reconstructed images with a coarser grid with ∆x = ∆y = 0.9766 mm; see details in [32, Sec. V-A2].
For EP MBIR, we finely tuned its regularization parameter to achieve both good root mean square error (RMSE) and structural similarity index measurement [65] values. For the CT MBIR model (P3), we chose the model parameters {γ, α′} that showed a good tradeoff between the data fidelity term and the learned convolutional regularizer, and set λA = 1 + ϵ. We evaluated the reconstruction quality by the RMSE (in a modified Hounsfield unit, HU, where air is 0 HU and water is 1000 HU) in a region of interest. See further details in [32, Sec. V-A2] and Fig. 6.
Fig. 6.
Comparisons of reconstructed images from different reconstruction methods for sparse-view CT (123 views (12.5% sampling); for the MBIR model (P3), convolutional regularizers were trained by CAOL (P1) – see [32, Fig. 2]; display window is within [800, 1200] HU) [32]. The MBIR model (P3) using convolutional sparsifying regularizers trained via CAOL (P1) shows higher image reconstruction accuracy compared to the EP reconstruction; see red arrows and magnified areas. For the MBIR model (P3), the autoencoder (see Remark 6.1) using the filter dimension R=K=49 improves reconstruction accuracy of that using R=K=25; compare the results in (d) and (e). In particular, the larger dimensional filters improve the edge sharpness of reconstructed images; see circled areas. The corresponding error maps are shown in Fig. S.5 of the supplementary material.
The imaging simulation and reconstruction experiments used the Michigan image reconstruction toolbox [66].
B. CAOL with BPEG-M
Under the sharp majorization regime (i.e., partial or all blocks have sufficiently tight bounds in Lemma 4.2), the proposed convergence-guaranteed BPEG-M can achieve significantly faster CAOL convergence rates compared with the state-of-the-art BPG algorithm [31] for solving block multi-nonconvex problems, by several generalizations of BPG (see Remark 4.3) and two majorization designs (see Proposition 5.1 and Lemma 5.3). See Fig. 3. In controlling the tradeoff between majorization sharpness and extrapolation effect of BPEG-M (i.e., choosing {λb} in (6)–(7)), maintaining majorization sharpness is more critical than gaining stronger extrapolation effects to accelerate convergence under the sharp majorization regime. See Fig. 4.
While using about two times less memory (see Table II), CAOL (P0) learns TF filters corresponding to those given by the patch-domain TF learning in [6, Fig. 2]. See Section V-C and Fig. S.1 with deterministic . Note that BPEG-M-based CAOL (P0) requires even less memory than BPEG-M-based CDL in [16], by using exact sparse coding solutions (e.g., (31) and (34)) without saving their extrapolated points. In particular, when tested with the large CT dataset of {L = 40, N = 512×512}, the BPEG-M-based CAOL algorithm ran fine, while BPEG-M-based CDL [16] and patch-domain AOL [6] were terminated due to exceeding available memory.6 In addition, the CAOL models (P1) and (P2) are easily parallelizable with K threads. Combining these results, the BPEG-M-based CAOL is a reasonable choice for learning filters from large training datasets. Finally, [34] shows theoretically how using many samples can improve CAOL, accentuating the benefits of the low memory usage of CAOL.
The effects of parameters for the CAOL models are shown as follows. In CAOL (P1), as the thresholding parameter α increases, the learned filters have more elongated structures; see Figs. 5(a) and S.2. In CAOL (P2), when α is fixed, increasing the filter diversity promoting regularizer β successfully lowers coherences between filters (e.g., gdiv(D) in (P2)); see Fig. 5(b).
Fig. 5.

Examples of learned filters with different CAOL models and parameters (Proposition 5.1 was used for MD; the CT-(i) dataset with a symmetric boundary condition).
In adaptive MBIR (e.g., [2], [6], [8]), one may apply adaptive image denoising [53], [67]–[71] to optimize thresholding parameters. However, if CAOL (P0) and testing the learned convolutional regularizer to MBIR (e.g., (P3)) are separated, selecting “optimal” thresholding parameters in (unsupervised) CAOL is challenging – similar to existing dictionary or analysis operator learning methods. Our strategy to select the thresholding parameter α in CAOL (P1) (with R = K) is given as follows. We first apply the first-order finite difference filters (e.g., in 1D) to all training signals and find their sparse representations, and then find αest that corresponds to the largest 95(±1)% of non-zero elements of the sparsified training signals. This procedure defines the range to select desirable α⋆ and its corresponding filter D⋆. We next ran CAOL (P1) with multiple α values within this range. Selecting {α⋆, D⋆} depends on application. For CT MBIR, D⋆ that both has (short) first-order finite difference filters and captures diverse (particularly diagonal) features of training signals, gave good RMSE values and well preserved edges; see Fig. S.2(c) and [32, Fig. 2].
C. Sparse-View CT MBIR with Learned Convolutional Sparsifying Regularizer (via CAOL) and BPEG-M
In sparse-view CT using only 12.5% of the full projections views, the CT MBIR (P3) using the learned convolutional regularizer via CAOL (P1) outperforms EP MBIR; it reduces RMSE by approximately 5.6–6.1HU. See the results in Figs. 6(c)–(e). The model (P3) can better recover high-contrast regions (e.g., bones) – see red arrows and magnified areas in Fig. 6(c)–(e). Nonetheless, the filters with R = K = 52 in the (ψ-weighting) autoencoding CNN, i.e., in (36), can blur edges in low-contrast regions (e.g., soft tissues) while removing noise. See Fig. 6(d) – the blurry issues were similarly observed in [54], [55]. The larger dimensional kernels (i.e., R=K=72) in the convolutional autoencoder can moderate this issue, while further reducing RMSE values; compare the results in Fig. 6(d)– (e). In particular, the larger dimensional convolutional kernels capture more diverse features – see [32, Fig. 2]) – and the diverse features captured in kernels are useful to further improve the performance of the proposed MBIR model (P3). (The importance of diverse features in kernels was similarly observed in CT experiments with the learned autoencoders having a fixed kernel dimension; see Fig. S.2(c).) The RMSE reduction over EP MBIR is comparable to that of CT MBIR (P3) using the {R,K=82}-dimensional filters trained via the patch-domain AOL [7]; however, at each BPEG-M iteration, this MBIR model using the trained (non-TF) filters via patch-domain AOL [7] requires more computations than the proposed CT MBIR model (P3) using the learned convolutional regularizer via CAOL (P1). See related results and discussion in Fig. S.4 and Section S.X, respectively.
On the algorithmic side, the BPEG-M framework can guarantee the convergence of CT MBIR (P3). Under the sharp majorization regime in BPEG-M, maintaining the majorization sharpness is more critical than having stronger extrapolation effects – see [32, Fig. 3], as similarly shown in CAOL experiments (see Section VII-B).
VIII. Conclusion
Developing rapidly converging and memory-efficient CAOL engines is important, since it is a basic element in training CNNs in an unsupervised learning manner (see Appendix A). Studying structures of convolutional kernels is another fundamental issue, since it can avoid learning redundant filters or provide energy compaction properties to filters. The proposed BPEG-M-based CAOL framework has several benefits. First, the orthogonality constraint and diversity promoting regularizer in CAOL are useful in learning filters with diverse structures. Second, the proposed BPEG-M algorithm significantly accelerates CAOL over the state-of-the-art method, BPG [31], with our sufficiently sharp majorizer designs. Third, BPEG-M-based CAOL uses much less memory compared to patch-domain AOL methods [3], [4], [7], and easily allows parallel computing. Finally, the learned convolutional regularizer provides the autoencoding CNN architecture in MBIR, and outperforms EP reconstruction in sparse-view CT.
Similar to existing unsupervised synthesis or analysis operator learning methods, the biggest remaining challenge of CAOL is optimizing its model parameters. This would become more challenging when one applies CAOL to train CNNs (see Appendix A). Our first future work is developing “task-driven” CAOL that is particularly useful to train thresholding values. Other future works include further acceleration of BPEG-M in Algorithm 1, designing sharper majorizers requiring only O(LRN) for the filter update problem of CAOL (P0), and applying the CNN model learned via (A1) to MBIR.
Supplementary Material
Acknowledgment
We thank Xuehang Zheng for providing CT imaging simulation setup, and Dr. Jonghoon Jin for constructive feedback on CNNs.
This work is supported in part by the Keck Foundation and NIH U01 EB018753.
Appendix
A. Training CNN in a unsupervised manner via CAOL
This section mathematically formulates an unsupervised training cost function for classical CNN (e.g., LeNet-5 [11] and AlexNet [72]) and solves the corresponding optimization problem, via the CAOL and BPEG-M frameworks studied in Sections III–V. We model the three core modules of CNN: 1) convolution, 2) pooling, e.g., average [11] or max [63], and 3) thresholding, e.g., RELU [73], while considering the TF filter condition in Proposition 3.1. Particularly, the orthogonality constraint in CAOL (P1) leads to a sharp majorizer, and BPEG-M is useful to train CNNs with convergence guarantees. Note that it is unclear how to train such diverse (or incoherent) filters described in Section III by the most common CNN optimization method, the stochastic gradient method in which gradients are computed by back-propagation. The major challenges include a) the non-differentiable hard thresholding operator related to ℓ0-norm in (P0), b) the nonconvex filter constraints in (P1) and (P2), c) using the identical filters in both encoder and decoder (e.g., W and WH in Section S.I), and d) vanishing gradients.
For simplicity, we consider a two-layer CNN with a single training image, but one can extend the CNN model (A1) (see below) to “deep” layers with multiple images. The first layer consists of 1c) convolutional, 1t) thresholding, and 1p) pooling layers; the second layer consists of 2c) convolutional and 2t) thresholding layers. Extending CAOL (P1), we model two-layer CNN training as the following optimization problem:
| (A1) |
where is the training data, is a set of filters in the first convolutional layer, is a set of features after the first thresholding layer, is a set of filters for each of in the second convolutional layer, is a set of features after the second thresholding layer, D[1] and are similarly given as in (1), denotes an average pooling [11] operator (see its definition below), and ω is the size of pooling window. The superscripted number in the bracket of vectors and matrices denotes the (·)th layer. Here, we model a simple average pooling operator by a block diagonal matrix with row vector : . We obtain a majorization matrix of PTP by (using Lemma S.1). For 2D case, the structure of P changes, but holds.
We solve the CNN training model in (A1) via the BPEG-M techniques in Section V, and relate the solutions of (A1) and modules in the two-layer CNN training. The symbols in the following items denote the CNN modules.
-
1c)
Filters in the first layer, : Updating the filters is straightforward via the techniques in Section V-A2.
-
1t)Features at the first layers, : Using BPEG-M with the kth set of TF filters and (see above), the proximal mapping for is
where and is given by (4). Combining the first two quadratic terms in (37) into a single quadratic term leads to an optimal update for (37):(37)
where the hard thresholding operator with a thresholding parameter a is defined in (32). -
1p)
Pooling, P : Applying the pooling operator P to gives input data –– to the second layer.
-
2c)
Filters in the second layer, : We update the kth set filters in a sequential way. Updating the kth set filters is straightforward via the techniques in Section V-A2.
-
2t)Features at the second layers, : The corresponding update is given by
Considering the introduced mathematical formulation of training CNNs [11] via CAOL, BPEG-M-based CAOL has potential to be a basic engine to rapidly train CNNs with big data (i.e., training data consisting of many (high-dimensional) signals).
B. Examples of in MBIR model (B1) using learned regularizers
This section introduces some potential applications of using MBIR model (B1) using learned regularizers in imaging processing, imaging, and computer vision. We first consider quadratic data fidelity function in the form of . Examples include
Image debluring (with W = I for simplicity), where y is a blurred image, A is a blurring operator, and is a box constraint;
Image denoising (with A = I), where y is a noisy image corrupted by additive white Gaussian noise (AWGN), W is the inverse covariance matrix corresponding to AWGN statistics, and is a box constraint;
Compressed sensing (with for simplicity) [74], [75], where y is a measurement vector, and A is a compressed sensing operator, e.g., subgaussian random matrix, bounded orthonormal system, subsampled isometries, certain types of random convolutions;
Image inpainting (with W = I for simplicity), where y is an image with missing entries, A is a masking operator, and is a box constraint;
Light-field photography from focal stack data with , where yc denotes measurements collected at the cth sensor, Ac,s models camera imaging geometry at the sth angular position for the cth detector, xs denotes the sth sub-aperture image, ∀c, s, and is a box constraint [29], [76].
Examples that use nonlinear data fidelity function include image classification using the logistic function [77], magnetic resonance imaging considering unknown magnetic field variation [78], and positron emission tomography [59].
C. Notation
We use to denote the ℓp-norm and write for the standard inner product on . The weighted ℓ2-norm with a Hermitian positive definite matrix A is denoted by . denotes the ℓ0-quasi-norm, i.e., the number of nonzeros of a vector. The Frobenius norm of a matrix is denoted by ., , and indicate the transpose, complex conjugate transpose (Hermitian transpose), and complex conjugate, respectively. diag(·) denotes the conversion of a vector into a diagonal matrix or diagonal elements of a matrix into a vector. ⴲ denotes the matrix direct sum of matrices. [C] denotes the set {1, 2, …, C}. Distinct from the index i, we denote the imaginary unit by i. For (self-adjoint) matrices , the notation denotes that A − B is a positive semi-definite matrix.
Footnotes
This paper has supplementary downloadable material available at http://ieeexplore.ieee.org, provided by the author. The material includes proofs and additional experimental results that are omitted in this main paper. The prefix “S” indicates the numbers in section, equation, figure, algorithm, and footnote in the supplement.
A block coordinate descent algorithm can be applied to CAOL (P1); however, its convergence guarantee in solving CAOL (P1) is not yet known and might require stronger sufficient conditions than BPEG-M [37].
The quadratically majorized function allows a unique minimizer if is convex and is a convex set (note that ).
is proper if domF ≠ Ø. F is lower bounded in dom(F) := {x : F (x) < ∞} if . F is lower semicontinuous at point x0 if .
See details of computing in [32].
Their double-precision MATLAB implementations were tested on 3.3 GHz Intel Core i5 CPU with 32 GB RAM.
Contributor Information
Il Yong Chun, Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48019 USA, and is now with the Department of Electrical Engineering, the University of Hawai’i at Mānoa, Honolulu, HI 96822 USA..
Jeffrey A. Fessler, Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48019 USA..
References
- [1].Aharon M, Elad M, and Bruckstein A, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process, vol. 54, no. 11, pp. 4311–4322, November 2006. [Google Scholar]
- [2].Elad M and Aharon M, “Image denoising via sparse and redundant representations over learned dictionaries,” IEEE Trans. Image Process, vol. 15, no. 12, pp. 3736–3745, November 2006. [DOI] [PubMed] [Google Scholar]
- [3].Yaghoobi M, Nam S, Gribonval R, and Davies ME, “Constrained overcomplete analysis operator learning for cosparse signal modelling,” IEEE Trans. Signal Process, vol. 61, no. 9, pp. 2341–2355, March 2013. [Google Scholar]
- [4].Hawe S, Kleinsteuber M, and Diepold K, “Analysis operator learning and its application to image reconstruction,” IEEE Trans. Image Process, vol. 22, no. 6, pp. 2138–2150, June 2013. [DOI] [PubMed] [Google Scholar]
- [5].Mairal J, Bach F, and Ponce J, “Sparse modeling for image and vision processing,” Found. & Trends in Comput. Graph. Vis, vol. 8, no. 2–3, pp. 85–283, December 2014. [Google Scholar]
- [6].Cai J-F, Ji H, Shen Z, and Ye G-B, “Data-driven tight frame construction and image denoising,” Appl. Comput. Harmon. Anal, vol. 37, no. 1, pp. 89–105, October 2014. [Google Scholar]
- [7].Ravishankar S and Bresler Y, “ℓ0 sparsifying transform learning with efficient optimal updates and convergence guarantees,” IEEE Trans. Sig. Process, vol. 63, no. 9, pp. 2389–2404, May 2015. [Google Scholar]
- [8].Pfister L and Bresler Y, “Learning sparsifying filter banks,” in Proc. SPIE, vol. 9597, August 2015, pp. 959 703–1–959 703–10. [Google Scholar]
- [9].Coates A and Ng AY, “Learning feature representations with K-means,” in Neural Networks: Tricks of the Trade, 2nd ed., LNCS 7700, Müller GMGBOK-R, Ed. Berlin: Springer Verlag, 2012, ch. 22, pp. 561–580. [Google Scholar]
- [10].Zeiler MD, Krishnan D, Taylor GW, and Fergus R, “Deconvolutional networks,” in Proc. IEEE CVPR, San Francisco, CA, Jun. 2010, pp. 2528–2535. [Google Scholar]
- [11].LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, November 1998. [Google Scholar]
- [12].Papyan V, Romano Y, and Elad M, “Convolutional neural networks analyzed via convolutional sparse coding,” J. Mach. Learn. Res, vol. 18, no. 1, pp. 2887–2938, January 2017. [Google Scholar]
- [13].Bristow H, Eriksson A, and Lucey S, “Fast convolutional sparse coding,” in Proc. IEEE CVPR, Portland, OR, Jun. 2013, pp. 391–398. [Google Scholar]
- [14].Heide F, Heidrich W, and Wetzstein G, “Fast and flexible convolutional sparse coding,” in Proc. IEEE CVPR, Boston, MA, Jun. 2015, pp. 5135–5143. [Google Scholar]
- [15].Wohlberg B, “Efficient algorithms for convolutional sparse representations,” IEEE Trans. Image Process, vol. 25, no. 1, pp. 301–315, January 2016. [DOI] [PubMed] [Google Scholar]
- [16].Chun IY and Fessler JA, “Convolutional dictionary learning: Acceleration and convergence,” IEEE Trans. Image Process, vol. 27, no. 4, pp. 1697–1712, April 2018. [DOI] [PubMed] [Google Scholar]
- [17].Chun IY and Fessler JA, “Convergent convolutional dictionary learning using adaptive con-trast enhancement (CDL-ACE): Application of CDL to image denoising,” in Proc. Sampling Theory and Appl. (SampTA), Tallinn, Estonia, Jul. 2017, pp. 460–464. [Google Scholar]
- [18].Barchiesi D and Plumbley MD, “Learning incoherent dictionaries for sparse approximation using iterative projections and rotations,” IEEE Trans. Signal Process, vol. 61, no. 8, pp. 2055–2065, February 2013. [Google Scholar]
- [19].Bao C, Cai J-F, and Ji H, “Fast sparsity-based orthogonal dictionary learning for image restoration,” in Proc. IEEE ICCV, Sydney, Australia, Dec. 2013, pp. 3384–3391. [Google Scholar]
- [20].Ravishankar S and Bresler Y, “Learning overcomplete sparsifying transforms for signal processing,” in Proc. IEEE ICASSP, Vancouver, Canada, May 2013, pp. 3088–3092. [Google Scholar]
- [21].Yang Y, Sun J, Li H, and Xu Z, “Deep ADMM-Net for compressive sensing MRI,” in Proc. NIPS 29, Long Beach, CA, Dec. 2016, pp. 10–18. [Google Scholar]
- [22].Zhang K, Zuo W, Gu S, and Zhang L, “Learning deep CNN denoiser prior for image restoration,” in Proc. IEEE CVPR, Honolulu, HI, Jul. 2017, pp. 4681–4690. [Google Scholar]
- [23].Chen Y and Pock T, “Trainable nonlinear reaction diffusion: A flexible framework for fast and effective image restoration,” IEEE Trans. Pattern Anal. Mach. Intell, vol. 39, no. 6, pp. 1256–1272, June 2017. [DOI] [PubMed] [Google Scholar]
- [24].Chen H, Zhang Y, Zhang W, Sun H, Liao P, He K, Zhou J, and Wang G, “Learned experts’ assessment-based reconstruction network (“learn”) for sparse-data ct,” arXiv preprint physics.med-ph/1707.09636, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [25].Wu D, Kim K, Fakhri GE, and Li Q, “Iterative low-dose CT reconstruction with priors trained by neural network,” in Proc. 14th Intl. Mtg. on Fully 3D Image Recon. in Rad. and Nuc. Med, Xi’an, China, Jun. 2017, pp. 195–198. [Google Scholar]
- [26].Romano Y, Elad M, and Milanfar P, “The little engine that could: Regularization by denoising (RED),” SIAM J. Imaging Sci, vol. 10, no. 4, pp. 1804–1844, October 2017. [Google Scholar]
- [27].Buzzard GT, Chan SH, Sreehari S, and Bouman CA, “Plug-and-play unplugged: Optimization free reconstruction using consensus equilibrium,” SIAM J. Imaging Sci, vol. 11, no. 3, pp. 2001–2020, September 2018. [Google Scholar]
- [28].Chun IY and Fessler JA, “Deep BCD-net using identical encoding-decoding CNN structures for iterative image recovery,” in Proc. IEEE IVMSP Workshop, Zagori, Greece, Jun. 2018, pp. 1–5. [Google Scholar]
- [29].Chun IY, Huang Z, Lim H, and Fessler JA, “Momentum-Net: Fast and convergent iterative neural network for inverse problems,” submitted, July 2019. [Online]. Available: http://arxiv.org/abs/1907.11818 [DOI] [PMC free article] [PubMed]
- [30].Chun IY, Lim H, Huang Z, and Fessler JA, “Fast and convergent iterative signal recovery using trained convolutional neural networkss,” in Proc. Allerton Conf. on Commun., Control, and Comput., Allerton, IL, Oct. 2018, pp. 155–159. [Google Scholar]
- [31].Xu Y and Yin W, “A globally convergent algorithm for nonconvex optimization based on block coordinate update,” J. Sci. Comput, vol. 72, no. 2, pp. 700–734, August 2017. [Google Scholar]
- [32].Chun IY and Fessler JA, “Convolutional analysis operator learning: Application to sparse-view CT,” in Proc. Asilomar Conf. on Signals, Syst., and Comput., Pacific Grove, CA, Oct. 2018, pp. 1631–1635. [Google Scholar]
- [33].Arridge S, Maass P, Öktem O, and Schönlieb C-B, “Solving inverse problems using data-driven models,” Acta Numer, vol. 28, pp. 1–174, May 2019. [Google Scholar]
- [34].Chun IY, Hong D, Adcock B, and Fessler JA, “Convolutional analysis operator learning: Dependence on training data and compressed sensing recovery guarantees,” IEEE Signal Process. Lett, vol. 26, no. 8, pp. 1137–1141, June 2019. [Online]. Available: http://arxiv.org/abs/1902.08267 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [35].Crockett C, Hong D, Chun IY, and Fessler JA, “Incorporating handcrafted filters in convolutional analysis operator learning for ill-posed inverse problems,” in Proc. IEEE Workshop CAMSAP (submitted), July 2019. [Google Scholar]
- [36].Remi R and Schnass K, “Dictionary identification? Sparse matrix-factorization via ℓ1-minimization,” IEEE Trans. Inf. Theory, vol. 56, no. 7, pp. 3523–3539, June 2010. [Google Scholar]
- [37].Tseng P, “Convergence of a block coordinate descent method for nondifferentiable minimization,” J. Optimiz. Theory App, vol. 109, no. 3, pp. 475–494, June 2001. [Google Scholar]
- [38].Xu Y and Yin W, “A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion,” SIAM J. Imaging Sci, vol. 6, no. 3, pp. 1758–1789, September 2013. [Google Scholar]
- [39].Lange K, Hunter DR, and Yang I, “Optimization transfer using surrogate objective functions,” J. Comput. Graph. Stat, vol. 9, no. 1, pp. 1–20, March 2000. [Google Scholar]
- [40].Jacobson MW and Fessler JA, “An expanded theoretical treatment of iteration-dependent majorize-minimize algorithms,” IEEE Trans. Image Process, vol. 16, no. 10, pp. 2411–2422, October 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [41].Fessler JA, Clinthorne NH, and Rogers WL, “On complete-data spaces for PET reconstruction algorithms,” IEEE Trans. Nucl. Sci, vol. 40, no. 4, pp. 1055–1061, August 1993. [Google Scholar]
- [42].Kruger AY, “On Fréchet subdifferentials,” J. Math Sci, vol. 116, no. 3, pp. 3325–3358, July 2003. [Google Scholar]
- [43].Rockafellar RT and Wets RJ-B, Variational analysis Berlin: Springer Verlag, 2009, vol. 317. [Google Scholar]
- [44].Bao C, Ji H, and Shen Z, “Convergence analysis for iterative data-driven tight frame construction scheme,” Appl. Comput. Harmon. Anal, vol. 38, no. 3, pp. 510–523, May 2015. [Google Scholar]
- [45].Beck A and Teboulle M, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM J. Imaging Sci, vol. 2, no. 1, pp. 183–202, March 2009. [Google Scholar]
- [46].Nesterov Y, “Gradient methods for minimizing composite objective function,” CORE Discussion Papers −2007/76, UCL, Louvain-la-Neuve, Belgium, Available: http://www.uclouvain.be/cps/ucl/doc/core/documents/Composit.pdf, 2007. [Google Scholar]
- [47].Tseng P, “On accelerated proximal gradient methods for convex-concave optimization,” Tech. Rep, Available: http://www.mit.edu/dim-itrib/PTseng/papers/apgm.pdf, May 2008.
- [48].O’Donoghue B and Candès E, “Adaptive restart for accelerated gradient schemes,” Found. Comput. Math, vol. 15, no. 3, pp. 715–732, June 2015. [Google Scholar]
- [49].Giselsson P and Boyd S, “Monotonicity and restart in fast gradient methods,” in Proc. IEEE CDC, Los Angeles, CA, Dec. 2014, pp. 5058–5063. [Google Scholar]
- [50].Xu Q, Yu H, Mou X, Zhang L, Hsieh J, and Wang G, “Low-dose X-ray CT reconstruction via dictionary learning,” IEEE Trans. Med. Imag, vol. 31, no. 9, pp. 1682–1697, September 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [51].Liu J, Garcia-Cardona C, Wohlberg B, and Yin W, “Online convolutional dictionary learning,” arXiv preprint cs.LG:1709.00106, 2017. [Google Scholar]
- [52].Mairal J, Bach F, Ponce J, and Sapiro G, “Online dictionary learning for sparse coding,” in Proc. ICML, Montreal, Canada, Jun. 2009, pp. 689–696. [Google Scholar]
- [53].Pfister L and Bresler Y, “Automatic parameter tuning for image denoising with learned sparsifying transforms,” in Proc. IEEE ICASSP, Mar. 2017, pp. 6040–6044. [Google Scholar]
- [54].Chun IY, Zheng X, Long Y, and Fessler JA, “Sparse-view X-ray CT reconstruction using ℓ1 regularization with learned sparsifying transform,” in Proc. Intl. Mtg. on Fully 3D Image Recon. in Rad. and Nuc. Med, Xi’an, China, Jun. 2017, pp. 115–119. [Google Scholar]
- [55].Zheng X, Chun IY, Li Z, Long Y, and Fessler JA, “Sparse-view X-ray CT reconstruction using ℓ1 prior with learned transform,” submitted, February 2019. [Online]. Available: http://arxiv.org/abs/1711.00905 [Google Scholar]
- [56].Chun IY and Talavage T, “Efficient compressed sensing statistical X-ray/CT reconstruction from fewer measurements,” in Proc. Intl. Mtg. on Fully 3D Image Recon. in Rad. and Nuc. Med, Lake Tahoe, CA, Jun. 2013, pp. 30–33. [Google Scholar]
- [57].Fessler JA and Rogers WL, “Spatial resolution properties of penalized-likelihood image reconstruction methods: Space-invariant to-mographs,” IEEE Trans. Image Process, vol. 5, no. 9, pp. 1346–58, September 1996. [DOI] [PubMed] [Google Scholar]
- [58].Chun IY, Zheng X, Long Y, and Fessler JA, “BCD-Net for low-dose CT reconstruction: Acceleration, convergence, and generalization,” in Proc. MICCAI (to appear), Shenzhen, China, Oct. 2019. [Google Scholar]
- [59].Lim H, Chun IY, Dewaraja YK, and Fessler JA, “Improved low-count quantitative PET reconstruction with a variational neural network,” submitted, May 2019. [Online]. Available: http://arxiv.org/abs/1906.02327 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [60].Lim H, Fessler JA, Dewaraja YK, and Chun IY, “Application of trained Deep BCD-Net to iterative low-count PET image reconstruction,” in Proc. IEEE NSS-MIC (to appear), Sydney, Australia, Nov. 2018. [Google Scholar]
- [61].Segars WP, Mahesh M, Beck TJ, Frey EC, and Tsui BM, “Realistic CT simulation using the 4D XCAT phantom,” Med. Phys, vol. 35, no. 8, pp. 3800–3808, July 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [62].Olshausen BA and Field DJ, “Emergence of simple-cell receptive field properties by learning a sparse code for natural images,” Nature, vol. 381, no. 6583, pp. 607–609, June 1996. [DOI] [PubMed] [Google Scholar]
- [63].Jarrett K, Kavukcuoglu K, LeCun Y et al. , “What is the best multi-stage architecture for object recognition?” in Proc. 2009 ICCV, Kyoto, Japan, Sep. 2009, pp. 2146–2153. [Google Scholar]
- [64].Chun IY, “CONVOLT: CONVolutional Operator Learning Toolbox (for Matlab),” [GitHub repository] https://github.com/mechatoz/convolt, 2019.
- [65].Wang Z, Bovik AC, Sheikh HR, and Simoncelli EP, “Image quality assessment: from error visibility to structural similarity,” IEEE Trans. Image Process, vol. 13, no. 4, pp. 600–612, April 2004. [DOI] [PubMed] [Google Scholar]
- [66].Fessler JA, “Michigan image reconstruction toolbox (MIRT) for Matlab,” Available from http://web.eecs.umich.edu/fessler, 2016.
- [67].Donoho DL, “De-noising by soft-thresholding,” IEEE Trans. Inf. Theory, vol. 41, no. 3, pp. 613–627, May 1995. [Google Scholar]
- [68].Donoho DL and Johnstone IM, “Adapting to unknown smoothness via wavelet shrinkage,” J. Amer. Stat. Assoc, vol. 90, no. 432, pp. 1200–1224, 1995. [Google Scholar]
- [69].Chang SG, Yu B, and Vetterli M, “Adaptive wavelet thresholding for image denoising and compression,” IEEE Trans. Image Process, vol. 9, no. 9, pp. 1532–1546, September 2000. [DOI] [PubMed] [Google Scholar]
- [70].Blu T and Luisier F, “The SURE-LET approach to image denoising,” IEEE Trans. Image Process, vol. 16, no. 11, pp. 2778–2786, November 2007. [DOI] [PubMed] [Google Scholar]
- [71].Liu H, Xiong R, Zhang J, and Gao W, “Image denoising via adaptive soft-thresholding based on non-local samples,” in Proc. IEEE CVPR, Boston, MA, Jun. 2015, pp. 484–492. [Google Scholar]
- [72].Krizhevsky A, Sutskever I, and Hinton GE, “ImageNet classification with deep convolutional neural networks,” in Proc. NIPS 25, Lake Tahoe, NV, Dec. 2012, pp. 1097–1105. [Google Scholar]
- [73].Nair V and Hinton GE, “Rectified linear units improve restricted boltzmann machines,” in Proc. ICML, Haifa, Israel, Jun. 2010, pp. 807–814. [Google Scholar]
- [74].Chun IY and Adcock B, “Compressed sensing and parallel acquisition,” IEEE Trans. Inf. Theory, vol. 63, no. 7, pp. 1–23, May 2017. [Online]. Available: http://arxiv.org/abs/1601.06214 [Google Scholar]
- [75].Chun IY and Adcock B, “Uniform recovery from subgaussian multi-sensor measurements,” Appl. Comput. Harmon. Anal (to appear), November 2018. [Online]. Available: http://arxiv.org/abs/1610.05758
- [76].Blocker CJ, Chun IY, and Fessler JA, “Low-rank plus sparse tensor models for light-field reconstruction from focal stack data,” in Proc. IEEE IVMSP Workshop, Zagori, Greece, Jun. 2018, pp. 1–5. [Google Scholar]
- [77].Mairal J, Ponce J, Sapiro G, Zisserman A, and Bach FR, “Supervised dictionary learning,” in Proc. NIPS 21, Vancouver, Canada, Dec. 2009, pp. 1033–1040. [Google Scholar]
- [78].Fessler JA, “Model-based image reconstruction for MRI,” IEEE Signal Process. Mag, vol. 27, no. 4, pp. 81–89, July 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.




