Convolutional Analysis Operator Learning: Dependence on Training Data

Il Yong Chun; David Hong; Ben Adcock; Jeffrey A Fessler

doi:10.1109/lsp.2019.2921446

. Author manuscript; available in PMC: 2020 Aug 1.

Published in final edited form as: IEEE Signal Process Lett. 2019 Jun 7;26(8):1137–1141. doi: 10.1109/lsp.2019.2921446

Convolutional Analysis Operator Learning: Dependence on Training Data

Il Yong Chun ^1,^†, David Hong ^1,^†, Ben Adcock ², Jeffrey A Fessler ³

PMCID: PMC7170269 NIHMSID: NIHMS1533018 PMID: 32313415

Abstract

Convolutional analysis operator learning (CAOL) enables the unsupervised training of (hierarchical) convolutional sparsifying operators or autoencoders from large datasets. One can use many training images for CAOL, but a precise understanding of the impact of doing so has remained an open question. This paper presents a series of results that lend insight into the impact of dataset size on the filter update in CAOL. The first result is a general deterministic bound on errors in the estimated filters, and is followed by a bound on the expected errors as the number of training samples increases. The second result provides a high probability analogue. The bounds depend on properties of the training data, and we investigate their empirical values with real data. Taken together, these results provide evidence for the potential benefit of using more training data in CAOL.

I. Introduction

LEARNING convolutional operators from large datasets is a growing trend in signal/image processing, computer vision, machine learning, and artificial intelligence. The convolutional approach resolves the large memory demands of patch-based operator learning and enables unsupervised operator learning from “big data,” i.e., many high-dimensional signals. See [1], [2] and references therein. Examples include convolutional dictionary learning [2], [3] and convolutional analysis operator learning (CAOL) [1], [4]. CAOL trains an autoencoding CNN in an unsupervised manner, and is useful for training multi-layer CNNs from many training images [1]. In particular, the block proximal gradient method using a majorizer [1], [2] leads to rapidly converging and memory-efficient CAOL [1]. However, a theoretical understanding of the impact of using many training images in CAOL has remained an open question.

This paper presents new insights on this topic. Our first main result provides a deterministic bound on filter estimation error, and is followed by a bound on the expected error when “model mismatch” has zero mean. (See Theorem 1 and Corollary 2, respectively.) The expected error bound depends on the training data, and we provide empirical evidence of its decrease with an increase in training samples. Our second main result provides a high probability bound that explicitly decreases with increasingly many i.i.d. training samples. The bound improves when model mismatch and samples are uncorrelated. (See Theorem 3.) Additional empirical findings provide evidence that the correlation can indeed be small in practice. Put together, our findings provide new insight into how using many samples can improve CAOL, underscoring the benefits of the low memory usage of CAOL.

II. Backgrounds and Preliminaries

A. CAOL with orthogonality constraints

CAOL seeks a set of filters that “best” sparsify a set of training images ${x_{l} \in ℂ^{N} : l = 1, \dots, L}$ by solving the optimization problem [1, §II-A] (see Appendix for notation):

\underset{D = [d_{1}, \dots, d_{K}]}{argmin} \min_{_{{z_{l, k}}}} F (D, {z_{l, k}}), subj . to D D^{H} = \frac{1}{R} \cdot I,

(P0)

F (D, {z_{l, k}}) : = \sum_{l = 1}^{L} \sum_{k = 1}^{K} {‖ d_{k} ⊛ x_{l} - z_{l, k} ‖}_{2}^{2} + α {‖ z_{l, k} ‖}_{0},

where ⊛ denotes convolution, ${d_{k} \in ℂ^{R} : k = 1, \dots, K}$ is a set of K ≥ R convolutional kernels, ${z_{l, k} \in ℂ^{N} : l = 1, \dots, L, k = 1, \dots, K}$ is a set of sparse codes, α > 0 is a regularization parameter controlling the sparsity of features {z_l,k}, and ||·||₀ denotes the ℓ⁰-quasi-norm. We group the K filters into a matrix:

D : = [d_{1} \dots d_{K}] \in ℂ^{R \times K} .

(1)

The orthogonality condition $D D^{H} = \frac{1}{R} I$ in (P0) enforces 1) a tight-frame condition on the filters, i.e., $\sum_{k = 1}^{K} {‖ d_{k} ⊛ x ‖}_{2}^{2} = ‖ x ‖_{2}^{2}$ , ∀x [1, Prop. 2.1]; and 2) filter diversity when R = K, since $D D^{H} = \frac{1}{R} I$ implies $D^{H} D = \frac{1}{K} I$ and each pair of filters is incoherent, i.e., |〈d_k, d_k′〉|² = 0, ∀k ≠ k′. One often solves (P0) iteratively, by alternating between optimizing D (filter update) and optimizing {z_l,k : ∀l, k} (sparse code update) [1], i.e., at the i iteration, the current iterates are updated as ${z_{l, k}^{(i + 1)}} = {argmin}_{{z_{l, k}}} F (D^{(i)}, {z_{l, k}})$ and $D^{(i + 1)} = {argmin}_{D D^{H} = \frac{1}{R} \cdot I} F (D, {z_{l, k}^{(i + 1)}})$ .

B. Filter update in a matrix form

The key to our analysis lies in rewriting the filter update for (P0) in matrix form, to which we apply matrix perturbation and concentration inequalities. Observe first that

d_{k} ⊛ x_{l} = \underset{= : Ψ_{l} \in ℂ^{N \times R}}{\underset{︸}{[Π^{0} x_{l}, \dots, Π^{R - 1} x_{l}]}} d_{k} = Ψ_{l} d_{k}, l = 1, \dots, L,

(2)

where $Π : = [\begin{matrix} 0 & I_{N - 1} \\ 1 & 0 \end{matrix}] \in ℂ^{N \times N}$ is the circular shift operator and (·)ⁿ denotes the matrix product of its n copies. We consider a circular boundary condition to simplify the presentation of {Ψ_l} in (2), but our entire analysis holds for a general boundary condition with only minor modifications of {Ψ_l} as done in [1, §IV-A]. Using (2), the filter update of (P0) is rewritten as

D^{⋆} = \underset{D}{argmin} \sum_{l = 1}^{L} {‖ Ψ_{l} D - Z_{l} ‖}_{F}^{2}, subj . to D D^{H} = \frac{1}{R} \cdot I,

(P1)

where $Z_{l} : = [z_{l, 1}, \dots, z_{l, K}] \in ℂ^{N \times K}$ contains all the current sparse code estimates for the lth sample, and we drop iteration superscript indices (·)⁽ⁱ⁾ throughout. The next section uses this form to characterize the filter update solution D^⋆.

III. Main Results:

Dependence of CAOL on Training Data

The main results in this section illustrate how training with many samples can reduce errors in the filter D^⋆ from (P1) and characterize the reduction in terms of properties of the training data. Throughout we model the current sparse codes estimates as

Z_{l} = \underset{= : Z_{true, l}}{\underset{︸}{Ψ_{l} D_{true}}} + E_{l}, l = 1, \dots, L

(3)

where D_true is formed from optimal (orthogonal) filters analogously to (1), and $E_{l} \in ℂ^{N \times K}$ captures model mismatch in the current sparse codes, e.g., due to the current iterate being far from convergence or being trapped in local minima.

The following theorem provides a deterministic characterization.

Theorem 1.

Suppose that both matrices

\sum_{l = 1}^{L} Ψ_{l}^{H} Z_{l} \in ℂ^{R \times K} a n d \sum_{l = 1}^{L} Ψ_{l}^{H} Z_{true, l} \in ℂ^{R \times K}

(4)

are full row rank, where {Ψ_l, Z_l, Z_true,l : l = 1, …, L} are defined in (2)–(3). Then, the solution D^⋆ to (P1) has error with respect to D_true bounded as

‖ D^{⋆} - D_{true} ‖_{F}^{2} \leq 5 \frac{‖ \sum_{l = 1}^{L} Ψ_{l}^{H} E_{l} ‖_{F}^{2}}{λ_{min}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l})},

(5)

where λ_min(·) denotes the smallest eigenvalue of its argument.

The full row rank condition on (4) ensures that the estimated filters D^⋆ and the true filters D_true are unique, and it further guarantees that the denominator of (5) is strictly positive. When the model mismatches E₁, …, E_L are independent and mean zero, we obtain the following expected error bound:

Corollary 2.

Under the construction of Theorem 1, suppose that E_l is a zero-mean random matrix for l = 1, …, L, and is independent over l. Then,

E {‖ D^{⋆} - D_{true} ‖}_{F}^{2} \leq 5 {\bar{σ}}^{2} ρ^{2},

(6)

where $E (\cdot)$ denotes the expectation,

{\bar{σ}}^{2} : = max_{_{l = 1, \dots, L}} λ_{max} (E {E_{l} E_{l}^{H}}), ρ^{2} : = \frac{tr (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l})}{λ_{min}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l})},

(7)

λ_max(·) denotes the largest eigenvalue of its argument, and the expectation is taken over the model mismatch.

Given fixed K and R, it is natural to expect that ${\bar{σ}}^{2}$ is bounded by some constant independent of L, and so the expected error bound in (6) largely depends on ρ² in (7). When training samples are i.i.d., one may further expect $(1 / L) \sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l}$ to concentrate around its expectation, roughly resulting in ρ² ∝ 1/L, with a proportionality constant that depends on R and the statistics of the training data. Fig. 1 illustrates ρ² for various image datasets, providing empirical evidence of this decrease in real data.

Fig. 1. — Empirical values of ρ² in (7) show a decrease with L for different datasets and filter dimensions. (The fruit and city datasets with L =10 and N = 10⁴ were preprocessed with contrast enhancement and mean subtraction; see details of datasets and experiments in [1], [2] and references therein. For L < 10, the results are averaged over 50 datasets randomly selected from the full datasets.) Under the assumptions of Corollary 2, the decrease in this quantity leads to a better expected error bound in (5). Without preprocessing, the quantity ρ² increases by a factor of around 10³.

Our second theorem provides a probabilistic error bound via concentration inequalities, given i.i.d. training sample and model mismatch pairs (x₁, E₁), …, (x_L, E_L).¹ It removes the zero-mean assumption for the model mismatches {E_l : ∀l} in Corollary 2 that might be strong, e.g., if training data are not preprocessed to have zero mean.

Theorem 3.

Suppose that training sample and model mismatch pairs $(x_{1}, E_{1}), \dots, (x_{L}, E_{L}) \overset{i i d}{~} (x, E)$ , where x and E are almost surely bounded, i.e.,

‖ x ‖_{2} \leq γ a n d ‖ E ‖_{F} \leq σ,

(8)

and the matrices in (4) are almost surely full row rank. Then, for any $0 < δ < λ_{min} (\bar{Λ}) / (2 R γ^{2})$ , the solution D^⋆ to (P1) has error with respect to D_true bounded as

{‖ D^{⋆} - D_{true} ‖}_{F}^{2} \leq 5 {\frac{σ \sqrt{tr (\bar{Λ}) / L} + {‖ E (Ψ^{H} E) ‖}_{F} + 2 σ γ \sqrt{R} δ}{λ_{min} (\bar{Λ}) - 2 γ^{2} R δ}}^{2},

(9)

with probability at least

1 - 3 R exp (- L \frac{δ^{2} / 2}{3 + δ / 3}),

(10)

where $\bar{Λ} : = E (Ψ^{H} Ψ)$ and Ψ is constructed from x as in (2).

Taking δ sufficiently small, the high probability error bound (9) is primarily driven by

\bar{ρ} : = \frac{\sqrt{tr (\bar{Λ}) / L}}{λ_{min} (\bar{Λ})} and \bar{χ} : = \frac{{‖ E (Ψ^{H} E) ‖}_{F}}{λ_{min} (\bar{Λ})},

(11)

where $\bar{ρ}$ is analogous to ρ in (7), and $\bar{χ}$ captures how correlated the model mismatch is to the training samples. As the number L of training samples increases, $\bar{ρ}$ decreases as $1 / \sqrt{L}$ . On the other hand, $\bar{χ}$ is constant with respect to L and provides a floor for the bound. Fig. 2 illustrates $\bar{χ}$ for CAOL iterates from different image datasets, and provides empirical evidence that this term can indeed be small in real data. If the model mismatch is sufficiently uncorrelated with the training samples, i.e., $\bar{χ}$ is practically zero, then only the $\bar{ρ}$ term remains and this term decreases with L. Namely, if model mismatch is entirely uncorrelated with the training samples, then using many samples decreases the error bound to (effectively) zero.

Fig. 2. — Empirical estimate of $\bar{χ}$ in (11) across iterations in the alternating optimization algorithm [1] that solves CAOL (P0) with α = 10⁻³. (The fruit and city datasets with L = 10 and N = 10⁴ were preprocessed with contrast enhancement and mean subtraction; see details of datasets and experiments in [1], [2] and references therein. The model mismatches ${E_{l}^{(i)} : \forall l}$ at the ith iteration were calculated every 50 iterations based on (3), where we use the converged filters for D_true.) Observe that ${\bar{χ}}^{(i)}$ generally decreases over iterations; when $\bar{χ}$ is small, the high probability error bound (9) in Theorem 3 depends primarily on $\bar{ρ}$ defined in (11).

IV. Related Works

Sample complexity [7] and synthesis (or reconstruction) error [8] have been studied in the context of synthesis operator learning (e.g., dictionary learning [9]); see the cited papers and references therein. A similar understanding for (C)AOL has however remained largely open; existing works focus primarily on establishing (C)AOL models and their algorithmic challenges [1], [10]–[13]. The authors in [14] studied sample complexity for a patch-based AOL method, but the form of their model differs from that of ours (P0). Specifically, they consider the following AOL problem: ${min}_{D} \sum_{l} f (D^{T} {\hat{x}}_{l}) + g (D)$ , where f (·) is a sparsity promoting function (e.g., a smooth approximation of the ℓ⁰-quasi-norm [14]), g(·) is a regularizer or constraint for the filter matrix D, and ${{\hat{x}}_{l} : l = 1, \dots, L}$ is a set of training patches (not images).

V. Proof of Theorem 1

Rewriting (P1) yields that D^⋆ is a solution of the (scaled) orthogonal Procrustes problem [1, §S.VII]:

\underset{D}{argmin} {‖ \tilde{Ψ} D - \tilde{Z} ‖}_{F}^{2}, subj . to D D^{H} = \frac{1}{R} \cdot I,

(12)

where $\tilde{Ψ} \in ℂ^{L N \times R}$ arises by stacking Ψ₁, …, Ψ_L vertically and $\tilde{Z} \in ℂ^{L N \times K}$ arises likewise from Z₁, …, Z_L. Similarly, since Ψ₁ D_true = Z_true,l as in (3), D_true is a solution of the analogous (scaled) orthogonal Procrustes problem

\underset{D}{argmin} {‖ \tilde{Ψ} D - {\tilde{Z}}_{true} ‖}_{F}^{2}, subj . to D D^{H} = \frac{1}{R} \cdot I,

(13)

where ${\tilde{Z}}_{true} \in ℂ^{L N \times K}$ arises by stacking Z_true,1, …, Z_true,L vertically.

By assumption, both ${\tilde{Ψ}}^{H} \tilde{Z}$ and ${\tilde{Ψ}}^{H} {\tilde{Z}}_{true}$ are full row rank and so (12) and (13) have unique solutions given by the unique (scaled) polar factors

D^{⋆} = \frac{1}{\sqrt{R}} Q {({\tilde{Z}}^{H} \tilde{Ψ})}^{H} D_{true} = \frac{1}{\sqrt{R}} Q {({\tilde{Z}}_{true}^{H} \tilde{Ψ})}^{H}

(14)

where Q(·) denotes the polar factor of its argument, and can be computed as Q(A) = WV^H from the (thin) singular value decomposition A = WΣV^H.

Thus we have

{‖ D^{⋆} - D_{true} ‖}_{F}^{2} = \frac{1}{R} {‖ Q ({\tilde{Z}}_{true}^{H} \tilde{Ψ}) - Q ({\tilde{Z}}^{H} \tilde{Ψ}) ‖}_{F}^{2} \leq \frac{1}{R} {‖ {\tilde{E}}^{H} \tilde{Ψ} ‖}_{F}^{2} {{[\frac{2}{σ_{R} ({\tilde{Z}}_{true}^{H} \tilde{Ψ}) + σ_{R} ({\tilde{Z}}^{H} \tilde{Ψ})}]}^{2} + {[\frac{1}{max {σ_{R} ({\tilde{Z}}_{true}^{H} \tilde{Ψ}), σ_{R} ({\tilde{Z}}^{H} \tilde{Ψ})}}]}^{2}} \leq \frac{1}{R} {‖ {\tilde{E}}^{H} \tilde{Ψ} ‖}_{F}^{2} {{[\frac{2}{σ_{R} ({\tilde{Z}}_{true}^{H} \tilde{Ψ})}]}^{2} + {[\frac{1}{σ_{R} ({\tilde{Z}}_{true}^{H} \tilde{Ψ})}]}^{2}} = \frac{5}{R} \frac{{‖ {\tilde{Ψ}}^{H} \tilde{E} ‖}_{F}^{2}}{σ_{R}^{2} ({\tilde{Ψ}}^{H} {\tilde{Z}}_{true})} = \frac{5}{R} \frac{{‖ \sum_{l = 1}^{L} Ψ_{l}^{H} E_{l} ‖}_{F}^{2}}{σ_{R}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Z_{true, l})}

(15)

where $\tilde{E} = \tilde{Z} - {\tilde{Z}}_{true}$ is exactly E₁, …, E_L stacked vertically, and σ_r (·) denotes the rth largest singular value of its argument. The first inequality holds by the perturbation bound in [15, Thm. 3], and the second holds since $σ_{R} ({\tilde{Z}}^{H} \tilde{Ψ}) \geq 0$ . Recalling that Z_true,l = Ψ_lD_true, we rewrite the denominator of (15) as

σ_{R}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Z_{true, l}) = σ_{R}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l} D_{true}) = \frac{1}{R} σ_{R}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l}) = \frac{1}{R} λ_{min}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l}),

(16)

where the second equality holds because $D_{trne} D_{true}^{H} = (1 / R) I$ . Substituting (16) into (15) yields (5).

VI. Proof of Corollary 2

Taking the expectation of (5) over the model mismatch amounts to taking the expectation of the numerator of the upper bound in (5):

E {‖ \sum_{l = 1}^{L} Ψ_{l}^{H} E_{l} ‖}_{F}^{2} = \sum_{l = 1}^{L} E {‖ Ψ_{l}^{H} E_{l} ‖}_{F}^{2} = \sum_{l = 1}^{L} tr (Ψ_{l}^{H} E {E_{l} E_{l}^{H}} Ψ_{l}) \leq \sum_{l = 1}^{L} λ_{max} (E {E_{l} E_{l}^{H}}) \cdot {‖ Ψ_{l} ‖}_{F}^{2} \leq {\bar{σ}}^{2} \cdot \sum_{l = 1}^{L} {‖ Ψ_{l} ‖}_{F}^{2},

(17)

where the first equality holds by using the assumption that E_l is zero-mean and independent over l, the second equality follows by expanding the Frobenius norm then applying linearity of the trace and expectation, the first inequality holds since $v^{H} M v \leq λ_{max} (M) \cdot ‖ v ‖_{2}^{2}$ for any vector v and Hermitian matrix M, and the last inequality follows from the definition of ${\bar{σ}}^{2}$ . Rewriting (17) using the identity $\sum_{l = 1}^{L} {‖ Ψ_{l} ‖}_{F}^{2} = \sum_{l = 1}^{L} tr (Ψ_{l}^{H} Ψ_{l}) = tr (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l})$ yields the result (6).

VII. Proof of Theorem 3

We derive two high probability bounds, one each for the numerator and denominator of (5). Then, the bound (9) with probability (10) follows by combining the two via a union bound. Before we begin, note that (8) implies that $‖ Ψ ‖_{2} \leq ‖ Ψ ‖_{F} \leq γ \sqrt{R}$ almost surely; our proofs use this inequality multiple times.

A. Upper bound for numerator

Observe first that

{‖ \sum_{l = 1}^{L} Ψ_{l}^{H} E_{l} ‖}_{F} = {‖ L E (Ψ_{l}^{H} E_{l}) + \sum_{l = 1}^{L} {Ψ_{l}^{H} E_{l} - E (Ψ_{l}^{H} E_{l})} ‖}_{F} \leq L {‖ E (Ψ_{l}^{H} E_{l}) ‖}_{F} + {‖ \sum_{l = 1}^{L} ξ_{l} ‖}_{2},

(18)

where $ξ_{l} : = vec {Ψ_{l}^{H} E_{l} - E (Ψ_{l}^{H} E_{l})} \in ℂ^{R K}$ for l = 1, …, L. We next bound ${‖ \sum_{l = 1}^{L} ξ_{l} ‖}_{2}$ via the vector Bernstein inequality [16, Cor. 8.44]. Note that ξ₁, …, ξ_L are i.i.d. with $E ξ_{l} = 0$ (by construction). Furthermore, ξ_l is almost surely bounded as

{‖ ξ_{l} ‖}_{2} = {‖ Ψ_{l}^{H} E_{l} - E (Ψ_{l}^{H} E_{l}) ‖}_{F} \leq {‖ Ψ_{l}^{H} E_{l} ‖}_{F} + {‖ E (Ψ_{l}^{H} E_{l}) ‖}_{F} (Triangle ineq .) \leq {‖ Ψ_{l}^{H} E_{l} ‖}_{F} + {‖ E (Ψ_{l}^{H} E_{l}) ‖}_{F} (Jensen ’ s ineq .) \leq {‖ Ψ_{l} ‖}_{F} {‖ E_{l} ‖}_{F} + E {‖ Ψ_{l} ‖}_{F} {‖ E_{l} ‖}_{F} \leq 2 σ γ \sqrt{R} .

Thus the vector Bernstein inequality [16, Cor. 8.44] yields that for any t > 0,

{‖ \sum_{l = 1}^{L} ξ_{l} ‖}_{2} \leq σ \sqrt{L} \sqrt{tr (\bar{Λ})} + t,

(19)

with probability at least

1 - exp {\frac{- t^{2} / 2}{3 L {(2 σ γ \sqrt{R})}^{2} + t (2 σ γ \sqrt{R}) / 3}} .

(20)

We obtained (19) by the following simplification:

E {‖ \sum_{l = 1}^{L} ξ_{l} ‖}_{2} \leq \sqrt{E {‖ \sum_{l = 1}^{L} ξ_{l} ‖}_{2}^{2}} = \sqrt{L E {‖ ξ_{l} ‖}_{2}^{2}} = \sqrt{L E {‖ Ψ_{l}^{H} E_{l} - E (Ψ_{l}^{H} E_{l}) ‖}_{F}^{2}} = \sqrt{L {E {‖ Ψ_{l}^{H} E_{l} ‖}_{F}^{2} - {‖ E (Ψ_{l}^{H} E_{l}) ‖}_{F}^{2}}} \leq \sqrt{L E {‖ Ψ_{l}^{H} E_{l} ‖}_{F}^{2}} \leq \sqrt{L E ({‖ Ψ_{l} ‖}_{F}^{2} {‖ E_{l} ‖}_{F}^{2})} \leq \sqrt{L σ^{2} E {‖ Ψ_{l} ‖}_{F}^{2}} = σ \sqrt{L} \sqrt{tr (\bar{Λ})},

where the third equality holds by $E ‖ A - E A ‖_{F}^{2} = \sum_{i, j} E {(A_{i, j} - E A_{i, j})}^{2} = \sum_{i, j} E A_{i, j}^{2} - {(E A_{i, j})}^{2} = E ‖ A ‖_{F}^{2} - ‖ E A ‖_{F}^{2}$ . We obtained (20) by the following simplifications:

sup_{‖ x ‖_{2} \leq 1} E {| x^{H} ξ_{l} |}^{2} \leq E {‖ ξ_{l} ‖}_{2}^{2} \leq {(2 σ γ \sqrt{R})}^{2},

E {‖ \sum_{l = 1}^{L} ξ_{l} ‖}_{2} \leq E \sum_{l = 1}^{L} {‖ ξ_{l} ‖}_{2} \leq L E {‖ ξ_{l} ‖}_{2} \leq L (2 σ γ \sqrt{R}) .

Applying (19) and (20) with $t = 2 σ γ \sqrt{R} L δ$ to the square of (18) yields

{‖ \sum_{l = 1}^{L} Ψ_{l}^{H} E_{l} ‖}_{F}^{2} \leq L^{2} {σ \sqrt{tr (\bar{Λ}) / L} + {‖ E (Ψ_{l}^{H} E_{l}) ‖}_{F} + 2 σ γ \sqrt{R} δ}^{2},

(21)

with probability at least $1 - exp (- L \frac{δ^{2} / 2}{3 + δ / 3})$ .

B. Lower bound for denominator

Observe that $\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l} = L \bar{Λ} + \sum_{l = 1}^{L} Λ_{l}$ , where $Λ_{l} : = Ψ_{l}^{H} Ψ_{l} - \bar{Λ}$ , so Weyl’s inequality [17] yields

λ_{min} (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l}) \geq λ_{min} (L \bar{Λ}) - {‖ \sum_{l = 1}^{L} Λ_{l} ‖}_{2},

(22)

and it remains to bound ${‖ \sum_{l = 1}^{L} Λ_{l} ‖}_{2}$ . We do so by using the Matrix Bernstein inequality [16, Cor. 8.15].

Note that Λ₁, …, Λ_L are i.i.d. (since x₁, …, x_L are i.i.d.) and $E Λ_{l} = 0$ . Furthermore, Λ_l is almost surely bounded as

{‖ Λ_{l} ‖}_{2} = {‖ Ψ_{l}^{H} Ψ_{l} - E (Ψ_{l}^{H} Ψ_{l}) ‖}_{2} \leq {‖ Ψ_{l}^{H} Ψ_{l} ‖}_{2} + {‖ E (Ψ_{l}^{H} Ψ_{l}) ‖}_{2} (Triangle ineq .) \leq {‖ Ψ_{l}^{H} Ψ_{l} ‖}_{2} + E {‖ Ψ_{l}^{H} Ψ_{l} ‖}_{2} (Jensen ’ s ineq .) = {‖ Ψ_{l} ‖}_{2}^{2} + E {‖ Ψ_{l} ‖}_{2}^{2} \leq 2 γ^{2} R .

Thus, the Matrix Bernstein inequality [16, Cor. 8.15] yields that for any t > 0,

ℙ {{‖ \sum_{l = 1}^{L} Λ_{l} ‖}_{2} \geq t} \leq 2 R exp {\frac{- t^{2} / 2}{L {(2 γ^{2} R)}^{2} + 2 γ^{2} R t / 3}},

(23)

where we use the following simplification:

{‖ \sum_{l = 1}^{L} E Λ_{l}^{2} ‖}_{2} = L {‖ E Λ_{l}^{2} ‖}_{2} \leq L E {‖ Λ_{l} ‖}_{2}^{2} \leq L {(2 γ^{2} R)}^{2} .

Applying (23) with t = 2γ²RLδ to the square of (22) yields

λ_{min}^{2} (\sum_{l = 1}^{L} Ψ_{l}^{H} Ψ_{l}) \geq L^{2} {λ_{min} (\bar{Λ}) - 2 γ^{2} R δ}^{2} .

(24)

with probability at least $1 - 2 R exp (- L \frac{δ^{2} / 2}{1 + δ / 3})$ .

C. Combined bound

Combining the bounds (21) and (24) via a union bound yields (9) with probability at least

1 - exp (- L \frac{δ^{2} / 2}{3 + δ / 3}) - 2 R exp (- L \frac{δ^{2} / 2}{1 + δ / 3}),

(25)

which is greater than or equal to (10).

Acknowledgments

This work is supported in part by the Keck Foundation and NIH grant U01 EB018753. BA is supported by NSERC grant 611675.

Footnotes

We follow the natural convention in sample size analyses of assuming that {x_l : ∀l} are i.i.d. samples from an underlying training distribution; see the references cited in Section IV and [5], [6] for other examples. Model mismatches {E_l : ∀l} also become i.i.d. across samples at all iterations of CAOL, if “fresh” training samples are used for each update, e.g., as can be done when solving (P1) via mini-batch stochastic optimization.

Contributor Information

Ben Adcock, Department of Mathematics, Simon Fraser University, Burnaby, BC V5A 1S6 Canada.

Jeffrey A. Fessler, Department of Electrical Engineering and Computer Science, The University of Michigan, Ann Arbor, MI 48019 USA.

REFERENCES

[1].Chun IY and Fessler JA, “Convolutional analysis operator learning: Acceleration and convergence,” submitted, January 2018. [Online]. Available: http://arxiv.org/abs/1802.05584 [DOI] [PMC free article] [PubMed]
[2].Chun IY and Fessler JA, “Convolutional dictionary learning: Acceleration and convergence,” IEEE Trans. Image Process, vol. 27, no. 4, pp. 1697–1712, April 2018. [DOI] [PubMed] [Google Scholar]
[3].Chun IY and Fessler JA, “Convergent convolutional dictionary learning using adaptive contrast enhancement (CDL-ACE): Application of CDL to image denoising,” in Proc. Sampling Theory and Appl. (SampTA), Tallinn, Estonia, July 2017, pp. 460–464. [Google Scholar]
[4].Chun IY and Fessler JA, “Convolutional analysis operator learning: Application to sparse-view CT,” in Proc. Asilomar Conf. on Signals, Syst., and Comput, Pacific Grove, CA, October 2018, pp. 1631–1635. [Google Scholar]
[5].Hastie T, Tibshirani R, and Friedman J, The elements of statistical learning: Data mining, inference, and prediction, ser. Springer series in statistics. New York, NY: Springer, 2009. [Google Scholar]
[6].Mohri M, Rostamizadeh A, and Talwalkar A, Foundations of machine learning. Cambridge, MA: MIT Press, 2018. [Google Scholar]
[7].Shakeri Z, Sarwate AD, and Bajwa WU, “Sample complexity bounds for dictionary learning from vector- and tensor-valued data,” in Information Theoretic Methods in Data Science, Rodrigues M and Eldar Y, Eds. Cambridge, UK: Cambridge University Press, 2019, ch. 5. [Google Scholar]
[8].Singh S, Poczos B, and Ma J, “Minimax reconstruction risk of convolutional sparse dictionary learning,” in Proc. Int. Conf. on Artif. Int. and Stat, ser. Proc. Mach. Learn. Res., vol. 84, Playa Blanca, Lanzarote, Canary Islands, April 2018, pp. 1327–1336. [Google Scholar]
[9].Aharon M, Elad M, and Bruckstein A, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process, vol. 54, no. 11, pp. 4311–4322, November 2006. [Google Scholar]
[10].Yaghoobi M, Nam S, Gribonval R, and Davies ME, “Constrained overcomplete analysis operator learning for cosparse signal modelling,” IEEE Trans. Signal Process, vol. 61, no. 9, pp. 2341–2355, March 2013. [Google Scholar]
[11].Hawe S, Kleinsteuber M, and Diepold K, “Analysis operator learning and its application to image reconstruction,” IEEE Trans. Image Process, vol. 22, no. 6, pp. 2138–2150, June 2013. [DOI] [PubMed] [Google Scholar]
[12].Cai J-F, Ji H, Shen Z, and Ye G-B, “Data-driven tight frame construction and image denoising,” Appl. Comput. Harmon. Anal, vol. 37, no. 1, pp. 89–105, October 2014. [Google Scholar]
[13].Ravishankar S and Bresler Y, “ℓ₀ sparsifying transform learning with efficient optimal updates and convergence guarantees,” IEEE Trans. Sig. Process, vol. 63, no. 9, pp. 2389–2404, May 2015. [Google Scholar]
[14].Seibert M, Wörmann J, Gribonval R, and Kleinsteuber M, “Learning co-sparse analysis operators with separable structures,” IEEE Trans. Signal Process, vol. 64, no. 1, pp. 120–130, January 2016. [Google Scholar]
[15].Li R-C, “New perturbation bounds for the unitary polar factor,” SIAM J. Matrix Anal. Appl, vol. 16, no. 1, pp. 327–332, January 1995. [Google Scholar]
[16].Foucart S and Rauhut H, A mathematical introduction to compressive sensing. New York, NY: Springer, 2013. [Google Scholar]
[17].Weyl H, “Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung),” Mathematische Annalen, vol. 71, no. 4, pp. 441–479, December 1912. [Google Scholar]

[R1] [1].Chun IY and Fessler JA, “Convolutional analysis operator learning: Acceleration and convergence,” submitted, January 2018. [Online]. Available: http://arxiv.org/abs/1802.05584 [DOI] [PMC free article] [PubMed]

[R2] [2].Chun IY and Fessler JA, “Convolutional dictionary learning: Acceleration and convergence,” IEEE Trans. Image Process, vol. 27, no. 4, pp. 1697–1712, April 2018. [DOI] [PubMed] [Google Scholar]

[R3] [3].Chun IY and Fessler JA, “Convergent convolutional dictionary learning using adaptive contrast enhancement (CDL-ACE): Application of CDL to image denoising,” in Proc. Sampling Theory and Appl. (SampTA), Tallinn, Estonia, July 2017, pp. 460–464. [Google Scholar]

[R4] [4].Chun IY and Fessler JA, “Convolutional analysis operator learning: Application to sparse-view CT,” in Proc. Asilomar Conf. on Signals, Syst., and Comput, Pacific Grove, CA, October 2018, pp. 1631–1635. [Google Scholar]

[R5] [5].Hastie T, Tibshirani R, and Friedman J, The elements of statistical learning: Data mining, inference, and prediction, ser. Springer series in statistics. New York, NY: Springer, 2009. [Google Scholar]

[R6] [6].Mohri M, Rostamizadeh A, and Talwalkar A, Foundations of machine learning. Cambridge, MA: MIT Press, 2018. [Google Scholar]

[R7] [7].Shakeri Z, Sarwate AD, and Bajwa WU, “Sample complexity bounds for dictionary learning from vector- and tensor-valued data,” in Information Theoretic Methods in Data Science, Rodrigues M and Eldar Y, Eds. Cambridge, UK: Cambridge University Press, 2019, ch. 5. [Google Scholar]

[R8] [8].Singh S, Poczos B, and Ma J, “Minimax reconstruction risk of convolutional sparse dictionary learning,” in Proc. Int. Conf. on Artif. Int. and Stat, ser. Proc. Mach. Learn. Res., vol. 84, Playa Blanca, Lanzarote, Canary Islands, April 2018, pp. 1327–1336. [Google Scholar]

[R9] [9].Aharon M, Elad M, and Bruckstein A, “K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation,” IEEE Trans. Signal Process, vol. 54, no. 11, pp. 4311–4322, November 2006. [Google Scholar]

[R10] [10].Yaghoobi M, Nam S, Gribonval R, and Davies ME, “Constrained overcomplete analysis operator learning for cosparse signal modelling,” IEEE Trans. Signal Process, vol. 61, no. 9, pp. 2341–2355, March 2013. [Google Scholar]

[R11] [11].Hawe S, Kleinsteuber M, and Diepold K, “Analysis operator learning and its application to image reconstruction,” IEEE Trans. Image Process, vol. 22, no. 6, pp. 2138–2150, June 2013. [DOI] [PubMed] [Google Scholar]

[R12] [12].Cai J-F, Ji H, Shen Z, and Ye G-B, “Data-driven tight frame construction and image denoising,” Appl. Comput. Harmon. Anal, vol. 37, no. 1, pp. 89–105, October 2014. [Google Scholar]

[R13] [13].Ravishankar S and Bresler Y, “ℓ₀ sparsifying transform learning with efficient optimal updates and convergence guarantees,” IEEE Trans. Sig. Process, vol. 63, no. 9, pp. 2389–2404, May 2015. [Google Scholar]

[R14] [14].Seibert M, Wörmann J, Gribonval R, and Kleinsteuber M, “Learning co-sparse analysis operators with separable structures,” IEEE Trans. Signal Process, vol. 64, no. 1, pp. 120–130, January 2016. [Google Scholar]

[R15] [15].Li R-C, “New perturbation bounds for the unitary polar factor,” SIAM J. Matrix Anal. Appl, vol. 16, no. 1, pp. 327–332, January 1995. [Google Scholar]

[R16] [16].Foucart S and Rauhut H, A mathematical introduction to compressive sensing. New York, NY: Springer, 2013. [Google Scholar]

[R17] [17].Weyl H, “Das asymptotische verteilungsgesetz der eigenwerte linearer partieller differentialgleichungen (mit einer anwendung auf die theorie der hohlraumstrahlung),” Mathematische Annalen, vol. 71, no. 4, pp. 441–479, December 1912. [Google Scholar]

PERMALINK

Convolutional Analysis Operator Learning: Dependence on Training Data

Il Yong Chun

David Hong

Ben Adcock

Jeffrey A Fessler

Roles

Abstract

I. Introduction

II. Backgrounds and Preliminaries

A. CAOL with orthogonality constraints

B. Filter update in a matrix form

III. Main Results:

Dependence of CAOL on Training Data

Theorem 1.

Corollary 2.

Fig. 1.

Theorem 3.

Fig. 2.

IV. Related Works

V. Proof of Theorem 1

VI. Proof of Corollary 2

VII. Proof of Theorem 3

A. Upper bound for numerator

B. Lower bound for denominator

C. Combined bound

Acknowledgments

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Convolutional Analysis Operator Learning: Dependence on Training Data

Il Yong Chun

David Hong

Ben Adcock

Jeffrey A Fessler

Roles

Abstract

I. Introduction

II. Backgrounds and Preliminaries

A. CAOL with orthogonality constraints

B. Filter update in a matrix form

III. Main Results:

Dependence of CAOL on Training Data

Theorem 1.

Corollary 2.

Fig. 1.

Theorem 3.

Fig. 2.

IV. Related Works

V. Proof of Theorem 1

VI. Proof of Corollary 2

VII. Proof of Theorem 3

A. Upper bound for numerator

B. Lower bound for denominator

C. Combined bound

Acknowledgments

Footnotes

Contributor Information

REFERENCES

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases