Markov Chain Monte Carlo Inference of Parametric Dictionaries for Sparse Bayesian Approximations

Theodora Chaspari; Andreas Tsiartas; Panagiotis Tsilifis; Shrikanth Narayanan

doi:10.1109/TSP.2016.2539143

. Author manuscript; available in PMC: 2017 Jun 23.

Published in final edited form as: IEEE Trans Signal Process. 2016 Mar 7;64(12):3077–3092. doi: 10.1109/TSP.2016.2539143

Markov Chain Monte Carlo Inference of Parametric Dictionaries for Sparse Bayesian Approximations

Theodora Chaspari ¹, Andreas Tsiartas ², Panagiotis Tsilifis ³, Shrikanth Narayanan ⁴

PMCID: PMC5482548 NIHMSID: NIHMS782938 PMID: 28649173

Abstract

Parametric dictionaries can increase the ability of sparse representations to meaningfully capture and interpret the underlying signal information, such as encountered in biomedical problems. Given a mapping function from the atom parameter space to the actual atoms, we propose a sparse Bayesian framework for learning the atom parameters, because of its ability to provide full posterior estimates, take uncertainty into account and generalize on unseen data. Inference is performed with Markov Chain Monte Carlo, that uses block sampling to generate the variables of the Bayesian problem. Since the parameterization of dictionary atoms results in posteriors that cannot be analytically computed, we use a Metropolis-Hastings-within-Gibbs framework, according to which variables with closed-form posteriors are generated with the Gibbs sampler, while the remaining ones with the Metropolis Hastings from appropriate candidate-generating densities. We further show that the corresponding Markov Chain is uniformly ergodic ensuring its convergence to a stationary distribution independently of the initial state. Results on synthetic data and real biomedical signals indicate that our approach offers advantages in terms of signal reconstruction compared to previously proposed Steepest Descent and Equiangular Tight Frame methods. This paper demonstrates the ability of Bayesian learning to generate parametric dictionaries that can reliably represent the exemplar data and provides the foundation towards inferring the entire variable set of the sparse approximation problem for signal denoising, adaptation and other applications.

Index Terms: Dictionary learning, parametric dictionaries, bayesian inference, markov chain monte carlo, sparse representation, uniform ergodicity

I. Introduction

Signal representations are fundamental for denoising, interpolation, estimation, classification and recognition. Recently there has been an increased focus on sparse representations [1] which model a signal with a small number of components from a large overcomplete set of exemplar atoms, called dictionary. A dictionary D ∈ ℜ^P×K contains K atoms d₁, …, d_K ∈ ℜ^P that constitute the building blocks of a P- dimensional signal x ∈ ℜ^P. Specifically a signal can be expressed as an exact or approximate linear combination of a small number of atoms from the dictionary as x ≈ Dc, where c ∈ ℜ^K contains the coefficients of the corresponding atoms. In the typical case where K > P, an infinite set of solutions arise. This can be addressed by imposing a sparsity constraint on x, according to which x should be represented by the smallest number of dictionary atoms and the sparse representation problem can be expressed as an l0-norm minimization.

The problem of minimizing ‖c‖₀ subject to the constraint x = Dc has been proven to be NP-hard and several directions have been proposed to solve it. One approach includes greedy strategies that abandon exhaustive search in favor of locally optimal updates resulting in sub-optimal solutions. Examples include matching pursuit [2] and orthogonal matching pursuit [3], [4] algorithms. An alternative is the relaxation of the discontinuous l0-norm leading to the more computationally expensive basis pursuit [5] and focal underdetermined system solver [6], [7] reaching global solutions. Bayesian methods with appropriate statistical assumptions have been further used to identify the desired sparse solution [8], [9], [10].

An essential step towards compact and reliable representations is the dictionary selection. Traditionally, analytic predesigned dictionaries comprising Gabor [11], wavelet [12], curvelet [13], or other atoms have been used, because of their localization, directionality and multi-resolution properties. Dictionary learning (DL) focuses on learning atoms from the available training data. It includes several well-known algorithms, such as the K-SVD [14] and the MOD [15], as well as probabilistic approaches [16]. Although non-parametric DL is effective for signal reconstruction [14], restoration [17], and classification [18], it depicts a highly non-convex nature, mostly yields non-structured dictionaries [19] and typically requires a large amount of training data [20].

These disadvantages can be mitigated by imposing a pre-determined structure through the use of carefully selected knowledge-driven parametric functions mapping a parameter space to structured dictionary atoms. This results in parametric dictionaries bridging the gap between pre-defined analytic dictionaries and purely numerical DL [21]. Dictionary atoms are expressed through an application-specific function ϕ of a parameter set, say d_k = ϕ(θ_k), where θ_k ∈ ℜ^Q, Q < P, are the atom parameters optimized with respect to desirable properties [22], [23], [24]. Parametric DL is more likely to converge faster and have more efficient implementations compared to the non-parametric problem [19]. It further provides higher signal interpretability yielding important metainformation [25], [26], [27], [28].

This paper proposes a Bayesian framework for learning the parameters of a dictionary given a predetermined parametric function. The formulation of the sparse representation problem from the Bayesian perspective assumes a probabilistic distribution for all variables in an effort to provide a posterior belief for their values [10], [29]. This allows the estimation of full posteriors rather than single estimates which can result in better handling uncertainty and benefits noise estimation.

Related Work

DL methods usually alternate between sparse decoding and dictionary update [19]. In the context of non-parametric DL, Ophir et al. have proposed a sequential learning algorithm by identifying the orthogonal directions to a data subset [30]. Engan et al. used matrix inversion to compute the dictionary matrix [15], while Aharon et al. introduced K-SVD, which is a constrained optimization approach performing atom-by-atom update [14]. Generalized principal component analysis models the data as a union of low-dimensional subspaces with orthogonal bases [31], while structured dictionaries have been proposed in an effort to enforce additional translation invariant, hierarchical and multiscale properties [32], [33], [34].

Parametric dictionaries depict higher interpretability, lower density of local minima and compact representation [19]. Dictionaries can be learnt as a result of translation of elementary signal segments over space and time [35], [36]. They can also contain atoms of predetermined structure, such as wavelets [37] or Gabor functions [22], whose parameters are adapted with steepest-descent (SD) [22] and other least-squares-based methods [25], [37]. Yanghoobi et al. proposed a method to find dictionaries close to the one with minimum coherence, called the equiangular tight frame (ETF) [24]. Finally, Thanou et al. [21] proposed a parametric DL for signals residing on weighted graphs.

Although the aforementioned deterministic DL methods perform well in various signal processing applications [14], [17], [18], they provide single-point estimates and cannot handle noise uncertainty. Bayesian approaches offer a way to address those disadvantages. They have been proposed for compressed sensing [8], [9], [38], [39] as well as for non-parametric DL.

The problem of ensuring sparseness in Bayesian approximations has been addressed in a variety of ways. Early approaches have used continuous sparse-promoting distributions to the atom coefficients c_n, such as the Laplacian [40], Cauchy [16], [41] and Student-t [42], [10]. Another way is the use of indicator variables that permit to independently sample each atom from the Bernoulli distribution [43], [44], [45]. Despite their computational efficiency, these can yield an arbitrarily large amount of non-zero coefficients, which might not be practical for many applications. More inline with our problem, previous studies have explored the use of appropriate probabilistic assumptions for keeping a constant number of dictionary atoms in the representation with proposal distributions iteratively conditioning on the atoms selected at previous steps [39], [46], [38]. Gaussian assumptions are typically imposed on the dictionary atoms [43], [44] and the signal noise [38], [43], [44], [47], the latter being consistent with biomedical applications [48], [49].

Bayesian inference casts DL into an optimization problem for maximizing the posterior distribution. The overcomplete dictionaries containing many atoms in combination with the large amount of training data yield a high-dimensional framework, for which closed form solutions are usually difficult to derive and approximate inference methods are followed. Common approaches include the evidence maximization [8], [9], relevance vector machine [10], Markov Chain Monte Carlo (MCMC) [38], [44], and variational approximation [43].

Contributions

We propose a Bayesian framework for learning the parameters of dictionary atoms, given a parametric function that maps the dictionary parameters to the actual atoms. Our approach imposes probabilistic distributions to the variables of the sparse representation problem that are estimated through MCMC methods because of their simplicity and ability to fully perform Bayesian inference [50]. Compared to previous Bayesian DL [43], [44], our approach introduces parametric dictionaries where non-closed-form solutions are handled with a combination of Gibbs and Metropolis-Hastings (MH) sampling (MH-within-Gibbs). Our approach differs from previous parametric DL [22], [24] because of its stochastic framework that yields estimation of the full problem variables. This results in parametric dictionaries that take into account the structure of the training data and are less prone to overfitting. The parametric nature of our problem further requires the use of appropriate sparse-imposing priors that keep the selected number of dictionary atoms within a pre-determined range. We perform atom sampling with and without replacement formalized through the Multinomial and the Wallenius’ hypergeometric distribution, respectively.

One key challenge with MCMC is to determine its asymptotic behavior, i.e. whether it provides accurate posterior approximations. The goal is to create an aperiodic and irreducible Markov Chain (MC) with stationary distribution same as the posterior distribution of interest [51]. Irreducibility ensures that any state of the space is accessible, while aperiodicity makes sure that the chain does not return to the same state at regular times. Uniformly ergodic MCs are a special case in which the MC converges to the invariant distribution independently of the initial state. They guarantee geometrically fast convergence and are key sufficient conditions in order to establish central limit theorems for empirical averages and provide consistent estimators of MCMC standard errors [51], [52]. Because of these, we discuss the geometric ergodicity of MCMC in our proposed Bayesian inference framework that ensures convergence. We further perform qualitative and quantitative diagnostics to evaluate the reliability of the generated samples and resulting distributions.

We demonstrate the ability of our algorithm for parallel processing with experiments on synthetic data and real biomedical signals. DL is performed for each sample separately and the resulting dictionaries from each exemplar data are further combined into a unified model. Our results indicate that the proposed approach yields benefits in terms of superior signal reconstruction compared to previous SD [22] and ETF [24] methods. When we have precise a priori knowledge of the optimal parametric function ϕ representing the data, our parametric framework also yields better performance than the classical non-parametric approach of K-SVD [14].

In the following, we provide the problem formulation (Section II) and describe the proposed MCMC approach for learning the parameters for the considered problem (Section III). We describe the parallel implementation framework of our algorithm (Section IV) and discuss the use of various parametric functions and their effect on the sample-generating procedure (Section V). We further analyze the geometric ergodicity properties of the proposed MH-within-Gibbs algorithm yielding uniform convergence (Section VI). In Section VII, we provide experimental results and the results of MCMC diagnostics. Finally, we discuss our results and offer conclusions in Sections VIII and IX.

Notation

We denote matrices with bold uppercase letters X and vectors with bold lowercase letters x. Lowercase letters with appropriate numerical indices will either refer to the columns of a matrix, i.e. X = [x₁ … x_N] or the elements of a vector, i.e. x = [x₁ … x_N]^T. We denote (X)_ij the entry of matrix X corresponding to the i^th row and j^th column. The p^th order norm of a vector is symbolized as ‖x‖_p, while the N × 1 identity vector and N × N identity matrix are noted as 1_N and I_N, respectively. The vectorization of matrix X, obtained by stacking its columns on top of one another, is defined as vec(X) = [x₁^T, …, x_N^T]^T. Finally the gradient and Hessian of a scalar valued function f(x), x ∈ ℜ^N, are denoted as ∇f(x) ∈ ℜ^N×1 and H_f = ∇²f(x) ∈ ℜ^N×N, where ${(\nabla f (x))}_{i} = \frac{θ f}{θ x_{i}}$ and ${(H_{f})}_{i j} = \frac{ϑ^{2} f}{ϑ x_{i} ϑ x_{j}}$ , respectively.

II. Problem Formulation

Let X = [x₁ … x_N] ∈ ℜ^P×N be a data matrix of N examples x_n ∈ ℜ^P. We formulate the parametric DL problem by assuming an overcomplete dictionary D_n ∈ ℜ^P×K containing K prototypical atoms d_nk ∈ ℜ^P for each exemplar data x_n separately. This approach, also found in similar studies [43], enjoys computational benefits compared to batch methods, since it can yield faster reliable estimates of the considered variables (Section III-B) and can be easily parallelized to run on multiple computational threads (Section III-E).

In the case of parametric dictionaries, the atoms can be expressed with an appropriate domain-specific function ϕ : ℜ^Q → ℜ^P in terms of a parameter vector θ_nk ∈ ℜ^Q, Q < P, as d_nk = ϕ(θ_nk), therefore D_n = [ϕ(θ_n1) … ϕ(θ_nK)]. Typically dictionary atoms have a unitary l2-norm, i.e. ‖ϕ(θ_nk)‖₂ = 1. Each signal x_n can be expressed as a linear combination of a small number of atoms, L ≪ K, with additive noise ε_n ∈ ℜ^P

x_{n} = D_{n} c_{n} + ε_{n}

(1)

where ε_n is the error and c_n ∈ ℜ^K, ‖c_n‖₀ = L, are the coefficients with non-zero values only for the used atoms. According to the Bayesian framework, each variable in (1) is assumed to follow an underlying probabilistic distribution.

Especially in parametric DL, where we jointly sample the parameters of the selected dictionary atoms (Section III-B), a small number of selected atoms can keep the implementation computationally tractable. For this reason, we are interested in imposing explicit sparseness constraints, similarly to previous studies [38], [39], [46] (Section I). We will describe two different ways to approach this using the Multinomial and the Wallenius’ hypergeometric distribution, that allow sampling with and without replacement.

A. Atom Sampling with Replacement

A straightforward method to sample the dictionary atoms is to relax the l0-sparsity norm constraint into ‖c_n‖₀ ≤ L allowing independent sampling of the dictionary atoms L times with replacement through the Multinomial distribution. Since the population size is much larger than the sample size (L≪K), duplicate atoms are rare [53]. If we assume a discrete multinomial distribution for selecting one dictionary atom out of the possible K, (1) can be re-written as

x_{n} = D_{n} \sum_{l = 1}^{L} s_{n l} z_{n l} + ε_{n}

(2)

where z_nl ∈ ∪ u_i, u_i = [0, 0, …, 1, …, 0]^T with 1 in the i^th entry (i ≤ K) and 0 in the rest K−1 entries. The vector ‖z_nl‖₀ = 1 is binary activating one dictionary atom at a time and s_n = [s_n1 … s_nL]^T ∈ ℜ^L only contains the coefficients of the selected atoms. If atom d_k is the l^th representation term, then z_{nl_k′} = 1, z_{nl_k′} = 0, ∀k′ ≠ k, and s_nl consists the k^th entry of vector c_n in (1), i.e. s_nl = c_nk. The probability of selecting atom k for data x_n is π_nk such that z_nl ~ Multinomial (1, π_n) with π_n = [π_n1, …, π_nK]^T ∈ ℜ^K. If the same atom is selected more than once, the corresponding coefficient is only once estimated.

B. Atom Sampling without Replacement

Sampling without replacement avoids duplicate atoms and keeps the l0-sparsity constraint intact. The problem of selecting L atoms out of the possible K can be formalized similarly to the classical experiment of taking colored balls at random from an urn without replacement [54], [55]. If the balls have a different weight, the result follows the Wallenius’ noncentral hypergeometric distribution [56].

In the considered problem, we can assume K dictionary atoms each of a different type (i.e. each ball in the urn has a different color) and selection probability π = [π₁, …, π_K]. If z_n = [z_n1, …, z_nK] ∈ ℜ^K, z_nk ∈ {0, 1} and ‖z_n‖₀ = L, indicates the selected atoms for x_n, then (1) becomes

x_{n} = D_{n} (s_{n} ◦ z_{n}) + ε_{n}

(3)

where “◦” represents the Hadamard or entrywise product and s_n ∈ ℜ^K and z_n follows the Wallenius distribution.

C. Additional Probabilistic Assumptions

We assume independent atom coefficients s_n1, …, s_nL each following a normal distribution with mean μ_{s_nl} and precision γ_s, i.e. $s_{n} ~ Normal (μ_{s_{n},} γ_{s}^{- 1} I_{L})$ . We have considered a different mean μ_{s_nl} for each exemplar data n and selected dictionary atom l to account for the various signal energy levels and possible atom configurations. We further hypothesize dictionary parameters of normal distribution θ_nk ~ Normal (g_nk, G_n⁻¹) with mean g_k and precision G_n. Finally, we assume zero mean Gaussian noise with variance $γ_{ε_{n}}^{- 1}$ , i.e. $ε_{n} ~ Normal (0, γ_{ε_{n}}^{- 1} I_{Q})$ . Similar distributions have been also hypothesized in prior work [16], [41], [44] (Section I).

The Bayesian framework further treats the parameters of the above variables as random components in order to better capture uncertainty. We introduce conjugate prior distributions that simplify computations. Specifically, we assume that π_n follows a Dirichlet prior, i.e. π_nk ~ Dirichlet (α), where α = [a₁ … a_K]^T, and the precision of the Gaussian noise follows a Gamma prior, i.e. γ_{ε_n} ~ Gamma (e, f). The mean vector g_nk and precision matrix G_n of the dictionary atom parameters are modeled with Gaussian and Wishart distributions, respectively, i.e. g_nk ~ Normal (g₀, G₀⁻¹) and G_n ~ Wishart (ν₀, R₀).

D. Objective

The goal is to find $y_{n}^{*}$ such that

y_{n}^{*} = arg max_{y_{n}} p (y_{n} | x_{n}, ℋ_{n})

(4)

y_{n} = {[{z_{n}}^{T}, s_{n 1}, \dots, s_{n L}, {θ_{n 1}}^{T}, \dots, {θ_{n K}}^{T}, {g_{n 1}}^{T}, \dots, {g_{n K}}^{T}, vec {(G_{n})}^{T}, {ε_{n}}^{T}, {π_{n}}^{T}, γ_{ε_{n}}]}^{T}

(5)

ℋ_{n} = {α, g_{0}, G_{0}, ν_{0}, R_{0}, e, f, μ_{s_{n}}, γ_{s}}

(6)

The probabilistic assumptions for (4)–(6) are summarized in Table I.

TABLE I.

Prior distributions of Bayesian dictionary learning variables.

Variable

Type

Expression

(Hyper) Parameters

z_nl^† ∈ ∪ u_i

Multinomial (1, π_n)

\prod_{k = 1}^{K} π_{n k}^{z_{n l_{k}}}

π_n: outcome
probability

z_n^‡ ∈ ℜ^K

Wallenius (1_K, L, π_n)

\int_{0}^{1} \prod_{k = 1}^{K} {(1 - t^{π_{n k} / d})}^{z_{n k}} d t, d = \sum_{k = 1}^{K} π_{n k} (1 - z_{n k})

s_n ∈ ℜ^L

Normal (μ_{s_n},

γ_{s}^{- 1} I_{L}

)

{(2 π)}^{- L / 2} γ_{s}^{1 / 2} exp [- \frac{γ_{s}}{2} {(S_{n} - μ_{s_{n}})}^{T} (S_{n} - μ_{s_{n}})]

μ_{s_n}: mean
γ_s: precision

θ_nk ∈ ℜ^Q

Normal (g_nk, G_n⁻¹)

{(2 π)}^{- Q / 2} | G_{n} |^{1 / 2} exp [- \frac{1}{2} {(θ_{n k} - g_{n k})}^{T} G_{n} (θ_{n k} - g_{n k})]

g_nk: mean
G_n: precision

ε_n ∈ ℜ^P

Normal (0,

γ_{ε_{n}}^{- 1} I_{P}

)

{(2 π)}^{- P / 2} γ_{ε_{n}}^{1 / 2} exp [- \frac{γ_{ε_{n}}}{2} {ε_{n}}^{T} ε_{n}]

γ_{ε_n} : precision

π_n ∈ ℜ^K

Dirichlet (α)

\frac{Γ (\sum_{k = 1}^{K} α_{k})}{\sum_{k = 1}^{K} Γ (α_{k})} \prod_{k = 1}^{K} π_{n k}^{α_{k^{- 1}}}

α: concentration
parameters

g_nk ∈ ℜ^Q

Normal (g₀, G₀⁻¹)

{(2 π)}^{- Q / 2} | G_{0} |^{1 / 2} exp [- \frac{1}{2} {(g_{n k} - g_{0})}^{T} G_{0} (g_{n k} - g_{0})]

g₀: mean
G₀: precision

G_n ∈ ℜ^Q×Q

Wishart (ν₀, R₀)

({| G_{n} |}^{\frac{ν_{0} - Q - 1}{2}} exp [- \frac{tr ({R_{0}}^{- 1} G_{n})}{2}]) {(2^{\frac{Q_{ν_{0}}}{2}} {| R_{0} |}^{\frac{ν_{0}}{2}} Γ_{Q} (\frac{ν_{0}}{2}))}^{- 1}

ν₀ > Q − 1: dof
R₀: scale matrix

γ_{ε_n} ∈ ℜ

Gamma (e, f)

\frac{f^{c}}{Γ (e)} γ_{ε_{n}}^{e - 1} exp [- f γ_{ε_{n}}]

e: shape
f: rate

Open in a new tab

Γ, Γ_Q: (multivariate) gamma functions, dof: degrees of freedom, 1_K = [1, …, 1]^T ∈ ℜ^K

^†

: atom sampling with replacement (‖z_nl‖₀ = 1), u_i = [0, 0, …, 1, …, 0]^T with 1 in the i^th entry, 0 otherwise

^‡

: atom sampling without replacement (‖z_n‖₀ = L)

III. Inference with MCMC Sampling

The inference problem aims at finding solutions $y_{n}^{*}$ , n = 1 …, N, that maximize (4). Since (4) is not analytically tractable, we use MCMC for approximate inference because of its simple and fast implementation. We describe the inference procedure in Section III-A and provide the corresponding derivations in Sections III-B, III-C, and III-D.

A. MCMC Sampling

The large number of variables in our problem renders the simultaneous sampling from the full posterior quite prohibitive. For this reason, we divide the variable space y into blocks, a technique which is usually referred as “block-at-a-time” MCMC [57], [58]. Suppose that the variable space is split into B blocks specified according to the problem characteristics, i.e. y = [y₁^T, …, y_B^T]^T. Without loss of generality, we hypothesize these blocks are sampled sequentially from y₁ to y_B. This is referred as “deterministic-scan” MCMC sampling [59], [60], [61] and will be the focus of our paper.

When the posterior probability of a block b yields a known probabilistic distribution, we use the Gibbs sampler, otherwise we sample based on the MH [62], [63]. MH generates a sample with a candidate-generating (or proposal) density q. The MC transitions to the generated sample with a predefined probability of move. A critical component is the selection of proposal density, based on which MH samplers are divided into two categories. The first is the random-walk [62] with samples generated around the current value of the corresponding variables. Despite its simplicity, this method often depicts slow convergence depending on the variance of q [58], [64]. The second type, called independent MH [63], samples independently of the previous state. Its proposal density is close to the target distribution in a certain sense benefiting convergence. Previous work has considered Student-t distributions tailored to the target density [65], [66], [67], whose long tails ensure that no areas of the state space are left unexplored. The “MH-within-Gibbs” sampler uses MH to generate samples for blocks whose posterior does not yield a known distribution, and Gibbs to generate samples for the remaining blocks.

We will further discuss the use of “MH-within-Gibbs sampler” for our problem and the derivation of posteriors for each variable (Table II). We will assume that y_{−y_b} contains all variables except the ones included in the b^th block (i.e. y_b), which are currently being generated.

TABLE II.

Description of Metropolis-Hastings-within-Gibbs sampling distributions.

Variable

Sampling Distribution/Proposal

Sampling Method

z_nl^† ∈ ℜ^K

Multinomial (1, p_nl)

p_{n l_{k}} = π_{n k} exp [- \frac{1}{2} γ_{ε_{n}} s_{n l} (s_{n l} - 2 ε_{n l}^{T} ϕ (θ_{n k}))]

Gibbs

z_n^‡ ∈ ℜ^K

Wallenius (1_k, L, π_n)

Metropolis-Hastings

s_nl ∈ ℜ

Normal (\frac{γ_{s} μ_{s_{n l}} + γ_{ε_{n}} {(D_{n} z_{n l})}^{T} ε_{n l}}{γ_{s} + γ_{ε_{n}}}, {(γ_{s} + γ_{ε_{n}})}^{- 1})

Gibbs

{\tilde{θ}}_{n} = vec ([θ_{n k_{1}^{'}} \dots θ_{n k_{L}^{'}}]) \in ℜ^{Q L}

Student-t (μ̂_{θ̃_n}, V̂_{θ̃_n}, ν₁)

Metropolis-Hastings

^† θ_nk ∈ ℜ^Q, k ∉ ℐ_{D_n}

Normal (g_nk, G_n)

Gibbs

ε_n ∈ ℜ^P

Normal (0, γ_{ε_{n}}^{- 1} I_{P})

Gibbs

π_n ∈ ℜ^K

Dirichlet ({[α_{1} + \sum_{l = 1}^{L} z_{n l_{1}}, \dots, α_{K} + \sum_{l = 1}^{L} z_{n l_{K}}]}^{T})

Gibbs

g_nk ∈ ℜ^Q

Normal ({(G_{n} + G_{0})}^{- 1} ({G_{n}}^{T} θ_{n k} + {G_{0}}^{T} g_{0}), {(G_{n} + G_{0})}^{- 1})

Gibbs

G_n ∈ ℜ^Q×Q

Wishart (ν_{0} + K, {[{R_{0}}^{- 1} + \sum_{k = 1}^{K} (θ_{n k} - g_{n k}) {(θ_{n k} - g_{n k})}^{T}]}^{- 1})

Gibbs

γ_{ε_n} ∈ ℜ

Gamma (e + \frac{1}{2}, f + \frac{1}{2} {‖ x_{n} - \sum_{l = 1}^{L} s_{n l} ϕ (θ_{n k_{l}^{'}}) ‖}^{2})

Gibbs

Open in a new tab

$x_{n} = \sum_{l = 1}^{L} s_{n l} ϕ (θ_{n k_{l}^{'}}) + ε_{n}$ ,

^†

$ℐ_{D_{n}} = {k_{1}^{'}, \dots, k_{L}^{'}}$ , 1_K = [1, …, 1]^T ∈ ℜ^K

^{†, ‡}

: atom sampling with/without replacement

B. Sampling Dictionary Parameters

Let $ℐ_{D_{n}} = {k_{1}^{'}, \dots, k_{L}^{'}}$ be the indices of dictionary atoms that represent signal x_n and $θ_{n k_{1}^{'}} \dots θ_{n k_{L}^{'}}$ the corresponding atom parameters. The joint posterior of ${\tilde{θ}}_{n} = vec ([θ_{n k_{1}^{'}} \dots θ_{n k_{L}^{'}}]) \in ℜ^{Q L}$ can be written as

p ({\tilde{θ}}_{n} | y_{- {\tilde{θ}}_{n}}, x_{n}, ℋ_{n}) \propto \prod_{k \in ℐ_{D_{n}}} p (θ_{n k} | g_{n k}, G_{n}) \cdot p (x_{n} - \sum_{l = 1}^{L} s_{n l} ϕ (θ_{n k_{l}^{'}}) | γ_{ε_{n}}) \propto exp [- \frac{γ_{ε_{n}}}{2} {‖ x_{n} - \sum_{l = 1}^{L} s_{n l} ϕ (θ_{n k_{l}^{'}}) ‖}_{2}^{2}] exp [- \frac{1}{2} \sum_{k \in ℐ_{D_{n}}} {(θ_{n k} - g_{n k})}^{T} G_{n} (θ_{n k} - g_{n k})] \propto exp [- \frac{γ_{ε_{n}}}{2} {‖ x_{n} - S_{n} {\tilde{ϕ}}_{n} ({\tilde{θ}}_{n}) ‖}_{2}^{2}] exp [- \frac{1}{2} {({\tilde{θ}}_{n} - {\tilde{g}}_{n})}^{T} {\tilde{G}}_{n} ({\tilde{θ}}_{n} - {\tilde{g}}_{n})]

(7)

where ${\tilde{ϕ}}_{n} ({\tilde{θ}}_{n}) = vec ([ϕ (θ_{n k_{1}^{'}}) \dots ϕ (θ_{n k_{L}^{'}})]) \in ℜ^{P L}, {\tilde{g}}_{n} = vec ([g_{n k_{1}^{'}} \dots g_{n k_{L}^{'}}]) \in ℜ^{Q L}$ and

{\tilde{G}}_{n} = (\begin{matrix} G_{n} \\ ⋱ \\ G_{n} \end{matrix}) \in ℜ^{Q L \times Q L}

graphic file with name nihms782938f5.jpg

Because of ϕ, we cannot find a known probabilistic distribution for (7), therefore we use MH for sampling θ̃_n. The proposal density is a multivariate Student-t distribution with location μ̂_{θ̃_n} tailored to the target density with ν₁ degrees of freedom and identity scale matrix V̂_{θ̃_n}, defined as

q ({\tilde{θ}}_{n} | x_{n}, ℋ_{n}, y_{- {\tilde{θ}}_{n}}) = \frac{Γ (\frac{ν_{1} + Q}{2})}{Γ (\frac{ν_{1}}{2})} {[π (ν_{1} + Q)]}^{- \frac{Q}{2}} {| {\hat{V}}_{{\tilde{θ}}_{n}} |}^{- \frac{1}{2}} {[1 + \frac{1}{ν_{1} + Q} {({\tilde{θ}}_{n} - {\hat{μ}}_{{\tilde{θ}}_{n}})}^{T} {\hat{V}}_{{\tilde{θ}}_{n}}^{- 1} ({\tilde{θ}}_{n} - {\hat{μ}}_{{\tilde{θ}}_{n}})]}^{- \frac{ν_{1} + Q}{2}}

(8)

{\hat{μ}}_{{\tilde{θ}}_{n}} = arg max_{{\tilde{θ}}_{n}} ln (p ({\tilde{θ}}_{n} | y_{- {\tilde{θ}}_{n}}, x_{n}, ℋ_{n}))

(9)

{\hat{V}}_{{\tilde{θ}}_{n}} = τ^{2} I_{Q L}

(10)

The choice of Student-t is crucial for the uniform ergodicity of MCMC (Section VI). In the experiments (Section VII), we find a numerical solution to (9) using the Nelder Mead’s simplex [68] because of its simplicity compared to gradient-based approaches. From (8–9), we are able to jointly generate the parameters of the selected atoms, rather than generating each one separately. A large number of selected atoms increases the computational cost towards maximizing (9).

The posterior for the remaining atoms θ_nk ∉ ℐ_{D_n}, generated with Gibbs, is

p (θ_{n k} | y_{- θ_{n k}}, x_{n}, ℋ_{n}) \propto p (θ_{n k} | g_{n k}, G_{n})

(11)

C. Sampling Atom Indices and Coefficients

The probabilistic selection of dictionary atoms gives the opportunity to test unseen combinations. This can alleviate disadvantages of greedy algorithms, since the randomization might overcome locally optimal solutions, as also observed in [39]. Because the atom indices and coefficients are interdependent, we will describe the sampling procedure for both.

In the case of sampling with replacement, the atom indices and coefficients are generated with the Gibbs sampler, as their posterior can be analytically derived based on the assumptions for their priors (Section II). In the following, we will denote ε_nl = x_n − ∑_l′≠l s_nl′ D_nz_nl′ the error for the exemplar data x_n when we exclude the l^th atom from the representation. Then the total representation error can be expressed as

ε_{n} = ε_{n l} - s_{n l} D_{n} z_{n l} = ε_{n l} - s_{n l} \sum_{k = 1}^{K} z_{n l_{k}} ϕ (θ_{n k})

(12)

The posterior distribution of z_nl can be written as

p (z_{n l} | y_{- {z_{n l}}}, x_{n}, ℋ_{n}) \propto p (z_{n l} | π_{n}, L) \cdot p (ε_{n} | γ_{ε_{n}}) \propto \prod_{k = 1}^{K} π_{n k}^{z_{n l_{k}}} \cdot exp (- \frac{γ_{ε_{n}}}{2} {‖ ε_{n l} - s_{n l} \sum_{k = 1}^{K} z_{n l_{k}} ϕ (θ_{n k}) ‖}_{2}^{2}) \propto \prod_{k = 1}^{K} {[π_{n k} exp (γ_{ε_{n}} s_{n l} ε_{n l}^{T} ϕ (θ_{n k}))]}^{z_{n l_{k}}}

(13)

For deriving the above expression we took into account that z_{nl_k} ∈ {0, 1} is a binary variable, implying that $z_{n l_{k}}^{2} = z_{n l_{k}}$ and az_{nl_k} = a^z_{nl_k}, ∀a ∈ ℜ. Also z_nl has unit l0-norm, i.e. ‖z_nl‖₀ = 1, resulting in z_{nl_k}z_{nl_k′} = 0, ∀k′ ≠ k. As indicated in (13), this update procedure considers the similarity of dictionary atoms to the signal residual and the prior knowledge for selecting atom k. Finally, the posterior of s_nl is

p (s_{n l} | y_{- s_{n l}}, x_{n}, ℋ_{n}) \propto p (s_{n l} | γ_{s}) \cdot p (ε_{n} | γ_{ε_{n}}) \propto exp [- \frac{γ_{s}}{2} {(s_{n l} - μ_{s_{n l}})}^{2} - \frac{γ_{ε_{n}}}{2} {‖ ε_{n l} - s_{n l} D_{n} z_{n l} ‖}_{2}^{2}]

(14)

By completing the square of the above quadratic formula with respect to s_nl and assuming dictionary atoms of unit norm, s_nl can be generated from a normal distribution (Table II).

To the best of our knowledge, we found no conjugate prior for the Wallenius’ hypergeometric distribution [53], therefore we used the independent MH with proposal distribution tailored to the Wallenius prior for sampling without replacement.

D. Sampling the Parameters of the Priors

The conjugate assumptions for the distribution of the parameters of the aforementioned variables allow us to use the Gibbs sampler to generate the corresponding samples. For the sake of completeness, we briefly sketch the derivation of the posteriors with sampling distributions shown in Table II.

p (π_{n} | y_{- π_{n}}, -)) \propto p (π_{n} | α) \prod_{l = 1}^{L} p (z_{n l} | π_{n}, α)

(15)

p (g_{n k} | y_{- g_{n k}}, -) \propto p (θ_{n k} | g_{n k}, G_{n}) \cdot p (g_{n k} | g_{0}, G_{0})

(16)

p (G_{n} | y_{- G_{n}}, -) \propto \prod_{k = 1}^{K} p (θ_{n k} | g_{n k}, G_{n}) \cdot p (G_{n} | ν_{0}, R_{0})

(17)

p (γ_{ε_{n}} | y_{- γ_{ε_{n}}}, -) \propto p (γ_{ε_{n}} | e, f) \cdot p (x_{n} - D_{n} \sum_{l = 1}^{L} s_{n l} z_{n l} | γ_{ε_{n}})

(18)

where the dash “−” denotes {x_n, ℋ_n}

Algorithm 1.

MCMC inference of parametric dictionary learning variables

Require: Data x_n, hyperparameters ℋ_n
1:	for n = 1, …, N do
2:	for m = 1, …, M do
3:	Sample atom indices $z_{n 1}^{(m)}, \dots z_{n L}^{(m)}$ or $z_{n}^{(m)}$
4:	Find $ℐ_{D_{n}}^{(m)} = {k_{1}^{'}, \dots, k_{L}^{'}}$ s.t. $z_{n l_{k_{l}^{'}}}^{(m)} = 1$
5:	Sample atom coefficients $s_{n 1}^{(m)}, \dots, s_{n L}^{(m)}$
6:	Sample noise vector $ε_{n}^{(m)}$
7:	Sample dictionary parameters $θ_{n k}^{(m)}, k \notin ℐ_{D_{n}}^{(m)}$
8:	Sample dictionary priors $g_{n 1}^{(m)}, \dots, g_{n K}^{(m)}, G_{n}^{(m)}$
9:	Sample atom selection probability priors $π_{n}^{(m)}$
10:	Sample noise variance $γ_{ε_{n}}^{(m)}$
11:	Sample ${\tilde{θ}}_{n}^{(m)}$ for generating $θ_{n k}^{(m)}, k \in ℐ_{D_{n}}^{(m)}$
12:	Compute $D_{n}^{(m)} = [ϕ (θ_{n 1}^{(m)}) \dots ϕ (θ_{n K}^{(m)})]$
13:	end for
14:	end for

Open in a new tab

E. Implementation of Bayesian DL

The problem variables (5) are inferred for each exemplar data separately. This method can yield signal-specific estimations of the noise variance, useful for denoising applications. It also allows sharing computational cost in MCMC inference, which typically requires a large amount of iterations to converge. Since more training samples are likely to require more MCMC iterations, if we had trained one dictionary on the entire dataset, the sequential nature of MCMC would have significantly slowed down convergence. It is worth mentioning that similar studies have acknowledged the difficulty of performing Bayesian inference in large datasets [43]. The MCMC inference procedure is outlined in Algorithm 1.

IV. Combination of Generated Dictionaries

The variables of the considered Bayesian problem are inferred for each exemplar data separately yielding sample-specific information useful for denoising and other applications, and allowing parallel implementations (Sections II, III). The latter is beneficial because of the large amount of data in many applications, that render batch DL methods computationally expensive or even prohibitive. However, in order to obtain generalizable dictionaries, that are able to reliably represent unseen data, we need to combine the corresponding results into a unified model (Algorithm 2). While the Bayesian framework aims to maximize the posterior probability of the model parameters, DL is usually evaluated based on the reconstruction error. For this reason, the combination of the dictionaries that result from the Bayesian inference is performed based on a root mean square (RMS) error criterion.

Let $D_{n}^{(m)}$ be the dictionaries generated with the MH-within-Gibbs sampler (Section III-E). In order to evaluate their performance using signal reconstruction criteria, we need to use a sparse decomposition algorithm, that can represent the original signal based on the inferred dictionaries. We use OMP [3], [4] because of its simplicity and effectiveness. Let ${Err}_{n}^{(m)}$ be the relative RMS error that yields from decomposing x_n based on dictionary $D_{n}^{(m)}$ . Also let $D_{n}^{(m_{n}^{*})} \in ℜ^{P \times K}$ be the dictionaries that yield the lowest relative RMS error for each signal n and $Θ^{(m_{n}^{*})} = [θ_{n k_{1}^{'}}^{(m_{n}^{*})}, \dots, θ_{n k_{L}^{'}}^{(m_{n}^{*})}] \in ℜ^{Q \times L}$ the parameters of the atoms that were selected by OMP based on $D_{n}^{(m_{n}^{*})}$ . $Θ^{(m_{n}^{*})}$ are concatenated into a unified dictionary

Θ_{U} = [Θ^{(m_{1}^{*})} \dots Θ^{(m_{N}^{*})}] \in ℜ^{Q \times N L}

m_{n}^{*} = arg min_{m \in [1, M]} {Err}_{n}^{(m)}, n = 1, \dots, N

Algorithm 2.

Combination of generated dictionaries

Require: Data x_n, generated dictionaries $D_{n}^{(m)}$ , number of K-means clusters N_b
1:	for n = 1, …, N do
2:	for m = 1, …, M do
3:	Reconstruct x_n based on $D_{n}^{(m)}$ with OMP
4:	Compute relative RMS error ${Err}_{n}^{(m)}$
5:	end for
6:	Find $m_{n}^{}$ such that $m_{n}^{} = arg min_{m = [1, M]} {Err}_{n}^{(m)}$
7:	Retrieve parameters of dictionary atoms selected by OMP based on $D_{n}^{(m_{n}^{})}, Θ^{(m_{n}^{})} = [θ_{n k_{1}^{'}}^{(m_{n}^{})}, \dots, θ_{n k_{L}^{'}}^{(m_{n}^{})}]$
8:	end for
9:	Concatenate $Θ_{U} = [Θ^{(m_{1}^{})} \dots Θ^{(m_{N}^{})}]$
10:	Quantize dictionary parameters Θ_Q=K-means(Θ_U)
11:	Compute the final dictionary D_Q = ϕ (Θ_Q) where ϕ operates on each column of matrix Θ_Q

Open in a new tab

Ideally, we could have used Θ_U as the parameters of our final dictionary. However, in practice the large amount of data renders Θ_U computationally expensive. For this reason, the parameter vectors of Θ_U are further quantized with K-means clustering using N_bin centers. This results in parameter matrix Θ_Q with corresponding final dictionary D_Q. The aforementioned procedure is performed on the training data, while the test data are not seen at all during this step (Section VII-B5).

This approach can be applied to combine dictionaries learnt from any other method, therefore it also provides a unified platform that allows comparison of the considered parametric DL approaches in our paper (Section VII).

V. Choosing the Parametric Dictionary Function

The choice of the parametric dictionary function is not trivial and is usually guided by the application of interest. If designed appropriately, parametric dictionaries can yield interpretable information about meaningful signal characteristics for a variety of applications. Previous studies have used Gabor [69] and Gammatone [24] atoms to represent speech signals because of their good localization properties and similarities to the human auditory system. Other efforts have proposed Gaussian-like functions to efficiently capture spherical stereo images [70], diffusion-based dictionaries to model MRI [25], and other wavelet-like atoms for digitizing fingerprint images [71]. Gabor dictionaries have been used for the electroencephalogram (EEG) [72], spline wavelets for the electrocardiogram (ECG) [73], and sigmoid-exponential functions for the electrodermal activity (EDA) [28].

Our Bayesian parametric DL approach generates the parameters of the dictionary atoms based on the MH sampler (Section III-B). The mean of the corresponding distribution depends on the considered parametric function and is selected so that it maximizes the corresponding posterior distribution (7). We will further show that the concavity of (7) generally depends on function ϕ. It is unusual that a function ϕ is concave with respect to all the parameters of interest, but in practice even the estimation of local optima is enough to generate useful samples (Section VII-C).

The posterior (7) of the dictionary atom parameters is:

U ({\tilde{θ}}_{n}) = ln (p ({\tilde{θ}}_{n} | y_{- {\tilde{θ}}_{n}}, X, ℋ)) = \propto - \frac{γ_{ε_{n}}}{2} {‖ x_{n} - S_{n} {\tilde{ϕ}}_{n} ({\tilde{θ}}_{n}) ‖}_{2}^{2} - \frac{1}{2} {({\tilde{θ}}_{n} - {\tilde{g}}_{n})}^{T} {\tilde{G}}_{n} ({\tilde{θ}}_{n} - {\tilde{g}}_{n})

(19)

If we set ψ_n (θ̃_n) = x_n − S_n ϕ̃_n (θ̃_n), then (19) becomes

U ({\tilde{θ}}_{n}) = - \frac{1}{2} {({\tilde{θ}}_{n} - {\tilde{g}}_{n})}^{T} {\tilde{G}}_{n} ({\tilde{θ}}_{n} - {\tilde{g}}_{n}) - \frac{γ_{ε_{n}}}{2} {‖ ψ_{n} ({\tilde{θ}}_{n}) ‖}^{2}

(20)

The gradient vector and Hessian matrix of (20) are

\nabla_{{\tilde{θ}}_{n}} U ({\tilde{θ}}_{n}) = - {\tilde{G}}_{n} ({\tilde{θ}}_{n} - {\tilde{g}}_{n}) - γ_{ε_{n}} {(\nabla_{{\tilde{θ}}_{n}} ψ_{n} ({\tilde{θ}}_{n}))}^{T} ψ_{n} ({\tilde{θ}}_{n})

(21)

H_{U} = \nabla_{{\tilde{θ}}_{n}}^{2} U ({\tilde{θ}}_{n}) = - {\tilde{G}}_{n} - \frac{γ_{ε_{n}}}{2} \nabla_{{\tilde{θ}}_{n}}^{2} {‖ ψ_{n} ({\tilde{θ}}_{n}) ‖}_{2}^{2}

(22)

where

{(\nabla_{{\tilde{θ}}_{n}}^{2} {‖ ψ_{n} ({\tilde{θ}}_{n}) ‖}_{2}^{2})}_{i j} = \frac{ϑ^{2}}{ϑ {\tilde{θ}}_{n_{i}} ϑ {\tilde{θ}}_{n_{j}}} {‖ ψ_{n} ({\tilde{θ}}_{n}) ‖}_{2}^{2} = \frac{ϑ}{ϑ {\tilde{θ}}_{n_{i}}} (2 \sum_{p = 1}^{P L} ψ_{n_{p}} ({\tilde{θ}}_{n}) \frac{ϑ ψ_{n_{p}} ({\tilde{θ}}_{n})}{ϑ {\tilde{θ}}_{n_{j}}}) = 2 \sum_{p = 1}^{P L} \frac{ϑ ψ_{n_{p}} ({\tilde{θ}}_{n})}{ϑ {\tilde{θ}}_{n_{i}}} \frac{ϑ ψ_{n_{p}} ({\tilde{θ}}_{n})}{ϑ {\tilde{θ}}_{n_{j}}} + 2 \sum_{p = 1}^{P L} ψ_{n_{p}} ({\tilde{θ}}_{n}) \frac{ϑ^{2} ψ_{n_{p}} ({\tilde{θ}}_{n})}{ϑ {\tilde{θ}}_{n_{i}} ϑ {\tilde{θ}}_{n_{j}}}

(23)

If H_{ψ_{n_p}} is the Hessian of ψ_{n_p}, p = 1, …, PL, then from (23)

\nabla_{{\tilde{θ}}_{n}}^{2} {‖ ψ_{n} ({\tilde{θ}}_{n}) ‖}_{2}^{2} = 2 {(\nabla_{{\tilde{θ}}_{n}} ψ_{n} ({\tilde{θ}}_{n}))}^{T} (\nabla_{{\tilde{θ}}_{n}} ψ_{n} ({\tilde{θ}}_{n})) + 2 \sum_{p = 1}^{P} ψ_{n_{p}} ({\tilde{θ}}_{n}) H_{ψ_{n_{p}}}

(24)

The Hessian of ψ_{n_p} can be computed as

{(H_{ψ_{n_{p}}})}_{i j} = - \sum_{l = 1}^{L} s_{n l} \frac{ϑ^{2} ϕ_{p} (θ_{n k_{l}^{'}})}{ϑ θ_{n k_{l i}^{'}} ϑ θ_{n k_{l l}^{'}}}

(25)

H_{ψ_{n_{p}}} = - \sum_{l = 1}^{L} s_{n l} H_{ϕ_{p}} |_{θ = θ_{n k_{l}^{'}}}

(26)

From (22), (24) and (26), we get

H_{U} = - {\tilde{G}}_{n} - γ_{ε_{n}} (\nabla_{{\tilde{θ}}_{n}} ψ_{n} ({\tilde{θ}}_{n})) {(\nabla_{{\tilde{θ}}_{n}} ψ_{n} ({\tilde{θ}}_{n}))}^{T} + γ_{ε_{n}} \sum_{p = 1}^{P} ψ_{n_{p}} ({\tilde{θ}}_{n}) \sum_{l = 1}^{L} s_{n l} H_{ϕ_{p}} |_{θ = θ_{n k_{l}^{'}}}

(27)

The first two terms of (27) are postive-definite matrices, whereas the positive-definiteness of the third term depends on ϕ. In our experiments (Section VII) and in the literature mentioned in this section, function ϕ is selected with a knowledge-driven approach based on the application of interest. Although this does not guarantee that the generated atom parameters are the global maxima, in practice the corresponding sampling method yields satisfactory results (Section VII-C).

VI. MCMC Convergence for Bayesian DL

Ergodicity properties of large-dimensional MC have been the subject of several studies [60], [74]. We discuss the uniform ergodicity property of the considered MC which implies convergence independently of the initial state [51], [52]. We focus on the sampling with replacement case, although our discussion can be extended for atom sampling without replacement. We show that the MC used for inferring the variables of the Bayesian DL problem is uniformly ergodic by proving that it meets the conditions of Theorem 6.1. This theorem establishes uniform ergodicity for the MH-within-Gibbs sampler and its proof can be found in [61].

Let P be a Markov transition kernel (Section III-A) and Y^(m), Y^(m+1) be two consecutive MCMC states generating observations y^(m), y^(m+1). We will assume that $y_{b}^{(m)}$ and $y_{b}^{(m + 1)}$ are the values of block b at the current and previous MCMC state, noted as m+1 and m, respectively. Also, $y_{- y_{b}}^{(m + 1)}$ contains all variables that have already been sampled at state m + 1 and $y_{- y_{b}}^{(m)}$ the ones from the previous state m.

Definition 6.1 (Conditional Weight Function)

If block b generated with the MH sampler, its conditional weight function is defined as

w_{y_{b}} ({y_{b}}^{(m + 1)} | {y_{b}}^{(m)}, {y_{{- y}_{b}}}^{(m)}, {y_{{- y}_{b}}}^{(m + 1)}) = \frac{p ({y_{b}}^{(m + 1)} | {y_{{- y}_{b}}}^{(m)}, {y_{{- y}_{b}}}^{(m + 1)})}{q ({y_{b}}^{(m + 1)} | {y_{b}}^{(m)}, {y_{{- y}_{b}}}^{(m)}, {y_{{- y}_{b}}}^{(m + 1)})}

(28)

Theorem 6.1 (MH-within-Gibbs Uniform Ergodicity)

Given p(y|X, ℋ) on state space y = [y₁^T … y_B^T]^T, let P = P₁ … P_B be the Markov kernel of the Gibbs sampler and Q_b, b ∈ ℐ_B′, where ℐ_B′ = {b_i₁, …, b_{i_B′}}, B′ < B, be the Markov kernel of the MH sampler with conditional weight w_{y_b′} as in (28). If each conditional weight w_{y_b′}, b′ ∈ ℐ_B′, of the MH sampler is bounded, sup w_{y_b′} < ∞, and the Gibbs sampler with Markov kernel P₁ … P_B is uniformly ergodic, then the MH-within-Gibbs sampler, resulting from substituting kernels P_b′ with Q_b′, b′ ∈ ℐ_B′, is also uniformly ergodic.

Based on Theorem 6.1, we show the MCMC uniform ergodicity for our case. Theorem 6.2 describes the minorization condition of the Gibbs sampler. This consists a multivariate extension from Jones et. al [61] (Proposition 2). Lemma 6.1 assists with its proof and ensures the minorization of the first B − 1 blocks. Lemma 6.2 ensures that the conditional weight for θ_k is bounded away from infinity. Finally, Theorem 6.3 combines all the aforementioned to prove the uniform ergodicity of the considered MC. The proofs of the theorems 6.2, 6.3 and lemmas 6.1, 6.2 can be found in Appendix A

Lemma 6.1 (Minorization Condition of B − 1 Blocks)

Let P((Y₁, Y₂, …, Y_B), A₁ × A₂ × … × A_B) be the Markov kernel of a Gibbs sampler, where A₁, …, A_B are elements of the Borel σ-algebra on the variable space Y₁, …, Y_B, respectively. We further assume that all updates of the Gibbs sampler, except the one corresponding to Y_B, are minorisable, in the sense that for b = 1, …, B − 1, there is ε_b > 0 and a probability measure ν_b, such that P_{Y_b}(y_{−y_b}, A_−b) ≥ ε_bν_b(A_−b), where A_−b = A₁ × … × A_b−1 × A_b+1, …, A_B. The Gibbs sampler with Markov kernel P₁ … P_B−1 is minorisable, i.e. there exist ε₀ > 0 and probability measure ν₀ such that

P ((Y_{1}, Y_{2}, \dots, Y_{B - 1}), A_{- B}) \geq ε_{0} ν_{0} (A_{- B})

Theorem 6.2 (Partial Minorization Condition for Gibbs)

Let the same assumptions from Lemma 6.1 hold, then the Gibbs sampler with Markov kernel P₁ … P_B is minorisable.

Lemma 6.2 (Bounded Conditional Weight of Dictionary Parameters)

Let the same assumptions from Lemma 6.1 hold. The conditional weight of the B^th block that includes the “super-vector” θ̃_n of dictionary parameters and is generated with MH according to (8)–(10), is bounded, i.e. sup w_{θ̃_n} < ∞.

Theorem 6.3 (MC Uniform Ergodicity for Bayesian DL Inference)

Let y_n be the variables of the parametric DL problem (4) generated with the MH-within-Gibbs sampler (Algorithm 1), according to which all variables except θ̃_n are sampled with Gibbs (first B − 1 blocks), while θ̃_n is sampled with MH (B^th block). Then the corresponding MC {Y⁽¹⁾, Y⁽²⁾, …} is uniformly ergodic.

VII. Experiments

We compare the Bayesian DL model against the previously proposed parametric SD [22] and ETF [24], which are conceptually closer to our approach. We further perform experiments with the non-parametric K-SVD [14] yielding dictionaries of arbitrary structure, therefore not directly comparable to our approach. We use synthetic and real biomedical data from EDA signals. Because of their characteristic structure, these types of signals favor sparse representation approaches with parametric dictionaries providing interpretable information [28].

EDA is decomposed into a slow moving tonic part depicting the general trend and a phasic part which contains fast fluctuations superimposed onto the tonic signal, also called skin conductance responses (SCR). The tonic part is mathematically expressed as a straight line, while SCRs are represented by sigmoid-exponential functions with a steep rise and slow recovery. Taking this into account, dictionaries contain tonic and phasic atoms as shown in Table III. Since SCR shapes typically contain higher variability than the signal level, which remains fairly constant throughout an analysis window, for the sake of simplicity we perform DL on the phasic atoms for learning the parameters θ = [t₀, T_rise, T_decay]^T. The initial dictionaries are created from the combination of all parameters reported in Table III, resulting in 63 tonic and 144 phasic atoms. The analysis window is 5sec, i.e. 160 samples with typical sampling frequency of 32Hz [28].

TABLE III.

Description of EDA-specific dictionary atoms and initial parameters.

Tonic Atoms

ϕ₁(t) = Δ₀ + Δ · t

Δ₀ ∈ {−20, −10, 1}
Δ ∈ {−0.01, −0.009, …,
0, 0.01, 0.02, …, 0.1}

Phasic Atoms

ϕ_{2} (t) = \frac{\frac{s t - t_{0}}{e^{T} decay}}{{[1 + {(\frac{s t - t_{0}}{T_{rise}})}^{- 2}]}^{2}} u (t - t_{0})

T_rise ∈ {8, 14, 18}
T_decay ∈ {10, 15, 20}
t₀ ∈ {0, 10, 20, …, 150}

Open in a new tab

u(t) = 1, t ≥ 0 and u(t) = 0 otherwise

A critical issue of MCMC is whether a certain number of iterations is enough to stop sampling. In high-dimensional problems, all inferred variables need to converge to the target distribution. Since examining each variable separately is not always feasible, we use a combination of monitoring and diagnostic strategies to quantitatively assess MCMC convergence.

In the following, we describe the data (Section VII-A), the experimental setup (Section VII-B), and the results evaluating the learned dictionaries (Section VII-C) and providing diagnostics for MCMC convergence (Section VII-D).

A. Data Description

1) Synthetic Data

We randomly generate 1000 synthetic signals that simulate the EDA structure. Each signal is expressed as the sum of a constant c and R number of SCRs

x (t) = c + \sum_{r = 1}^{R} ϕ_{2}^{(r)} (t)

(29)

where $ϕ_{2}^{(r)}$ is given in Table III with parameters $T_{rise}^{(r)}, T_{decay}^{(r)}$ and $t_{0}^{(r)}$ . In contrast to the rest of the paper, in which superscript “(·)” denotes the MCMC state, here it symbolizes the SCR index. The parameters of the synthetic data are randomly generated within a pre-specified range $t_{0}^{(r)} \in [1, 150]$ samples, T_rise,^(r), $T_{decay}^{(r)} \in [1, 20]$ , and R ∈ [1, 5]. Since the number of SCRs for each signal is known a priori, DL was performed with K = R + 1 number of atoms, from which one captures the tonic part and the rest, the phasic.

2) Real Data

We further evaluate the considered DL methods on human EDA data from the database of emotion analysis using physiological signals (DEAP) [75]. DEAP contains 40 one-minute recordings from 32 subjects watching long excerpts of music videos designed to study the relation of multimedia content with mood and temperament [75]. Because of the expected SCR rate in the considered 5sec analysis window [76], training was performed with K = 3 atoms, although similar results yield for different values.

B. Experimental Setup

1) Bayesian DL

The proposed MCMC inference (Algorithm 1) is performed with 1000 and 500 iterations for the synthetic and real data, respectively. The atom indices z_nl, coefficients s_n and selection probabilities π_n are initialized based on the decomposition of each exemplar signal with OMP. The mean μ_{s_n} of the coefficients’ prior was initialized with the average of the coefficients of the selected atoms from OMP. The mean and precision of dictionary atoms’ prior g₀ and G₀ were initialized with the mean and inverse covariance matrix of the initial parameters (Table III). The scale matrix R₀ of the Wishart distribution for sampling G_n was also initialized with the covariance of dictionary parameters. The remaining hyperparameters were empirically set to α_k = 2, e = 1, f = 2, ν₀ = 3, γ_s = 10, and γ_e = 8.

Dictionary combination (Algorithm 2) was performed on the generated vector parameters [T_rise, T_decay]^T with N_b = 100, 225, 400. We did not include the time offset t₀ in this procedure, since it does not contribute to the shape of the dictionary atoms; the final dictionaries should contain the learnt versions of phasic atoms shifted across the entire analysis frame. Therefore, the quantized matrix of dictionary parameters Θ_Q ∈ ℜ^2×N_b is replicated for all values of t₀ (Table III), yielding final dictionaries with 16N_b phasic atoms.

2) Steepest descent DL

Dictionaries are trained in parallel for each exemplar data using the SD [22], in which OMP alternates with a least-squares fit step for estimating the parameters of the selected atoms. SD was run over 250 iterations. We perform the same procedure for combining the learnt parameters, yielding dictionaries of same size with MCMC.

3) Equiangular tight frame DL

This method alternates between finding the dictionary with minimum Frobenius distance from the ETF and updating the corresponding parameters [24]. These steps involve an ETF relaxation constraint value and a gradient descent step, crucial for the overall performance, referred in [24] as α_k and ε. In our experiments, dictionaries were trained with different combinations of α_k ∈ {0.1, 0.5, 0.9} and ε ∈ {0.001, 0.002, 0.003, 0.004}. The dictionary { $α_{k}^{*}$ , ε*} that showed the lowest relative RMS error on the training data was used to evaluate the corresponding test data. In order to ensure the same dictionary size as the other approaches, parameters are uniformly initialized within the intervals T_rise ∈ [8, 18], T_decay ∈ [10, 20] with $\sqrt{N_{b}} = 10, 15, 20$ different values.

4) K-SVD

K-SVD is a classical non-parametric DL method generalizing K-means and iteratively estimating each dictionary atom in order to minimize the reconstruction error [14]. The original dictionaries used in this method were the same as the initial ones of the aforementioned approaches.

5) Evaluation

Evaluation of the final dictionaries is performed based on the relative RMS error computed between the original signal and its corresponding approximation through OMP based on the final dictionaries. RMS error is a common metric in related studies [14], [15], [24]. In order to ensure that our results reflect the ability of the algorithms to represent unseen data, all experiments are performed within a 10-fold cross-validation for the synthetic data and a leave-one-subject-out cross-validation setup for the real EDA signals. We note that OMP has two functionalities in our framework. It serves as a decomposition technique, based on which the learned dictionaries are evaluated against the test data, and is also used at the dictionary combination step (Section IV) in order to “prune” the atoms of the generated dictionaries. OMP does not operate on the test data during this dictionary combination framework, but rather at the evaluation step, during which the final dictionary is already created independently of the test set.

Besides signal reconstruction, dictionary coherence is an additional evaluation criterion. It is defined as the absolute value of the largest inner-product between any atoms of the dictionary and is an important property related to signal reconstruction quality [24]. We computed the resulting coherence of the dictionaries after training.

C. Results

Visual inspection of the final dictionaries indicates that SD does not always preserve the initial dictionary structure (Figs. 1a–b), while ETF yields less variable dictionaries (Fig. 1c). MCMC results in a variety of atoms preserving the original shape (Fig. 1e–f). In contrast to parametric DL, non-parametric K-SVD yields very unstructured non-interpretable atoms (Fig. 1d). Further comparison between an exemplar input and the reconstructed signals suggests that our approach depicts superior signal representation (Fig. 2).

Fig. 1 — Example of initial dictionary and dictionaries learnt Steepest Descent (SD), Equiangular Tight Frame (ETF), K-SVD and Markov Chain Monte Carlo Bayesian inference (MCMC) using atom sampling with and without replacement (w-,w-o rplcm). Dictionary combination is performed with *N_b* = 400. An instance of phasic atoms shifted with t₀ = 30 is shown.

Fig. 2 — Example of input EDA signal (solid cyan line) and reconstructed signals using the original dictionary (blue dashed line) and the dictionaries learnt with Steepest Descent (SD), Equiangular Tight Frame (ETF), and Markov Chain Monte Carlo Bayesian inference (MCMC) (black-triangle, green-asterisk, and red-circled lines, respectively). Reconstruction is performed using orthogonal matching pursuit with 4 iterations.

Dictionaries learnt from our proposed Bayesian DL yield lower reconstruction errors compared to the initial ones and the ones learnt through SD and ETF (Fig. 3). Atom sampling with and without replacement are not significantly different, since the two distributions are very similar for K≫L. Dictionaries trained using SD perform quite poorly on unseen synthetic data (Figs. 3a–c), which might occur because their simple structure causes significant overfitting to least-squares-based methods. Despite the fact that ETF DL is not prone to overfitting, since it does not take into account exemplar data during training, it lacks adaptation to more complex real data (Figs. 3d–f). K-SVD appears more accurate for real signals than the parametric approaches (Figs. 3d–f), indicating its ability to learn more complex patterns, not necessarily of the same structure as the initial dictionaries. Similar results occur for different dictionary sizes, omitted here for the sake of simplicity.

Fig. 3 — Relative root mean square (RMS) error between original and reconstructed signal with respect to the number of (orthogonal) matching pursuit (OMP) iterations. Dictionaries are learnt with Steepest Descent (SD), Equiangular Tight Frame (ETF), K-SVD, and Markov Chain Monte Carlo Bayesian inference (MCMC). During MCMC atom sampling is performed with (w-) and without (w-o) replacement (rplcm).

The large number of atoms generally yields dictionaries of high coherence. All considered methods appear quite equivalent, with ETF achieving the lowest coherence, since this is included as an optimization metric (Table IV).

TABLE IV.

Final dictionary coherence

Method	Synthetic Data	Real Data

Initial	1	1
SD	0.9990	0.9992
ETF	0.9988	0.9990
K-SVD	0.9989	0.9991
MCMC w- rplcm	0.9989	0.9991
MCMC w-o rplcm	0.9989	0.9991

Open in a new tab

D. MCMC Diagnostics

Besides theoretical analysis (Section VI), another way to examine MCMC convergence is to see how well the MC is mixing, which is usually achieved through visual inspection. Traceplots of several problem variables from an exemplar signal indicate that the considered chains move around the parameter space and are not only limited in certain areas suggesting good mixing (Figs. 4a–d). Density plots of the generated samples further validate that the estimated posteriors are close to the target distributions (Figs. 4e–h).

Fig. 4 — Example MCMC trace plots and generated posteriors for the first element of vectors containing the (a,e) dictionary atoms θ_n, (b,f) atom coefficients **s_n**, (c,g) atom priors **π_n**, and (d,h) for the noise precision γ_{ε_n}.

Convergence diagnostics can further quantitatively assess if there is a bias from generated samples. We use the Geweke diagnostic [77], because it only requires one running chain and attempts to address issues of both bias and variance [78]. It takes two non-overlapping parts of the MC (usually the first 0.1 and last 0.5) and compares their mean using a difference of means test. If the samples are drawn from the stationary distribution of the chain, the two means are equal. The test statistic is a standard Z-score with values under convergence within two standard deviations from zero, i.e. |z| < 2 for the standard normal distribution. We perform the Geweke test for both datasets and atom sampling methods (Sections II-A,II-B) with the first 100 samples used as a burn-in period.

The high-dimensionality of our problem prohibits us to report diagnostics for each variable separately, therefore we will summarize the results for groups of variables. For each group of variables, we report the proportion of chains for which |z| < 2 (Table V). Most of the variables in our framework succeed on the Geweke diagnostic. The dictionary parameters, generated with MH, usually have a lower success percentages. Although there is a reduced number of cases where the Geweke diagnostic fails, given the large number of variables and the signal reconstruction results (Section VII-C), the performed number of MCMC iterations appears to result in meaningful dictionaries learned through this process.

TABLE V.

Geweke MCMC diagnostic - P(|z| < 2)

		Synthetic Data		Real Data

		w- rplcm	w-o rplcm	w- rplcm	w-o rplcm

c_n		0.93	0.98	0.92	0.95
z_n		0.93	0.93	0.91	0.94
γ_{ε_n}		0.91	0.98	0.99	0.92
π_n		0.93	0.96	0.96	0.92

θ_n	t₀	0.68	0.70	0.72	0.75
	T_rise	0.91	0.92	0.87	0.85
	T_decay	0.92	0.92	0.86	0.85

g_n	t₀	0.75	0.80	0.78	0.79
	T_rise	0.90	0.87	0.89	0.87
	T_decay	0.90	0.86	0.91	0.85

G_n		0.81	0.75	0.82	0.83

Open in a new tab

MH acceptance rate refers to the fraction of candidate draws that are accepted and is important for convergence. Very high acceptance rates suggest that the chain is not mixing well, while very low rates might be inefficient. The acceptance rates for the variables of our problem (Table VI) are close to previously proposed optimal values in the literature [79] suggesting a good mixing of the considered MC.

TABLE VI.

Metropolis-Hastings acceptance rates (%)

Synthetic Data		Real Data
w- rplcm	w-o rplcm	w- rplcm	w-o rplcm

23.22	31.52	25.48	38.15

Open in a new tab

VIII. Discussion

An important benefit of Bayesian methods yields in providing estimates of the entire variable set of a problem. In the considered setup this can help inferring the dictionary size, the optimal number of dictionary atoms for a given signal, and the corresponding noise levels. Estimation of the dictionary size is not as crucial in our greedy sparse representation approach, that is not as prone to overcomplete dictionaries as basis pursuit methods [5]. Inferring the optimal number of dictionary atoms is important, as it can yield high compression rates and help interpret the underlying signal information. An extension of our method could have imposed discrete probabilistic distributions onto the number of atoms appropriately inferred through MH.

Another advantage of Bayesian methods is that they are less prone to overfitting, which usually occurs with deterministic algorithms that might describe the random noise instead of the actual data [81]. This is reflected in our results (Section VII-C), since deterministic SD, that does not include prior knowledge of the signal structure, performs more poorly on unseen data. On the other side, ETF avoids overfitting, but does not learn the morphology of training data. Bayesian inference appears to compromise between the two.

Inherent differences exist between parametric and non-parametric DL methods. The first impose a predetermined structure on the input space (through function ϕ) and learn the parameters that represent this structure from the training data. Their major benefit lies in their interpretability, since the considered dictionary atoms are able to meaningfully capture the characteristics of input signals and can be used for knowledge-driven classification and inference. On the other hand, the exemplar signals learned from non-parametric methods are hardly interpretable and can blindly represent the input space. Since the functionality and scope of these methods is so different, meaningful comparison is challenging. In the case of synthetic data, where function ϕ is a perfect match to the signal (i.e. input signals are built with the same functions as the dictionary atoms), our proposed parametric Bayesian approach is more reliable than K-SVD in learning the hidden atom parameters. This means that given a perfectly constructed dictionary function, Bayesian parametric DL yields better results, even compared to non-parametric approaches. However, in the case of real signals, function ϕ cannot always perfectly selected, therefore parametric approaches seem to perform slightly worse than K-SVD. While more precise function might have yielded lower errors, the problem of finding the optimal mapping function is still active in research [82].

Although our proposed approach is general for learning the parameters of the dictionary atoms, we need to have good knowledge about the appropriate mapping function ϕ between the parameters and the data. As discussed in Section V, the selection of ϕ is usually guided by the application of interest and can vary for different types of signals. In the case of 2D images, for example, the dictionary needs to be constructed using a function ϕ that captures the context of the image, such as a Gabor or wavelet-like function. In order to reduce complexity, we can convert the 2D image into a 1D vector with dimensionality equal to the number of pixels, as typically performed [14], [40], [44]. Given a function ϕ suitable for the image of interest, our Bayesian approach can learn the parameters that efficiently capture the shape of the data.

In this paper, for the sake of simplicity we considered a simple dictionary combination approach with K-means quantization (Section IV). However, there exist more sophisticated approaches of block-coordinate descent with warm restarts [17] and weighted batch averaging [43].

Convergence is a critical component of MCMC in the context of Bayesian inference. In Section VI, we discussed the uniform ergodicity of the corresponding MH-within-Gibbs sampler, ensuring that the MC converges to the invariant distribution. Convergence rate is another important issue, for which previous studies have proposed explicit theoretical bounds in simple scenarios [59], [83]. The computation of these bounds can be quite difficult in such high-dimensional problems [84], therefore it goes beyond the scope of our study.

IX. Conclusions

We propose a sparse Bayesian model for parametric DL, whose problem variables follow appropriately selected probabilistic distributions. We use MH-within-Gibbs to infer the corresponding variables, because of its ability to compensate for posterior distributions that cannot be analytically computed. We further show the uniform ergodicity of the proposed MCMC through the minorization of the corresponding Gibbs sampler and the bounded conditional weights of the MH. Our experimental results performed on synthetic and real biomedical signals indicate that this approach offers benefits in terms of signal reconstruction compared to previously proposed SD and ETF methods and also provides a good tradeoff between learning the signal structure and avoiding overfitting.

Supplementary Material

Supplemental Material

NIHMS782938-supplement-Supplemental_Material.pdf^{(237.2KB, pdf)}

Acknowledgments

This work was funded by National Science Foundation and National Institute of Health.

Biographies

graphic file with name nihms782938b1.gif

Theodora Chaspari (S’12) received the diploma in electrical and computer engineering from the National Technical University of Athens, Greece (2010) and the master’s degree from the University of Southern California (2012), where she is currently pursuing a Ph.D. degree. Since 2010 she has been a member of the Signal Analysis and Interpretation Laboratory (SAIL). Her research interests lie in the area of biomedical signal processing, speech analysis and behavioral signal processing.

Ms Chaspari is a recipient of the USC Annenberg Graduate Fellowship, USC WiSE Merit Fellowship, and the IEEE Signal Processing Society Travel Grant.

graphic file with name nihms782938b2.gif

Andreas Tsiartas (S’10-M’14) received the B.Sc. degree in electronics and computer engineering from the Technical University of Crete, Crete, Greece, in 2006, and the M.Sc. and Ph.D. degrees in the Department of Electrical Engineering, University of Southern California (USC), Los Angeles, in 2014. He is currently a research engineer at SRI International. His main research direction focuses on speech-to-speech translation. Other research interests include acoustic and language modeling for automatic speech recognition (ASR) and voice activity detection.

Dr. Tsiartas received best teaching assistant awards for the years 2009 and 2010 in the Department of Electrical Engineering, USC. In 2006, he was awarded the Viterbi School Deans Doctoral Fellowship from USC.

graphic file with name nihms782938b3.gif

Panagiotis Tsilifis was born in Aigio, Greece in 1985. He received his Diploma and M.Sc. degrees in Applied Mathematical Sciences from the National Technical University of Athens, Greece, in 2009 and 2011 respectively. In between, he also spent one year as an exchange student at the Royal Institute of Technology (KTH), in Stockholm, Sweden. He also received an M.A. in Applied Mathematics in 2014 from the University of Southern California (USC), Los Angeles, CA, USA. Currently he is pursuing his Ph.D. degree in Applied Mathematics at USC, working under the supervision of Prof. Roger G. Ghanem. His research focuses on Uncertainty Quantification methods including Bayesian approaches to optimal design and inference as well as construction of Polynomial Chaos surrogates with applications in geosciences, reservoir modeling and food microbiology.

Mr. Tsilifis received the Graduate Scholarship by the State Scholarship Foundation (IKY), Greece in 2010 and the Gerondelis Graduate Fellowship in 2013. He is also a SIAM student member and has received twice the SIAM student travel award, in 2015 and 2016.

graphic file with name nihms782938b4.gif

Shrikanth S. Narayanan (S’88-M’95-SM’02-F’09) is Andrew J. Viterbi Professor of Engineering at the University of Southern California (USC), and holds appointments as Professor of Electrical Engineering, Computer Science, Linguistics, Psychology Neuroscience and Pediatrics and as the founding director of the Ming Hsieh Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research from 1995-2000. At USC he directs the Signal Analysis and Interpretation Laboratory (SAIL). His research focuses on human-centered signal and information processing and systems modeling with an interdisciplinary emphasis on speech, audio, language, multimodal and biomedical problems and applications with direct societal relevance. [http://sail.usc.edu]

Prof. Narayanan is a Fellow of the Acoustical Society of America and the American Association for the Advancement of Science (AAAS) and a member of Tau Beta Pi, Phi Kappa Phi, and Eta Kappa Nu. He is also an Editor in Chief for the IEEE Journal of Selected Topics in Signal Processing and an Editor for the Computer Speech and Language Journal and an Associate Editor for the IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, IEEE TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING OVER NETWORKS, APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING and the Journal of the Acoustical Society of America. He was also previously an Associate Editor of the IEEE TRANSACTIONS OF SPEECH AND AUDIO PROCESSING (2000–2004), IEEE SIGNAL PROCESSING MAGAZINE (2005–2008) and the IEEE TRANSACTIONS ON MULTIMEDIA (2008–2011). He is a recipient of a number of honors including Best Transactions Paper awards from the IEEE Signal Processing Society in 2005 (with A. Potamianos) and in 2009 (with C. M. Lee) and selection as an IEEE Signal Processing Society Distinguished Lecturer for 2010–2011 and ISCA Distinguished Lecturer for 2015–2016. Papers co-authored with his students have won awards including the 2014 Ten-year Technical Impact Award from ACM ICMI and at Interspeech 2015 Nativeness Detection Challenge, 2014 Cognitive Load Challenge, 2013 Social Signal Challenge, Interspeech 2012 Speaker Trait Challenge, Interspeech 2011 Speaker State Challenge, InterSpeech 2013 and 2010, InterSpeech 2009 Emotion Challenge, IEE DCOSS 2009, IEEE MMSP 2007, IEEE MMSP 2006, ICASSP 2005 and ICSLP 2002. He has published over 650 papers and has been granted seventeen U.S. patents.

Appendix A

MCMC Ergodicity Proofs

We provide the proofs for the theorems and lemmas of Section VI.

Proof of Lemma 6.1

We assume that the sampling order is Y₁, …, Y_B−1. Let ${y_{1}^{'}, \dots, y_{B - 1}^{'}}$ and {y₁, …, y_B} be the block variables at the current and previous MCMC state, respectively. The conditional probability for sampling the first B − 1 blocks is

p (y_{1}^{'}, \dots, y_{B - 1}^{'} | y_{1}, \dots, y_{B}) = p (y_{B - 1}^{'} | y_{1}^{'}, \dots, y_{B - 2}^{'}, y_{B - 1}, y_{B}) \cdot p (y_{1}^{'}, \dots, y_{B - 2}^{'} | y_{1}, \dots, y_{B}) = p (y_{B - 1}^{'} | y_{1}^{'}, \dots, y_{B - 2}^{'}, y_{B - 1}, y_{B}) \cdot p (y_{B - 2}^{'} | y_{1}^{'}, \dots, y_{B - 3}^{'}, y_{B - 2}, y_{B - 1}, y_{B}) \cdot \cdot \dots \cdot p (y_{1}^{'} | y_{1}, \dots, y_{B}) = p (y_{B - 1}^{'} | y_{- y_{B - 1}}) \cdot p (y_{B - 2}^{'} | y_{- y_{B - 2}}) \cdot p (y_{1}^{'} | y_{- y_{1}})

where y_−b contains all variables except y_b. The Markov kernel for the first B − 1 blocks can be written as in (30). The first and second inequalities in (30) occur from the minorization of the (B − 1)^th and (B − 2)^th blocks. Therefore $\exists ε_{0} = \prod_{b = 1}^{B - 1} ε_{b} > 0$ and $ν_{0} = \prod_{b - 1}^{B - 1} ν_{b} (A_{b})$ (a probability measure) that satisfy the desired inequality.

P ((Y_{1}, Y_{2}, \dots, Y_{B - 1}), A_{- B}) = \int_{A_{1} \dots A_{B - 1}} p (y_{1}^{'}, \dots, y_{B - 1}^{'} | y_{1}, \dots, y_{B}) μ (d (y'_{- B})) = \int_{A_{1} \dots A_{B - 1}} p (y_{1}^{'}, \dots, y_{B - 2}^{'} | y_{- y_{1}}, \dots, y_{- y_{B - 2}}) p (y_{B - 1}^{'} | y_{- y_{B - 1}}) μ_{B - 1} (d y_{B - 1}^{'}) \dots μ_{1} (d y_{1}^{'}) = \int_{A_{1} \dots A_{B - 2}} p (y_{1}^{'}, \dots, y_{B - 2}^{'} | y_{- y_{1}}, \dots, y_{- y_{B - 2}}) (\int_{A_{B - 1}} p (y_{B - 1}^{'} | y_{- y_{B - 1}}) μ_{B - 1} (d y_{B - 1}^{'})) μ_{B - 2} (d y_{B - 2}^{'}) \dots μ_{1} (d y_{1}^{'}) \geq ε_{B - 1} ν_{B - 1} (A_{B - 1}) \int_{A_{1} \dots A_{B - 3}} p (y_{1}^{'}, \dots, y_{B - 3}^{'} | \dots) (\int_{A_{B - 2}} p (y_{B - 2}^{'} | y_{- y_{B - 2}}) μ_{B - 2} (d y_{B - 2}^{'})) μ_{B - 3} (d y_{B - 3}^{'}) \dots μ_{1} (d y_{1}^{'}) \geq \dots \geq \prod_{b - 1}^{B - 1} ε_{b} ν_{b} (A_{b})

(30)

Proof of Theorem 6.2

If we assume that the sampling order is Y₁, …, Y_B, the Markov kernel of the Gibbs sampler can be expressed as in (31). The first inequality in (31) results from Lemma 6.1 and can yield to a minorization condition for the entire Gibbs sampler.

P ((Y_{1}, Y_{2}, \dots, Y_{B}), A_{1} \times A_{2} \times \dots \times A_{B}) = \int_{A_{1} \dots A_{B}} p (y_{1}^{'}, \dots, y_{B}^{'} | y_{1}, \dots, y_{B}) μ (d (y_{1}^{'}, \dots, y_{B}^{'})) = \int_{A_{B}} p (y_{B}^{'} | y_{1}^{'}, \dots, y_{B - 1}^{'}, y_{B}) \int_{A_{1} \dots A_{B - 1}} p (y_{1}^{'}, \dots, y_{B - 1}^{'} | y_{1}, \dots, y_{B - 1}, y_{B}) μ (d (y_{1}^{'}, \dots, y_{B - 1}^{'})) \cdot μ_{B} (d y_{B}^{'}) \geq ε_{0} \int_{A_{B}} p (y_{B}^{'} | y_{1}^{'}, \dots, y_{B - 1}^{'}, y_{B}) \int_{A_{1} \dots A_{B - 1}} ν_{0} (d (y_{1}^{'}, \dots, y_{B - 1}^{'})) \cdot μ_{B} (d y_{B}^{'}) = ε_{0} \int_{A_{1} \dots A_{B} - 1} p (y_{- y_{B}}, A_{b}) ν (d (y_{1}^{'}, \dots, y_{B - 1}^{'})

(31)

Proof of Lemma 6.2

From Def. 6.1, Table II, and (7–8)

w_{y_{{\tilde{θ}}_{n}}} \propto exp (- \frac{γ_{ε_{n}}}{2} {‖ ε_{n} ‖}^{2}) \frac{{| {\tilde{G}}_{n} |}^{\frac{1}{2}}}{{| V_{{\tilde{θ}}_{n}} |}^{- 1 / 2}} \cdot \frac{exp [- \frac{1}{2} {({\tilde{θ}}_{n} - {\tilde{g}}_{n})}^{T} {\tilde{G}}_{n} ({\tilde{θ}}_{n} - {\tilde{g}}_{n})]}{{[1 + \frac{1}{ν_{1} + Q} {({\tilde{θ}}_{n} - {\hat{μ}}_{{\tilde{θ}}_{n}})}^{T} {\hat{V}}_{{\tilde{θ}}_{n}}^{- 1} ({\tilde{θ}}_{n} - {\hat{μ}}_{{\tilde{θ}}_{n}})]}^{- \frac{ν_{1} + Q}{2}}}

(32)

Using (10), we have

\frac{{| {\tilde{G}}_{n} |}^{\frac{1}{2}}}{{| V_{{\tilde{θ}}_{n}} |}^{- \frac{1}{2}}} = τ {| {\tilde{G}}_{n} |}^{\frac{1}{2}} < \infty and \frac{1}{\prod_{k = 1}^{K} {| V_{θ_{k}} |}^{- \frac{1}{2}}} = τ^{K} < \infty

since G̃_n is the precision matrix of Gaussian distribution, therefore has finite eigenvalues and determinants, and τ < ∞. Moreover $\prod_{n = 1}^{N} exp (- \frac{γ_{ε_{n}}}{2} {‖ ε_{n} ‖}^{2}) < 1$ because $\frac{γ_{ε_{n}}}{2} {‖ ε_{n} ‖}^{2} \geq 0$ for n = 1, …, N and γ_{ε_n} > 0. Finally the function

f (\tilde{θ_{n}}) = \frac{{[1 + \frac{1}{ν_{1} + Q} {({\tilde{θ}}_{n} - {\hat{μ}}_{{\tilde{θ}}_{n}})}^{T} {\hat{V}}_{{\tilde{θ}}_{n}}^{- 1} ({\tilde{θ}}_{n} - {\hat{μ}}_{{\tilde{θ}}_{n}})]}^{\frac{ν_{1} + Q}{2}}}{exp (\frac{1}{2} {({\tilde{θ}}_{n} - {\tilde{g}}_{n})}^{T} {\tilde{G}}_{n} ({\tilde{θ}}_{n} - {\tilde{g}}_{n}))}

(33)

is bounded since f(θ̃_n) → 0, ‖θ̃_n‖ → ∞, and f is continuous. Function f remains bounded at ‖θ̃_n‖ → ∞, since the quadratic forms (θ̃_n − g̃_n)^T G̃_n(θ̃_n − g̃_n) and ${({\tilde{θ}}_{n} - {\tilde{g}}_{n})}^{T} {\hat{V}}_{{\tilde{θ}}_{n}}^{- 1} ({\tilde{θ}}_{n} - {\tilde{g}}_{n})$ increase to infinity at the same rate, and the denominator increases exponentially fast, while the numerator with polynomial rate.

Proof of Theorem 6.3

The first B − 1 blocks of y_n, that are generated with the Gibbs sampler, follow well-known distributions (Table I), therefore it is trivial to show that their updates are minorisable. From Theorem 6.2, since the partial updates for all blocks except the last one are minorisable, the entire Gibbs sampler is minorisable, therefore uniformly ergodic. From Lemma 6.2, the conditional weight of the B^th block y_{θ̃_n} generated with the MH is bounded. Thus the MH-within-Gibbs sampler meets the conditions of Theorem 6.1, therefore it is uniformly ergodic.

Appendix B

Computational Complexity

We analyze the computational complexity of our proposed framework for each MCMC step (Algorithm 1, Table II) using the “𝒪” notation. For an input signal and MCMC iteration, the weight of the Multinomial distribution for each atom is computed with cost 𝒪(P), i.e. 𝒪(PK) for all atoms. Sampling L indices from K dictionary atoms with replacement results in 𝒪(LK), therefore the final cost yields 𝒪(PK + LK). For sampling without replacement, re-adjusting the atom weights requires 𝒪 ((K − 1) + … + (K − L + 1)) ~ 𝒪(LK). Since each of the L iterations takes into account the previously selected atoms [55], the cost of the Wallenius distribution is 𝒪 (K + (K − 1) + … + (K − L + 1)) ~ 𝒪(LK).

Each of the L coefficients is generated from the normal distribution, whose mean requires 𝒪(P), therefore the entire complexity is 𝒪(LP). Regarding the dictionary atom parameters, their posterior (7) requires 𝒪 (L(P + Q²)), while the typical cost of the Nelder Mead’s simplex is 𝒪 (Q²) [80]. Finally, complexity results in 𝒪(1) for the noise ε_n, 𝒪(K + L) for the Dirichlet prior π_n, 𝒪(Q²) for the mean g_nk and precision G_n of the dictionary parameters prior, and 𝒪(P) for the noise precision γ_{ε_n}.

Taking these into account, the total complexity when using sampling with and without replacement, respectively, yields:

𝒪 (P K + L K + 2 L P + (L + 3) Q^{2} + P + K + L + 1) ~ 𝒪 (P K + L K + L P + L Q^{2})

(34)

𝒪 (2 L K + 2 L P + (L + 3) Q^{2} + P + K + L + 1) ~ 𝒪 (L K + L P + L Q^{2})

(35)

Our approach requires first-order polynomial time with the signal dimensionality P, the number of dictionary atoms K, and the number of selected atoms in the representation L ≪ K, and second-order polynomial cost with respect to the number of dictionary parameters Q. We note that Q ≪ P, L, therefore the latter is not too expensive.

We further compare run-time statistics of all the approaches. We report the average duration of one training iteration and the number of estimated variables for each approach (Table B1). Experiments were performed with an Intel Core i7 Processor with CPU at 2.93Hz and RAM at 7.8GB. SD and K-SVD require a decomposition step, therefore they compute slightly more variables than ETF. Our method estimates the highest number of variables, including the priors of the considered problem. Results indicate that SD and ETF are computationally less expensive than the proposed MCMC. Consistently with previous observations [19], K-SVD also has high computational cost. MCMC sampling with and without replacement yield computation times of the same order.

TABLE B1.

Average computation time of dictionary learning algorithms

Method	# Variables	Computation Time (sec / training iteration)

SD	438	0.1173
ETF	432	0.0616
K-SVD	438	3.9536
MCMC w- rplcm	1024	3.5014
MCMC w-o rplcm	1024	4.9638

Open in a new tab

Footnotes

This paper has supplementary material available including proofs and derivations.

Contributor Information

Theodora Chaspari, Signal Analysis and Interpretation Laboratory (SAIL) at the Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089, USA.

Andreas Tsiartas, SRI International, Menlo Park, CA, 94025-3493, USA.

Panagiotis Tsilifis, Department of Mathematics at the Dana and David Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA, 90089, USA.

Shrikanth Narayanan, Signal Analysis and Interpretation Laboratory (SAIL) at the Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089, USA.

References

1.Elad M. Sparse and redundant representations: From theory to applications in signal and image processing. Springer; 2010. [Google Scholar]
2.Mallat SG, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE TSP. 1993;41(12):3397–3415. [Google Scholar]
3.Pati YC, Ramin R, Krishnaprasad PS. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition; Conference on Signals, Systems and Computers; 1993. [Google Scholar]
4.Davis G, Mallat SG, Avellaneda M. Adaptive greedy approximations. Constructive Approximation. 1997;13(1):57–98. [Google Scholar]
5.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing. 1998;20(1):33–61. [Google Scholar]
6.Rao BD, Kreutz-Delgado K. An affine scaling methodology for best basis selection. IEEE TSP. 1999;47(1):187–200. [Google Scholar]
7.Rao BD, Engan K, Cotter SF, Palmer J, Kreutz-Delgado K. Subset selection in noise based on diversity measure minimization. IEEE TSP. 2003;51(3):760–770. [Google Scholar]
8.Wipf DP, Rao BD. Sparse Bayesian learning for basis selection. IEEE TSP. 2004;52(8):2153–2164. [Google Scholar]
9.Wipf DP, Rao BD. An empirical Bayesian strategy for solving the simultaneous sparse approximation problem. IEEE TSP. 2007;55(7):3704–3716. [Google Scholar]
10.Ji S, Xue Y, Carin L. Bayesian compressive sensing. IEEE TSP. 2008;56(6):2346–2356. [Google Scholar]
11.Daugman JG. Two-dimensional spectral analysis of cortical receptive field profiles. Vision research. 1980;20(10):847–856. doi: 10.1016/0042-6989(80)90065-6. [DOI] [PubMed] [Google Scholar]
12.Daubechies I. The wavelet transform, time-frequency localization and signal analysis. Information Theory. 1990;36(5):961–1005. [Google Scholar]
13.Starck JL, Candès EJ, Donoho DL. The curvelet transform for image denoising. IEEE TIP. 2002;11(6):670–684. doi: 10.1109/TIP.2002.1014998. [DOI] [PubMed] [Google Scholar]
14.Aharon M, Elad M, Bruckstein A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE TSP. 2006;54(11):4311–4322. [Google Scholar]
15.Engan K, Aase SO, Hakon Husoy J. Method of optimal directions for frame design. Proc. ICASSP. 1999:2443–2446. [Google Scholar]
16.Olshausen BA, Field DJ. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381(6583):607–609. doi: 10.1038/381607a0. [DOI] [PubMed] [Google Scholar]
17.Mairal J, Bach F, Ponce J, Sapiro G. Online dictionary learning for sparse coding. Proc. ACM. 2009:689–696. [Google Scholar]
18.Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A. Supervised dictionary learning. Proc. NIPS. 2008:1033–1040. [Google Scholar]
19.Rubinstein R, Bruckstein AM, Elad M. Dictionaries for sparse representation modeling. Proc. IEEE. 2010;98(6):1045–1057. [Google Scholar]
20.Skretting K, Engan K. Image compression using learned dictionaries by RLS-DLA and compared with K-SVD. Proc. ICASSP. 2011:1517–1520. [Google Scholar]
21.Thanou D, Shuman D, Frossard P. Learning Parametric Dictionaries for Signals on Graphs. IEEE TSP. 2014;62(15):3849–3862. [Google Scholar]
22.Ataee M, Zayyani H, Babaie-Zadeh M, Jutten C. Parametric dictionary learning using steepest descent. Proc. ICASSP. 2010:1987–1981. [Google Scholar]
23.Szu HH, Telfer BA, Kadambe SL. Neural network adaptive wavelets for signal representation and classification. Optical Engineering. 1992;21(9):1907–1916. [Google Scholar]
24.Yaghoobi M, Daudet L, Davies ME. Parametric dictionary design for sparse coding. IEEE TSP. 2009;57(12):4800–4810. [Google Scholar]
25.Merlet S, Caruyer E, Ghosh A, Deriche R. A computational diffusion MRI and parametric dictionary learning framework for modeling the diffusion signal and its features. Medical Image Analysis. 2013;17(9):830–843. doi: 10.1016/j.media.2013.04.011. [DOI] [PubMed] [Google Scholar]
26.Barthélemy Q, Gouy-Pailler C, Isaac Y, Souloumiac A, Larue A, Mars JI. Multivariate temporal dictionary learning for EEG. Journal of neuroscience methods. 2013;215(1):19–28. doi: 10.1016/j.jneumeth.2013.02.001. [DOI] [PubMed] [Google Scholar]
27.Reisert M, Skibbe H, Kiselev VG. The diffusion dictionary in the human brain is short: Rotation invariant learning of basis functions. Proc. CDMRI. 2014:47–55. [Google Scholar]
28.Chaspari T, Tsiartas A, Stein LI, Cermak SA, Narayanan SS. Sparse representation of electrodermal activity with knowledge-driven dictionaries. IEEE TBME. 2015;62(3):960–971. doi: 10.1109/TBME.2014.2376960. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Tipping ME. Advanced lectures on machine Learning. Springer; 2004. Bayesian inference: An introduction to principles and practice in machine learning; pp. 41–62. [Google Scholar]
30.Ophir B, Elad M, Bertin N, Plumbley MD. Sequential minimal eigenvalues - An approach to analysis dictionary learning. Proc. EUSIPCO. 2011 [Google Scholar]
31.Vidal R, Ma Y, Sastry S. Generalized principal component analysis. IEEE PAMI. 2005;27(12):1945–1959. doi: 10.1109/TPAMI.2005.244. [DOI] [PubMed] [Google Scholar]
32.Aharon M, Elad M. Sparse and redundant modeling of image content using an image-signature-dictionary. SIIMS. 2008;1(3):228–247. [Google Scholar]
33.Rubinstein R, Zibulevsky M, Elad M. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE TSP. 2010;58(3):1553–1564. [Google Scholar]
34.Mazaheri JA, Guillemot C, Labit C. Learning a tree-structured dictionary for efficient image representation with adaptive sparse coding. Proc. ICASSP. 2013:1320–1324. [Google Scholar]
35.Blumensath T, Davies M. Sparse and shift-invariant representations of music. IEEE TASLP. 2006;14(1):50–57. [Google Scholar]
36.Engan K, Skretting K, Husøy JH. Family of iterative LS-based dictionary learning algorithms, ILS-DLA, for sparse signal representation. Digital Signal Processing. 2007;17(1):32–49. [Google Scholar]
37.Jasper WJ, Garnier SJ, Potlapalli H. Texture characterization and defect detection using adaptive wavelets. Optical Engineering. 1996;35(11):3140–3149. [Google Scholar]
38.Blumensath T. Monte Carlo methods for compressed sensing. Proc. ICASSP. 2014:1000–1004. [Google Scholar]
39.Elad M, Yavneh I. A plurality of sparse representations is better than the sparsest one alone. IEEE TIT. 2009;55(10):4701–4714. [Google Scholar]
40.Lewicki MS, Olshausen BA. Probabilistic framework for the adaptation and comparison of image codes. JOSA A. 1999;16(7):1587–1601. [Google Scholar]
41.Lewicki MS, Sejnowski T. Learning overcomplete representations. Neural computation. 2000;12(2):337–365. doi: 10.1162/089976600300015826. [DOI] [PubMed] [Google Scholar]
42.Tipping ME. Sparse Bayesian learning and the relevance vector machine. The journal of machine learning research. 2001;1:211–244. [Google Scholar]
43.Li L, Silva J, Zhou M, Carin L. Online Bayesian dictionary learning for large datasets. Proc. ICASSP. 2012:2157–2160. [Google Scholar]
44.Zhou M, Chen H, Paisley J, Ren L, Li L, Xing Z, Dunson D, Sapiro G, Carin L. Nonparametric Bayesian dictionary learning for analysis of noisy and incomplete images. IEEE TIP. 2012;21(1):130–144. doi: 10.1109/TIP.2011.2160072. [DOI] [PMC free article] [PubMed] [Google Scholar]
45.He L, Qi H, Zaretzki R. Non-parametric Bayesian dictionary learning for image super resolution. Proc. FIIW. 2011:122–125. [Google Scholar]
46.Crandall R, Dong B, Bilgin A. Randomized iterative hard thresholding: A fast approximate MMSE estimator for sparse approximations. 2013 http://math.arizona.edu/~dongbin/Publications/RandIHT. [Google Scholar]
47.Sallee P, Olshausen BA. Learning sparse multiscale image representations. Proc. NIPS. 2002:1327–1334. [Google Scholar]
48.Reaz MBI, Hussain MS, Mohd-Yasin F. Techniques of EMG signal analysis: detection, processing, classification and applications. Biological procedures online. 2006;8(1):11–35. doi: 10.1251/bpo115. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, Frackowiak RSJ. Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping. 1995;2:189–210. [Google Scholar]
50.Grimmer J. An introduction to Bayesian inference via variational approximations. Political Analysis. 2010;19(1):32–47. [Google Scholar]
51.Meyn PS, Tweedie LR. Markov chains and stochastic stability. 2nd. New York: Cambridge University Press; 2009. [Google Scholar]
52.Mengersen KL, Tweedie RL. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics. 1996;24(1):101–121. [Google Scholar]
53.Tsagkias M, De Rijke M, Weerkamp W. Hypergeometric language models for republished article finding. Proc. SIGIR. 2011:485–494. [Google Scholar]
54.Fog A. Calculation methods for Wallenius’ noncentral hypergeometric distribution. Communications in Statistics-Simulation and Computation. 2008;37(2):258–273. [Google Scholar]
55.Fog A. Sampling methods for Wallenius’ and Fisher’s noncentral hypergeometric distributions. Communications in Statistics-Simulation and Computation. 2008;37(2):241–257. [Google Scholar]
56.Wallenius KT. Biased sampling; the noncentral hypergeometric probability distribution. Tech. Rep., DTIC Document. 1963 [Google Scholar]
57.Chib S, Jeliazkov I. Marginal likelihood from the Metropolis- Hastings output. JASA. 2001;96(453):270–281. [Google Scholar]
58.Chib S, Greenberg E. Understanding the Metropolis-Hastings algorithm. The American Statistician. 1995;49(4):327–335. [Google Scholar]
59.Johnson A. Ph.D. thesis. University of Minessota; 2009. Geometric ergodicity of Gibbs samplers. [Google Scholar]
60.Johnson AA, Jones LG, Neath CR. Component-wise Markov chain Monte Carlo: Uniform and geometric ergodicity under mixing and composition. Statistical Science. 2013;28(3):360–375. [Google Scholar]
61.Jones LG, Roberts OG, Rosenthal JS. Convergence of conditional Metropolis-Hastings samplers. Advances in Applied Probability. 2014;46(2):422–445. [Google Scholar]
62.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. The journal of chemical physics. 1953;21(6):1087–1092. [Google Scholar]
63.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57(1):97–109. [Google Scholar]
64.Roberts GO, Gelman A, Gilks WR. Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability. 1997;7(1):110–120. [Google Scholar]
65.Chib S, Greenberg E, Winkelmann R. Posterior simulation and Bayes factors in panel count data models. Journal of Econometrics. 1986;86(1):33–54. [Google Scholar]
66.Bédard M, Fraser DAS. On a Directionally Adjusted Metropolis-Hastings Algorithm. International Journal of Statistical Sciences. 2008;9(1):33–57. [Google Scholar]
67.Chib S, Ramamurthy S. Tailored randomized block MCMC methods with application to DSGE models. Journal of Econometrics. 2010;155(1):19–38. [Google Scholar]
68.Nelder JA, Mead R. A simplex method for function minimization. The computer journal. 1965;7(4):308–313. [Google Scholar]
69.Lobo AP, Loizou PC. Voiced/unvoiced speech discrimination in noise using gabor atomic decomposition. Proc. ICASSP) 2003:817–820. [Google Scholar]
70.Tošić I, Frossard P. Dictionary learning for stereo image representation. IEEE TIP. 2011;20(4):921–934. doi: 10.1109/TIP.2010.2081679. [DOI] [PubMed] [Google Scholar]
71.Demanet L, Ying L. Wave atoms and sparsity of oscillatory patterns. Applied and Computational Harmonic Analysis. 2007;23(3):368–387. [Google Scholar]
72.Barthélemy Q, Gouy-Pailler C, Isaac Y, Souloumiac A, Larue A, Mars JI. Multivariate temporal dictionary learning for EEG. Journal of neuroscience methods. 2013;215(1):19–28. doi: 10.1016/j.jneumeth.2013.02.001. [DOI] [PubMed] [Google Scholar]
73.Ruiz-Reyes N, Vera-Candeas P, Reche-López PJ, Canadas-Quesada F. A time-frequency adaptive signal model-based approach for parametric ecg compression. Proc. EUSIPCO. 2006:1–5. [Google Scholar]
74.Jones GL, Hobert JP. Honest exploration of intractable probability distributions via Markov Chain Monte Carlo. Statistical Science. 2001:312–334. [Google Scholar]
75.Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I. DEAP: A database for emotion analysis using physiological signals. IEEE TAC. 2012;3(1):18–31. [Google Scholar]
76.Dawson ME, Schell AM, Filion DL. The Electrodermal System. In: Cacioppo JT, Tassinary LG, Berntson GG, editors. Handbook of psychophysiology. 3rd. New York: Cambridge University Press; 2007. pp. 159–181. [Google Scholar]
77.Geweke J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In: Bernardo JM, Berger J, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford University Press; 1991. pp. 169–193. [Google Scholar]
78.Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: a comparative review. JASA. 1996;91(434):883–904. [Google Scholar]
79.Rosenthal JS. Optimal proposal distributions and adaptive mcmc. Handbook of Markov Chain Monte Carlo. 2011:93–112. [Google Scholar]
80.Singer S, Singer S. Complexity analysis of Nelder-Mead search iterations; Proc. Conference on Applied Mathematics and Computation; 1999. pp. 185–196. [Google Scholar]
81.Bishop Christopher M. Model-based machine learning. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2013;371(1984) doi: 10.1098/rsta.2012.0222. [DOI] [PMC free article] [PubMed] [Google Scholar]
82.Qiu Q, Patel VM, Pavan T, Chellappa R. Domain adaptive dictionary learning. Proc. ECCV. 2012:631–645. [Google Scholar]
83.Rosenthal JS. Minorization conditions and convergence rates for Markov Chain Monte Carlo. JASA. 1995;90(430):558–566. [Google Scholar]
84.Rosenthal JS. Theoretical rates of convergence for Markov Chain Monte Carlo. Computing Science and Statistics. 1994:486–486. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

NIHMS782938-supplement-Supplemental_Material.pdf^{(237.2KB, pdf)}

[R1] 1.Elad M. Sparse and redundant representations: From theory to applications in signal and image processing. Springer; 2010. [Google Scholar]

[R2] 2.Mallat SG, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE TSP. 1993;41(12):3397–3415. [Google Scholar]

[R3] 3.Pati YC, Ramin R, Krishnaprasad PS. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition; Conference on Signals, Systems and Computers; 1993. [Google Scholar]

[R4] 4.Davis G, Mallat SG, Avellaneda M. Adaptive greedy approximations. Constructive Approximation. 1997;13(1):57–98. [Google Scholar]

[R5] 5.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing. 1998;20(1):33–61. [Google Scholar]

[R6] 6.Rao BD, Kreutz-Delgado K. An affine scaling methodology for best basis selection. IEEE TSP. 1999;47(1):187–200. [Google Scholar]

[R7] 7.Rao BD, Engan K, Cotter SF, Palmer J, Kreutz-Delgado K. Subset selection in noise based on diversity measure minimization. IEEE TSP. 2003;51(3):760–770. [Google Scholar]

[R8] 8.Wipf DP, Rao BD. Sparse Bayesian learning for basis selection. IEEE TSP. 2004;52(8):2153–2164. [Google Scholar]

[R9] 9.Wipf DP, Rao BD. An empirical Bayesian strategy for solving the simultaneous sparse approximation problem. IEEE TSP. 2007;55(7):3704–3716. [Google Scholar]

[R10] 10.Ji S, Xue Y, Carin L. Bayesian compressive sensing. IEEE TSP. 2008;56(6):2346–2356. [Google Scholar]

[R11] 11.Daugman JG. Two-dimensional spectral analysis of cortical receptive field profiles. Vision research. 1980;20(10):847–856. doi: 10.1016/0042-6989(80)90065-6. [DOI] [PubMed] [Google Scholar]

[R12] 12.Daubechies I. The wavelet transform, time-frequency localization and signal analysis. Information Theory. 1990;36(5):961–1005. [Google Scholar]

[R13] 13.Starck JL, Candès EJ, Donoho DL. The curvelet transform for image denoising. IEEE TIP. 2002;11(6):670–684. doi: 10.1109/TIP.2002.1014998. [DOI] [PubMed] [Google Scholar]

[R14] 14.Aharon M, Elad M, Bruckstein A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE TSP. 2006;54(11):4311–4322. [Google Scholar]

[R15] 15.Engan K, Aase SO, Hakon Husoy J. Method of optimal directions for frame design. Proc. ICASSP. 1999:2443–2446. [Google Scholar]

[R16] 16.Olshausen BA, Field DJ. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381(6583):607–609. doi: 10.1038/381607a0. [DOI] [PubMed] [Google Scholar]

[R17] 17.Mairal J, Bach F, Ponce J, Sapiro G. Online dictionary learning for sparse coding. Proc. ACM. 2009:689–696. [Google Scholar]

[R18] 18.Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A. Supervised dictionary learning. Proc. NIPS. 2008:1033–1040. [Google Scholar]

[R19] 19.Rubinstein R, Bruckstein AM, Elad M. Dictionaries for sparse representation modeling. Proc. IEEE. 2010;98(6):1045–1057. [Google Scholar]

[R20] 20.Skretting K, Engan K. Image compression using learned dictionaries by RLS-DLA and compared with K-SVD. Proc. ICASSP. 2011:1517–1520. [Google Scholar]

[R21] 21.Thanou D, Shuman D, Frossard P. Learning Parametric Dictionaries for Signals on Graphs. IEEE TSP. 2014;62(15):3849–3862. [Google Scholar]

[R22] 22.Ataee M, Zayyani H, Babaie-Zadeh M, Jutten C. Parametric dictionary learning using steepest descent. Proc. ICASSP. 2010:1987–1981. [Google Scholar]

[R23] 23.Szu HH, Telfer BA, Kadambe SL. Neural network adaptive wavelets for signal representation and classification. Optical Engineering. 1992;21(9):1907–1916. [Google Scholar]

[R24] 24.Yaghoobi M, Daudet L, Davies ME. Parametric dictionary design for sparse coding. IEEE TSP. 2009;57(12):4800–4810. [Google Scholar]

[R25] 25.Merlet S, Caruyer E, Ghosh A, Deriche R. A computational diffusion MRI and parametric dictionary learning framework for modeling the diffusion signal and its features. Medical Image Analysis. 2013;17(9):830–843. doi: 10.1016/j.media.2013.04.011. [DOI] [PubMed] [Google Scholar]

[R26] 26.Barthélemy Q, Gouy-Pailler C, Isaac Y, Souloumiac A, Larue A, Mars JI. Multivariate temporal dictionary learning for EEG. Journal of neuroscience methods. 2013;215(1):19–28. doi: 10.1016/j.jneumeth.2013.02.001. [DOI] [PubMed] [Google Scholar]

[R27] 27.Reisert M, Skibbe H, Kiselev VG. The diffusion dictionary in the human brain is short: Rotation invariant learning of basis functions. Proc. CDMRI. 2014:47–55. [Google Scholar]

[R28] 28.Chaspari T, Tsiartas A, Stein LI, Cermak SA, Narayanan SS. Sparse representation of electrodermal activity with knowledge-driven dictionaries. IEEE TBME. 2015;62(3):960–971. doi: 10.1109/TBME.2014.2376960. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] 29.Tipping ME. Advanced lectures on machine Learning. Springer; 2004. Bayesian inference: An introduction to principles and practice in machine learning; pp. 41–62. [Google Scholar]

[R30] 30.Ophir B, Elad M, Bertin N, Plumbley MD. Sequential minimal eigenvalues - An approach to analysis dictionary learning. Proc. EUSIPCO. 2011 [Google Scholar]

[R31] 31.Vidal R, Ma Y, Sastry S. Generalized principal component analysis. IEEE PAMI. 2005;27(12):1945–1959. doi: 10.1109/TPAMI.2005.244. [DOI] [PubMed] [Google Scholar]

[R32] 32.Aharon M, Elad M. Sparse and redundant modeling of image content using an image-signature-dictionary. SIIMS. 2008;1(3):228–247. [Google Scholar]

[R33] 33.Rubinstein R, Zibulevsky M, Elad M. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE TSP. 2010;58(3):1553–1564. [Google Scholar]

[R34] 34.Mazaheri JA, Guillemot C, Labit C. Learning a tree-structured dictionary for efficient image representation with adaptive sparse coding. Proc. ICASSP. 2013:1320–1324. [Google Scholar]

[R35] 35.Blumensath T, Davies M. Sparse and shift-invariant representations of music. IEEE TASLP. 2006;14(1):50–57. [Google Scholar]

[R36] 36.Engan K, Skretting K, Husøy JH. Family of iterative LS-based dictionary learning algorithms, ILS-DLA, for sparse signal representation. Digital Signal Processing. 2007;17(1):32–49. [Google Scholar]

[R37] 37.Jasper WJ, Garnier SJ, Potlapalli H. Texture characterization and defect detection using adaptive wavelets. Optical Engineering. 1996;35(11):3140–3149. [Google Scholar]

[R38] 38.Blumensath T. Monte Carlo methods for compressed sensing. Proc. ICASSP. 2014:1000–1004. [Google Scholar]

[R39] 39.Elad M, Yavneh I. A plurality of sparse representations is better than the sparsest one alone. IEEE TIT. 2009;55(10):4701–4714. [Google Scholar]

[R40] 40.Lewicki MS, Olshausen BA. Probabilistic framework for the adaptation and comparison of image codes. JOSA A. 1999;16(7):1587–1601. [Google Scholar]

[R41] 41.Lewicki MS, Sejnowski T. Learning overcomplete representations. Neural computation. 2000;12(2):337–365. doi: 10.1162/089976600300015826. [DOI] [PubMed] [Google Scholar]

[R42] 42.Tipping ME. Sparse Bayesian learning and the relevance vector machine. The journal of machine learning research. 2001;1:211–244. [Google Scholar]

[R43] 43.Li L, Silva J, Zhou M, Carin L. Online Bayesian dictionary learning for large datasets. Proc. ICASSP. 2012:2157–2160. [Google Scholar]

[R44] 44.Zhou M, Chen H, Paisley J, Ren L, Li L, Xing Z, Dunson D, Sapiro G, Carin L. Nonparametric Bayesian dictionary learning for analysis of noisy and incomplete images. IEEE TIP. 2012;21(1):130–144. doi: 10.1109/TIP.2011.2160072. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R45] 45.He L, Qi H, Zaretzki R. Non-parametric Bayesian dictionary learning for image super resolution. Proc. FIIW. 2011:122–125. [Google Scholar]

[R46] 46.Crandall R, Dong B, Bilgin A. Randomized iterative hard thresholding: A fast approximate MMSE estimator for sparse approximations. 2013 http://math.arizona.edu/~dongbin/Publications/RandIHT. [Google Scholar]

[R47] 47.Sallee P, Olshausen BA. Learning sparse multiscale image representations. Proc. NIPS. 2002:1327–1334. [Google Scholar]

[R48] 48.Reaz MBI, Hussain MS, Mohd-Yasin F. Techniques of EMG signal analysis: detection, processing, classification and applications. Biological procedures online. 2006;8(1):11–35. doi: 10.1251/bpo115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, Frackowiak RSJ. Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping. 1995;2:189–210. [Google Scholar]

[R50] 50.Grimmer J. An introduction to Bayesian inference via variational approximations. Political Analysis. 2010;19(1):32–47. [Google Scholar]

[R51] 51.Meyn PS, Tweedie LR. Markov chains and stochastic stability. 2nd. New York: Cambridge University Press; 2009. [Google Scholar]

[R52] 52.Mengersen KL, Tweedie RL. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics. 1996;24(1):101–121. [Google Scholar]

[R53] 53.Tsagkias M, De Rijke M, Weerkamp W. Hypergeometric language models for republished article finding. Proc. SIGIR. 2011:485–494. [Google Scholar]

[R54] 54.Fog A. Calculation methods for Wallenius’ noncentral hypergeometric distribution. Communications in Statistics-Simulation and Computation. 2008;37(2):258–273. [Google Scholar]

[R55] 55.Fog A. Sampling methods for Wallenius’ and Fisher’s noncentral hypergeometric distributions. Communications in Statistics-Simulation and Computation. 2008;37(2):241–257. [Google Scholar]

[R56] 56.Wallenius KT. Biased sampling; the noncentral hypergeometric probability distribution. Tech. Rep., DTIC Document. 1963 [Google Scholar]

[R57] 57.Chib S, Jeliazkov I. Marginal likelihood from the Metropolis- Hastings output. JASA. 2001;96(453):270–281. [Google Scholar]

[R58] 58.Chib S, Greenberg E. Understanding the Metropolis-Hastings algorithm. The American Statistician. 1995;49(4):327–335. [Google Scholar]

[R59] 59.Johnson A. Ph.D. thesis. University of Minessota; 2009. Geometric ergodicity of Gibbs samplers. [Google Scholar]

[R60] 60.Johnson AA, Jones LG, Neath CR. Component-wise Markov chain Monte Carlo: Uniform and geometric ergodicity under mixing and composition. Statistical Science. 2013;28(3):360–375. [Google Scholar]

[R61] 61.Jones LG, Roberts OG, Rosenthal JS. Convergence of conditional Metropolis-Hastings samplers. Advances in Applied Probability. 2014;46(2):422–445. [Google Scholar]

[R62] 62.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. The journal of chemical physics. 1953;21(6):1087–1092. [Google Scholar]

[R63] 63.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57(1):97–109. [Google Scholar]

[R64] 64.Roberts GO, Gelman A, Gilks WR. Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability. 1997;7(1):110–120. [Google Scholar]

[R65] 65.Chib S, Greenberg E, Winkelmann R. Posterior simulation and Bayes factors in panel count data models. Journal of Econometrics. 1986;86(1):33–54. [Google Scholar]

[R66] 66.Bédard M, Fraser DAS. On a Directionally Adjusted Metropolis-Hastings Algorithm. International Journal of Statistical Sciences. 2008;9(1):33–57. [Google Scholar]

[R67] 67.Chib S, Ramamurthy S. Tailored randomized block MCMC methods with application to DSGE models. Journal of Econometrics. 2010;155(1):19–38. [Google Scholar]

[R68] 68.Nelder JA, Mead R. A simplex method for function minimization. The computer journal. 1965;7(4):308–313. [Google Scholar]

[R69] 69.Lobo AP, Loizou PC. Voiced/unvoiced speech discrimination in noise using gabor atomic decomposition. Proc. ICASSP) 2003:817–820. [Google Scholar]

[R70] 70.Tošić I, Frossard P. Dictionary learning for stereo image representation. IEEE TIP. 2011;20(4):921–934. doi: 10.1109/TIP.2010.2081679. [DOI] [PubMed] [Google Scholar]

[R71] 71.Demanet L, Ying L. Wave atoms and sparsity of oscillatory patterns. Applied and Computational Harmonic Analysis. 2007;23(3):368–387. [Google Scholar]

[R72] 72.Barthélemy Q, Gouy-Pailler C, Isaac Y, Souloumiac A, Larue A, Mars JI. Multivariate temporal dictionary learning for EEG. Journal of neuroscience methods. 2013;215(1):19–28. doi: 10.1016/j.jneumeth.2013.02.001. [DOI] [PubMed] [Google Scholar]

[R73] 73.Ruiz-Reyes N, Vera-Candeas P, Reche-López PJ, Canadas-Quesada F. A time-frequency adaptive signal model-based approach for parametric ecg compression. Proc. EUSIPCO. 2006:1–5. [Google Scholar]

[R74] 74.Jones GL, Hobert JP. Honest exploration of intractable probability distributions via Markov Chain Monte Carlo. Statistical Science. 2001:312–334. [Google Scholar]

[R75] 75.Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I. DEAP: A database for emotion analysis using physiological signals. IEEE TAC. 2012;3(1):18–31. [Google Scholar]

[R76] 76.Dawson ME, Schell AM, Filion DL. The Electrodermal System. In: Cacioppo JT, Tassinary LG, Berntson GG, editors. Handbook of psychophysiology. 3rd. New York: Cambridge University Press; 2007. pp. 159–181. [Google Scholar]

[R77] 77.Geweke J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In: Bernardo JM, Berger J, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford University Press; 1991. pp. 169–193. [Google Scholar]

[R78] 78.Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: a comparative review. JASA. 1996;91(434):883–904. [Google Scholar]

[R79] 79.Rosenthal JS. Optimal proposal distributions and adaptive mcmc. Handbook of Markov Chain Monte Carlo. 2011:93–112. [Google Scholar]

[R80] 80.Singer S, Singer S. Complexity analysis of Nelder-Mead search iterations; Proc. Conference on Applied Mathematics and Computation; 1999. pp. 185–196. [Google Scholar]

[R81] 81.Bishop Christopher M. Model-based machine learning. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2013;371(1984) doi: 10.1098/rsta.2012.0222. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R82] 82.Qiu Q, Patel VM, Pavan T, Chellappa R. Domain adaptive dictionary learning. Proc. ECCV. 2012:631–645. [Google Scholar]

[R83] 83.Rosenthal JS. Minorization conditions and convergence rates for Markov Chain Monte Carlo. JASA. 1995;90(430):558–566. [Google Scholar]

[R84] 84.Rosenthal JS. Theoretical rates of convergence for Markov Chain Monte Carlo. Computing Science and Statistics. 1994:486–486. [Google Scholar]

PERMALINK

Markov Chain Monte Carlo Inference of Parametric Dictionaries for Sparse Bayesian Approximations

Theodora Chaspari, IEEE

Andreas Tsiartas, IEEE

Panagiotis Tsilifis

Shrikanth Narayanan, IEEE

Roles

Abstract

I. Introduction

Related Work

Contributions

Notation

II. Problem Formulation

A. Atom Sampling with Replacement

B. Atom Sampling without Replacement

C. Additional Probabilistic Assumptions

D. Objective

TABLE I.

III. Inference with MCMC Sampling

A. MCMC Sampling

TABLE II.

B. Sampling Dictionary Parameters

C. Sampling Atom Indices and Coefficients

D. Sampling the Parameters of the Priors

Algorithm 1.

E. Implementation of Bayesian DL

IV. Combination of Generated Dictionaries

Algorithm 2.

V. Choosing the Parametric Dictionary Function

VI. MCMC Convergence for Bayesian DL

Definition 6.1 (Conditional Weight Function)

Theorem 6.1 (MH-within-Gibbs Uniform Ergodicity)

Lemma 6.1 (Minorization Condition of B − 1 Blocks)

Theorem 6.2 (Partial Minorization Condition for Gibbs)

Lemma 6.2 (Bounded Conditional Weight of Dictionary Parameters)

Theorem 6.3 (MC Uniform Ergodicity for Bayesian DL Inference)

VII. Experiments

TABLE III.

A. Data Description

1) Synthetic Data

2) Real Data

B. Experimental Setup

1) Bayesian DL

2) Steepest descent DL

3) Equiangular tight frame DL

4) K-SVD

5) Evaluation

C. Results

Fig. 1.

Fig. 2.

Fig. 3.

TABLE IV.

D. MCMC Diagnostics

Fig. 4.

TABLE V.

TABLE VI.

VIII. Discussion

IX. Conclusions

Supplementary Material

Acknowledgments

Biographies

Appendix A

MCMC Ergodicity Proofs

Proof of Lemma 6.1

Proof of Theorem 6.2

Proof of Lemma 6.2

Proof of Theorem 6.3

Appendix B

Computational Complexity

TABLE B1.

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles