Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Jun 23.
Published in final edited form as: IEEE Trans Signal Process. 2016 Mar 7;64(12):3077–3092. doi: 10.1109/TSP.2016.2539143

Markov Chain Monte Carlo Inference of Parametric Dictionaries for Sparse Bayesian Approximations

Theodora Chaspari 1, Andreas Tsiartas 2, Panagiotis Tsilifis 3, Shrikanth Narayanan 4
PMCID: PMC5482548  NIHMSID: NIHMS782938  PMID: 28649173

Abstract

Parametric dictionaries can increase the ability of sparse representations to meaningfully capture and interpret the underlying signal information, such as encountered in biomedical problems. Given a mapping function from the atom parameter space to the actual atoms, we propose a sparse Bayesian framework for learning the atom parameters, because of its ability to provide full posterior estimates, take uncertainty into account and generalize on unseen data. Inference is performed with Markov Chain Monte Carlo, that uses block sampling to generate the variables of the Bayesian problem. Since the parameterization of dictionary atoms results in posteriors that cannot be analytically computed, we use a Metropolis-Hastings-within-Gibbs framework, according to which variables with closed-form posteriors are generated with the Gibbs sampler, while the remaining ones with the Metropolis Hastings from appropriate candidate-generating densities. We further show that the corresponding Markov Chain is uniformly ergodic ensuring its convergence to a stationary distribution independently of the initial state. Results on synthetic data and real biomedical signals indicate that our approach offers advantages in terms of signal reconstruction compared to previously proposed Steepest Descent and Equiangular Tight Frame methods. This paper demonstrates the ability of Bayesian learning to generate parametric dictionaries that can reliably represent the exemplar data and provides the foundation towards inferring the entire variable set of the sparse approximation problem for signal denoising, adaptation and other applications.

Index Terms: Dictionary learning, parametric dictionaries, bayesian inference, markov chain monte carlo, sparse representation, uniform ergodicity

I. Introduction

Signal representations are fundamental for denoising, interpolation, estimation, classification and recognition. Recently there has been an increased focus on sparse representations [1] which model a signal with a small number of components from a large overcomplete set of exemplar atoms, called dictionary. A dictionary D ∈ ℜP×K contains K atoms d1, …, dK ∈ ℜP that constitute the building blocks of a P- dimensional signal x ∈ ℜP. Specifically a signal can be expressed as an exact or approximate linear combination of a small number of atoms from the dictionary as xDc, where c ∈ ℜK contains the coefficients of the corresponding atoms. In the typical case where K > P, an infinite set of solutions arise. This can be addressed by imposing a sparsity constraint on x, according to which x should be represented by the smallest number of dictionary atoms and the sparse representation problem can be expressed as an l0-norm minimization.

The problem of minimizing ‖c0 subject to the constraint x = Dc has been proven to be NP-hard and several directions have been proposed to solve it. One approach includes greedy strategies that abandon exhaustive search in favor of locally optimal updates resulting in sub-optimal solutions. Examples include matching pursuit [2] and orthogonal matching pursuit [3], [4] algorithms. An alternative is the relaxation of the discontinuous l0-norm leading to the more computationally expensive basis pursuit [5] and focal underdetermined system solver [6], [7] reaching global solutions. Bayesian methods with appropriate statistical assumptions have been further used to identify the desired sparse solution [8], [9], [10].

An essential step towards compact and reliable representations is the dictionary selection. Traditionally, analytic predesigned dictionaries comprising Gabor [11], wavelet [12], curvelet [13], or other atoms have been used, because of their localization, directionality and multi-resolution properties. Dictionary learning (DL) focuses on learning atoms from the available training data. It includes several well-known algorithms, such as the K-SVD [14] and the MOD [15], as well as probabilistic approaches [16]. Although non-parametric DL is effective for signal reconstruction [14], restoration [17], and classification [18], it depicts a highly non-convex nature, mostly yields non-structured dictionaries [19] and typically requires a large amount of training data [20].

These disadvantages can be mitigated by imposing a pre-determined structure through the use of carefully selected knowledge-driven parametric functions mapping a parameter space to structured dictionary atoms. This results in parametric dictionaries bridging the gap between pre-defined analytic dictionaries and purely numerical DL [21]. Dictionary atoms are expressed through an application-specific function ϕ of a parameter set, say dk = ϕ(θk), where θk ∈ ℜQ, Q < P, are the atom parameters optimized with respect to desirable properties [22], [23], [24]. Parametric DL is more likely to converge faster and have more efficient implementations compared to the non-parametric problem [19]. It further provides higher signal interpretability yielding important metainformation [25], [26], [27], [28].

This paper proposes a Bayesian framework for learning the parameters of a dictionary given a predetermined parametric function. The formulation of the sparse representation problem from the Bayesian perspective assumes a probabilistic distribution for all variables in an effort to provide a posterior belief for their values [10], [29]. This allows the estimation of full posteriors rather than single estimates which can result in better handling uncertainty and benefits noise estimation.

Related Work

DL methods usually alternate between sparse decoding and dictionary update [19]. In the context of non-parametric DL, Ophir et al. have proposed a sequential learning algorithm by identifying the orthogonal directions to a data subset [30]. Engan et al. used matrix inversion to compute the dictionary matrix [15], while Aharon et al. introduced K-SVD, which is a constrained optimization approach performing atom-by-atom update [14]. Generalized principal component analysis models the data as a union of low-dimensional subspaces with orthogonal bases [31], while structured dictionaries have been proposed in an effort to enforce additional translation invariant, hierarchical and multiscale properties [32], [33], [34].

Parametric dictionaries depict higher interpretability, lower density of local minima and compact representation [19]. Dictionaries can be learnt as a result of translation of elementary signal segments over space and time [35], [36]. They can also contain atoms of predetermined structure, such as wavelets [37] or Gabor functions [22], whose parameters are adapted with steepest-descent (SD) [22] and other least-squares-based methods [25], [37]. Yanghoobi et al. proposed a method to find dictionaries close to the one with minimum coherence, called the equiangular tight frame (ETF) [24]. Finally, Thanou et al. [21] proposed a parametric DL for signals residing on weighted graphs.

Although the aforementioned deterministic DL methods perform well in various signal processing applications [14], [17], [18], they provide single-point estimates and cannot handle noise uncertainty. Bayesian approaches offer a way to address those disadvantages. They have been proposed for compressed sensing [8], [9], [38], [39] as well as for non-parametric DL.

The problem of ensuring sparseness in Bayesian approximations has been addressed in a variety of ways. Early approaches have used continuous sparse-promoting distributions to the atom coefficients cn, such as the Laplacian [40], Cauchy [16], [41] and Student-t [42], [10]. Another way is the use of indicator variables that permit to independently sample each atom from the Bernoulli distribution [43], [44], [45]. Despite their computational efficiency, these can yield an arbitrarily large amount of non-zero coefficients, which might not be practical for many applications. More inline with our problem, previous studies have explored the use of appropriate probabilistic assumptions for keeping a constant number of dictionary atoms in the representation with proposal distributions iteratively conditioning on the atoms selected at previous steps [39], [46], [38]. Gaussian assumptions are typically imposed on the dictionary atoms [43], [44] and the signal noise [38], [43], [44], [47], the latter being consistent with biomedical applications [48], [49].

Bayesian inference casts DL into an optimization problem for maximizing the posterior distribution. The overcomplete dictionaries containing many atoms in combination with the large amount of training data yield a high-dimensional framework, for which closed form solutions are usually difficult to derive and approximate inference methods are followed. Common approaches include the evidence maximization [8], [9], relevance vector machine [10], Markov Chain Monte Carlo (MCMC) [38], [44], and variational approximation [43].

Contributions

We propose a Bayesian framework for learning the parameters of dictionary atoms, given a parametric function that maps the dictionary parameters to the actual atoms. Our approach imposes probabilistic distributions to the variables of the sparse representation problem that are estimated through MCMC methods because of their simplicity and ability to fully perform Bayesian inference [50]. Compared to previous Bayesian DL [43], [44], our approach introduces parametric dictionaries where non-closed-form solutions are handled with a combination of Gibbs and Metropolis-Hastings (MH) sampling (MH-within-Gibbs). Our approach differs from previous parametric DL [22], [24] because of its stochastic framework that yields estimation of the full problem variables. This results in parametric dictionaries that take into account the structure of the training data and are less prone to overfitting. The parametric nature of our problem further requires the use of appropriate sparse-imposing priors that keep the selected number of dictionary atoms within a pre-determined range. We perform atom sampling with and without replacement formalized through the Multinomial and the Wallenius’ hypergeometric distribution, respectively.

One key challenge with MCMC is to determine its asymptotic behavior, i.e. whether it provides accurate posterior approximations. The goal is to create an aperiodic and irreducible Markov Chain (MC) with stationary distribution same as the posterior distribution of interest [51]. Irreducibility ensures that any state of the space is accessible, while aperiodicity makes sure that the chain does not return to the same state at regular times. Uniformly ergodic MCs are a special case in which the MC converges to the invariant distribution independently of the initial state. They guarantee geometrically fast convergence and are key sufficient conditions in order to establish central limit theorems for empirical averages and provide consistent estimators of MCMC standard errors [51], [52]. Because of these, we discuss the geometric ergodicity of MCMC in our proposed Bayesian inference framework that ensures convergence. We further perform qualitative and quantitative diagnostics to evaluate the reliability of the generated samples and resulting distributions.

We demonstrate the ability of our algorithm for parallel processing with experiments on synthetic data and real biomedical signals. DL is performed for each sample separately and the resulting dictionaries from each exemplar data are further combined into a unified model. Our results indicate that the proposed approach yields benefits in terms of superior signal reconstruction compared to previous SD [22] and ETF [24] methods. When we have precise a priori knowledge of the optimal parametric function ϕ representing the data, our parametric framework also yields better performance than the classical non-parametric approach of K-SVD [14].

In the following, we provide the problem formulation (Section II) and describe the proposed MCMC approach for learning the parameters for the considered problem (Section III). We describe the parallel implementation framework of our algorithm (Section IV) and discuss the use of various parametric functions and their effect on the sample-generating procedure (Section V). We further analyze the geometric ergodicity properties of the proposed MH-within-Gibbs algorithm yielding uniform convergence (Section VI). In Section VII, we provide experimental results and the results of MCMC diagnostics. Finally, we discuss our results and offer conclusions in Sections VIII and IX.

Notation

We denote matrices with bold uppercase letters X and vectors with bold lowercase letters x. Lowercase letters with appropriate numerical indices will either refer to the columns of a matrix, i.e. X = [x1xN] or the elements of a vector, i.e. x = [x1xN]T. We denote (X)ij the entry of matrix X corresponding to the ith row and jth column. The pth order norm of a vector is symbolized as ‖xp, while the N × 1 identity vector and N × N identity matrix are noted as 1N and IN, respectively. The vectorization of matrix X, obtained by stacking its columns on top of one another, is defined as vec(X) = [x1T, …, xNT]T. Finally the gradient and Hessian of a scalar valued function f(x), x ∈ ℜN, are denoted as ∇f(x) ∈ ℜN×1 and Hf = ∇2f(x) ∈ ℜN×N, where (f(x))i=θfθxi and (Hf)ij=ϑ2fϑxiϑxj, respectively.

II. Problem Formulation

Let X = [x1xN] ∈ ℜP×N be a data matrix of N examples xn ∈ ℜP. We formulate the parametric DL problem by assuming an overcomplete dictionary Dn ∈ ℜP×K containing K prototypical atoms dnk ∈ ℜP for each exemplar data xn separately. This approach, also found in similar studies [43], enjoys computational benefits compared to batch methods, since it can yield faster reliable estimates of the considered variables (Section III-B) and can be easily parallelized to run on multiple computational threads (Section III-E).

In the case of parametric dictionaries, the atoms can be expressed with an appropriate domain-specific function ϕ : ℜQ → ℜP in terms of a parameter vector θnk ∈ ℜQ, Q < P, as dnk = ϕ(θnk), therefore Dn = [ϕ(θn1) … ϕ(θnK)]. Typically dictionary atoms have a unitary l2-norm, i.e. ‖ϕ(θnk)‖2 = 1. Each signal xn can be expressed as a linear combination of a small number of atoms, LK, with additive noise εn ∈ ℜP

xn=Dncn+εn (1)

where εn is the error and cn ∈ ℜK, ‖cn0 = L, are the coefficients with non-zero values only for the used atoms. According to the Bayesian framework, each variable in (1) is assumed to follow an underlying probabilistic distribution.

Especially in parametric DL, where we jointly sample the parameters of the selected dictionary atoms (Section III-B), a small number of selected atoms can keep the implementation computationally tractable. For this reason, we are interested in imposing explicit sparseness constraints, similarly to previous studies [38], [39], [46] (Section I). We will describe two different ways to approach this using the Multinomial and the Wallenius’ hypergeometric distribution, that allow sampling with and without replacement.

A. Atom Sampling with Replacement

A straightforward method to sample the dictionary atoms is to relax the l0-sparsity norm constraint into ‖cn0L allowing independent sampling of the dictionary atoms L times with replacement through the Multinomial distribution. Since the population size is much larger than the sample size (LK), duplicate atoms are rare [53]. If we assume a discrete multinomial distribution for selecting one dictionary atom out of the possible K, (1) can be re-written as

xn=Dnl=1Lsnlznl+εn (2)

where znl ∈ ∪ ui, ui = [0, 0, …, 1, …, 0]T with 1 in the ith entry (iK) and 0 in the rest K−1 entries. The vector ‖znl0 = 1 is binary activating one dictionary atom at a time and sn = [sn1snL]T ∈ ℜL only contains the coefficients of the selected atoms. If atom dk is the lth representation term, then znlk′ = 1, znlk′ = 0, ∀k′ ≠ k, and snl consists the kth entry of vector cn in (1), i.e. snl = cnk. The probability of selecting atom k for data xn is πnk such that znl ~ Multinomial (1, πn) with πn = [πn1, …, πnK]T ∈ ℜK. If the same atom is selected more than once, the corresponding coefficient is only once estimated.

B. Atom Sampling without Replacement

Sampling without replacement avoids duplicate atoms and keeps the l0-sparsity constraint intact. The problem of selecting L atoms out of the possible K can be formalized similarly to the classical experiment of taking colored balls at random from an urn without replacement [54], [55]. If the balls have a different weight, the result follows the Wallenius’ noncentral hypergeometric distribution [56].

In the considered problem, we can assume K dictionary atoms each of a different type (i.e. each ball in the urn has a different color) and selection probability π = [π1, …, πK]. If zn = [zn1, …, znK] ∈ ℜK, znk ∈ {0, 1} and ‖zn0 = L, indicates the selected atoms for xn, then (1) becomes

xn=Dn(snzn)+εn (3)

where “◦” represents the Hadamard or entrywise product and sn ∈ ℜK and zn follows the Wallenius distribution.

C. Additional Probabilistic Assumptions

We assume independent atom coefficients sn1, …, snL each following a normal distribution with mean μsnl and precision γs, i.e. sn~Normal (μsn,γs1IL). We have considered a different mean μsnl for each exemplar data n and selected dictionary atom l to account for the various signal energy levels and possible atom configurations. We further hypothesize dictionary parameters of normal distribution θnk ~ Normal (gnk, Gn−1) with mean gk and precision Gn. Finally, we assume zero mean Gaussian noise with variance γεn1, i.e. εn~Normal (0,γεn1IQ). Similar distributions have been also hypothesized in prior work [16], [41], [44] (Section I).

The Bayesian framework further treats the parameters of the above variables as random components in order to better capture uncertainty. We introduce conjugate prior distributions that simplify computations. Specifically, we assume that πn follows a Dirichlet prior, i.e. πnk ~ Dirichlet (α), where α = [a1aK]T, and the precision of the Gaussian noise follows a Gamma prior, i.e. γεn ~ Gamma (e, f). The mean vector gnk and precision matrix Gn of the dictionary atom parameters are modeled with Gaussian and Wishart distributions, respectively, i.e. gnk ~ Normal (g0, G0−1) and Gn ~ Wishart (ν0, R0).

D. Objective

The goal is to find yn* such that

yn*=arg max ynp(yn|xn,n) (4)
yn=[znT,sn1,,snL,θn1T,,θnKT,gn1T,,gnKT,vec(Gn)T,εnT,πnT,γεn]T (5)
n={α,g0,G0,ν0,R0,e,f,μsn,γs} (6)

The probabilistic assumptions for (4)(6) are summarized in Table I.

TABLE I.

Prior distributions of Bayesian dictionary learning variables.

Variable Type Expression (Hyper) Parameters

znl ∈ ∪ ui Multinomial (1, πn)
k=1Kπnkznlk
πn: outcome
probability
zn ∈ ℜK Wallenius (1K, L, πn)
01k=1K(1tπnk/d)znkdt,d=k=1Kπnk(1znk)
sn ∈ ℜL Normal (μsn, γs1IL)
(2π)L/2γs1/2exp [γs2(Snμsn)T(Snμsn)]
μsn: mean
γs: precision
θnk ∈ ℜQ Normal (gnk, Gn−1)
(2π)Q/2|Gn|1/2exp [12(θnkgnk)TGn(θnkgnk)]
gnk: mean
Gn: precision
εn ∈ ℜP Normal (0, γεn1IP)
(2π)P/2γεn1/2exp [γεn2εnTεn]
γεn : precision

πn ∈ ℜK Dirichlet (α)
Γ(k=1Kαk)k=1KΓ(αk)k=1Kπnkαk1
α: concentration
parameters
gnk ∈ ℜQ Normal (g0, G0−1)
(2π)Q/2|G0|1/2exp [12(gnkg0)TG0(gnkg0)]
g0: mean
G0: precision
Gn ∈ ℜQ×Q Wishart (ν0, R0)
(|Gn|ν0Q12exp [tr(R01Gn)2])(2Qν02|R0|ν02ΓQ(ν02))1
ν0 > Q − 1: dof
R0: scale matrix
γεn ∈ ℜ Gamma (e, f)
fcΓ(e)γεne1exp [fγεn]
e: shape
f: rate

Γ, ΓQ: (multivariate) gamma functions, dof: degrees of freedom, 1K = [1, …, 1]T ∈ ℜK

: atom sampling with replacement (‖znl0 = 1), ui = [0, 0, …, 1, …, 0]T with 1 in the ith entry, 0 otherwise

: atom sampling without replacement (‖zn0 = L)

III. Inference with MCMC Sampling

The inference problem aims at finding solutions yn*, n = 1 …, N, that maximize (4). Since (4) is not analytically tractable, we use MCMC for approximate inference because of its simple and fast implementation. We describe the inference procedure in Section III-A and provide the corresponding derivations in Sections III-B, III-C, and III-D.

A. MCMC Sampling

The large number of variables in our problem renders the simultaneous sampling from the full posterior quite prohibitive. For this reason, we divide the variable space y into blocks, a technique which is usually referred as “block-at-a-time” MCMC [57], [58]. Suppose that the variable space is split into B blocks specified according to the problem characteristics, i.e. y = [y1T, …, yBT]T. Without loss of generality, we hypothesize these blocks are sampled sequentially from y1 to yB. This is referred as “deterministic-scan” MCMC sampling [59], [60], [61] and will be the focus of our paper.

When the posterior probability of a block b yields a known probabilistic distribution, we use the Gibbs sampler, otherwise we sample based on the MH [62], [63]. MH generates a sample with a candidate-generating (or proposal) density q. The MC transitions to the generated sample with a predefined probability of move. A critical component is the selection of proposal density, based on which MH samplers are divided into two categories. The first is the random-walk [62] with samples generated around the current value of the corresponding variables. Despite its simplicity, this method often depicts slow convergence depending on the variance of q [58], [64]. The second type, called independent MH [63], samples independently of the previous state. Its proposal density is close to the target distribution in a certain sense benefiting convergence. Previous work has considered Student-t distributions tailored to the target density [65], [66], [67], whose long tails ensure that no areas of the state space are left unexplored. The “MH-within-Gibbs” sampler uses MH to generate samples for blocks whose posterior does not yield a known distribution, and Gibbs to generate samples for the remaining blocks.

We will further discuss the use of “MH-within-Gibbs sampler” for our problem and the derivation of posteriors for each variable (Table II). We will assume that yyb contains all variables except the ones included in the bth block (i.e. yb), which are currently being generated.

TABLE II.

Description of Metropolis-Hastings-within-Gibbs sampling distributions.

Variable Sampling Distribution/Proposal Sampling Method

znl ∈ ℜK Multinomial (1, pnl)
pnlk=πnk exp[12γεnsnl(snl2εnlTϕ(θnk))]
Gibbs
zn ∈ ℜK Wallenius (1k, L, πn) Metropolis-Hastings
snl ∈ ℜ
Normal (γsμsnl+γεn(Dnznl)Tεnlγs+γεn,(γs+γεn)1)
Gibbs
*
θ˜n=vec ([θnk1θnkL])QL
Student-t (μ̂θ̃n, θ̃n, ν1) Metropolis-Hastings
θnk ∈ ℜQ, k ∉ ℐDn Normal (gnk, Gn) Gibbs
εn ∈ ℜP
Normal (0,γεn1IP)
Gibbs

πn ∈ ℜK
Dirichlet ([α1+l=1Lznl1,,αK+l=1LznlK]T)
Gibbs
gnk ∈ ℜQ
Normal ((Gn+G0)1(GnTθnk+G0Tg0),(Gn+G0)1)
Gibbs
Gn ∈ ℜQ×Q
Wishart (ν0+K,[R01+k=1K(θnkgnk)(θnkgnk)T]1)
Gibbs
γεn ∈ ℜ
Gamma (e+12,f+12xnl=1Lsnlϕ(θnkl)2)
Gibbs
*

xn=l=1Lsnlϕ(θnkl)+εn,

Dn={k1,,kL}, 1K = [1, …, 1]T ∈ ℜK

†, ‡

: atom sampling with/without replacement

B. Sampling Dictionary Parameters

Let Dn={k1,,kL} be the indices of dictionary atoms that represent signal xn and θnk1θnkL the corresponding atom parameters. The joint posterior of θ˜n=vec([θnk1θnkL])QL can be written as

p(θ˜n|yθ˜n,xn,n)kDnp(θnk|gnk,Gn)·p(xnl=1Lsnlϕ(θnkl)|γεn)exp[γεn2xnl=1Lsnlϕ(θnkl)22]exp[12kDn(θnkgnk)TGn(θnkgnk)]exp[γεn2xnSnϕ˜n(θ˜n)22]exp[12(θ˜ng˜n)TG˜n(θ˜ng˜n)] (7)

where ϕ˜n(θ˜n)=vec ([ϕ(θnk1)ϕ(θnkL)])PL,g˜n=vec ([gnk1gnkL])QL and

G˜n=(GnGn)QL×QL
graphic file with name nihms782938f5.jpg

Because of ϕ, we cannot find a known probabilistic distribution for (7), therefore we use MH for sampling θ̃n. The proposal density is a multivariate Student-t distribution with location μ̂θ̃n tailored to the target density with ν1 degrees of freedom and identity scale matrix θ̃n, defined as

q(θ˜n|xn,n,yθ˜n)=Γ(ν1+Q2)Γ(ν12)[π(ν1+Q)]Q2|V^θ˜n|12[1+1ν1+Q(θ˜nμ^θ˜n)TV^θ˜n1(θ˜nμ^θ˜n)]ν1+Q2 (8)
μ^θ˜n=arg maxθ˜n ln (p(θ˜n|yθ˜n,xn,n)) (9)
V^θ˜n=τ2IQL (10)

The choice of Student-t is crucial for the uniform ergodicity of MCMC (Section VI). In the experiments (Section VII), we find a numerical solution to (9) using the Nelder Mead’s simplex [68] because of its simplicity compared to gradient-based approaches. From (89), we are able to jointly generate the parameters of the selected atoms, rather than generating each one separately. A large number of selected atoms increases the computational cost towards maximizing (9).

The posterior for the remaining atoms θnk ∉ ℐDn, generated with Gibbs, is

p(θnk|yθnk,xn,n)p(θnk|gnk,Gn) (11)

C. Sampling Atom Indices and Coefficients

The probabilistic selection of dictionary atoms gives the opportunity to test unseen combinations. This can alleviate disadvantages of greedy algorithms, since the randomization might overcome locally optimal solutions, as also observed in [39]. Because the atom indices and coefficients are interdependent, we will describe the sampling procedure for both.

In the case of sampling with replacement, the atom indices and coefficients are generated with the Gibbs sampler, as their posterior can be analytically derived based on the assumptions for their priors (Section II). In the following, we will denote εnl = xn − ∑l′≠l snl′ Dnznl′ the error for the exemplar data xn when we exclude the lth atom from the representation. Then the total representation error can be expressed as

εn=εnlsnlDnznl=εnlsnlk=1Kznlkϕ(θnk) (12)

The posterior distribution of znl can be written as

p(znl|y{znl},xn,n)p(znl|πn,L)·p(εn|γεn)k=1Kπnkznlk·exp (γεn2εnlsnlk=1Kznlkϕ(θnk)22)k=1K[πnk exp (γεnsnlεnlTϕ(θnk))]znlk (13)

For deriving the above expression we took into account that znlk ∈ {0, 1} is a binary variable, implying that znlk2=znlk and aznlk = aznlk, ∀a ∈ ℜ. Also znl has unit l0-norm, i.e. ‖znl0 = 1, resulting in znlkznlk′ = 0, ∀k′ ≠ k. As indicated in (13), this update procedure considers the similarity of dictionary atoms to the signal residual and the prior knowledge for selecting atom k. Finally, the posterior of snl is

p(snl|ysnl,xn,n)p(snl|γs)·p(εn|γεn)exp [γs2(snlμsnl)2γεn2εnlsnlDnznl22] (14)

By completing the square of the above quadratic formula with respect to snl and assuming dictionary atoms of unit norm, snl can be generated from a normal distribution (Table II).

To the best of our knowledge, we found no conjugate prior for the Wallenius’ hypergeometric distribution [53], therefore we used the independent MH with proposal distribution tailored to the Wallenius prior for sampling without replacement.

D. Sampling the Parameters of the Priors

The conjugate assumptions for the distribution of the parameters of the aforementioned variables allow us to use the Gibbs sampler to generate the corresponding samples. For the sake of completeness, we briefly sketch the derivation of the posteriors with sampling distributions shown in Table II.

p(πn|yπn,))p(πn|α)l=1Lp(znl|πn,α) (15)
p(gnk|ygnk,)p(θnk|gnk,Gn)·p(gnk|g0,G0) (16)
p(Gn|yGn,)k=1Kp(θnk|gnk,Gn)·p(Gn|ν0,R0) (17)
p(γεn|yγεn,)p(γεn|e,f)·p(xnDnl=1Lsnlznl|γεn) (18)

where the dash “−” denotes {xn, ℋn}

Algorithm 1.

MCMC inference of parametric dictionary learning variables

Require: Data xn, hyperparameters ℋn
1: for n = 1, …, N do
2:   for m = 1, …, M do
3:     Sample atom indices zn1(m),znL(m) or zn(m)
4:     Find Dn(m)={k1,,kL} s.t. znlkl(m)=1
5:     Sample atom coefficients sn1(m),,snL(m)
6:     Sample noise vector εn(m)
7:     Sample dictionary parameters θnk(m),kDn(m)
8:     Sample dictionary priors gn1(m),,gnK(m),Gn(m)
9:     Sample atom selection probability priors πn(m)
10:     Sample noise variance γεn(m)
11:     Sample θ˜n(m) for generating θnk(m),kDn(m)
12:     Compute Dn(m)=[ϕ(θn1(m))ϕ(θnK(m))]
13:   end for
14: end for

E. Implementation of Bayesian DL

The problem variables (5) are inferred for each exemplar data separately. This method can yield signal-specific estimations of the noise variance, useful for denoising applications. It also allows sharing computational cost in MCMC inference, which typically requires a large amount of iterations to converge. Since more training samples are likely to require more MCMC iterations, if we had trained one dictionary on the entire dataset, the sequential nature of MCMC would have significantly slowed down convergence. It is worth mentioning that similar studies have acknowledged the difficulty of performing Bayesian inference in large datasets [43]. The MCMC inference procedure is outlined in Algorithm 1.

IV. Combination of Generated Dictionaries

The variables of the considered Bayesian problem are inferred for each exemplar data separately yielding sample-specific information useful for denoising and other applications, and allowing parallel implementations (Sections II, III). The latter is beneficial because of the large amount of data in many applications, that render batch DL methods computationally expensive or even prohibitive. However, in order to obtain generalizable dictionaries, that are able to reliably represent unseen data, we need to combine the corresponding results into a unified model (Algorithm 2). While the Bayesian framework aims to maximize the posterior probability of the model parameters, DL is usually evaluated based on the reconstruction error. For this reason, the combination of the dictionaries that result from the Bayesian inference is performed based on a root mean square (RMS) error criterion.

Let Dn(m) be the dictionaries generated with the MH-within-Gibbs sampler (Section III-E). In order to evaluate their performance using signal reconstruction criteria, we need to use a sparse decomposition algorithm, that can represent the original signal based on the inferred dictionaries. We use OMP [3], [4] because of its simplicity and effectiveness. Let Errn(m) be the relative RMS error that yields from decomposing xn based on dictionary Dn(m). Also let Dn(mn*)P×K be the dictionaries that yield the lowest relative RMS error for each signal n and Θ(mn*)=[θnk1(mn*),,θnkL(mn*)]Q×L the parameters of the atoms that were selected by OMP based on Dn(mn*). Θ(mn*) are concatenated into a unified dictionary

ΘU=[Θ(m1*)Θ(mN*)]Q×NL
mn*=arg minm[1,M] Errn(m),n=1,,N

Algorithm 2.

Combination of generated dictionaries

Require: Data xn, generated dictionaries Dn(m) , number of
      K-means clusters Nb
1: for n = 1, …, N do
2:   for m = 1, …, M do
3:     Reconstruct xn based on Dn(m) with OMP
4:     Compute relative RMS error Errn(m)
5:   end for
6:   Find mn* such that mn*=argminm=[1,M]Errn(m)
7:   Retrieve parameters of dictionary atoms selected by
  OMP based on Dn(mn*),Θ(mn*)=[θnk1(mn*),,θnkL(mn*)]
8: end for
9: Concatenate ΘU=[Θ(m1*)Θ(mN*)]
10: Quantize dictionary parameters ΘQ=K-means(ΘU)
11: Compute the final dictionary DQ = ϕ (ΘQ) where ϕ
operates on each column of matrix ΘQ

Ideally, we could have used ΘU as the parameters of our final dictionary. However, in practice the large amount of data renders ΘU computationally expensive. For this reason, the parameter vectors of ΘU are further quantized with K-means clustering using Nbin centers. This results in parameter matrix ΘQ with corresponding final dictionary DQ. The aforementioned procedure is performed on the training data, while the test data are not seen at all during this step (Section VII-B5).

This approach can be applied to combine dictionaries learnt from any other method, therefore it also provides a unified platform that allows comparison of the considered parametric DL approaches in our paper (Section VII).

V. Choosing the Parametric Dictionary Function

The choice of the parametric dictionary function is not trivial and is usually guided by the application of interest. If designed appropriately, parametric dictionaries can yield interpretable information about meaningful signal characteristics for a variety of applications. Previous studies have used Gabor [69] and Gammatone [24] atoms to represent speech signals because of their good localization properties and similarities to the human auditory system. Other efforts have proposed Gaussian-like functions to efficiently capture spherical stereo images [70], diffusion-based dictionaries to model MRI [25], and other wavelet-like atoms for digitizing fingerprint images [71]. Gabor dictionaries have been used for the electroencephalogram (EEG) [72], spline wavelets for the electrocardiogram (ECG) [73], and sigmoid-exponential functions for the electrodermal activity (EDA) [28].

Our Bayesian parametric DL approach generates the parameters of the dictionary atoms based on the MH sampler (Section III-B). The mean of the corresponding distribution depends on the considered parametric function and is selected so that it maximizes the corresponding posterior distribution (7). We will further show that the concavity of (7) generally depends on function ϕ. It is unusual that a function ϕ is concave with respect to all the parameters of interest, but in practice even the estimation of local optima is enough to generate useful samples (Section VII-C).

The posterior (7) of the dictionary atom parameters is:

U(θ˜n)=ln(p(θ˜n|yθ˜n,X,))=γεn2xnSnϕ˜n(θ˜n)2212(θ˜ng˜n)TG˜n(θ˜ng˜n) (19)

If we set ψn (θ̃n) = xnSn ϕ̃n (θ̃n), then (19) becomes

U(θ˜n)=12(θ˜ng˜n)TG˜n(θ˜ng˜n)γεn2ψn(θ˜n)2 (20)

The gradient vector and Hessian matrix of (20) are

θ˜nU(θ˜n)=G˜n(θ˜ng˜n)γεn(θ˜nψn(θ˜n))Tψn(θ˜n) (21)
HU=θ˜n2U(θ˜n)=G˜nγεn2θ˜n2ψn(θ˜n)22 (22)

where

(θ˜n2ψn(θ˜n)22)ij=ϑ2ϑθ˜niϑθ˜njψn(θ˜n)22=ϑϑθ˜ni(2p=1PLψnp(θ˜n)ϑψnp(θ˜n)ϑθ˜nj)=2p=1PLϑψnp(θ˜n)ϑθ˜niϑψnp(θ˜n)ϑθ˜nj+2p=1PLψnp(θ˜n)ϑ2ψnp(θ˜n)ϑθ˜niϑθ˜nj (23)

If Hψnp is the Hessian of ψnp, p = 1, …, PL, then from (23)

θ˜n2ψn(θ˜n)22=2(θ˜nψn(θ˜n))T(θ˜nψn(θ˜n))+2p=1Pψnp(θ˜n)Hψnp (24)

The Hessian of ψnp can be computed as

(Hψnp)ij=l=1Lsnlϑ2ϕp(θnkl)ϑθnkliϑθnkll (25)
Hψnp=l=1LsnlHϕp|θ=θnkl (26)

From (22), (24) and (26), we get

HU=G˜nγεn(θ˜nψn(θ˜n))(θ˜nψn(θ˜n))T+γεnp=1Pψnp(θ˜n)l=1LsnlHϕp|θ=θnkl (27)

The first two terms of (27) are postive-definite matrices, whereas the positive-definiteness of the third term depends on ϕ. In our experiments (Section VII) and in the literature mentioned in this section, function ϕ is selected with a knowledge-driven approach based on the application of interest. Although this does not guarantee that the generated atom parameters are the global maxima, in practice the corresponding sampling method yields satisfactory results (Section VII-C).

VI. MCMC Convergence for Bayesian DL

Ergodicity properties of large-dimensional MC have been the subject of several studies [60], [74]. We discuss the uniform ergodicity property of the considered MC which implies convergence independently of the initial state [51], [52]. We focus on the sampling with replacement case, although our discussion can be extended for atom sampling without replacement. We show that the MC used for inferring the variables of the Bayesian DL problem is uniformly ergodic by proving that it meets the conditions of Theorem 6.1. This theorem establishes uniform ergodicity for the MH-within-Gibbs sampler and its proof can be found in [61].

Let P be a Markov transition kernel (Section III-A) and Y(m), Y(m+1) be two consecutive MCMC states generating observations y(m), y(m+1). We will assume that yb(m) and yb(m+1) are the values of block b at the current and previous MCMC state, noted as m+1 and m, respectively. Also, yyb(m+1) contains all variables that have already been sampled at state m + 1 and yyb(m) the ones from the previous state m.

Definition 6.1 (Conditional Weight Function)

If block b generated with the MH sampler, its conditional weight function is defined as

wyb(yb(m+1)|yb(m),yyb(m),yyb(m+1))=p(yb(m+1)|yyb(m),yyb(m+1))q(yb(m+1)|yb(m),yyb(m),yyb(m+1)) (28)

Theorem 6.1 (MH-within-Gibbs Uniform Ergodicity)

Given p(y|X, ℋ) on state space y = [y1TyBT]T, let P = P1PB be the Markov kernel of the Gibbs sampler and Qb, b ∈ ℐB, where ℐB = {bi1, …, biB′}, B′ < B, be the Markov kernel of the MH sampler with conditional weight wyb′ as in (28). If each conditional weight wyb′, b′ ∈ ℐB, of the MH sampler is bounded, sup wyb′ < ∞, and the Gibbs sampler with Markov kernel P1PB is uniformly ergodic, then the MH-within-Gibbs sampler, resulting from substituting kernels Pb′ with Qb′, b′ ∈ ℐB, is also uniformly ergodic.

Based on Theorem 6.1, we show the MCMC uniform ergodicity for our case. Theorem 6.2 describes the minorization condition of the Gibbs sampler. This consists a multivariate extension from Jones et. al [61] (Proposition 2). Lemma 6.1 assists with its proof and ensures the minorization of the first B − 1 blocks. Lemma 6.2 ensures that the conditional weight for θk is bounded away from infinity. Finally, Theorem 6.3 combines all the aforementioned to prove the uniform ergodicity of the considered MC. The proofs of the theorems 6.2, 6.3 and lemmas 6.1, 6.2 can be found in Appendix A

Lemma 6.1 (Minorization Condition of B − 1 Blocks)

Let P((Y1, Y2, …, YB), A1 × A2 × … × AB) be the Markov kernel of a Gibbs sampler, where A1, …, AB are elements of the Borel σ-algebra on the variable space Y1, …, YB, respectively. We further assume that all updates of the Gibbs sampler, except the one corresponding to YB, are minorisable, in the sense that for b = 1, …, B − 1, there is εb > 0 and a probability measure νb, such that PYb(yyb, A−b) ≥ εbνb(A−b), where A−b = A1 × … × Ab−1 × Ab+1, …, AB. The Gibbs sampler with Markov kernel P1PB−1 is minorisable, i.e. there exist ε0 > 0 and probability measure ν0 such that

P((Y1,Y2,,YB1),AB)ε0ν0(AB)

Theorem 6.2 (Partial Minorization Condition for Gibbs)

Let the same assumptions from Lemma 6.1 hold, then the Gibbs sampler with Markov kernel P1PB is minorisable.

Lemma 6.2 (Bounded Conditional Weight of Dictionary Parameters)

Let the same assumptions from Lemma 6.1 hold. The conditional weight of the Bth block that includes the “super-vector” θ̃n of dictionary parameters and is generated with MH according to (8)(10), is bounded, i.e. sup wθ̃n < ∞.

Theorem 6.3 (MC Uniform Ergodicity for Bayesian DL Inference)

Let yn be the variables of the parametric DL problem (4) generated with the MH-within-Gibbs sampler (Algorithm 1), according to which all variables except θ̃n are sampled with Gibbs (first B − 1 blocks), while θ̃n is sampled with MH (Bth block). Then the corresponding MC {Y(1), Y(2), …} is uniformly ergodic.

VII. Experiments

We compare the Bayesian DL model against the previously proposed parametric SD [22] and ETF [24], which are conceptually closer to our approach. We further perform experiments with the non-parametric K-SVD [14] yielding dictionaries of arbitrary structure, therefore not directly comparable to our approach. We use synthetic and real biomedical data from EDA signals. Because of their characteristic structure, these types of signals favor sparse representation approaches with parametric dictionaries providing interpretable information [28].

EDA is decomposed into a slow moving tonic part depicting the general trend and a phasic part which contains fast fluctuations superimposed onto the tonic signal, also called skin conductance responses (SCR). The tonic part is mathematically expressed as a straight line, while SCRs are represented by sigmoid-exponential functions with a steep rise and slow recovery. Taking this into account, dictionaries contain tonic and phasic atoms as shown in Table III. Since SCR shapes typically contain higher variability than the signal level, which remains fairly constant throughout an analysis window, for the sake of simplicity we perform DL on the phasic atoms for learning the parameters θ = [t0, Trise, Tdecay]T. The initial dictionaries are created from the combination of all parameters reported in Table III, resulting in 63 tonic and 144 phasic atoms. The analysis window is 5sec, i.e. 160 samples with typical sampling frequency of 32Hz [28].

TABLE III.

Description of EDA-specific dictionary atoms and initial parameters.

Tonic Atoms
ϕ1(t) = Δ0 + Δ · t Δ0 ∈ {−20, −10, 1}
Δ ∈ {−0.01, −0.009, …,
0, 0.01, 0.02, …, 0.1}
Phasic Atoms
ϕ2(t)=stt0eTdecay[1+(stt0Trise)2]2u(tt0)
Trise ∈ {8, 14, 18}
Tdecay ∈ {10, 15, 20}
t0 ∈ {0, 10, 20, …, 150}

u(t) = 1, t ≥ 0 and u(t) = 0 otherwise

A critical issue of MCMC is whether a certain number of iterations is enough to stop sampling. In high-dimensional problems, all inferred variables need to converge to the target distribution. Since examining each variable separately is not always feasible, we use a combination of monitoring and diagnostic strategies to quantitatively assess MCMC convergence.

In the following, we describe the data (Section VII-A), the experimental setup (Section VII-B), and the results evaluating the learned dictionaries (Section VII-C) and providing diagnostics for MCMC convergence (Section VII-D).

A. Data Description

1) Synthetic Data

We randomly generate 1000 synthetic signals that simulate the EDA structure. Each signal is expressed as the sum of a constant c and R number of SCRs

x(t)=c+r=1Rϕ2(r)(t) (29)

where ϕ2(r) is given in Table III with parameters Trise(r),Tdecay(r) and t0(r). In contrast to the rest of the paper, in which superscript “(·)” denotes the MCMC state, here it symbolizes the SCR index. The parameters of the synthetic data are randomly generated within a pre-specified range t0(r)[1,150] samples, Trise,(r), Tdecay(r)[1,20], and R ∈ [1, 5]. Since the number of SCRs for each signal is known a priori, DL was performed with K = R + 1 number of atoms, from which one captures the tonic part and the rest, the phasic.

2) Real Data

We further evaluate the considered DL methods on human EDA data from the database of emotion analysis using physiological signals (DEAP) [75]. DEAP contains 40 one-minute recordings from 32 subjects watching long excerpts of music videos designed to study the relation of multimedia content with mood and temperament [75]. Because of the expected SCR rate in the considered 5sec analysis window [76], training was performed with K = 3 atoms, although similar results yield for different values.

B. Experimental Setup

1) Bayesian DL

The proposed MCMC inference (Algorithm 1) is performed with 1000 and 500 iterations for the synthetic and real data, respectively. The atom indices znl, coefficients sn and selection probabilities πn are initialized based on the decomposition of each exemplar signal with OMP. The mean μsn of the coefficients’ prior was initialized with the average of the coefficients of the selected atoms from OMP. The mean and precision of dictionary atoms’ prior g0 and G0 were initialized with the mean and inverse covariance matrix of the initial parameters (Table III). The scale matrix R0 of the Wishart distribution for sampling Gn was also initialized with the covariance of dictionary parameters. The remaining hyperparameters were empirically set to αk = 2, e = 1, f = 2, ν0 = 3, γs = 10, and γe = 8.

Dictionary combination (Algorithm 2) was performed on the generated vector parameters [Trise, Tdecay]T with Nb = 100, 225, 400. We did not include the time offset t0 in this procedure, since it does not contribute to the shape of the dictionary atoms; the final dictionaries should contain the learnt versions of phasic atoms shifted across the entire analysis frame. Therefore, the quantized matrix of dictionary parameters ΘQ ∈ ℜNb is replicated for all values of t0 (Table III), yielding final dictionaries with 16Nb phasic atoms.

2) Steepest descent DL

Dictionaries are trained in parallel for each exemplar data using the SD [22], in which OMP alternates with a least-squares fit step for estimating the parameters of the selected atoms. SD was run over 250 iterations. We perform the same procedure for combining the learnt parameters, yielding dictionaries of same size with MCMC.

3) Equiangular tight frame DL

This method alternates between finding the dictionary with minimum Frobenius distance from the ETF and updating the corresponding parameters [24]. These steps involve an ETF relaxation constraint value and a gradient descent step, crucial for the overall performance, referred in [24] as αk and ε. In our experiments, dictionaries were trained with different combinations of αk ∈ {0.1, 0.5, 0.9} and ε ∈ {0.001, 0.002, 0.003, 0.004}. The dictionary {αk*, ε*} that showed the lowest relative RMS error on the training data was used to evaluate the corresponding test data. In order to ensure the same dictionary size as the other approaches, parameters are uniformly initialized within the intervals Trise ∈ [8, 18], Tdecay ∈ [10, 20] with Nb=10,15,20 different values.

4) K-SVD

K-SVD is a classical non-parametric DL method generalizing K-means and iteratively estimating each dictionary atom in order to minimize the reconstruction error [14]. The original dictionaries used in this method were the same as the initial ones of the aforementioned approaches.

5) Evaluation

Evaluation of the final dictionaries is performed based on the relative RMS error computed between the original signal and its corresponding approximation through OMP based on the final dictionaries. RMS error is a common metric in related studies [14], [15], [24]. In order to ensure that our results reflect the ability of the algorithms to represent unseen data, all experiments are performed within a 10-fold cross-validation for the synthetic data and a leave-one-subject-out cross-validation setup for the real EDA signals. We note that OMP has two functionalities in our framework. It serves as a decomposition technique, based on which the learned dictionaries are evaluated against the test data, and is also used at the dictionary combination step (Section IV) in order to “prune” the atoms of the generated dictionaries. OMP does not operate on the test data during this dictionary combination framework, but rather at the evaluation step, during which the final dictionary is already created independently of the test set.

Besides signal reconstruction, dictionary coherence is an additional evaluation criterion. It is defined as the absolute value of the largest inner-product between any atoms of the dictionary and is an important property related to signal reconstruction quality [24]. We computed the resulting coherence of the dictionaries after training.

C. Results

Visual inspection of the final dictionaries indicates that SD does not always preserve the initial dictionary structure (Figs. 1a–b), while ETF yields less variable dictionaries (Fig. 1c). MCMC results in a variety of atoms preserving the original shape (Fig. 1e–f). In contrast to parametric DL, non-parametric K-SVD yields very unstructured non-interpretable atoms (Fig. 1d). Further comparison between an exemplar input and the reconstructed signals suggests that our approach depicts superior signal representation (Fig. 2).

Fig. 1.

Fig. 1

Example of initial dictionary and dictionaries learnt Steepest Descent (SD), Equiangular Tight Frame (ETF), K-SVD and Markov Chain Monte Carlo Bayesian inference (MCMC) using atom sampling with and without replacement (w-,w-o rplcm). Dictionary combination is performed with Nb = 400. An instance of phasic atoms shifted with t0 = 30 is shown.

Fig. 2.

Fig. 2

Example of input EDA signal (solid cyan line) and reconstructed signals using the original dictionary (blue dashed line) and the dictionaries learnt with Steepest Descent (SD), Equiangular Tight Frame (ETF), and Markov Chain Monte Carlo Bayesian inference (MCMC) (black-triangle, green-asterisk, and red-circled lines, respectively). Reconstruction is performed using orthogonal matching pursuit with 4 iterations.

Dictionaries learnt from our proposed Bayesian DL yield lower reconstruction errors compared to the initial ones and the ones learnt through SD and ETF (Fig. 3). Atom sampling with and without replacement are not significantly different, since the two distributions are very similar for KL. Dictionaries trained using SD perform quite poorly on unseen synthetic data (Figs. 3a–c), which might occur because their simple structure causes significant overfitting to least-squares-based methods. Despite the fact that ETF DL is not prone to overfitting, since it does not take into account exemplar data during training, it lacks adaptation to more complex real data (Figs. 3d–f). K-SVD appears more accurate for real signals than the parametric approaches (Figs. 3d–f), indicating its ability to learn more complex patterns, not necessarily of the same structure as the initial dictionaries. Similar results occur for different dictionary sizes, omitted here for the sake of simplicity.

Fig. 3.

Fig. 3

Relative root mean square (RMS) error between original and reconstructed signal with respect to the number of (orthogonal) matching pursuit (OMP) iterations. Dictionaries are learnt with Steepest Descent (SD), Equiangular Tight Frame (ETF), K-SVD, and Markov Chain Monte Carlo Bayesian inference (MCMC). During MCMC atom sampling is performed with (w-) and without (w-o) replacement (rplcm).

The large number of atoms generally yields dictionaries of high coherence. All considered methods appear quite equivalent, with ETF achieving the lowest coherence, since this is included as an optimization metric (Table IV).

TABLE IV.

Final dictionary coherence

Method Synthetic Data Real Data

Initial 1 1
SD 0.9990 0.9992
ETF 0.9988 0.9990
K-SVD 0.9989 0.9991
MCMC w- rplcm 0.9989 0.9991
MCMC w-o rplcm 0.9989 0.9991

D. MCMC Diagnostics

Besides theoretical analysis (Section VI), another way to examine MCMC convergence is to see how well the MC is mixing, which is usually achieved through visual inspection. Traceplots of several problem variables from an exemplar signal indicate that the considered chains move around the parameter space and are not only limited in certain areas suggesting good mixing (Figs. 4a–d). Density plots of the generated samples further validate that the estimated posteriors are close to the target distributions (Figs. 4e–h).

Fig. 4.

Fig. 4

Example MCMC trace plots and generated posteriors for the first element of vectors containing the (a,e) dictionary atoms θn, (b,f) atom coefficients sn, (c,g) atom priors πn, and (d,h) for the noise precision γεn.

Convergence diagnostics can further quantitatively assess if there is a bias from generated samples. We use the Geweke diagnostic [77], because it only requires one running chain and attempts to address issues of both bias and variance [78]. It takes two non-overlapping parts of the MC (usually the first 0.1 and last 0.5) and compares their mean using a difference of means test. If the samples are drawn from the stationary distribution of the chain, the two means are equal. The test statistic is a standard Z-score with values under convergence within two standard deviations from zero, i.e. |z| < 2 for the standard normal distribution. We perform the Geweke test for both datasets and atom sampling methods (Sections II-A,II-B) with the first 100 samples used as a burn-in period.

The high-dimensionality of our problem prohibits us to report diagnostics for each variable separately, therefore we will summarize the results for groups of variables. For each group of variables, we report the proportion of chains for which |z| < 2 (Table V). Most of the variables in our framework succeed on the Geweke diagnostic. The dictionary parameters, generated with MH, usually have a lower success percentages. Although there is a reduced number of cases where the Geweke diagnostic fails, given the large number of variables and the signal reconstruction results (Section VII-C), the performed number of MCMC iterations appears to result in meaningful dictionaries learned through this process.

TABLE V.

Geweke MCMC diagnostic - P(|z| < 2)

Synthetic Data Real Data

w- rplcm w-o rplcm w- rplcm w-o rplcm

cn 0.93 0.98 0.92 0.95
zn 0.93 0.93 0.91 0.94
γεn 0.91 0.98 0.99 0.92
πn 0.93 0.96 0.96 0.92

θn t0 0.68 0.70 0.72 0.75
Trise 0.91 0.92 0.87 0.85
Tdecay 0.92 0.92 0.86 0.85

gn t0 0.75 0.80 0.78 0.79
Trise 0.90 0.87 0.89 0.87
Tdecay 0.90 0.86 0.91 0.85

Gn 0.81 0.75 0.82 0.83

MH acceptance rate refers to the fraction of candidate draws that are accepted and is important for convergence. Very high acceptance rates suggest that the chain is not mixing well, while very low rates might be inefficient. The acceptance rates for the variables of our problem (Table VI) are close to previously proposed optimal values in the literature [79] suggesting a good mixing of the considered MC.

TABLE VI.

Metropolis-Hastings acceptance rates (%)

Synthetic Data Real Data
w- rplcm w-o rplcm w- rplcm w-o rplcm

23.22 31.52 25.48 38.15

VIII. Discussion

An important benefit of Bayesian methods yields in providing estimates of the entire variable set of a problem. In the considered setup this can help inferring the dictionary size, the optimal number of dictionary atoms for a given signal, and the corresponding noise levels. Estimation of the dictionary size is not as crucial in our greedy sparse representation approach, that is not as prone to overcomplete dictionaries as basis pursuit methods [5]. Inferring the optimal number of dictionary atoms is important, as it can yield high compression rates and help interpret the underlying signal information. An extension of our method could have imposed discrete probabilistic distributions onto the number of atoms appropriately inferred through MH.

Another advantage of Bayesian methods is that they are less prone to overfitting, which usually occurs with deterministic algorithms that might describe the random noise instead of the actual data [81]. This is reflected in our results (Section VII-C), since deterministic SD, that does not include prior knowledge of the signal structure, performs more poorly on unseen data. On the other side, ETF avoids overfitting, but does not learn the morphology of training data. Bayesian inference appears to compromise between the two.

Inherent differences exist between parametric and non-parametric DL methods. The first impose a predetermined structure on the input space (through function ϕ) and learn the parameters that represent this structure from the training data. Their major benefit lies in their interpretability, since the considered dictionary atoms are able to meaningfully capture the characteristics of input signals and can be used for knowledge-driven classification and inference. On the other hand, the exemplar signals learned from non-parametric methods are hardly interpretable and can blindly represent the input space. Since the functionality and scope of these methods is so different, meaningful comparison is challenging. In the case of synthetic data, where function ϕ is a perfect match to the signal (i.e. input signals are built with the same functions as the dictionary atoms), our proposed parametric Bayesian approach is more reliable than K-SVD in learning the hidden atom parameters. This means that given a perfectly constructed dictionary function, Bayesian parametric DL yields better results, even compared to non-parametric approaches. However, in the case of real signals, function ϕ cannot always perfectly selected, therefore parametric approaches seem to perform slightly worse than K-SVD. While more precise function might have yielded lower errors, the problem of finding the optimal mapping function is still active in research [82].

Although our proposed approach is general for learning the parameters of the dictionary atoms, we need to have good knowledge about the appropriate mapping function ϕ between the parameters and the data. As discussed in Section V, the selection of ϕ is usually guided by the application of interest and can vary for different types of signals. In the case of 2D images, for example, the dictionary needs to be constructed using a function ϕ that captures the context of the image, such as a Gabor or wavelet-like function. In order to reduce complexity, we can convert the 2D image into a 1D vector with dimensionality equal to the number of pixels, as typically performed [14], [40], [44]. Given a function ϕ suitable for the image of interest, our Bayesian approach can learn the parameters that efficiently capture the shape of the data.

In this paper, for the sake of simplicity we considered a simple dictionary combination approach with K-means quantization (Section IV). However, there exist more sophisticated approaches of block-coordinate descent with warm restarts [17] and weighted batch averaging [43].

Convergence is a critical component of MCMC in the context of Bayesian inference. In Section VI, we discussed the uniform ergodicity of the corresponding MH-within-Gibbs sampler, ensuring that the MC converges to the invariant distribution. Convergence rate is another important issue, for which previous studies have proposed explicit theoretical bounds in simple scenarios [59], [83]. The computation of these bounds can be quite difficult in such high-dimensional problems [84], therefore it goes beyond the scope of our study.

IX. Conclusions

We propose a sparse Bayesian model for parametric DL, whose problem variables follow appropriately selected probabilistic distributions. We use MH-within-Gibbs to infer the corresponding variables, because of its ability to compensate for posterior distributions that cannot be analytically computed. We further show the uniform ergodicity of the proposed MCMC through the minorization of the corresponding Gibbs sampler and the bounded conditional weights of the MH. Our experimental results performed on synthetic and real biomedical signals indicate that this approach offers benefits in terms of signal reconstruction compared to previously proposed SD and ETF methods and also provides a good tradeoff between learning the signal structure and avoiding overfitting.

Supplementary Material

Supplemental Material

Acknowledgments

This work was funded by National Science Foundation and National Institute of Health.

Biographies

graphic file with name nihms782938b1.gif

Theodora Chaspari (S’12) received the diploma in electrical and computer engineering from the National Technical University of Athens, Greece (2010) and the master’s degree from the University of Southern California (2012), where she is currently pursuing a Ph.D. degree. Since 2010 she has been a member of the Signal Analysis and Interpretation Laboratory (SAIL). Her research interests lie in the area of biomedical signal processing, speech analysis and behavioral signal processing.

Ms Chaspari is a recipient of the USC Annenberg Graduate Fellowship, USC WiSE Merit Fellowship, and the IEEE Signal Processing Society Travel Grant.

graphic file with name nihms782938b2.gif

Andreas Tsiartas (S’10-M’14) received the B.Sc. degree in electronics and computer engineering from the Technical University of Crete, Crete, Greece, in 2006, and the M.Sc. and Ph.D. degrees in the Department of Electrical Engineering, University of Southern California (USC), Los Angeles, in 2014. He is currently a research engineer at SRI International. His main research direction focuses on speech-to-speech translation. Other research interests include acoustic and language modeling for automatic speech recognition (ASR) and voice activity detection.

Dr. Tsiartas received best teaching assistant awards for the years 2009 and 2010 in the Department of Electrical Engineering, USC. In 2006, he was awarded the Viterbi School Deans Doctoral Fellowship from USC.

graphic file with name nihms782938b3.gif

Panagiotis Tsilifis was born in Aigio, Greece in 1985. He received his Diploma and M.Sc. degrees in Applied Mathematical Sciences from the National Technical University of Athens, Greece, in 2009 and 2011 respectively. In between, he also spent one year as an exchange student at the Royal Institute of Technology (KTH), in Stockholm, Sweden. He also received an M.A. in Applied Mathematics in 2014 from the University of Southern California (USC), Los Angeles, CA, USA. Currently he is pursuing his Ph.D. degree in Applied Mathematics at USC, working under the supervision of Prof. Roger G. Ghanem. His research focuses on Uncertainty Quantification methods including Bayesian approaches to optimal design and inference as well as construction of Polynomial Chaos surrogates with applications in geosciences, reservoir modeling and food microbiology.

Mr. Tsilifis received the Graduate Scholarship by the State Scholarship Foundation (IKY), Greece in 2010 and the Gerondelis Graduate Fellowship in 2013. He is also a SIAM student member and has received twice the SIAM student travel award, in 2015 and 2016.

graphic file with name nihms782938b4.gif

Shrikanth S. Narayanan (S’88-M’95-SM’02-F’09) is Andrew J. Viterbi Professor of Engineering at the University of Southern California (USC), and holds appointments as Professor of Electrical Engineering, Computer Science, Linguistics, Psychology Neuroscience and Pediatrics and as the founding director of the Ming Hsieh Institute. Prior to USC he was with AT&T Bell Labs and AT&T Research from 1995-2000. At USC he directs the Signal Analysis and Interpretation Laboratory (SAIL). His research focuses on human-centered signal and information processing and systems modeling with an interdisciplinary emphasis on speech, audio, language, multimodal and biomedical problems and applications with direct societal relevance. [http://sail.usc.edu]

Prof. Narayanan is a Fellow of the Acoustical Society of America and the American Association for the Advancement of Science (AAAS) and a member of Tau Beta Pi, Phi Kappa Phi, and Eta Kappa Nu. He is also an Editor in Chief for the IEEE Journal of Selected Topics in Signal Processing and an Editor for the Computer Speech and Language Journal and an Associate Editor for the IEEE TRANSACTIONS ON AFFECTIVE COMPUTING, IEEE TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING OVER NETWORKS, APSIPA TRANSACTIONS ON SIGNAL AND INFORMATION PROCESSING and the Journal of the Acoustical Society of America. He was also previously an Associate Editor of the IEEE TRANSACTIONS OF SPEECH AND AUDIO PROCESSING (2000–2004), IEEE SIGNAL PROCESSING MAGAZINE (2005–2008) and the IEEE TRANSACTIONS ON MULTIMEDIA (2008–2011). He is a recipient of a number of honors including Best Transactions Paper awards from the IEEE Signal Processing Society in 2005 (with A. Potamianos) and in 2009 (with C. M. Lee) and selection as an IEEE Signal Processing Society Distinguished Lecturer for 2010–2011 and ISCA Distinguished Lecturer for 2015–2016. Papers co-authored with his students have won awards including the 2014 Ten-year Technical Impact Award from ACM ICMI and at Interspeech 2015 Nativeness Detection Challenge, 2014 Cognitive Load Challenge, 2013 Social Signal Challenge, Interspeech 2012 Speaker Trait Challenge, Interspeech 2011 Speaker State Challenge, InterSpeech 2013 and 2010, InterSpeech 2009 Emotion Challenge, IEE DCOSS 2009, IEEE MMSP 2007, IEEE MMSP 2006, ICASSP 2005 and ICSLP 2002. He has published over 650 papers and has been granted seventeen U.S. patents.

Appendix A

MCMC Ergodicity Proofs

We provide the proofs for the theorems and lemmas of Section VI.

Proof of Lemma 6.1

We assume that the sampling order is Y1, …, YB−1. Let {y1,,yB1} and {y1, …, yB} be the block variables at the current and previous MCMC state, respectively. The conditional probability for sampling the first B − 1 blocks is

p(y1,,yB1|y1,,yB)=p(yB1|y1,,yB2,yB1,yB)·p(y1,,yB2|y1,,yB)=p(yB1|y1,,yB2,yB1,yB)·p(yB2|y1,,yB3,yB2,yB1,yB)···p(y1|y1,,yB)=p(yB1|yyB1)·p(yB2|yyB2)·p(y1|yy1)

where yb contains all variables except yb. The Markov kernel for the first B − 1 blocks can be written as in (30). The first and second inequalities in (30) occur from the minorization of the (B − 1)th and (B − 2)th blocks. Therefore ε0=b=1B1εb>0 and ν0=b1B1νb(Ab) (a probability measure) that satisfy the desired inequality.

P((Y1,Y2,,YB1),AB)=A1AB1p(y1,,yB1|y1,,yB)μ(d(yB))=A1AB1p(y1,,yB2|yy1,,yyB2)p(yB1|yyB1)μB1(dyB1)μ1(dy1)=A1AB2p(y1,,yB2|yy1,,yyB2)(AB1p(yB1|yyB1)μB1(dyB1))μB2(dyB2)μ1(dy1)εB1νB1(AB1)A1AB3p(y1,,yB3|)(AB2p(yB2|yyB2)μB2(dyB2))μB3(dyB3)μ1(dy1)b1B1εbνb(Ab) (30)

Proof of Theorem 6.2

If we assume that the sampling order is Y1, …, YB, the Markov kernel of the Gibbs sampler can be expressed as in (31). The first inequality in (31) results from Lemma 6.1 and can yield to a minorization condition for the entire Gibbs sampler.

P((Y1,Y2,,YB),A1×A2××AB)=A1ABp(y1,,yB|y1,,yB)μ(d(y1,,yB))=ABp(yB|y1,,yB1,yB)A1AB1p(y1,,yB1|y1,,yB1,yB)μ(d(y1,,yB1))·μB(dyB)ε0ABp(yB|y1,,yB1,yB)A1AB1ν0(d(y1,,yB1))·μB(dyB)=ε0A1AB1p(yyB,Ab)ν(d(y1,,yB1) (31)

Proof of Lemma 6.2

From Def. 6.1, Table II, and (78)

wyθ˜nexp (γεn2εn2)|G˜n|12|Vθ˜n|1/2·exp [12(θ˜ng˜n)TG˜n(θ˜ng˜n)][1+1ν1+Q(θ˜nμ^θ˜n)TV^θ˜n1(θ˜nμ^θ˜n)]ν1+Q2 (32)

Using (10), we have

|G˜n|12|Vθ˜n|12=τ|G˜n|12<  and  1k=1K|Vθk|12=τK<

since n is the precision matrix of Gaussian distribution, therefore has finite eigenvalues and determinants, and τ < ∞. Moreover n=1Nexp (γεn2εn2)<1 because γεn2εn20 for n = 1, …, N and γεn > 0. Finally the function

f(θn˜)=[1+1ν1+Q(θ˜nμ^θ˜n)TV^θ˜n1(θ˜nμ^θ˜n)]ν1+Q2exp (12(θ˜ng˜n)TG˜n(θ˜ng˜n)) (33)

is bounded since f(θ̃n) → 0, ‖θ̃n‖ → ∞, and f is continuous. Function f remains bounded at ‖θ̃n‖ → ∞, since the quadratic forms (θ̃nn)T n(θ̃nn) and (θ˜ng˜n)TV^θ˜n1(θ˜ng˜n) increase to infinity at the same rate, and the denominator increases exponentially fast, while the numerator with polynomial rate.

Proof of Theorem 6.3

The first B − 1 blocks of yn, that are generated with the Gibbs sampler, follow well-known distributions (Table I), therefore it is trivial to show that their updates are minorisable. From Theorem 6.2, since the partial updates for all blocks except the last one are minorisable, the entire Gibbs sampler is minorisable, therefore uniformly ergodic. From Lemma 6.2, the conditional weight of the Bth block yθ̃n generated with the MH is bounded. Thus the MH-within-Gibbs sampler meets the conditions of Theorem 6.1, therefore it is uniformly ergodic.

Appendix B

Computational Complexity

We analyze the computational complexity of our proposed framework for each MCMC step (Algorithm 1, Table II) using the “𝒪” notation. For an input signal and MCMC iteration, the weight of the Multinomial distribution for each atom is computed with cost 𝒪(P), i.e. 𝒪(PK) for all atoms. Sampling L indices from K dictionary atoms with replacement results in 𝒪(LK), therefore the final cost yields 𝒪(PK + LK). For sampling without replacement, re-adjusting the atom weights requires 𝒪 ((K − 1) + … + (KL + 1)) ~ 𝒪(LK). Since each of the L iterations takes into account the previously selected atoms [55], the cost of the Wallenius distribution is 𝒪 (K + (K − 1) + … + (KL + 1)) ~ 𝒪(LK).

Each of the L coefficients is generated from the normal distribution, whose mean requires 𝒪(P), therefore the entire complexity is 𝒪(LP). Regarding the dictionary atom parameters, their posterior (7) requires 𝒪 (L(P + Q2)), while the typical cost of the Nelder Mead’s simplex is 𝒪 (Q2) [80]. Finally, complexity results in 𝒪(1) for the noise εn, 𝒪(K + L) for the Dirichlet prior πn, 𝒪(Q2) for the mean gnk and precision Gn of the dictionary parameters prior, and 𝒪(P) for the noise precision γεn.

Taking these into account, the total complexity when using sampling with and without replacement, respectively, yields:

𝒪(PK+LK+2LP+(L+3)Q2+P+K+L+1)~𝒪(PK+LK+LP+LQ2) (34)
𝒪(2LK+2LP+(L+3)Q2+P+K+L+1)~𝒪(LK+LP+LQ2) (35)

Our approach requires first-order polynomial time with the signal dimensionality P, the number of dictionary atoms K, and the number of selected atoms in the representation LK, and second-order polynomial cost with respect to the number of dictionary parameters Q. We note that QP, L, therefore the latter is not too expensive.

We further compare run-time statistics of all the approaches. We report the average duration of one training iteration and the number of estimated variables for each approach (Table B1). Experiments were performed with an Intel Core i7 Processor with CPU at 2.93Hz and RAM at 7.8GB. SD and K-SVD require a decomposition step, therefore they compute slightly more variables than ETF. Our method estimates the highest number of variables, including the priors of the considered problem. Results indicate that SD and ETF are computationally less expensive than the proposed MCMC. Consistently with previous observations [19], K-SVD also has high computational cost. MCMC sampling with and without replacement yield computation times of the same order.

TABLE B1.

Average computation time of dictionary learning algorithms

Method # Variables Computation Time
(sec / training iteration)

SD 438 0.1173
ETF 432 0.0616
K-SVD 438 3.9536
MCMC w- rplcm 1024 3.5014
MCMC w-o rplcm 1024 4.9638

Footnotes

This paper has supplementary material available including proofs and derivations.

Contributor Information

Theodora Chaspari, Signal Analysis and Interpretation Laboratory (SAIL) at the Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089, USA.

Andreas Tsiartas, SRI International, Menlo Park, CA, 94025-3493, USA.

Panagiotis Tsilifis, Department of Mathematics at the Dana and David Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA, 90089, USA.

Shrikanth Narayanan, Signal Analysis and Interpretation Laboratory (SAIL) at the Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA, 90089, USA.

References

  • 1.Elad M. Sparse and redundant representations: From theory to applications in signal and image processing. Springer; 2010. [Google Scholar]
  • 2.Mallat SG, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE TSP. 1993;41(12):3397–3415. [Google Scholar]
  • 3.Pati YC, Ramin R, Krishnaprasad PS. Orthogonal matching pursuit: Recursive function approximation with applications to wavelet decomposition; Conference on Signals, Systems and Computers; 1993. [Google Scholar]
  • 4.Davis G, Mallat SG, Avellaneda M. Adaptive greedy approximations. Constructive Approximation. 1997;13(1):57–98. [Google Scholar]
  • 5.Chen SS, Donoho DL, Saunders MA. Atomic decomposition by basis pursuit. SIAM Journal on Scientific Computing. 1998;20(1):33–61. [Google Scholar]
  • 6.Rao BD, Kreutz-Delgado K. An affine scaling methodology for best basis selection. IEEE TSP. 1999;47(1):187–200. [Google Scholar]
  • 7.Rao BD, Engan K, Cotter SF, Palmer J, Kreutz-Delgado K. Subset selection in noise based on diversity measure minimization. IEEE TSP. 2003;51(3):760–770. [Google Scholar]
  • 8.Wipf DP, Rao BD. Sparse Bayesian learning for basis selection. IEEE TSP. 2004;52(8):2153–2164. [Google Scholar]
  • 9.Wipf DP, Rao BD. An empirical Bayesian strategy for solving the simultaneous sparse approximation problem. IEEE TSP. 2007;55(7):3704–3716. [Google Scholar]
  • 10.Ji S, Xue Y, Carin L. Bayesian compressive sensing. IEEE TSP. 2008;56(6):2346–2356. [Google Scholar]
  • 11.Daugman JG. Two-dimensional spectral analysis of cortical receptive field profiles. Vision research. 1980;20(10):847–856. doi: 10.1016/0042-6989(80)90065-6. [DOI] [PubMed] [Google Scholar]
  • 12.Daubechies I. The wavelet transform, time-frequency localization and signal analysis. Information Theory. 1990;36(5):961–1005. [Google Scholar]
  • 13.Starck JL, Candès EJ, Donoho DL. The curvelet transform for image denoising. IEEE TIP. 2002;11(6):670–684. doi: 10.1109/TIP.2002.1014998. [DOI] [PubMed] [Google Scholar]
  • 14.Aharon M, Elad M, Bruckstein A. K-SVD: An algorithm for designing overcomplete dictionaries for sparse representation. IEEE TSP. 2006;54(11):4311–4322. [Google Scholar]
  • 15.Engan K, Aase SO, Hakon Husoy J. Method of optimal directions for frame design. Proc. ICASSP. 1999:2443–2446. [Google Scholar]
  • 16.Olshausen BA, Field DJ. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381(6583):607–609. doi: 10.1038/381607a0. [DOI] [PubMed] [Google Scholar]
  • 17.Mairal J, Bach F, Ponce J, Sapiro G. Online dictionary learning for sparse coding. Proc. ACM. 2009:689–696. [Google Scholar]
  • 18.Mairal J, Bach F, Ponce J, Sapiro G, Zisserman A. Supervised dictionary learning. Proc. NIPS. 2008:1033–1040. [Google Scholar]
  • 19.Rubinstein R, Bruckstein AM, Elad M. Dictionaries for sparse representation modeling. Proc. IEEE. 2010;98(6):1045–1057. [Google Scholar]
  • 20.Skretting K, Engan K. Image compression using learned dictionaries by RLS-DLA and compared with K-SVD. Proc. ICASSP. 2011:1517–1520. [Google Scholar]
  • 21.Thanou D, Shuman D, Frossard P. Learning Parametric Dictionaries for Signals on Graphs. IEEE TSP. 2014;62(15):3849–3862. [Google Scholar]
  • 22.Ataee M, Zayyani H, Babaie-Zadeh M, Jutten C. Parametric dictionary learning using steepest descent. Proc. ICASSP. 2010:1987–1981. [Google Scholar]
  • 23.Szu HH, Telfer BA, Kadambe SL. Neural network adaptive wavelets for signal representation and classification. Optical Engineering. 1992;21(9):1907–1916. [Google Scholar]
  • 24.Yaghoobi M, Daudet L, Davies ME. Parametric dictionary design for sparse coding. IEEE TSP. 2009;57(12):4800–4810. [Google Scholar]
  • 25.Merlet S, Caruyer E, Ghosh A, Deriche R. A computational diffusion MRI and parametric dictionary learning framework for modeling the diffusion signal and its features. Medical Image Analysis. 2013;17(9):830–843. doi: 10.1016/j.media.2013.04.011. [DOI] [PubMed] [Google Scholar]
  • 26.Barthélemy Q, Gouy-Pailler C, Isaac Y, Souloumiac A, Larue A, Mars JI. Multivariate temporal dictionary learning for EEG. Journal of neuroscience methods. 2013;215(1):19–28. doi: 10.1016/j.jneumeth.2013.02.001. [DOI] [PubMed] [Google Scholar]
  • 27.Reisert M, Skibbe H, Kiselev VG. The diffusion dictionary in the human brain is short: Rotation invariant learning of basis functions. Proc. CDMRI. 2014:47–55. [Google Scholar]
  • 28.Chaspari T, Tsiartas A, Stein LI, Cermak SA, Narayanan SS. Sparse representation of electrodermal activity with knowledge-driven dictionaries. IEEE TBME. 2015;62(3):960–971. doi: 10.1109/TBME.2014.2376960. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Tipping ME. Advanced lectures on machine Learning. Springer; 2004. Bayesian inference: An introduction to principles and practice in machine learning; pp. 41–62. [Google Scholar]
  • 30.Ophir B, Elad M, Bertin N, Plumbley MD. Sequential minimal eigenvalues - An approach to analysis dictionary learning. Proc. EUSIPCO. 2011 [Google Scholar]
  • 31.Vidal R, Ma Y, Sastry S. Generalized principal component analysis. IEEE PAMI. 2005;27(12):1945–1959. doi: 10.1109/TPAMI.2005.244. [DOI] [PubMed] [Google Scholar]
  • 32.Aharon M, Elad M. Sparse and redundant modeling of image content using an image-signature-dictionary. SIIMS. 2008;1(3):228–247. [Google Scholar]
  • 33.Rubinstein R, Zibulevsky M, Elad M. Double sparsity: Learning sparse dictionaries for sparse signal approximation. IEEE TSP. 2010;58(3):1553–1564. [Google Scholar]
  • 34.Mazaheri JA, Guillemot C, Labit C. Learning a tree-structured dictionary for efficient image representation with adaptive sparse coding. Proc. ICASSP. 2013:1320–1324. [Google Scholar]
  • 35.Blumensath T, Davies M. Sparse and shift-invariant representations of music. IEEE TASLP. 2006;14(1):50–57. [Google Scholar]
  • 36.Engan K, Skretting K, Husøy JH. Family of iterative LS-based dictionary learning algorithms, ILS-DLA, for sparse signal representation. Digital Signal Processing. 2007;17(1):32–49. [Google Scholar]
  • 37.Jasper WJ, Garnier SJ, Potlapalli H. Texture characterization and defect detection using adaptive wavelets. Optical Engineering. 1996;35(11):3140–3149. [Google Scholar]
  • 38.Blumensath T. Monte Carlo methods for compressed sensing. Proc. ICASSP. 2014:1000–1004. [Google Scholar]
  • 39.Elad M, Yavneh I. A plurality of sparse representations is better than the sparsest one alone. IEEE TIT. 2009;55(10):4701–4714. [Google Scholar]
  • 40.Lewicki MS, Olshausen BA. Probabilistic framework for the adaptation and comparison of image codes. JOSA A. 1999;16(7):1587–1601. [Google Scholar]
  • 41.Lewicki MS, Sejnowski T. Learning overcomplete representations. Neural computation. 2000;12(2):337–365. doi: 10.1162/089976600300015826. [DOI] [PubMed] [Google Scholar]
  • 42.Tipping ME. Sparse Bayesian learning and the relevance vector machine. The journal of machine learning research. 2001;1:211–244. [Google Scholar]
  • 43.Li L, Silva J, Zhou M, Carin L. Online Bayesian dictionary learning for large datasets. Proc. ICASSP. 2012:2157–2160. [Google Scholar]
  • 44.Zhou M, Chen H, Paisley J, Ren L, Li L, Xing Z, Dunson D, Sapiro G, Carin L. Nonparametric Bayesian dictionary learning for analysis of noisy and incomplete images. IEEE TIP. 2012;21(1):130–144. doi: 10.1109/TIP.2011.2160072. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.He L, Qi H, Zaretzki R. Non-parametric Bayesian dictionary learning for image super resolution. Proc. FIIW. 2011:122–125. [Google Scholar]
  • 46.Crandall R, Dong B, Bilgin A. Randomized iterative hard thresholding: A fast approximate MMSE estimator for sparse approximations. 2013 http://math.arizona.edu/~dongbin/Publications/RandIHT. [Google Scholar]
  • 47.Sallee P, Olshausen BA. Learning sparse multiscale image representations. Proc. NIPS. 2002:1327–1334. [Google Scholar]
  • 48.Reaz MBI, Hussain MS, Mohd-Yasin F. Techniques of EMG signal analysis: detection, processing, classification and applications. Biological procedures online. 2006;8(1):11–35. doi: 10.1251/bpo115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49.Friston KJ, Holmes AP, Worsley KJ, Poline JP, Frith CD, Frackowiak RSJ. Statistical parametric maps in functional imaging: A general linear approach. Human Brain Mapping. 1995;2:189–210. [Google Scholar]
  • 50.Grimmer J. An introduction to Bayesian inference via variational approximations. Political Analysis. 2010;19(1):32–47. [Google Scholar]
  • 51.Meyn PS, Tweedie LR. Markov chains and stochastic stability. 2nd. New York: Cambridge University Press; 2009. [Google Scholar]
  • 52.Mengersen KL, Tweedie RL. Rates of convergence of the Hastings and Metropolis algorithms. The Annals of Statistics. 1996;24(1):101–121. [Google Scholar]
  • 53.Tsagkias M, De Rijke M, Weerkamp W. Hypergeometric language models for republished article finding. Proc. SIGIR. 2011:485–494. [Google Scholar]
  • 54.Fog A. Calculation methods for Wallenius’ noncentral hypergeometric distribution. Communications in Statistics-Simulation and Computation. 2008;37(2):258–273. [Google Scholar]
  • 55.Fog A. Sampling methods for Wallenius’ and Fisher’s noncentral hypergeometric distributions. Communications in Statistics-Simulation and Computation. 2008;37(2):241–257. [Google Scholar]
  • 56.Wallenius KT. Biased sampling; the noncentral hypergeometric probability distribution. Tech. Rep., DTIC Document. 1963 [Google Scholar]
  • 57.Chib S, Jeliazkov I. Marginal likelihood from the Metropolis- Hastings output. JASA. 2001;96(453):270–281. [Google Scholar]
  • 58.Chib S, Greenberg E. Understanding the Metropolis-Hastings algorithm. The American Statistician. 1995;49(4):327–335. [Google Scholar]
  • 59.Johnson A. Ph.D. thesis. University of Minessota; 2009. Geometric ergodicity of Gibbs samplers. [Google Scholar]
  • 60.Johnson AA, Jones LG, Neath CR. Component-wise Markov chain Monte Carlo: Uniform and geometric ergodicity under mixing and composition. Statistical Science. 2013;28(3):360–375. [Google Scholar]
  • 61.Jones LG, Roberts OG, Rosenthal JS. Convergence of conditional Metropolis-Hastings samplers. Advances in Applied Probability. 2014;46(2):422–445. [Google Scholar]
  • 62.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. The journal of chemical physics. 1953;21(6):1087–1092. [Google Scholar]
  • 63.Hastings WK. Monte Carlo sampling methods using Markov chains and their applications. Biometrika. 1970;57(1):97–109. [Google Scholar]
  • 64.Roberts GO, Gelman A, Gilks WR. Weak convergence and optimal scaling of random walk Metropolis algorithms. The annals of applied probability. 1997;7(1):110–120. [Google Scholar]
  • 65.Chib S, Greenberg E, Winkelmann R. Posterior simulation and Bayes factors in panel count data models. Journal of Econometrics. 1986;86(1):33–54. [Google Scholar]
  • 66.Bédard M, Fraser DAS. On a Directionally Adjusted Metropolis-Hastings Algorithm. International Journal of Statistical Sciences. 2008;9(1):33–57. [Google Scholar]
  • 67.Chib S, Ramamurthy S. Tailored randomized block MCMC methods with application to DSGE models. Journal of Econometrics. 2010;155(1):19–38. [Google Scholar]
  • 68.Nelder JA, Mead R. A simplex method for function minimization. The computer journal. 1965;7(4):308–313. [Google Scholar]
  • 69.Lobo AP, Loizou PC. Voiced/unvoiced speech discrimination in noise using gabor atomic decomposition. Proc. ICASSP) 2003:817–820. [Google Scholar]
  • 70.Tošić I, Frossard P. Dictionary learning for stereo image representation. IEEE TIP. 2011;20(4):921–934. doi: 10.1109/TIP.2010.2081679. [DOI] [PubMed] [Google Scholar]
  • 71.Demanet L, Ying L. Wave atoms and sparsity of oscillatory patterns. Applied and Computational Harmonic Analysis. 2007;23(3):368–387. [Google Scholar]
  • 72.Barthélemy Q, Gouy-Pailler C, Isaac Y, Souloumiac A, Larue A, Mars JI. Multivariate temporal dictionary learning for EEG. Journal of neuroscience methods. 2013;215(1):19–28. doi: 10.1016/j.jneumeth.2013.02.001. [DOI] [PubMed] [Google Scholar]
  • 73.Ruiz-Reyes N, Vera-Candeas P, Reche-López PJ, Canadas-Quesada F. A time-frequency adaptive signal model-based approach for parametric ecg compression. Proc. EUSIPCO. 2006:1–5. [Google Scholar]
  • 74.Jones GL, Hobert JP. Honest exploration of intractable probability distributions via Markov Chain Monte Carlo. Statistical Science. 2001:312–334. [Google Scholar]
  • 75.Koelstra S, Muhl C, Soleymani M, Lee JS, Yazdani A, Ebrahimi T, Pun T, Nijholt A, Patras I. DEAP: A database for emotion analysis using physiological signals. IEEE TAC. 2012;3(1):18–31. [Google Scholar]
  • 76.Dawson ME, Schell AM, Filion DL. The Electrodermal System. In: Cacioppo JT, Tassinary LG, Berntson GG, editors. Handbook of psychophysiology. 3rd. New York: Cambridge University Press; 2007. pp. 159–181. [Google Scholar]
  • 77.Geweke J. Evaluating the accuracy of sampling-based approaches to the calculation of posterior moments. In: Bernardo JM, Berger J, Dawid AP, Smith AFM, editors. Bayesian Statistics. Vol. 4. Oxford University Press; 1991. pp. 169–193. [Google Scholar]
  • 78.Cowles MK, Carlin BP. Markov chain Monte Carlo convergence diagnostics: a comparative review. JASA. 1996;91(434):883–904. [Google Scholar]
  • 79.Rosenthal JS. Optimal proposal distributions and adaptive mcmc. Handbook of Markov Chain Monte Carlo. 2011:93–112. [Google Scholar]
  • 80.Singer S, Singer S. Complexity analysis of Nelder-Mead search iterations; Proc. Conference on Applied Mathematics and Computation; 1999. pp. 185–196. [Google Scholar]
  • 81.Bishop Christopher M. Model-based machine learning. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences. 2013;371(1984) doi: 10.1098/rsta.2012.0222. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 82.Qiu Q, Patel VM, Pavan T, Chellappa R. Domain adaptive dictionary learning. Proc. ECCV. 2012:631–645. [Google Scholar]
  • 83.Rosenthal JS. Minorization conditions and convergence rates for Markov Chain Monte Carlo. JASA. 1995;90(430):558–566. [Google Scholar]
  • 84.Rosenthal JS. Theoretical rates of convergence for Markov Chain Monte Carlo. Computing Science and Statistics. 1994:486–486. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplemental Material

RESOURCES