Skip to main content
Howard Hughes Medical Institute Author Manuscripts logoLink to Howard Hughes Medical Institute Author Manuscripts
. Author manuscript; available in PMC: 2010 Sep 23.
Published in final edited form as: Neural Comput. 2003 Feb;15(2):349–396. doi: 10.1162/089976603762552951

Dictionary Learning Algorithms for Sparse Representation

Kenneth Kreutz-Delgado 1, Joseph F Murray 2, Bhaskar D Rao 3, Kjersti Engan 4, Te-Won Lee 5, Terrence J Sejnowski 6
PMCID: PMC2944020  NIHMSID: NIHMS234072  PMID: 12590811

Abstract

Algorithms for data-driven learning of domain-specific overcomplete dictionaries are developed to obtain maximum likelihood and maximum a posteriori dictionary estimates based on the use of Bayesian models with concave/Schur-concave (CSC) negative log priors. Such priors are appropriate for obtaining sparse representations of environmental signals within an appropriately chosen (environmentally matched) dictionary. The elements of the dictionary can be interpreted as concepts, features, or words capable of succinct expression of events encountered in the environment (the source of the measured signals). This is a generalization of vector quantization in that one is interested in a description involving a few dictionary entries (the proverbial “25 words or less”), but not necessarily as succinct as one entry. To learn an environmentally adapted dictionary capable of concise expression of signals generated by the environment, we develop algorithms that iterate between a representative set of sparse representations found by variants of FOCUSS and an update of the dictionary using these sparse representations.

Experiments were performed using synthetic data and natural images. For complete dictionaries, we demonstrate that our algorithms have improved performance over other independent component analysis (ICA) methods, measured in terms of signal-to-noise ratios of separated sources. In the overcomplete case, we show that the true underlying dictionary and sparse sources can be accurately recovered. In tests with natural images, learned overcomplete dictionaries are shown to have higher coding efficiency than complete dictionaries; that is, images encoded with an over-complete dictionary have both higher compression (fewer bits per pixel) and higher accuracy (lower mean square error).

1 Introduction

FOCUSS, which stands for FOCal Underdetermined System Solver, is an algorithm designed to obtain suboptimally (and, at times, maximally) sparse solutions to the following m × n, underdetermined linear inverse problem1 (Gorodnitsky, George, & Rao, 1995; Rao & Gorodnitsky, 1997; Gorodnitsky & Rao, 1997; Adler, Rao, & Kreutz-Delgado, 1996; Rao & Kreutz-Delgado, 1997; Rao, 1997, 1998),

y=Ax, (1.1)

for known A. The sparsity of a vector is the number of zero-valued elements (Donoho, 1994), and is related to the diversity, the number of nonzero elements,

sparsity=#{x[i]=0}diversity=#{x[i]0}diversity=nsparsity.

Since our initial investigations into the properties of FOCUSS as an algorithm for providing sparse solutions to linear inverse problems in relatively noise-free environments (Gorodnitsky et al., 1995; Rao, 1997; Rao & Gorodnitsky, 1997; Gorodnitsky & Rao, 1997; Adler et al., 1996; Rao & Kreutz-Delgado, 1997), we now better understand the behavior of FOCUSS in noisy environments (Rao & Kreutz-Delgado, 1998a, 1998b) and as an interior point-like optimization algorithm for optimizing concave functionals subject to linear constraints (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997, 1998a, 1998b, 1998c, 1999; Kreutz-Delgado, Rao, Engan, Lee, & Sejnowski, 1999a; Engan, Rao, & Kreutz-Delgado, 2000; Rao, Engan, Cotter, & Kreutz-Delgado, 2002). In this article, we consider the use of the FOCUSS algorithm in the case where the matrix A is unknown and must be learned. Toward this end, we first briefly discuss how the use of concave (and Schur concave) functionals enforces sparse solutions to equation 1.1. We also discuss the choice of the matrix, A, in equation 1.1 and its relationship to the set of signal vectors y for which we hope to obtain sparse representations. Finally, we present algorithms capable of learning an environmentally adapted dictionary, A, given a sufficiently large and statistically representative sample of signal vectors, y, building on ideas originally presented in Kreutz-Delgado, Rao, Engan, Lee, and Sejnowski (1999b), Kreutz-Delgado, Rao, and Engan (1999), and Engan, Rao, and Kreutz-Delgado (1999).

We refer to the columns of the full row-rank m × n matrix A,

A=[a1,,an]Rm×n,nm, (1.2)

as a dictionary, and they are assumed to be a set of vectors capable of providing a highly succinct representation for most (and, ideally, all) statistically representative signal vectors y ∈ ℝm. Note that with the assumption that rank(A) = m, every vector y has a representation; the question at hand is whether this representation is likely to be sparse. We call the statistical generating mechanism for signals, y, the environment and a dictionary, A, within which such signals can be sparsely represented an environmentally adapted dictionary.

Environmentally generated signals typically have significant statistical structure and can be represented by a set of basis vectors spanning a lower-dimensional submanifold of meaningful signals (Field, 1994; Ruderman, 1994). These environmentally meaningful representation vectors can be obtained by maximizing the mutual information between the set of these vectors (the dictionary) and the signals generated by the environment (Comon, 1994; Bell & Sejnowski, 1995; Deco & Obradovic, 1996; Olshausen & Field, 1996; Zhu, Wu, & Mumford, 1997; Wang, Lee, & Juang, 1997). This procedure can be viewed as a natural generalization of independent component analysis (ICA) (Comon, 1994; Deco & Obradovic, 1996). As initially developed, this procedure usually results in obtaining a minimal spanning set of linearly independent vectors (i.e., a true basis). More recently, the desirability of obtaining “overcomplete” sets of vectors (or “dictionaries”) has been noted (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000; Coifman & Wickerhauser, 1992; Mallat & Zhang, 1993; Donoho, 1994; Rao & Kreutz-Delgado, 1997). For example, projecting measured noisy signals onto the signal submanifold spanned by a set of dictionary vectors results in noise reduction and data compression (Donoho, 1994, 1995). These dictionaries can be structured as a set of bases from which a single basis is to be selected to represent the measured signal(s) of interest (Coifman & Wickerhauser, 1992) or as a single, overcomplete set of individual vectors from within which a vector, y, is to be sparsely represented (Mallat & Zhang, 1993; Olshausen & Field, 1996; Lewicki & Sejnowski, 2000; Rao & Kreutz-Delgado, 1997).

The problem of determining a representation from a full row-rank over-complete dictionary, A = [a1, …, an], nm, for a specific signal measurement, y, is equivalent to solving an underdetermined inverse problem, Ax = y, which is nonuniquely solvable for any y. The standard least-squares solution to this problem has the (at times) undesirable feature of involving all the dictionary vectors in the solution2 (the “spurious artifact” problem) and does not generally allow for the extraction of a categorically or physically meaningful solution. That is, it is not generally the case that a least-squares solution yields a concise representation allowing for a precise semantic meaning.3 If the dictionary is large and rich enough in representational power, a measured signal can be matched to a very few (perhaps even just one) dictionary words. In this manner, we can obtain concise semantic content about objects or situations encountered in natural environments (Field, 1994). Thus, there has been significant interest in finding sparse solutions, x (solutions having a minimum number of nonzero elements), to the signal representation problem. Interestingly, matching a specific signal to a sparse set of dictionary words or vectors can be related to entropy minimization as a means of elucidating statistical structure (Watanabe, 1981). Finding a sparse representation (based on the use of a “few” code or dictionary words) can also be viewed as a generalization of vector quantization where a match to a single “code vector” (word) is always sought (taking “code book” = “dictionary”).4 Indeed, we can refer to a sparse solution, x, as a sparse coding of the signal instantiation, y.

1.1 Stochastic Models

It is well known (Basilevsky, 1994) that the stochastic generative model

y=Ax=ν, (1.3)

can be used to develop algorithms enabling coding of y ∈ ℝm via solving the inverse problem for a sparse solution x ∈ ℝn for the undercomplete (n < m) and complete (n = m) cases. In recent years, there has been a great deal of interest in obtaining sparse codings of y with this procedure for the overcomplete (n > m) case (Mallat & Zhang, 1993; Field, 1994). In our earlier work, we have shown that given an overcomplete dictionary, A (with the columns of A comprising the dictionary vectors), a maximum a posteriori (MAP) estimate of the source vector, x, will yield a sparse coding of y in the low-noise limit if the negative log prior, −log(P(x)), is concave/Schur-concave (CSC) (Rao, 1998; Kreutz-Delgado & Rao, 1999), as discussed below. For P(x) factorizable into a product of marginal probabilities, the resulting code is also known to provide an independent component analysis (ICA) representation of y. More generally, a CSC prior results in a sparse representation even in the nonfactorizable case (with x then forming a dependent component analysis, or DCA, representation).

Given independently and identically distributed (i.i.d.) data, Y = YN = (y1, …, yN), assumed to be generated by the model 1.3, a maximum likelihood estimate, ÂML, of the unknown (but nonrandom) dictionary A can be determined as (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000)

A^ML=argmaxAP(Y;A).

This requires integrating out the unobservable i.i.d. source vectors, X = XN = (x1, …, xN), in order to compute P(Y; A) from the (assumed) known probabilities P(x) and P(ν). In essence, X is formally treated as a set of nuisance parameters that in principle can be removed by integration. However, because the prior P(x) is generally taken to be supergaussian, this integration is intractable or computationally unreasonable. Thus, approximations to this integration are performed that result in an approximation to P(Y; A), which is then maximized with respect to Y. A new, better approximation to the integration can then be made, and this process is iterated until the estimate of the dictionary A has (hopefully) converged (Olshausen & Field, 1996). We refer to the resulting estimate as an approximate maximum likelihood (AML) estimate of the dictionary A (denoted here by ÂAML). No formal proof of the convergence of this algorithm to the true maximum likelihood estimate, ÂML, has been given in the prior literature, but it appears to perform well in various test cases (Olshausen & Field, 1996). Below, we discuss the problem of dictionary learning within the framework of our recently developed log-prior model-based sparse source vector learning approach that for a known overcomplete dictionary can be used to obtain sparse codes (Rao, 1998; Kreutz-Delgado & Rao, 1997, 1998b, 1998c, 1999; Rao & Kreutz-Delgado, 1999). Such sparse codes can be found using FOCUSS, an affine scaling transformation (AST)–like iterative algorithm that finds a sparse locally optimal MAP estimate of the source vector x for an observation y. Using these results, we can develop dictionary learning algorithms within the AML framework and for obtaining a MAP-like estimate, ÂMAP, of the (now assumed random) dictionary, A, assuming in the latter case that the dictionary belongs to a compact submanifold corresponding to unit Frobenius norm. Under certain conditions, convergence to a local minimum of a MAP-loss function that combines functions of the discrepancy e = (yAx) and the degree of sparsity in x can be rigorously proved.

1.2 Related Work

Previous work includes efforts to solve equation 1.3 in the overcomplete case within the maximum likelihood (ML) framework. An algorithm for finding sparse codes was developed in Olshausen and Field (1997) and tested on small patches of natural images, resulting in Gabor-like receptive fields. In Lewicki and Sejnowski (2000) another ML algorithm is presented, which uses the Laplacian prior to enforce sparsity. The values of the elements of x are found with a modified conjugate gradient optimization (which has a rather complicated implementation) as opposed to the standard ICA (square mixing matrix) case where the coefficients are found by inverting the A matrix. The difficulty that arises when using ML is that finding the estimate of the dictionary A requires integrating over all possible values of the joint density P(y, x; A) as a function of x. In Olshausen and Field (1997), this is handled by assuming the prior density of x is a delta function, while in Lewicki and Sejnowski (2000), it is approximated by a gaussian. The fixed-point FastICA (Hyvärinen, Cristescu, & Oja, 1999) has also been extended to generate overcomplete representations. The FastICA algorithm can find the basis functions (columns of the dictionary A) one at a time by imposing a quasi-orthogonality condition and can be thought of as a “greedy” algorithm. It also can be run in parallel, meaning that all columns of A are updated together.

Other methods to solve equation 1.3 in the overcomplete case have been developed using a combination of the expectation-maximization (EM) algorithm and variational approximation techniques. Independent factor analysis (Attias, 1999) uses a mixture of gaussians to approximate the prior density of the sources, which avoids the difficulty of integrating out the parameters X and allows different sources to have different densities. In another method (Girolami, 2001), the source priors are assumed to be supergaussian (heavy-tailed), and a variational lower bound is developed that is used in the EM estimation of the parameters A and X. It is noted in Girolami (2001) that the mixtures used in independent factor analysis are more general than may be needed for the sparse overcomplete case, and they can be computationally expensive as the dimension of the data vector and number of mixtures increases.

In Zibulevsky and Pearlmutter (2001), the blind source separation problem is formulated in terms of a sparse source underlying each unmixed signal. These sparse sources are expanded into the unmixed signal with a predefined wavelet dictionary, which may be overcomplete. The unmixed signals are linearly combined by a different mixing matrix to create the observed sensor signals. The method is shown to give better separation performance than ICA techniques. The use of learned dictionaries (instead of being chosen a priori) is suggested.

2 FOCUSS: Sparse Solutions for Known Dictionaries

2.1 Known Dictionary Model

A Bayesian interpretation is obtained from the generative signal model, equation 1.3, by assuming that x has the parameterized (generally nongaussian) probability density function (pdf),

Pp(x)=Zp1eγpdp(x),Zp=eγpdp(x)dx, (2.1)

with parameter vector p. Similarly, the noise ν is assumed to have a parameterized (possibly nongaussian) density Pq(ν) of the same form as equation 2.1 with parameter vector q. It is assumed that x and ν have zero means and that their densities obey the property d(x) = d(|x|), for | · | defined component-wise. This is equivalent to assuming that the densities are symmetric with respect to sign changes in the components x, x[i] ← −x[i] and therefore that the skews of these densities are zero. We also assume that d(0) = 0. With a slight abuse of notation, we allow the differing subscripts q and p to indicate that dq and dp may be functionally different as well as parametrically different. We refer to densities like equation 2.1 for suitable additional constraints on dp(x), as hypergeneralized gaussian distributions (Kreutz-Delgado & Rao, 1999; Kreutz-Delgado et al., 1999).

If we treat A, p, and q as known parameters, then x and y are jointly distributed as

P(x,y)=P(x,y;p,q,A).

Bayes’ rule yields

P(xy;p,A)=1βP(yx;p,A)·P(x;p,A)=1βPq(yAx)·Pp(x) (2.2)
β=P(y)=P(y;p,q,A)=P(yx)·Pp(x)dx. (2.3)

Usually the dependence on p and q is notationally suppressed, for example, β = P(y; A). Given an observation, y, maximizing equation 2.2 with respect to x yields a MAP estimate . This ideally results in a sparse coding of the observation, a requirement that places functional constraints on the pdfs, and particularly on dp. Note that β is independent of x and can be ignored when optimizing equation 2.2 with respect to the unknown source vector x.

The MAP estimate equivalently is obtained from minimizing the negative logarithm of P(x | y), which is

x^=argminxdq(yAx)+λdp(x), (2.4)

where λ = γp/γq and dq(yAx) = dq(Axy) by our assumption of symmetry. The quantity 1λ is interpretable as a signal-to-noise ratio (SNR). Furthermore, assuming that both dq and dp are concave/Schur–concave (CSC) as defined in section 2.4, then the term dq(yAx) in equation 2.4 will encourage sparse residuals, e = yAx̂, while the term dp(x) encourages sparse source vector estimates, . A given value of λ then determines a trade-off between residual and source vector sparseness.

This most general formulation will not be used here. Although we are interested in obtaining sparse source vector estimates, we will not enforce sparsity on the residuals but instead, to simplify the development, will assume the q = 2 i.i.d. gaussian measurement noise case (ν gaussian with known covariance σ2 · I), which corresponds to taking,

γqdq(yA^x^)=12σ2||yA^x^||2. (2.5)

In this case, problem 2.4 becomes

x^=argminx12||yAx||2+λdp(x). (2.6)

In either case, we note that λ → 0 as γp → 0 which (consistent with the generative model, 1.3) we refer to as the low noise limit. Because the mapping A is assumed to be onto, in the low-noise limit, the optimization, equation 2.4, is equivalent to the linearly constrained problem,

x^=argmindp(x)subjecttoAx=y. (2.7)

In the low-noise limit, no sparseness constraint need be placed on the residuals, e = yAx̂, which are assumed to be zero. It is evident that the structure of dp(·) is critical for obtaining a sparse coding, , of the observation y (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999). Throughtout this article, the quantity dp(x) is always assumed to be CSC (enforcing sparse solutions to the inverse problem 1.3). As noted, and as will be evident during the development of dictionary learning algorithms below, we do not impose a sparsity constraint on the residuals; instead, the measurement noise ν will be assumed to be gaussian (q = 2).

2.2 Independent Component Analysis and Sparsity Inducing Priors

An important class of densities is given by the generalized gaussians for which

dp(x)=||x||pp=k=1nx[k]p, (2.8)

for p > 0 (Kassam, 1982). This is a special case of the larger ℓp class (the p-class) of functions, which allows p to be negative in value (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997). Note that this function has the special property of separability,

dp(x)=k=1ndp(x[k]),

which corresponds to factorizability of the density Pp(x),

Pp(x)=k=1nPp(x[k]),

and hence to independence of the components of x. The assumption of independent components allows the problem of solving the generative model, equation 1.3, for x to be interpreted as an ICA problem (Comon, 1994; Pham, 1996; Olshausen & Field, 1996; Roberts, 1998). It is of interest, then, to consider the development of a large class of parameterizable separable functions dp(x) consistent with the ICA assumption (Rao & Kreutz-Delgado, 1999; Kreutz-Delgado & Rao, 1997). Given such a class, it is natural to examine the issue of finding a best fit within this class to the “true” underlying prior density of x. This is a problem of parametric density estimation of the true prior, where one attempts to find an optimal choice of the model density Pp(x) by an optimization over the parameters p that define the choice of a prior from within the class. This is, in general, a difficult problem, which may require the use of Monte Carlo, evolutionary programming, or stochastic search techniques.

Can the belief that supergaussian priors, Pp(x), are appropriate for finding sparse solutions to equation 1.3 (Field, 1994; Olshausen & Field, 1996) be clarified or made rigorous? It is well known that the generalized gaussian distribution arising from the use of equation 2.8 yields supergaussian distributions (positive kurtosis) for p < 2 and subgaussian (negative kurtosis) for p > 2. However, one can argue (see section 2.5 below) that the condition for obtaining sparse solutions in the low-noise limit is the stronger requirement that p ≤ 1, in which case the separable function dp(x) is CSC. This indicates that supergaussianness (positive kurtosis) alone is necessary but not sufficient for inducing sparse solutions. Rather, sufficiency is given by the requirement that −log Pp(x) ≈ dp(x) be CSC.

We have seen that the function dp(x) has an interpretation as a (negative logarithm of) a Bayesian prior or as a penalty function enforcing sparsity in equation 2.4, where dp(x) should serve as a “relaxed counting function” on the nonzero elements of x. Our perspective emphasizes that dp(x) serves both of these goals simultaneously. Thus, good regularizing functions, dp(x), should be flexibly parameterizable so that Pp(x) can be optimized over the parameter vector p to provide a good parametric fit to the underlying environmental pdf, and the functions should also have analytical properties consistent with the goal of enforcing sparse solutions. Such properties are discussed in the next section.

2.3 Majorization and Schur-Concavity

In this section, we discuss functions that are both concave and Schur-concave (CSC functions; Marshall & Olkin, 1979). We will call functions, dp(·), which are CSC, diversity functions, anticoncentration functions or antisparsity functions. The larger the value of the CSC function dp(x), the more diverse (i.e., the less concentrated or sparse) the elements of the vector x are. Thus, minimizing dp(x) with respect to x results in less diverse (more concentrated or sparse) vectors x.

2.3.1 Schur-Concave Functions

A measure of the sparsity of the elements of a solution vector x (or the lack thereof, which we refer to as the diversity of x) is given by a partial ordering on vectors known as the Lorentz order. For any vector in the positive orthant, xR+n, define the decreasing rearrangement

x(x1,,xn),x1xn0

and the partial sums (Marshall & Olkin, 1979; Wickerhauser, 1994),

Sx[k]=i=1kxn,k=1,,n.

We say that y majorizes x, y > x, iff for k = 1, …, n,

Sy[k]Sx[k];Sy[n]=Sx[n],

and the vector y is said to be more concentrated, or less diverse, than x. This partial order defined by majorization then defines the Lorentz order.

We are interested in scalar-valued functions of x that are consistent with majorization. Such functions are known as Schur-concave functions, d(·):R+nR. They are defined to be precisely the class of functions consistent with the Lorentz order,

yxd(y)<d(x).

In words, if y is less diverse than x (according to the Lorentz order) then d(y) is less than d(x) for d(·) Schur-concave. We assume that Schur-concavity is a necessary condition for d(·) to be a good measure of diversity (antisparsity).

2.3.2 Concavity Yields Sparse Solutions

Recall that a function d(·) is concave on the positive orthant R+n iff (Rockafellar, 1970)

d((1γ)x+γy)(1γ)d(x)+γd(y),

x, yR+n, ∀γ, 0 ≤ γ ≤ 1. In addition, a scalar function is said to be permutation invariant if its value is independent of rearrangements of its components. An important fact is that for permutation invariant functions, concavity is a sufficient condition for Schur-concavity:

Concavity+permutationinvarianceSchurconcavity.

Now consider the low-noise sparse inverse problem, 2.7. It is well known that subject to linear constraints, a concave function on R+n takes its minima on the boundary of R+n (Rockafellar, 1970), and as a consequence these minima are therefore sparse. We take concavity to be a sufficient condition for a permutation invariant d(·) to be a measure of diversity and obtain sparsity as constrained minima of d(·). More generally, a diversity measure should be somewhere between Schur-concave and concave. In this spirit, one can define almost concave functions (Kreutz-Delgado & Rao, 1997), which are Schur-concave and (locally) concave in all n directions but one, which also are good measures of diversity.

2.3.3 Separability, Schur-Concavity, and ICA

The simplest way to ensure that d(x) be permutation invariant (a necessary condition for Schur-concavity) is to use functions that are separable. Recall that separability of dp(x) corresponds to factorizability of Pp(x). Thus, separability of d(x) corresponds to the assumption of independent components of x under the model 1.3. We see that from a Bayesian perspective, separability of d(x) corresponds to a generative model for y that assumes a source, x, with independent components. With this assumption, we are working within the framework of ICA (Nadal & Parga, 1994; Pham, 1996; Roberts, 1998). We have developed effective algorithms for solving the optimization problem 2.7 for sparse solutions when dp(x) is separable and concave (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999).

It is now evident that relaxing the restriction of separability generalizes the generative model to the case where the source vector, x, has dependent components. We can reasonably call an approach based on a nonseparable diversity measure d(x) a dependent component analysis (DCA). Unless care is taken, this relaxation can significantly complicate the analysis and development of optimization algorithms. However, one can solve the low-noise DCA problem, at least in principle, provided appropriate choices of nonseparable diversity functions are made.

2.4 Supergaussian Priors and Sparse Coding

The P-class of diversity measures for 0 < p ≤ 1 result in sparse solutions to the low-noise coding problem, 2.7. These separable and concave (and thus Schur-concave) diversity measures correspond to supergaussian priors, consistent with the folk theorem that supergaussian priors are sparsity-enforcing priors. However, taking 1 ≤ p < 2 results in supergaussian priors that are not sparsity enforcing. Taking p to be between 1 and 2 yields a dp(x) that is convex and therefore not concave. This is consistent with the well-known fact that for this range of p, the pth-root of dp(x) is a norm. Minimizing dp(x) in this case drives x toward the origin, favoring concentrated rather than sparse solutions. We see that if a sparse coding is to be found based on obtaining a MAP estimate to the low-noise generative model, 1.3, then, in a sense, supergaussianness is a necessary but not sufficient condition for a prior to be sparsity enforcing. A sufficient condition for obtaining a sparse MAP coding is that the negative log prior be CSC.

2.5 The FOCUSS Algorithm

Locally optimal solutions to the known-dictionary sparse inverse problems in gaussian noise, equations 2.6 and 2.7, are given by the FOCUSS algorithm. This is an affine-scaling transformation (AST)-like (interior point) algorithm originally proposed for the low-noise case 2.7 in Rao and Kreutz-Delgado (1997, 1999) and Kreutz-Delgado and Rao (1997); and extended by regularization to the nontrivial noise case, equation 2.6, in Rao and Kreutz-Delgado (1998a), Engan et al. (2000), and Rao et al. (2002). In these references, it is shown that the FOCUSS algorithm has excellent behavior for concave functions (which includes the the CSC concentration functions) dp(·). For such functions, FOCUSS quickly converges to a local minimum, yielding a sparse solution to problems 2.7 and 2.6.

One can quickly motivate the development of the FOCUSS algorithm appropriate for solving the optimization problem 2.6 by considering the problem of obtaining the stationary points of the objective function. These are given as solutions, x*, to

AT(Axy)+λxdp(x)=0. (2.9)

In general, equation 2.9 is nonlinear and cannot be explicitly solved for a solution x*. However, we proceed by assuming the existence of a gradient factorization,

xdp(x)=α(x)Π(x)x, (2.10)

where α(x) is a positive scalar function and Π(x) is symmetric, positive-definite, and diagonal. As discussed in Kreutz-Delgado and Rao (1997, 1998c) and Rao and Kreutz-Delgado (1999), this assumption is generally true for CSC sparsity functions dp(·) and is key to understanding FOCUSS as a sparsity-inducing interior-point (AST-like) optimization algorithm.5

With the gradient factorization 2.10, the stationary points of equation 2.9 are readily shown to be solutions to the (equally nonlinear and implicit) system,

x=(ATA+β(x)Π(x))1ATy (2.11)
=Π1(x)AT(β(x)I+AΠ1(x)AT)1y, (2.12)

where β(x) = λα(x) and the second equation follows from identity A.18. Although equation 2.12 is also not generally solvable in closed form, it does suggest the following relaxation algorithm,

x^Π1(x^)AT(β(x^)I+AΠ1(x^)AT)1y, (2.13)

which is to be repeatedly reiterated until convergence.

Taking β ≡ 0 in equation 2.13 yields the FOCUSS algorithm proved in Kreutz-Delgado and Rao (1997, 1998c) and Rao and Kreutz-Delgado (1999) to converge to a sparse solution of equation 2.7 for CSC sparsity functions dp(·). The case β ≠ 0 yields the regularized FOCUSS algorithm that will converge to a sparse solution of equation 2.6 (Rao, 1998; Engan et al., 2000; Rao et al., 2002). More computationally robust variants of equation 2.13 are discussed elsewhere (Gorodnitsky & Rao, 1997; Rao & Kreutz-Delgado, 1998a).

Note that for the general regularized FOCUSS algorithm, 2.13, we have β() = λα(), where λ is the regularization parameter in equation 2.4. The function β(x) is usually generalized to be a function of , y and the iteration number. Methods for choosing λ include the quality-of-fit criteria, the sparsity critera, and the L-curve (Engan, 2000; Engan et al., 2000; Rao et al., 2002). The quality-of-fit criterion attempts to minimize the residual error yAx (Rao, 1997), which can be shown to converge to a sparse solution (Rao & Kreutz-Delgado, 1999). The sparsity criterion requires that a certain number of elements of each xk be nonzero.

The L-curve method adjusts λ to optimize the trade-off between the residual and sparsity of x. The plot of dp(x) versus dq(yAx) has an L shape, the corner of which provides the best trade-off. The corner of the L-curve is the point of maximum curvature and can be found by a one-dimensional maximization of the curvature function (Hansen & O’Leary, 1993).

A hybrid approach known as the modified L-curve method combines the L-curve method on a linear scale and the quality-of-fit criterion, which is used to place limits on the range of λ that can be chosen by the L-curve (Engan, 2000). The modified L-curve method was shown to have good performance, but it requires a one-dimensional numerical optimization step for each xk at each iteration, which can be computationally expensive for large vectors.

3 Dictionary Learning

3.1 Unknown, Nonrandom Dictionaries

The MLE framework treats parameters to be estimated as unknown but deterministic (nonrandom). In this spirit, we take the dictionary, A, to be the set of unknown but deterministic parameters to be estimated from the observation set Y = YN. In particular, given YN, the maximum likelihood estimate ÂML is found from maximizing the likelihood function L(A | YN) = P(YN; A). Under the assumption that the observations are i.i.d., this corresponds to the optimization

A^ML=argmaxAk=1NP(yk;A), (3.1)
P(yk;A)=P(yk,x;A)dx=P(ykx;A)·Pp(x)dx=Pq(ykAx)·Pp(x)dx. (3.2)

Defining the sample average of a function f (y) over the sample set YN = (y1, …, yN) by

f(y)N=1Nk=1Nf(yk),

the optimization 3.1 can be equivalently written as

A^ML=argminAlog(P(y;A))N. (3.3)

Note that P(yk; A) is equal to the normalization factor β already encountered, but now with the dependence of β on A and the particular sample, yk, made explicit. The integration in equation 3.2 in general is intractable, and various approximations have been proposed to obtain an approximate maximum likelihood estimate, ÂAML (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000).

In particular, the following approximation has been proposed (Olshausen & Field, 1996),

Pp(x)δ(xx^k(A^)), (3.4)

where

x^k(A^)=argmaxxP(yk,x;A^), (3.5)

for k = 1, …, N, assuming a current estimate, Â, for A. This approximation corresponds to assuming that the source vector xk for which yk = Axk is known and equal to k(Â). With this approximation, the optimization 3.3 becomes

A^AML=argminAdq(yAx^)N, (3.6)

which is an optimization over the sample average 〈 · 〉N of the functional 2.4 encountered earlier. Updating our estimate for the dictionary,

A^A^AML, (3.7)

we can iterate the procedure (3.5)–(3.6) until ÂAML has converged, hopefully (at least in the limit of large N) to ÂML = ÂML(YN) as the maximum likelihood estimate ÂML(YN) has well-known desirable asymptotic properties in the limit N → ∞.

Performing the optimization in equation 3.6 for the q = 2 i.i.d. gaussian measurement noise case (ν gaussian with known covariance σ2 · I) corresponds to taking

dq(yA^x^)=12σ2||yA^x^||2, (3.8)

in equation 3.6. In appendix A, it is shown that we can readily obtain the unique batch solution,

A^AML=yx^x^x^1, (3.9)
yx^=1Nk=1Nykx^kT,x^x^=1Nk=1Nx^kx^kT. (3.10)

Appendix A derives the maximum likelihood estimate of A for the ideal case of known source vectors X = (x1, …, xN),

Knownsourcevectorcase:AML=yxxx1,

which is, of course, actually not computable since the actual source vectors are assumed to be hidden.

As an alternative to using the explicit solution 3.9, which requires an often prohibitive n × n inversion, we can obtain AAML iteratively by gradient descent on equations 3.6 and 3.8,

A^AMLA^AMLμ1Nk=1Nekx^kT,ek=A^AMLx^kyk,k=1,,N, (3.11)

for an appropriate choice of the (possibly adaptive) positive step-size parameter μ. The iteration 3.11 can be initialized as ÂAML = Â.

A general iterative dictionary learning procedure is obtained by nesting the iteration 3.11 entirely within the iteration defined by repeatedly solving equation 3.5 every time a new estimate, ÂAML, of the dictionary becomes available. However, performing the optimization required in equation 3.5 is generally nontrivial (Olshausen & Field, 1996; Lewicki & Sejnowski, 2000). Recently, we have shown how the use of the FOCUSS algorithm results in an effective algorithm for performing the optimization required in equation 3.5 for the case when ν is gaussian (Rao & Kreutz-Delgado, 1998a; Engan et al., 1999). This approach solves equation 3.5 using the affine-scaling transformation (AST)-like algorithms recently proposed for the low-noise case (Rao & Kreutz-Delgado, 1997, 1999; Kreutz-Delgado & Rao, 1997) and extended by regularization to the nontrivial noise case (Rao & Kreutz-Delgado, 1998a; Engan et al., 1999). As discussed in section 2.5, for the current dictionary estimate, Â, a solution to the optimization problem, 3.5, is provided by the repeated iteration,

x^kΠ1(x^k)A^T(β(x^k)I+A^Π1(x^k)A^T)1yk, (3.12)

k = 1, …, N, with Π(x) defined as in equation 3.18, given below. This is the regularized FOCUSS algorithm (Rao, 1998; Engan et al., 1999), which has an interpretation as an AST-like concave function minimization algorithm. The proposed dictionary learning algorithm alternates between iteration 3.12 and iteration 3.11 (or the direct batch solution given by equation 3.9 if the inversion is tractable). Extensive simulations show the ability of the AST-based algorithm to recover an unknown 20 × 30 dictionary matrix A completely (Engan et al., 1999).

3.2 Unknown, Random Dictionaries

We now generalize to the case where the dictionary A and the source vector set X = XN = (x1, …, xN) are jointly random and unknown. We add the requirement that the dictionary is known to obey the constraint

AA=compactsubmanifoldofRm×n.

A compact submanifold of ℝm×n is necessarily closed and bounded. On the constraint submanifold, the dictionary A has the prior probability density function P(A), which in the sequel we assume has the simple (uniform on Inline graphic) form,

P(A)=cX(AA), (3.13)

where Inline graphic(·) is the indicator function and c is a positive constant chosen to ensure that

P(A)=AP(A)dA=1.

The dictionary A and the elements of the set X are also all assumed to be mutually independent,

P(A,X)=P(A)P(X)=P(A)Pp(x1),,Pp(xN).

With the set of i.i.d. noise vectors, (ν1, …, νN) also taken to be jointly random with, and independent of, A and X, the observation set Y = YN = (y1, …, yN) is assumed to be generated by the model 1.3. With these assumptions, we have

P(A,XY)=P(YA,X)P(A,X)/P(Y)=cX(AA)P(YA,X)P(X)/P(Y)=cX(AA)P(Y)k=1NP(ykA,xk)Pp(xk)=cX(AA)P(Y)k=1NPq(yAxk)Pp(xk), (3.14)

using the facts that the observations are conditionally independent and P(yk | A, X) = P(yk | A, xk).

The jointly MAP estimates

(A^MAP,X^MAP)=(A^MAP,x^1,MAP,,x^N,MAP)

are found by maximizing a posteriori probability density P(A, X | Y) simultaneously with respect to AInline graphic and X. This is equivalent to minimizing the negative logarithm of P(A, X | Y), yielding the optimization problem,

(A^MAP,X^MAP)=argminAA,Xdq(yAx)+λdp(x)N. (3.15)

Note that this is a joint minimization of the sample average of the functional 2.4, and as such is a natural generalization of the single (with respect to the set of source vectors) optimization previously encountered in equation 3.6. By finding joint MAP estimates of A and X, we obtain a problem that is much more tractable than the one of finding the single MAP estimate of A (which involves maximizing the marginal posterior density P(A | Y)).

The requirement that AInline graphic, where Inline graphic is a compact and hence bounded subset of ℝm×n, is sufficient for the optimization problem 3.15 to avoid the degenerate solution,6

fork=1,,N,yk=Axk,with||A||and||xk||0. (3.16)

This solution is possible for unbounded A because y = Ax is almost always solvable for x since learned overcomplete A’s are (generically) onto, and for any solution pair (A, x) the pair ( 1αA, αx) is also a solution. This fact shows that the inverse problem of finding a solution pair (A, x) is generally ill posed unless A is constrained to be bounded (as we have explicitly done here) or the cost functional is chosen to ensure that bounded A’s are learned (e.g., by adding a term monotonic in the matrix norm ||A|| to the cost function in equation 3.15).

A variety of choices for the compact set Inline graphic are available. Obviously, since different choices of Inline graphic correspond to different a priori assumptions on the set of admissible matrices, A, the choice of this set can be expected to affect the performance of the resulting dictionary learning algorithm. We will consider two relatively simple forms for Inline graphic.

3.3 Unit Frobenius–Norm Dictionary Prior

For the i.i.d. q = 2 gaussian measurement noise case of equation 3.8, algorithms that provably converge (in the low step-size limit) to a local minimum of equation 3.15 can be readily developed for the very simple choice,

AF={A||A||F=1}Rm×n, (3.17)

where ||A||F denotes the Frobenius norm of the matrix A,

||A||F2=tr(ATA)trace(ATA),

and it is assumed that the prior P(A) is uniformly distributed on Inline graphic as per condition 3.13. As discussed in appendix A, Inline graphic is simply connected, and there exists a path in Inline graphic between any two matrices in Inline graphic.

Following the gradient factorization procedure (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999), we factor the gradient of d(x) as

d(x)=α(x)Π(x)x,α(x)>0, (3.18)

where it is assumed that Π(x)is diagonal and positive definite for all nonzero x. For example, in the case where d(x)=||x||pp,

Π1(x)=diag(x[i]2p). (3.19)

Factorizations for other diversity measures d(x) are given in Kreutz-Delgado and Rao (1997). We also define β(x) = λα(x). As derived and proved in appendix A, a learning law that provably converges to a minimum of equation 3.15 on the manifold 3.17 is then given by

ddtx^k=Ωk{(A^TA^+β(x^k)Π(x^k))x^kA^Tyk},ddtA^=μ(δA^tr(A^TδA^)A^),μ>0, (3.20)

for k = 1, …, N, where  is initialized to ||Â||F = 1, Ωk are n × n positive definite matrices, and the “error” δÂis

δA^=e(x^)x^TN=1Nk=1Ne(x^k)x^kT,e(x^k)=A^x^kyk, (3.21)

which can be rewritten in the perhaps a more illuminating form (cf. equations 3.9 and 3.10),

δA^=A^x^x^yx^. (3.22)

A formal convergence proof of equation 3.20 is given in appendix A, where it is also shown that the right-hand side of the second equation in 3.20 corresponds to projecting the error term δ onto the tangent space of Inline graphic, thereby ensuring that the derivative of  lies in the tangent space. Convergence of the algorithm to a local optimum of equation 3.15 is formally proved by interpreting the loss functional as a Lyapunov function whose time derivative along the trajectories of the adapted parameters (Â, ) is guaranteed to be negative definite by the choice of parameter time derivatives shown in equation 3.20. As a consequence of the La Salle invariance principle, the loss functional will decrease in value, and the parameters will converge to the largest invariant set for which the time derivative of the loss functional is identically zero (Khalil, 1996).

Equation 3.20 is a set of coupled (between  and the vectors k) nonlinear differential equations that correspond to simultaneous, parallel updating of the estimates  and k. This should be compared to the alternated separate (nonparallel) update rules 3.11 and 3.12 used in the AML algorithm described in section 3.1. Note also that (except for the trace term) the right-hand side of the dictionary learning update in equation 3.20 is of the same form as for the AML update law given in equation 3.11 (see also the discretized version of equation 3.20 given in equation 3.28 below). The key difference is the additional trace term in equation 3.20. This difference corresponds to a projection of the update onto the tangent space of the manifold 3.17, thereby ensuring a unit Frobenius norm (and hence boundedness) of the dictionary estimate at all times and avoiding the ill-posedness problem indicated in equation 3.16. It is also of interest to note that choosing Ωk to be the positive-definite matrix,

Ωk=ηk(A^TA^+β(x^k)Π(x^k))1,ηk>0, (3.23)

in equation 3.20, followed by some matrix manipulations (see equation A.18 in appendix A), yields the alternative algorithm,

ddtx^k=ηk{x^kΠ1(x^k)A^T(β(x^k)I+A^Π1(x^k)A^T)1yk} (3.24)

with ηk > 0. In any event (regardless of the specific choice of the positive definite matrices Ωk as shown in appendix A), the proposed algorithm out-lined here converges to a solution (k, Â), which satisfies the implicit and nonlinear relationships,

x^k=Π1(x^k)A^T(β(x^k)I+A^Π1(x^k)A^T)1y,A^=yx^T(x^x^cI)1AF, (3.25)

for scalar c = tr(ÂTδÂ).

To implement the algorithm 3.20 (or the variant using equation 3.24) in discrete time, a first-order forward difference approximation at time t = tl can be used,

ddtx^k(tl)x^k(tl+1)x^k(tl)tl+1tlx^k[l+1]x^k[l]Δl. (3.26)

Applied to equation 3.24, this yields

x^k[l+1]=(1μl)x^k[l]+μlx^kFOCUSS[l]x^kFOCUSS[l]=Π1(x^k)A^T(β(x^k)I+A^Π1(x^k)A^T)1ykμl=ηkΔl0. (3.27)

Similarly, discretizing the Â-update equation yields the A-learning rule, equation 3.28, given below. More generally, taking μl to have a value between zero and one, 0 ≤ μl ≤ 1 yields an updated value k[l + 1], which is a linear interpolation between the previous value k[l] and x^kFOCUSS[l].

When implemented in discrete time, and setting μl = 1, the resulting Bayesian learning algorithm has the form of a combined iteration where we loop over the operations,

x^kΠ1(x^k)A^T(β(x^k)I+A^Π1(x^k)A^T)1yk,k=1,,NandA^A^γ(δA^tr(A^TδA^)A^)γ>0. (3.28)

We call this FOCUSS-based, Frobenius-normalized dictionary-learning algorithm the FOCUSS-FDL algorithm. Again, this merged procedure should be compared to the separate iterations involved in the maximum likelihood approach given in equations 3.11 and 3.12. Equation 3.28, with δ given by equation 3.21, corresponds to performing a finite step-size gradient descent on the manifold Inline graphic. This projection in equation 3.28 of the dictionary update onto the tangent plane of Inline graphic (see the discussion in appendix A) ensures the well-behavedness of the MAP algorithm.7 The specific step-size choice μl = 1, which results in the first equation in equation 3.28, is discussed at length for the low-noise case in Rao and Kreutz-Delgado (1999).

3.4 Column-Normalized Dictionary Prior

Although mathematically very tractable, the unit-Frobenius norm prior, equation 3.17, appears to be somewhat too loose, judging from simulation results given below. In simulations with the Frobenius norm constraint Inline graphic, some columns of A can tend toward zero, a phenomenon that occurs more often in highly overcomplete A. This problem can be understood by remembering that we are using the dp(x), p ≠ 0 diversity measure, which penalizes columns associated with terms in x with large magnitudes. If a column ai has a small relative magnitude, the weight of its coefficient x[i] can be large, and it will be penalized more than a column with a larger norm. This leads to certain columns being underused, which is especially problematic in the overcomplete case.

An alternative, and more restrictive, form of the constraint set Inline graphic is obtained by enforcing the requirement that the columns ai of A each be normalized (with respect to the Euclidean 2-norm) to the same constant value (Murray & Kreutz-Delgado, 2001). This constraint can be justified by noting that Ax can be written as the nonunique weighted sum of the columns ai,

Ax=i=1naix[i]=i=1n(aiαi)(αix[i])=Ax,foranyαi>0,i=1,,n,

showing that there is a column-wise ambiguity that remains even after the overall unit-Frobenius norm normalization has occurred, as one can now Frobenius-normalize the new matrix A′.

Therefore, consider the set of matrices on which has been imposed the column-wise constraint that

AC={A||ai||2=aiTai=1n,i=1,,n}. (3.29)

The set Inline graphic is an mnn = n(m−1)–dimensional submanifold of ℝm×n. Note that every column of a matrix in Inline graphic has been normalized to the value 1n. In fact, any constant value for the column normalization can be used (including the unit normalization), but, as shown in appendix B, the particular normalization of 1n results in Inline graphic being a proper submanifold of the mn − 1 dimensional unit Frobenius manifold Inline graphic,

ACAF,

indicating that a tighter constraint on the matrix A is being imposed. Again, it is assumed that the prior P(A) is uniformly distributed on Inline graphic in the manner of equation 3.13. As shown in appendix B, Inline graphic is simply connected.

A learning algorithm is derived for the constraint Inline graphic in appendix B, following much the same approach as in appendix A. Because the derivation of the k update to find sparse solutions does not depend on the form of the constraint Inline graphic, only the  update in algorithm 3.28 needs to be modified. Each column ai is now updated independently (see equation B.17),

aiaiγ(Ia^ia^iT)δaii=1,,n, (3.30)

where δai is the ith column of δ in equation 3.21. We call the resulting column-normalized dictionary-learning algorithm the FOCUSS-CNDL algorithm. The implementation details of the FOCUSS-CNDL algorithm are presented in section 4.2.

4 Algorithm Implementation

The dictionary learning algorithms derived above are an extension of the FOCUSS algorithm used for obtaining sparse solutions to the linear inverse problem y = Ax to the case where dictionary learning is now required. We refer to these algorithms generally as FOCUSS-DL algorithms, with the unit Frobenius-norm prior-based algorithm denoted by FOCUSS-FDL and the column-normalized prior-base algorithm by FOCUSS-CNDL. In this section, the algorithms are stated in the forms implemented in the experimental tests, where it is shown that the column normalization-based algorithm achieves higher performance in the overcomplete dictionary case.

4.1 Unit Frobenius-Norm Dictionary Learning Algorithm

We now summarize the FOCUSS-FDL algorithm derived in section 3.3. For each of the data vectors yk, we update the sparse source vectors xk using the FOCUSS algorithm:

Π1(x^k)=diag(x^k[i]2p)x^kΠ1(x^k)A^T(λkI+A^Π1(x^k)A^T)1yk(FOCUSS), (4.1)

where λk = β(k) is the regularization parameter. After updating the N source vectors xk, k = 1, …, n, the dictionary  is reestimated,

yx^=1Nk=1Nykx^kTx^x^=1Nk=1Nx^kx^kTδA^=A^x^x^yx^A^A^γ(δA^tr(A^TδA^)A^),γ>0, (4.2)

where γ controls the learning rate. For the experiments in section 5, the data block size is N = 100. During each iteration all training vectors are updated using equation 4.1, with a corresponding number of dictionary updates using equation 4.2. After each update of the dictionary Â, it is renormalized to have unit Frobenius norm, ||Â||F = 1.

The learning algorithm is a combined iteration, meaning that the FOCUSS algorithm is allowed to run for only one iteration (not until full convergence) before the A update step. This means that during early iterations, the k are in general not sparse. To facilitate learning A, the covariances Σyx̂ and Σx̂x̂ are calculated with sparsified k that have all but the largest elements set to zero. The value of is usually set to the largest desired number of nonzero elements, but this choice does not appear to be critical.

The regularization parameter λk is taken to be a monotonically increasing function of the iteration number,

λk=λmaxtanh(103·(iter1500))+1). (4.3)

While this choice of λk does not have the optimality properties of the modified L-curve method (see section 2.5), it does not require a one-dimensional optimization for each k and so is much less computationally expensive. This is further discussed below.

4.2 Column Normalized Dictionary Learning Algorithm

The improved version of the algorithm called FOCUSS-CNDL, which provides increased accuracy especially in the overcomplete case, was proposed in Murray and Kreutz-Delgado (2001). The three key improvements are column normalization that restricts the learned Â, an efficient way of adjusting the regularization parameter λk, and reinitialization to escape from local optima.

The column-normalized learning algorithm discussed in section 3.4 and derived in appendix B is used. Because the k update does not depend on the constraint set Inline graphic, the FOCUSS algorithm in equation 4.1 is used to update N vectors, as discussed in section 4.1. After every N source vectors are updated, each column of the dictionary is then updated as

aiaiγ(Ia^ia^iT)δaii=1,,n, (4.4)

where δai are the columns of δÂ, which is found using equation 4.2. After updating each ai, it is renormalized to ||ai||2 = 1/n by

aiain||ai||, (4.5)

which also ensures that ||Â||F = 1 as shown in section B.1.

The regularization parameter λk may be set independently for each vector in the training set, and a number of methods have been suggested, including quality of fit (which requires a certain level of reconstruction accuracy), sparsity (requiring a certain number of nonzero elements), and the L-curve which attempts to find an optimal trade-off (Engan, 2000). The L-curve method works well, but it requires solving a one-dimensional optimization for each λk, which becomes computationally expensive for large problems. Alternatively, we use a heuristic method that allows the trade-off between error and sparsity to be tuned for each application, while letting each training vector yk have its own regularization parameter λk to improve the quality of the solution,

λk=λmax(1||ykA^x^k||||yk||),λk,λmax>0. (4.6)

For data vectors that are represented accurately, λk will be large, driving the algorithm to find sparser solutions. If the SNR can be estimated, we can set λmax = (SNR)−1.

The optimization problem, 3.15, is concave when p ≤ 1, so there will be multiple local minima. The FOCUSS algorithm is guaranteed to converge only to one of these local minima, but in some cases, it is possible to determine when that has happened by noticing if the sparsity is too low. Periodically (after a large number of iterations), the sparsity of the solutions k is checked, and if found too low, k is reinitialized randomly. The algorithm is also sensitive to initial conditions, and prior information may be incorporated into the initialization to help convergence to the global solution.

5 Experimental Results

Experiments were performed using complete dictionaries (n = m) and over-complete dictionaries (n > m) on both synthetically generated data and natural images. Performance was measured in a number of ways. With synthetic data, performance measures include the SNR of the recovered sources xk compared to the true generating source and comparing the learned dictionary with the true dictionary. For images of natural scenes, the true underlying sources are not known, so the accuracy and efficiency of the image coding are found.

5.1 Complete Dictionaries: Comparison with ICA

To test the FOCUSS-FDL and FOCUSS-CNDL algorithms, simulated data were created following the method of Engan et al. (1999) and Engan (2000). The dictionary A of size 20×20 was created by drawing each element aij from a normal distribution with μ = 0, σ2 = 1 (written as Inline graphic(0, 1)) followed by a normalization to ensure that ||A||F = 1. Sparse source vectors xk, k = 1, …, 1000 were created with r = 4 nonzero elements, where the r nonzero locations are selected at random (uniformly) from the 20 possible locations. The magnitudes of each nonzero element were also drawn from Inline graphic(0, 1) and limited so that |xkl| > 0.1. The input data yk were generated using y = Ax (no noise was added).

For the first iteration of the algorithm, the columns of the initialization estimate, Âinit, were taken to be the first n = 20 training vectors yk. The initial xk estimates were then set to the pseudoinverse solution x^k=A^initT(A^initA^initT)1yk. The constant parameters of the algorithm were set as follows: p = 1.0, γ = 1.0, and λmax = 2 × 10−3 (low noise, assumed SNR ≈ 27 dB). The algorithms were run for 200 iterations through the entire data set, and during each iteration, Â was updated after updating 100 data vectors k.

To measure performance, the SNR between the recovered sources k and the the true sources xk were calculated. Each element xk[i] for fixed i was considered as a time-series vector with 1000 elements, and SNRi for each was found using

SNRi=10log10(||xk[i]||2||xk[i]x^k[i]||2). (5.1)

The final SNR is found by averaging SNRi over the i = 1, …, 20 vectors and 20 trials of the algorithms. Because the dictionary A is learned only to within a scaling factor and column permutations, the learned sources must be matched with corresponding true sources and scaled to unit norm before the SNR calculation is done.

The FOCUSS-FDL and FOCUSS-CNDL algorithms were compared with extended ICA (Lee, Girolami, & Sejnowski, 1999) and Fastica8 (Hyvärinen et al., 1999). Figure 1 shows the SNR for the tested algorithms. The SNR for FOCUSS-FDL is 27.7 dB, which is a 4.7 dB improvement over extended ICA, and for FOCUSS-CNDL, the SNR is 28.3 dB. The average run time for FOCUSS-FDL/CNDL was 4.3 minutes, for FastICA 0.10 minute, and for extended ICA 0.19 minute on a 1.0 GHz Pentium III Xeon computer.

Figure 1.

Figure 1

Comparison between FOCUSS-FDL, FOCUSS-CNDL, extended ICA, and FastICA on synthetically generated data with a complete dictionary A, size 20 × 20. The SNR was computed between the recovered sources k and the true sources xk. The mean SNR and standard deviation were computed over 20 trials.

5.2 Overcomplete Dictionaries

To test the ability to recover the true A and xk solutions in the overcomplete case, dictionaries of size 20 × 30 and 64 × 128 were generated. Diversity r was set to fixed values (4 and 7) and uniformly randomly (5–10 and 10–15). The elements of A and the sources xk were created as in section 5.1.

The parameters were set as follows: p = 1.0, γ = 1.0, λmax = 2 × 10−3. The algorithms were run for 500 iterations through the entire data set, and during each iteration, Â was updated after updating 100 data vectors k.

As a measure of performance, we find the number of columns of A that were matched during learning. Because A can be learned only to within column permutations and sign and scale changes, the columns are normalized so that ||âi|| = ||aj|| = 1, and  is rearranged columnwise so that âj is given the index of the closest match in A (in the minimum two-norm sense). A match is counted if

1aiTa^i<0.01. (5.2)

Similarly, the number of matching k are counted (after rearranging the elements in accordance with the indices of the rearranged Â),

1xiTx^i<0.05. (5.3)

If the data are generated by an A that is not column normalized, other measures of performance need to be used to compare xk and k.

The performance is summarized in Table 1, which compares the FOCUSS-FDL with the column-normalized algorithm (FOCUSS-CNDL). For the 20 × 30 dictionary, 1000 training vectors were used, and for the 64×128 dictionary 10,000 were used. Results are averaged over four or more trials. For the 64 × 128 matrix and r = 10–15, FOCUSS-CNDL is able to recover 99.5% (127.4/128) of the columns of A and 94.6% (9463/10,000) of the solutions xk to within the tolerance given above. This shows a clear improvement over FOCUSS-FDL, which learns only 80.3% of the A columns and 40.1% of the solutions xk.

Table 1.

Synthetic Data Results.

Algorithm Size of A Diversity, r Learned A Columns
Learned x
Average SD % Average SD %
FOCUSS-FDL 20 × 30 7 25.3 3.4 84.2 675.9 141.0 67.6
FOCUSS-CNDL 20 × 30 7 28.9 1.6 96.2 846.8 97.6 84.7
FOCUSS-CNDL 64 × 128 7 125.3 2.1 97.9 9414.0 406.5 94.1
FOCUSS-CNDL 64 × 128 5–10 126.3 1.3 98.6 9505.5 263.8 95.1
FOCUSS-FDL 64 × 128 10–15 102.8 4.5 80.3 4009.6 499.6 40.1
FOCUSS-CNDL 64 × 128 10–15 127.4 1.3 99.5 9463.4 330.3 94.6

Learning curves for one of the trials of this experiment (see Figure 2) show that most of the columns of A are learned quickly within the first 100 iterations, and that the diversity of the solutions drops to the desired level. Figure 2b shows that it takes somewhat longer to learn the xk correctly and that reinitialization of the low sparsity solutions (at iterations 175 and 350) helps to learn additional solutions. Figure 2c shows the diversity at each iteration, measured as the average number of elements of each k that are larger than 1 × 10−4.

Figure 2.

Figure 2

Performance of the FOCUSS-CNDL algorithm with overcomplete dictionary A, size 64 × 128. (a) Number of correctly learned columns of A at each iteration. (b) Number of sources xk learned. (c) Average diversity (n-sparsity) of the xk. The spikes in b and c indicate where some solutions k were reinitialized because they were not sparse enough.

5.3 Image Data Experiments

Previous work has shown that learned basis functions can be used to code data more efficiently than traditional Fourier or wavelet bases (Lewicki & Olshausen, 1999). The algorithm for finding overcomplete bases in Lewicki and Olshausen (1999) is also designed to solve problem 1.1 but differs from our method in a number of ways, including using only the Laplacian prior (p = 1), and using conjugate gradient optimization for finding sparse solutions (whereas we use the FOCUSS algorithm). It is widely believed that overcomplete representations are more efficient than complete bases, but in Lewicki and Olshausen (1999), the overcomplete code was less efficient for image data (measured in bits per pixel entropy), and it was suggested that different priors could be used to improve the efficiency. Here, we show that our algorithm is able to learn more efficient overcomplete codes for priors with p < 1.

The training data consisted of 10,000 8 × 8 image patches drawn at random from black and white images of natural scenes. The parameter p was varied from 0.5–1.0, and the FOCUSS-CNDL algorithm was trained for 150 iterations. The complete dictionary (64 × 64) was compared with the 2× overcomplete dictionary (64 × 128). Other parameters were set: γ = 0.01, λmax = 2 × 10−3. The coding efficiency was measured using the entropy (bits per pixel) method described in Lewicki and Olshausen (1999). Figure 3 plots the entropy versus reconstruction error (root mean square error, RMSE) and shows that when p < 0.9, the entropy is less for the overcomplete representation at the same RMSE.

Figure 3.

Figure 3

Comparing the coding efficiency of complete and 2× overcomplete representations on 8×8 pixel patches drawn from natural images. The points on the curve are the results from different values of p, at the bottom right, p = 1.0, and at the top left, p = 0.5. For smaller p, the overcomplete case is more efficient at the same level of reconstruction error (RMSE).

An example of coding an entire image is shown in Figure 4. The original test image (see Figure 4a) of size 256 × 256 was encoded using the learned dictionaries. Patches from the test image were not used during training. Table 2 gives results for low- and high-compression cases. In both cases, coding with the overcomplete dictionary (64×128) gives higher compression (lower bits per pixel) and lower error (RMSE). For the high-compression case (see Figures 4b and 4c), the 64 × 128 overcomplete dictionary gives compression of 0.777 bits per pixel at error 0.328, compared to the 64 × 64 complete dictionary at 0.826 bits per pixel at error 0.329. The amount of compression was selected by adjusting λmax (the upper limit of the regularization parameter). For high compression, λmax = 0.02, and for low compression, λmax = 0.002.

Figure 4.

Figure 4

Image compression using complete and overcomplete dictionaries. Coding with an overcomplete dictionary is more efficient (fewer bits per pixel) and more accurate (lower RMSE). (a) Original image of size 256 × 256 pixels. (b) Compressed with 64×64 complete dictionary to 0.826 bits per pixel at RMSE = 0.329. (c) Compressed with 64 × 128 overcomplete dictionary to 0.777 bits per pixel at RMSE = 0.328.

Table 2.

Image Compression Results.

Dictionary Size p Compression (bits/pixel) RMSE Average Diversity
64 × 64 0.5 2.450 0.148 17.3
64 × 128 0.6 2.410 0.141 15.4
64 × 64 0.5 0.826 0.329 4.5
64 × 128 0.6 0.777 0.328 4.0

6 Discussion and Conclusions

We have applied a variety of tools and perspectives (including ideas drawn from Bayesian estimation theory, nonlinear regularized optimization, and the theory of majorization and convex analysis) to the problem of developing algorithms capable of simultaneously learning overcomplete dictionaries and solving sparse source-vector inverse problems.

The test experiment described in section 5.2 is a difficult problem designed to determine if the proposed learning algorithm can solve for the known true solutions for A and the spare source vectors xk to the underdetermined inverse problem yk = Axk. Such testing, which does not appear to be regularly done in the literature, shows how well an algorithm can extract stable and categorically meaningful solutions from synthetic data. The ability to perform well on such test inverse problems would appear to be at least a necessary condition for an algorithm to be trustworthy in domains where a physically or biologically meaningful sparse solution is sought, such as occurs in biomedical imaging, geophysical seismic sounding, and multitarget tracking.

The experimental results presented in section 5.2 show that the FOCUSS-DL algorithms can recover the dictionary and the sparse sources vectors. This is particularly gratifying when one considers that little or no optimization of the algorithm parameters has been done. Furthermore, the convergence proofs given in the appendixes show convergence only to local optima, whereas one expects that there will be many local optima in the cost function because of the concave prior and the generally multimodal form of the cost function.

One should note that algorithm 3.28 was constructed precisely with the goal of solving inverse problems of the type considered here, and therefore one must be careful when comparing the results given here with other algorithms reported in the literature. For instance, the mixture-of-gaussians prior used in Attias (1999) does not necessarily enforce sparsity. While other algorithms in the literature might perform well on this test experiment, to the best of our knowledge, possible comparably performing algorithms such as Attias (1999), Girolami (2001), Hyvärinen et al. (1999), and Lewicki and Olshausen (1999) have not been tested on large, overcomplete matrices to determine their accuracy in recovering A, and so any comparison along these lines would be premature. In section 5.1, the FOCUSS-DL algorithms were compared to the well-known extended ICA and FastICA algorithms in a more conventional test with complete dictionaries. Performance was measured in terms of the accuracy (SNR) of the recovered sources xk, and both FOCUSS-DL algorithms were found to have significantly better performance (albeit with longer run times).

We have also shown that the FOCUSS-CNDL algorithm can learn an over-complete representation, which can encode natural images more efficiently than complete bases learned from data (which in turn are more efficient than standard nonadaptive bases, such as Fourier or wavelet bases; Lewicki & Olshausen, 1999). Studies of the human visual cortex have shown a higher degree of overrepresentation of the fovea compared to the other mammals, which suggests an interesting connection between overcomplete representations and visual acuity and recognition abilities (Popovic & Sjöstrand, 2001).

Because the coupled dictionary learning and sparse-inverse solving algorithms are merged and run in parallel, it should be possible to run the algorithms in real time to track dictionary evolution in quasistationary environments once the algorithm has essentially converged. One way to do this would be to constantly present randomly encountered new signals, yk, to the algorithm at each iteration instead of the original training set. One also has to ensure that the dictionary learning algorithm is sensitive to the new data so that dictionary tracking can occur. This would be done by an appropriate adaptive filtering of the current dictionary estimate driven by the new-data derived corrections, similar to techniques used in the adaptive filtering literature (Kalouptsidis & Theodoridis, 1993).

Acknowledgments

This research was partially supported by NSF grant CCR-9902961. K. K.-D. also acknowledges the support provided by the Computational Neurobiology Laboratory, Salk Institute, during a 1998–1999 sabbatical visit.

Appendix A: The Frobenius-Normalized Prior Learning Algorithm

Here we provide a derivation of the algorithm 3.203.21 and prove that it converges to a local minimum of equation 3.15 on the manifold Inline graphic = {A | ||A||F = 1} ⊂ ℝm×n defined in equation 3.17. Although we focus on the development of the learning algorithm on Inline graphic, the derivations in sections A.2 and A.3, and the beginning of subsection are done for a general constraint manifold Inline graphic.

A.1 Admissible Matrix Derivatives

A.1.1 The Constraint Manifold Inline graphic

In order to determine the structural form of admissible derivatives, A.=ddtA for matrices belonging to Inline graphic,9 it is useful to view Inline graphic as embedded in the finite-dimensional Hilbert space of matrices, ℝm×n, with inner product

A,B=tr(ATB)=tr(BTA)=tr(ABT)=tr(BAT).

The corresponding matrix norm is the Frobenius norm,

||A||=||A||F=trATA=trAAT.

We will call this space the Frobenius space and the associated inner product the Frobenius inner product. It is useful to note the isometry,

ARm×nA=vec(A)Rmn,

where A is the mn-vector formed by stacking the columns of A. Henceforth, bold type represents the stacked version of a matrix (e.g., B = vec(B)). The stacked vector A belongs to the standard Hilbert space ℝmn, which we shall henceforth refer to as the stacked space. This space has the standard Euclidean inner product and norm,

A,B=ATB,||A||=ATA.

It is straightforward to show that

A,B=A,Band||A||=||A||.

In particular, we have

AAF||A||=||A||=1.

Thus, the manifold in equation 3.17 corresponds to the (mn−1)–dimensional unit sphere in the stacked space, ℝmn (which, with a slight abuse of notation, we will continue to denote by Inline graphic). It is evident that Inline graphic is simply connected so that a path exists between any two elements of Inline graphic and, in particular, a path exists between any initial value for a dictionary, AinitInline graphic, used to initialize a learning algorithm, and a desired target value, AfinalInline graphic.10

A.1.2 Derivatives on Inline graphic: The Tangent Space T Inline graphic

Determining the form of admissible derivatives on equation 3.17 is equivalent to determining the form of admissible derivatives on the unit ℝmn–sphere. On the unit sphere, we have the well-known fact that

AAFddt||A||2=2ATA.=0A.A.

This shows that the general form of Ȧ is Ȧ = ΛQ, where Q is arbitrary and

Λ=(IAAT||A||2)=(IAAT) (A.1)

is the stacked space projection operator onto the tangent space of the unit ℝmn–sphere at the point A (note that we used the fact that ||A|| = 1). The projection operator Λ is necessarily idempotent, Λ = Λ2. Λ is also self-adjoint, Λ = Λ*, where the adjoint operator Λ* is defined by the requirement that

ΛQ1,Q2=Q1,ΛQ2,forallQ1,Q2Rmn,

showing that Λ is an orthogonal projection operator. In this case, Λ* = ΛT, so that self-adjointness corresponds to Λ being symmetric. One can readily show that an idempotent, self-adjoint operator is nonnegative, which in this case corresponds to the symmetric, idempotent operator Λ being a positive semidefinite matrix.

This projection can be easily rewritten in the Frobenius space,

A.=ΛQ=QA,QAA.=ΛQ=QA,QA=Qtr(ATQ)A. (A.2)

Of course, this result can be derived directly in the Frobenius space using the fact that

AAFddt||A||2=2A,A.=2tr(ATA.)=0,

from which it is directly evident that

A.TAFatAAFA,A.=trATA.=0, (A.3)

and therefore Ȧ must be of the form11

A.=ΛQ=Qtr(ATQ)tr(ATA)A=Qtr(ATQ)A. (A.4)

One can verify that Λ is idempotent and self-adjoint and is therefore a nonnegative, orthogonal projection operator. It is the orthogonal projection operator from ℝm×n onto the tangent space TInline graphic.

In the stacked space (with some additional abuse of notation), we represent the quadratic form for positive semidefinite symmetric matrices W as

||A||W2=ATWA.

Note that this defines a weighted norm if, and only if, W is positive definite, which might not be the case by definition. In particular, when W = Λ, the quadratic form ||A||Λ2 is only positive semidefinite. Finally, note from equation A.4 that ∀AInline graphic,

ΛQ=0Q=cA,withc=tr(ATQ). (A.5)

A.2 Minimizing the Loss Function over a General Manifold Inline graphic

Consider the Lyapunov function,

VN(X,A)=dq(yAx)+λdp(x)N,AA, (A.6)

where Inline graphic is some arbitrary but otherwise appropriately defined constraint manifold associated with the prior, equation 3.13. Note that this is precisely the loss function to be minimized in equation 3.15. If we can determine smooth parameter trajectories (i.e., a parameter-vector adaptation rule) (, Ȧ) such that along these trajectories (X, A) ≤ 0, then as a consequence of the La Salle invariance principle (Khalil, 1996), the parameter values will converge to the largest invariant set (of the adaptation rule viewed as a nonlinear dynamical system) contained in the set

Γ={(X,A)V.N(X,A)0andAA}. (A.7)

The set Γ contains the local minima of VN. With some additional technical assumptions (generally dependent on the choice of adaptation rule), the elements of Γ will contain only local minima of VN.

Assuming the i.i.d. q = 2 gaussian measurement noise case of equation 3.8,12 the loss (Lyapunov) function to be minimized is then

VN(X,A)=12||Axy||2+λdp(x)N,AA, (A.8)

which is essentially the loss function to be minimized in equation 3.15.13

Suppose for the moment, as in equations 3.4 through 3.9, that X is assumed to be known, and note that then (ignoring constant terms depending on X and Y) VN can be rewritten as

VN(A)=tr(Axy)(Axy)TN=tr(AxxTAT)2tr(AxyT)+tr(yyT)N=trAxxAT2trAxyVN(A)=tr{AxxAT2Axy},

for Σxx and xy=yxT defined as in equation 3.10. Using standard results from matrix calculus (Dhrymes, 1984), we can show that VN(A) is minimized by the solution 3.9. This is done by setting

AVN(A^)=0,

and using the identities (valid for W symmetric)

AtrAWAT=2AWandAtrAB=BT.

This yields (assuming that Σxx is invertible),

AVN(A^)=2A^xx2xyT=2(A^xxyx)=0,A^=yx^x^x^1,

which is equation 3.9, as claimed. For Σxx nonsingular, the solution is unique and globally optimal. This is, of course, a well-known result.

Now return to the general case, equation A.8, where both X and A are unknown. For the data indexed by k = 1, …, N, define the quantities

dk=dp(xk),e(x)=Axy,andek=Axkyk.

The loss function and its time derivative can be written,

VN(X,A)=12eT(x)e(x)+λdp(x)NV.N(X,A)=eT(x)e.(x)+λTdp(x)x.N=eT(x)(Ax.+A.x)+λTdp(x)x.N.

Then, to determine an appropriate adaptation rule, note that

V.N=T1+T2, (A.9)

where

T1=(eT(x)A+λTdp(x))x.N=1Nk=1N(ekTA+λTdk)x.k (A.10)

and

T2=eT(x)A.xN=1Nk=1NekTA.xk. (A.11)

Enforcing the separate conditions

T10andT20, (A.12)

(as well as the additional condition that AInline graphic) will be sufficient to ensure that N ≤ 0 on Inline graphic. In this case, the solution-containing set Γ of equation A.7 is given by

Γ={(X,A)T1(X,A)0,T2(X,A)0andAA}. (A.13)

Note that if A is known and fixed, then T2 ≡ 0, and only the first condition of equation A.12 (which enforces learning of the source vectors, xk) is of concern. Contrawise, if source vectors xk, which ensure that e(xk) = 0 are fixed and known, then T1 ≡ 0, and the second condition of equation A.12 (which enforces learning of the dictionary matrix, A) is at issue.

A.3 Obtaining the xk Solutions with the FOCUSS Algorithm

We now develop the gradient factorization-based derivation of the FOCUSS algorithm, which provides estimates of xk while satisfying the first convergence condition of equation A.12. The constraint manifold Inline graphic is still assumed to be arbitrary. To enforce the condition T1 ≤ 0 and derive the first adaptation rule given in equation 3.20, we note that we can factor ∇dk = ∇d(xk) as (Kreutz-Delgado & Rao, 1997; Rao & Kreutz-Delgado, 1999)

dk=αkΠkxk

with αk = αxk > 0 and Πxk = Πk positive definite and diagonal for all nonzero xk. Then, defining βk = λαk > 0 and selecting an arbitrary set of (adaptable) symmetric positive-definite matrices Ωk, we choose the learning rule

x.k=Ωk{ATek+λdk}=Ωk{(ATA+βkΠk)xkATyk},k=1,,N, (A.14)

which is the adaptation rule for the state estimates xk = k given in the first line of equation 3.20. With this choice, we obtain

T1=||ATe(x)+λd(x)||Ω2N=1Nk=1N||ATek+λdk||Ωk2=1Nk=1N||(ATA+βkΠk)xkATyk||Ωk20, (A.15)

as desired. Assuming convergence to the set A.13 (which will be seen to be the case after we show below how to ensure that we also have T2 ≤ 0), we will asymptotically obtain (reintroducing the “hat” notation to now denote converged parameter estimates)

||(A^TA^+βkΠk)x^kA^Tyk||Ωk20,k=1,,N,

which is equivalent to

x^k=(A^TA^+βkΠk)1A^Tyk,k=1,,N, (A.16)

at convergence. This is also equivalent to the condition given in the first line of equation 3.25, as shown below.

Exploiting the fact that Ωk in equation A.14 are arbitrary (subject to the symmetry and positive-definiteness constraint), let us make the specific choice shown in equation 3.23,

Ωk=ηk(ATA+βkΠk)1,ηk>0,k=1,,N. (A.17)

Also note the (trivial) identity,

(ATA+βΠ)Π1AT=AT(AΠ1AT+βI),

which can be recast nontrivially as

(ATA+βΠ)1AT=Π1AT(βI+AΠ1AT)1. (A.18)

With equations A.17 and A.18, the learning rule, A.14, can be recast as

x.k=ηk{xkΠk1AT(βkI+AΠk1AT)1yk},k=1,,N, (A.19)

which is the alternative learning algorithm, 3.24. At convergence (when T1 ≡ 0) we have the condition shown in the first line of equation 3.25,

x^k=Π1A^T(βkI+A^Π1A^T)1yk. (A.20)

This also follows from the convergence condition A.16 and the identity A.18, showing that the result, A.20, is independent of the specific choice of Ωk > 0. Note from equations A.14 and A.15 that T1 ≡ 0 also results in k ≡ 0 for k = 1, …, N, so that we will have converged to constant values, k, which satisfy equation A.20.

For the case of known, fixed A, the learning rule derived here will converge to sparse solutions, xk, and when discretized as in section 3.3, yields equation 3.27, which is the known dictionary FOCUSS algorithm (Rao & Kreutz-Delgado, 1998a, 1999).

Note that the derivation of the sparse source-vector learning algorithm here, which enforces the condition T1 ≤ 0, is entirely independent of any constraints placed on A (such as, for example, the unit Frobenius-norm and column-norm constraints considered in this article) or of the form of the A-learning rule. Thus alternative choices of constraints placed on A, as considered in Murray and Kreutz-Delgado (2001) and described in appendix B, will not change the form of the xk-learning rule derived here. Of course, because the xk learning rule is strongly coupled to the A-learning rule, algorithmic performance and speed of convergence may well be highly sensitive to conditions placed on A and the specific A learning algorithm used.

A.4 Learning the Dictionary A

A.4.1 General Results

We now turn to the enforcement of the second convergence condition, T2 ≤ 0 and the development of the dictionary adaptation rule shown in equation 3.20. First, as in equation 3.21, we define the error term δA as

δA=e(x)xTN=1Nk=1Ne(xk)xkT=Axxyx, (A.21)

using the fact that e(x) = Axy. Then, from equation A.11, we have

T2=eT(x)A.xN=tr(xeT(x)A.)N=tr(xeT(x)NA.)=tr(δATA.)=δATA.. (A.22)

So far, these steps are independent of any specific constraints that may be placed on A, other than that the manifold Inline graphic be smooth and compact. With A constrained to lie on a specified smooth, compact manifold, to ensure correct learning behavior, it is sufficient to impose the constraint that Ȧ lies in the tangent space to the manifold and the condition that T2 ≤ 0.

A.4.2 Learning on the Unit Frobenius Sphere, Inline graphic

To ensure that T2 ≤ 0 and that Ȧ is in the tangent space of the unit sphere in the Frobenius space ℝmn, we take

A.=μΛδAA.=μΛδA=μ(δAtr(ATδA)A),μ>0, (A.23)

which is the adaptation rule given in equation 3.20. With this choice, and using the positive semidefiniteness of Λ, we have

T2=μ||δA||Λ20,

as required. Note that at convergence, the condition T2 ≡ 0, yields Ȧ ≡ 0, so that we will have converged to constant values for the dictionary elements, and

0=ΛδA^=Λ(A^x^x^yx^)δA^=(A^x^x^yx^)=cA^, (A.24)

from equation A.5, where c = tr(ÂTδÂ). Thus, the steady-state solution is

A^=y,x^(x^x^cI)1AF. (A.25)

Note that equations A.20 and A.25 are the steady-state values given earlier in equation 3.25.

Appendix B: Convergence of the Column–Normalized Learning Algorithm

The derivation and proof of convergence of the column-normalized learning algorithm, applicable to learning members of Inline graphic, is accomplished by appropriately modifying key steps of the development given in appendix A. As in appendix A, the standard Euclidean norm and inner product apply to column vectors, while the Frobenius norm and inner product apply to m × n matrix elements of ℝm×n.

B.1 Admissible Matrix Derivatives

B.1.1 The Constraint Manifold Inline graphic

Let ei = (0 · · · 0 1 0 · · ·0)T ∈ ℝn be the canonical unit vector whose components are all zero except for the value 1 in the ith location. Then

I=i=1neieiTandai=Aei.

Note that

||ai||2=aiTai=(Aei)TAei=eiTATAei=treieiTATA=trMiTA=Mi,A,

where

MiAeieiT=aieiT=[000ai00]Rm×n. (B.1)

Note that only the ith column of Mi is nonzero and is equal to ai = ith column of A. We therefore have that

A=i=1nMi. (B.2)

Also, for i, j = 1, …, n,

Mi,Mj=trMiTMj=treieiTATAejejT=treiTATAejejTei=||ai||2δi,j, (B.3)

where δi,j is the Kronecker delta. Note, in particular, that ||ai|| = ||Mi||.

Let AInline graphic, where Inline graphic is the set of column-normalized matrices, as defined by equation 3.29. Inline graphic is an mnn = n(m−1)–dimensional submanifold of the Frobenius space ℝm×n as each of the n columns of A is normalized as

||ai||2=||Mi||21n,

and

||A||2=trATA=tri=1nMiTj=1nMj=i=1n||ai||2=nn=1,

using linearity of the trace function and property, B.3. It is evident, therefore, that

ACAF,

for Inline graphic defined by equation 3.17.

We can readily show that Inline graphic is simply connected, so that a continuous path exists between any matrix A = AinitInline graphic to any other column-normalized matrix A′ = AFinalInline graphic. Indeed, let A and A′ be such that

A=[a1,,an],A=[a1,,an],||ai||=||aj||=1/n,i,j=1,,n.

There is obviously a continuous path entirely in Inline graphic from AInline graphic to the intermediate matrix [a1,a2,,an]AC. Similarly, there is a continuous path entirely in Inline graphic from [ a1,a2,,an] to [ a1,a2,,an], and so on to [a1,,an]=A. To summarize, Inline graphic is a simply connected (nmn)–dimensional submanifold of the simply connected (nm − 1)–dimensional manifold Inline graphic and they are both submanifolds of the (nm)–dimensional Frobenius space ℝm×n.

B.1.2 Derivatives on Inline graphic: The Tangent Space TInline graphic

For convenience, for i = 1, …, n define

a^i=ai||ai||=nai,||a^i||=1,

and

M^i=Mi||Mi||=nMi=a^ieiT,||M^i||=1.

Note that equation B.3 yields

M^i,M^j=trM^iTM^j=δi,j. (B.4)

For any AInline graphic we have for each i = 1, …, n,

0=ddt||ai||2=ddtaiTai=ddteiTATAei=2eiTATA.ei=2treieiTATA.=2trMiTA.,

or

AACMi,A.=trMiTA.=0fori=1,,n. (B.5)

In fact,

A.ACatAACM^i,A.=trM^iTA.=0fori=1,,n. (B.6)

Note from equations B.2 and B.5 that

A,A.=i=1nMi,A.=0,

showing that (see equation A.3) T Inline graphicT Inline graphic, as expected from the fact that Inline graphicInline graphic.

An important and readily proven fact is that for each i = 1, …, n,

trM^iTA.=0A.=PiQi (B.7)

for some matrix Qi and projection operator, Pi, defined by

PiQQM^iM^i,Q=QM^itrM^iTQ. (B.8)

From equation B.4, it can be shown that the projection operators commute,

PiPj=PjPi,ij,i,j=1,,n, (B.9)

and are idempotent,

Pi2=Pi,i=1,,n. (B.10)

Indeed, it is straightforward to show from equation B.4 that for all Q,

PiPjQ=PjPiQ=QM^iM^i,QM^jM^j,Q,forallij, (B.11)

and

Pi2Q=QM^iM^i,Q=PiQ,i=1,,n. (B.12)

It can also be shown that

PiQ1,Q2=Q1,PiQ2,

showing that Pi is self–adjoint, Pi=Pi.

Define the operator,

P=P1,,Pn. (B.13)

Note that because of the commutativity of the Pi, the order of multiplication on the right-hand side of equation B.11 is immaterial, and it is easily determined that P is idempotent,

P2=P.

By induction on equation B.11, it is readily shown that

PQ=QM^1M^1,QM^nM^n,Q=Qi=1nM^iM^i,Q. (B.14)

Either from the self-adjointness and idempotency of each Pi, or directly from equation B.14, it can be shown that P itself is self–adjoint, P = P*,

PQ1,Q2=Q1,PQ2.

Thus, P is the orthogonal projection operator from ℝm×n onto T Inline graphic.

As a consequence, we have the key result that

A.TACatAA.=PQ, (B.15)

for some matrix Q ∈ ℝm×n. This follows from the fact that given Q, then the right-hand side of equation B.6 is true for Ȧ = PQ. On the other hand, if the right-hand side of equation B.6 is true for Ȧ, we can take Q = Ȧ in equation B.15. Note that for the T Inline graphic–projection operator Λ given by equation A.2, we have that

P=ΛP=PΛ,

consistent with the fact that T Inline graphicT Inline graphic.

Let qi, i = 1, …, n be the columns of Q = [q1, …, qn]. The operation Q′ = PjQ, corresponds to

qi=qi,ij,andqj=(Ia^ja^jT)qj,

while the operation Q′= PQ, corresponds to

qi=(Ia^ia^iT)qi,i=1,,n.

B.2 Learning on the Manifold Inline graphic

The development of equations A.6 through A.22 is independent of the precise nature of the constraint manifold Inline graphic and can be applied here to the specific case of Inline graphic = Inline graphic. To ensure that T2 ≤ 0 in equation A.22 and that ȦT Inline graphic, we can use the learning rule,

A.=μPδA=μ(δAi=1nM^itrM^iδA), (B.16)

for μ > 0 and δA given by equation A.21. With the ith column of δA denoted by δai, this corresponds to the learning rule,

a.i=μ(Ia^ia^iT)δai,i=1,,n. (B.17)

With rule B.16, we have

T2=δA,A.=μδA,PδA=μδA,P2δA=μPδA,PδA=μ||PδA||20,

where we have explicitly shown that the idempotency and self-adjointness of P correspond to its being a nonnegative operator. Thus, we will have convergence to the largest invariant set for which Ȧ ≡ 0, which from equation B.16 is equivalent to the set for which

PδA=δAi=1nM^iM^i,δA0. (B.18)

This, in turn, is equivalent to

δA=i=1nM^iM^i,δA=i=1nnMiMi,δA=i=1nciMi,

with

cinMi,δA=naiTδai,i=1,,n.

An equivalent statement to equation B.19 is

δai=ciai,i=1,,n.

Defining the diagonal matrix

C=diag[ci,,cn]

and recalling the definitions A.21 and B.1, we obtain from equation B.19 that

Axxyx=δA=AC,

which can be solved as

A=yx(xxC)1. (B.20)

This is the general form of the solution found by the Inline graphic–learning algorithm.

Footnotes

1

For notational simplicity, we consider the real case only.

2

This fact comes as no surprise when the solution is interpreted within a Bayesian framework using a gaussian (maximum entropy) prior.

3

Taking “semantic” here to mean categorically or physically interpretable.

4

For example, n = 100 corresponds to 100 features encoded via vector quantization (“one column = one concept”). If we are allowed to represent features using up to four columns, we can encode (1001)+(1002)+(1003)+(1004)=4,087,975concepts showing a combinatorial boost in expressive power.

5
This interpretation, which is not elaborated on here, follows from defining a diagonal positive-definite affine scaling transformation matrix W(x) by the relation
Π(x)=W2(x).
6

||A|| is any suitable matrix norm on A.

7
Because of the discrete-time approximation in equation 3.28, and even more generally because of numerical round-off effects in equation 3.20, a renormalization,
A^A^/||A||F,
is usually performed at regular intervals.
8

Matlab and C versions of extended ICA can be found on-line at http://www.cnl.salk.edu/~tewon/ICA/code.html. Matlab code for FastICA can be found at: http://www.cis.hut.fi/projects/ica/fastica/.

9

Equivalently, we want to determine the structure of elements Ȧ of the tangent space, TInline graphic, to the smooth manifold Inline graphic at the point A.

10
For example, for 0 ≤ t ≤ 1, take the t–parameterized path,
A(t)=(1t)Ainit+tAfinal||(1t)Ainit+tAfinal||.
11

It must be the case that Λ=IAA||A||2=IAA, using the physicist’s “bra-ket” notation (Dirac, 1958).

12

Note that in the appendix, unlike the notation used in equations 3.8 et seq., the “hat” notation has been dropped. Nonetheless, it should be understood that the quantities A and X are unknown parameters to be optimized over in equation 3.15, while the measured signal vectors Y are known.

13

The factor 12 is added for notational convenience and does not materially affect the derivation.

Contributor Information

Kenneth Kreutz-Delgado, Email: kreutz@ece.ucsd.edu, Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A.

Joseph F. Murray, Email: jfmurray@ucsd.edu, Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A.

Bhaskar D. Rao, Email: brao@ece.ucsd.edu, Electrical and Computer Engineering, Jacobs School of Engineering, University of California, San Diego, La Jolla, California 92093-0407, U.S.A.

Kjersti Engan, Email: kjersti.engan@tn.his.no, Stavanger University College, School of Science and Technology Ullandhaug, N-4091 Stavanger, Norway.

Te-Won Lee, Email: tewon@salk.edu, Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, California 92037, U.S.A.

Terrence J. Sejnowski, Email: terry@salk.edu, Howard Hughes Medical Institute, Computational Neurobiology Laboratory, Salk Institute, La Jolla, California 92037, U.S.A.

References

  1. Adler J, Rao B, Kreutz-Delgado K. Comparison of basis selection methods. Proceedings of the 30th Asilomar Conference on Signals, Systems, and Computers; Pacific Grove, CA: IEEE; 1996. [Google Scholar]
  2. Attias H. Independent factor analysis. Neural Computation. 1999;11(4):803–851. doi: 10.1162/089976699300016458. [DOI] [PubMed] [Google Scholar]
  3. Basilevsky A. Statistical factor analysis and related methods: Theory and applications. New York: Wiley; 1994. [Google Scholar]
  4. Bell A, Sejnowski T. An information-maximization approach to blind separation and blind deconvolution. Neural Computation. 1995;7:1129–1159. doi: 10.1162/neco.1995.7.6.1129. [DOI] [PubMed] [Google Scholar]
  5. Coifman R, Wickerhauser M. Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory. 1992;IT-38(2):713–718. [Google Scholar]
  6. Comon P. Independent component analysis: A new concept? Signal Processing. 1994;36:287–314. [Google Scholar]
  7. Deco G, Obradovic D. An information-theoretic approach to neural computing. Berlin: Springer-Verlag; 1996. [Google Scholar]
  8. Dhrymes P. Mathematics for econometrics. 2. New York: Springer-Verlag; 1984. [Google Scholar]
  9. Dirac PAM. The principles of quantum mechanics. Oxford: Clarendon Press; 1958. [Google Scholar]
  10. Donoho D. On minimum entropy segmentation. In: Chui C, Montefusco L, Puccio L, editors. Wavelets: Theory, algorithms, and applications. San Diego, CA: Academic Press; 1994. pp. 233–269. [Google Scholar]
  11. Donoho D. De-noising by soft-thresholding. IEEE Trans on Information Theory. 1995;41:613–627. [Google Scholar]
  12. Engan K. Unpublished doctoral dissertation. Stavanger Universty College; Norway: 2000. Frame based signal representation and compression. [Google Scholar]
  13. Engan K, Rao B, Kreutz-Delgado K. Frame design using FOCUSS with method of optimal directions (MOD). Proc. NORSIG-99; Tronheim, Norway: IEEE; 1999. [Google Scholar]
  14. Engan K, Rao BD, Kreutz-Delgado K. Regularized FOCUSS for subset selection in noise. Proc. NORSIG 2000; Sweden. June, 2000; Kolmarden, Sweden: IEEE; 2000. pp. 247–250. [Google Scholar]
  15. Field D. What is the goal of sensory coding. Neural Computation. 1994;6:559–599. [Google Scholar]
  16. Girolami M. A variational method for learning sparse and overcomplete representations. Neural Computation. 2001;13:2517–2532. doi: 10.1162/089976601753196003. [DOI] [PubMed] [Google Scholar]
  17. Gorodnitsky I, George J, Rao B. Neuromagnetic source imaging with FOCUSS: A recursive weighted minimum norm algorithm. Journal of Electroencephalography and Clinical Neurophysiology. 1995;95(4):231–251. doi: 10.1016/0013-4694(95)00107-a. [DOI] [PubMed] [Google Scholar]
  18. Gorodnitsky I, Rao B. Sparse signal reconstruction from limited data using FOCUSS: A re-weighted minimum norm algorithm. IEEE Trans Signal Processing. 1997;45(3):600–616. [Google Scholar]
  19. Hansen PC, O’Leary DP. The use of the L-curve in the regularization of discrete ill-posed problems. SIAM J Sct Comput. 1993;14:1487–1503. [Google Scholar]
  20. Hyvärinen A, Cristescu R, Oja E. A fast algorithm for estimating overcomplete ICA bases for image windows. Proc. Int. Joint Conf. on Neural Networks (IJCNN’99); Washington, DC: 1999. pp. 894–899. [Google Scholar]
  21. Kalouptsidis T, Theodoridis S. Adaptive system identification and signal processing algorithms. Upper Saddle River, NJ: Prentice Hall; 1993. [Google Scholar]
  22. Kassam S. Signal detection in non-gaussian noise. New York: Springer-Verlag; 1982. [Google Scholar]
  23. Khalil H. Nonlinear systems. 2. Upper Saddle River, NJ: Prentice Hall; 1996. [Google Scholar]
  24. Kreutz-Delgado K, Rao B. A general approach to sparse basis selection: Majorization, concavity, and affine Scaling (Full Report with Proofs) (Center for Information Engineering Report No. UCSD-CIE-97-7-1) San Diego: University of California; 1997. Available on-line at: http://dsp.ucsd.edu/~kreutz. [Google Scholar]
  25. Kreutz-Delgado K, Rao B. A new algorithm and entropy-like measures for sparse coding. Proc. 5th Joint Symp. Neural Computation; La Jolla, CA. 1998a. [Google Scholar]
  26. Kreutz-Delgado K, Rao B. Gradient factorization-based algorithm for best-basis selection. Proceedings of the 8th IEEE Digital Signal Processing Workshop; New York: IEEE; 1998b. [Google Scholar]
  27. Kreutz-Delgado K, Rao B. Measures and algorithms for best basis selection. Proceedings of the 1998 ICASSP; New York: IEEE; 1998c. [Google Scholar]
  28. Kreutz-Delgado K, Rao B. Sparse basis selection, ICA, and majorization: Towards a unified perspective. Proc. 1999 ICASSP; Phoenix, AZ. 1999. [Google Scholar]
  29. Kreutz-Delgado K, Rao B, Engan K. Novel algorithms for learning overcomplete Dictionaries. Proc. 1999 Asilomar Conference; Pacific Grove, CA: IEEE; 1999. [Google Scholar]
  30. Kreutz-Delgado K, Rao B, Engan K, Lee T-W, Sejnowski T. Convex/Schur–convex (CSC) log-priors and sparse coding. Proc. 6th Joint Symposium on Neural Computation; Pasadena, CA. 1999a. [Google Scholar]
  31. Kreutz-Delgado K, Rao B, Engan K, Lee T-W, Sejnowski T. Learning overcomplete dictionaries: Convex/Schur-convex (CSC) log-priors, factorial codes, and independent/dependent component analysis (I/DCA). Proc. 6th Joint Symposium on Neural Computation; Pasadena, CA. 1999b. [Google Scholar]
  32. Lee TW, Girolami M, Sejnowski T. Independent component analysis using an extended infomax algorithm for mixed sub-gaussian and super-gaussian sources. Neural Computation. 1999;11(2):417–441. doi: 10.1162/089976699300016719. [DOI] [PubMed] [Google Scholar]
  33. Lewicki MS, Olshausen BA. A probabilistic framework for the adaptation and comparison of image codes. J Opt Soc Am A. 1999;16(7):1587–1601. [Google Scholar]
  34. Lewicki MS, Sejnowski TJ. Learning overcomplete representations. Neural Computation. 2000;12(2):337–365. doi: 10.1162/089976600300015826. [DOI] [PubMed] [Google Scholar]
  35. Mallat S, Zhang Z. Matching pursuits with time-frequency dictionaries. IEEE Trans ASSP. 1993;41(12):3397–3415. [Google Scholar]
  36. Marshall A, Olkin I. Inequalities: Theory of majorization and its applications. San Diego, CA: Academic Press; 1979. [Google Scholar]
  37. Murray JF, Kreutz-Delgado K. An improved FOCUSS-based learning algorithm for solving sparse linear inverse problems. Conference Record of the 35th Asilomar Conference on Signals, Systems and Computers; Pacific Grove, CA: IEEE; 2001. [Google Scholar]
  38. Nadal JP, Parga N. Nonlinear neurons in the low noise limit: A factorial code maximizes information transfer. Network. 1994;5:565–581. [Google Scholar]
  39. Olshausen B, Field D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature. 1996;381:607–609. doi: 10.1038/381607a0. [DOI] [PubMed] [Google Scholar]
  40. Olshausen B, Field D. Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision Research. 1997;37:3311–3325. doi: 10.1016/s0042-6989(97)00169-7. [DOI] [PubMed] [Google Scholar]
  41. Pham D. Blind separation of instantaneous mixture of sources via an independent component analysis. Signal Processing. 1996;44(11):2768–2779. [Google Scholar]
  42. Popovic Z, Sjöstrand J. Resolution, separation of retinal ganglion cells, and cortical magnification in humans. Vision Research. 2001;41:1313–1319. doi: 10.1016/s0042-6989(00)00290-x. [DOI] [PubMed] [Google Scholar]
  43. Rao B. Analysis and extensions of the FOCUSS algorithm. Proceedings of the 1996 Asilomar Conference on Circuits, Systems, and Computers; New York: IEEE; 1997. pp. 1218–1223. [Google Scholar]
  44. Rao B. Signal processing with the sparseness constraint. Proc. 1998 ICASSP; New York: IEEE; 1998. [Google Scholar]
  45. Rao BD, Engan K, Cotter SF, Kreutz-Delgado K. Subset selection in noise based on diversity measure minimization. 2002. Manuscript submitted for publication. [Google Scholar]
  46. Rao B, Gorodnitsky I. Affine scaling transformation based methods for computing low complexity sparse solutions. Proc. 1996 ICASSP; New York: IEEE; 1997. pp. 1783–1786. [Google Scholar]
  47. Rao B, Kreutz-Delgado K. In: Neyman J, editor. Deriving algorithms for computing sparse solutions to linear inverse problems; Proceedings of the 1997 Asilomar Conference on Circuits, Systems, and Computers; New York: IEEE; 1997. [Google Scholar]
  48. Rao B, Kreutz-Delgado K. Basis selection in the presence of noise. Proceedings of the 1998 Asilomar Conference; New York: IEEE; 1998a. [Google Scholar]
  49. Rao B, Kreutz-Delgado K. Sparse solutions to linear inverse problems with multiple measurement vectors. Proceedings of the 8th IEEE Digital Signal Processing Workshop; New York: IEEE; 1998b. [Google Scholar]
  50. Rao B, Kreutz-Delgado K. An affine scaling methodology for best basis selection. IEEE Trans Signal Processing. 1999;1:187–202. [Google Scholar]
  51. Roberts S. Independent component analysis: Source assessment and separation, a Bayesian approach. IEE (Brit) Proc Vis Image Signal Processing. 1998;145(3):149–154. [Google Scholar]
  52. Rockafellar R. Convex analysis. Princeton, NJ: Princeton University Press; 1970. [Google Scholar]
  53. Ruderman D. The statistics of natural images. Network: Computation in Neural Systems. 1994;5:517–548. [Google Scholar]
  54. Wang K, Lee C, Juang B. Selective feature extraction via signal decomposition. IEEE Signal Processing Letters. 1997;4(1):8–11. [Google Scholar]
  55. Watanabe S. Pattern recognition as a quest for minimum entropy. Pattern Recognition. 1981;13(5):381–387. [Google Scholar]
  56. Wickerhauser M. Adapted wavelet analysis from theory to software. Wellesley, MA: A. K. Peters; 1994. [Google Scholar]
  57. Zhu S, Wu Y, Mumford D. Minimax entropy principle and its application to texture modeling. Neural Computation. 1997;9:1627–1660. [Google Scholar]
  58. Zibulevsky M, Pearlmutter BA. Blind source separation by sparse decomposition in a signal dictionary. Neural Computation. 2001;13(4):863–882. doi: 10.1162/089976601300014385. [DOI] [PubMed] [Google Scholar]

RESOURCES