Skip to main content
Elsevier Sponsored Documents logoLink to Elsevier Sponsored Documents
. 2012 Mar 15;80(1):38–46. doi: 10.1016/j.neucom.2011.09.024

Sparse nonnegative matrix factorization with ℓ0-constraints

Robert Peharz 1,, Franz Pernkopf 1
PMCID: PMC3312776  PMID: 22505792

Abstract

Although nonnegative matrix factorization (NMF) favors a sparse and part-based representation of nonnegative data, there is no guarantee for this behavior. Several authors proposed NMF methods which enforce sparseness by constraining or penalizing the 1-norm of the factor matrices. On the other hand, little work has been done using a more natural sparseness measure, the 0-pseudo-norm. In this paper, we propose a framework for approximate NMF which constrains the 0-norm of the basis matrix, or the coefficient matrix, respectively. For this purpose, techniques for unconstrained NMF can be easily incorporated, such as multiplicative update rules, or the alternating nonnegative least-squares scheme. In experiments we demonstrate the benefits of our methods, which compare to, or outperform existing approaches.

Keywords: NMF, Sparse coding, Nonnegative least squares

1. Introduction

Nonnegative matrix factorization (NMF) aims to factorize a nonnegative matrix X into a product of nonnegative matrices W and H. We can distinguish between exact NMF, i.e. X=WH, and approximate NMF, i.e. XWH. For approximate NMF, which seems more relevant for practical applications, one needs to define a divergence measure between the data X and its reconstruction WH, such as the Frobenius norm or the generalized Kullback–Leibler divergence [1]. Approaches using more generalized measures, such as Bregman divergences or the α-divergence, can be found in [2,3]. In this paper, we focus on approximate NMF using the Frobenius norm as objective. Let the dimensions of X, W and H be D×N, D×K, and K×N, respectively. When the columns of X are multidimensional measurements of some process, the columns of W gain the interpretation of basis vectors, while H contains the corresponding weights. The number of basis vectors (or inner approximation rank) K is typically assumed to be Kmin(D,N). Hence, NMF is typically used as compressive technique.

Originally proposed by Paatero and Tapper under the term positive matrix factorization [4], NMF became widely known due to the work of Lee and Seung [5,1]. One reason for its popularity is that the multiplicative update rules proposed in [1] are easy to implement. Furthermore, since these algorithms rely only on matrix multiplication and element-wise multiplication, they are fast on systems with well-tuned linear algebra methods. The main reason for its popularity, however, is that NMF tends to return a sparse and part-based representation of its input data, which makes its application interesting in areas such as computer vision [5], speech and audio processing [6–9], document clustering [10], to name but a few. This naturally occurring sparseness gives NMF a special status compared to other matrix factorization methods such as principal/independent component analysis or k-means clustering.

However, sparsity in NMF occurs as a by-product due to nonnegativity constraints, rather than being a design objective of its own. Various authors proposed modified NMF algorithms which explicitly enforce sparseness. These methods usually penalize [11,12] or constrain [13] the 1-norm of H or W, which is known to yield a sparse representation [14,15]. An explanation for the sparseness inducing nature of the 1-norm is that it can be interpreted as a convex relaxation of the 0-(pseudo)-norm, i.e. the number of non-zero entries in a vector. Indeed, the 0-norm is a more intuitive sparseness measure, which allows to specify a certain number of non-zero entries, while similar statements cannot be made via the 1-norm. Introducing the non-convex 0-norm as constraint function typically renders a problem NP-hard, requiring exhaustive combinatoric search. However, Vavasis [16] has shown that NMF is NP-hard per se,1 so we have to accept (most probably) that any algorithm for NMF is suboptimal. Hence, a heuristic method for NMF with 0-constraints might be just as appropriate and efficient as 1-sparse NMF.

Little work is concerned with 0-sparse NMF. The K-SVD algorithm [17] aims to find an overcomplete dictionary for sparse representation of a set of training signals X. The algorithm minimizes the approximation error between data X and its reconstruction WH, where 0-constraints are imposed on the columns of H. The nonnegative version of K-SVD [18] additionally constrains all matrices to be nonnegative. Hence, nonnegative K-SVD can be interpreted as NMF with sparseness constraints on the columns of H. Probabilistic sparse matrix factorization (PSMF) [19] is closely related to K-SVD, however, no nonnegative PSMF was proposed so far. Morup et al. [20] proposed an approximate NMF algorithm which constrains the 0-norm of the H columns, using a nonnegative version of least angle regression and selection [21]. For the W update, they used the normalization-invariant update rule described in [12]. In [22], a method for NMF was described, which penalizes a smoothed 0-norm of the matrix H. Using this smooth approximation of the 0-norm, they were able to derive multiplicative update rules, similar as in [1]. More details about prior art concerned with 0-sparse NMF can be found in Section 2.4.

In this paper, we propose two generic strategies to compute an approximate NMF with 0-sparseness constraints on the columns of H or W, respectively. For NMF with constraints on H, the key challenge is to find a good approximate solution for the nonnegative sparse coding problem, which is NP-hard in general [23]. For sparse coding without nonnegativity constraints, a popular approximation algorithm is orthogonal matching pursuit (OMP) [24], due to its simplicity and theoretic guarantees [25,26]. Here, we show a close relation between OMP and the active set algorithm for nonnegative least squares (NNLS) [27], which, to the best of our knowledge, was overlooked in the literature so far. As a consequence, we propose a simple modification of NNLS, called sparse NNLS (sNNLS), which represents a natural integration of nonnegativity constraints in OMP. Furthermore, we propose an algorithm called reverse sparse NNLS (rsNNLS), which uses a reversed matching pursuit principle. This algorithm shows the best performance of all sparse coders in our experiments, and competes with or outperforms nonnegative basis pursuit (NNBP). Note that basis pursuit usually delivers better results than algorithms from the matching pursuit family, while requiring more computational resources [28,25]. For the second stage in our framework, which updates W and the non-zero coefficients in H, we show that the standard multiplicative update rules [1] can be used without any modification. Also, we propose a sparseness maintaining active set algorithm for NNLS, which allows to apply an alternating least squares scheme for 0-sparse NMF.

Furthermore, we propose an algorithm for NMF with 0-constrained columns in W. As far as we know, no method exists for this problem so far. The proposed algorithm follows a similar approach as in 1-constrained NMF [13], projecting the columns of W onto the closest vectors with desired sparseness after each update step. In experiments, the algorithm runs much faster than the 1-constrained method. Furthermore, the results of the proposed method are significantly sparser in terms of the 0-norm while achieving the same reconstruction quality.

Throughout the paper, we use the following notation. Upper-case boldface letter denote matrices. For sets we use Fraktur letters, e.g. P. A lower-case boldface letter denotes a column vector. A lower-case boldface letter with a subscript index denotes a specific column of the matrix denoted by the same upper-case letter, e.g. xi is the ith column of matrix X. Lower-case letters with subscripts denote specific entries of a vector or a matrix, e.g. x i is the ith entry of x and xi,j is the entry in the ith row and the jth column of X. A matrix symbol subscripted with a set symbol denotes the submatrix consisting of the columns which are indexed by the elements of the set, e.g. WP is the sub-matrix of W containing the columns indexed by P. Similarly, a vector subscripted by a set symbol is the sub-vector containing the entries indexed by the set. With ·p,p1 we denote the p-norm: xp=(i|xi|p)1/p. Further, ·0 denotes the 0-pseudo-norm, i.e. the number of non-zero entries in the argument. The Frobenius norm is defined as XF=i,j|xi,j|2.

The paper is organized as follows. In Section 2 we discuss NMF techniques related to our work. We present our framework for 0-sparse NMF in Section 3. Experiments are presented in Section 4 and Section 5 concludes the paper.

2. Related work

Let us formalize NMF as the following optimization problem:

minimizeW,HXWHFsubjecttoW0,H0, (1)

where denotes the element-wise greater-or-equal operator.

2.1. NMF via multiplicative updates

Lee and Seung [1] showed that XWHF is nonincreasing under the multiplicative update rules

HH(WTX)(WTWH) (2)

and

WW(XHT)(WHHT), (3)

where and / denote element-wise multiplication and division, respectively. Obviously, these update rules preserve nonnegativity of W and H, given that X is element-wise nonnegative.

2.2. NMF via alternating nonnegative least squares

Paatero and Tapper [4] originally suggested to solve (1) alternately for W and H, i.e. to iterate the following two steps:

HargminHXWHFs.t.H0,
WargminWXWHFs.t.W0.

Note that XWHF is convex in either W or H, respectively, but non-convex in W and H jointly. An optimal solution for each sub-problem is found by solving a nonnegatively constrained least-squares problem (NNLS) with multiple right hand sides (i.e. one for each column of X). Consequently, this scheme is called alternating nonnegative least-squares (ANLS). For this purpose, we can use the well known active-set algorithm by Lawson and Hanson [27], which for convenience is shown in Algorithm 1. The symbol denotes the pseudo-inverse.

Algorithm 1

Active-set NNLS [27].

1: Z={1,,K},P=
2: h=0
3: r=xWh
4: a=WTr
5: while|Z|>0andiZ:ai>0do
6:  i=argmaxa
7:  ZZi
8:  PPi
9:  zP=WPx
10:  zZ=0
11:  whilejP:zj<0do
12:   α=minkPhk/(hkzk)
13:   hh+α(zh)
14:   Z{i|hi=0}
15:   P{i|hi>0}
16:   zP=WPx
17:   zZ=0
18:  end while
19:  hz
20:  r=xWh
21:  a=WTr
22: end while

This algorithm solves NNLS for a single right hand side, i.e. it returns argminhxWh2,s.t.h0, for an arbitrary vector x. The active set Z and the in-active set P contain disjointly the indices of the h entries. Entries with indices in Z are held at zero (i.e. the nonnegativity constraints are active), while entries with indices in P have positive values. In the outer loop of Algorithm 1, indices are moved from Z to P, until an optimal solution is obtained (cf. [27] for details). The inner loop (steps 11–18) corrects the tentative, possibly infeasible (i.e. negative) solution z to a feasible one. Note that i is guaranteed to be an element of Z in step 6, since the residual xWh is orthogonal to every column in WP, and hence aP0. Lawson and Hanson [27] showed that the outer loop has to terminate eventually, since the residual error strictly decreases in each iteration, which implies that no in-active set P is considered twice. Unfortunately, there is no polynomial runtime guarantee for NNLS. However, they report that the algorithm typically runs well-behaved.

For ANLS, one can apply Algorithm 1 to all columns of X independently, in order to solve for H.2 However, a more efficient variant of NNLS with multiple right hand sides was proposed in [29]. This algorithm executes Algorithm 1 quasi-parallel for all columns in X, and solves step 16 (least-squares) jointly for all columns sharing the same in-active set P. Kim and Park [30] applied this more efficient variant to ANLS-NMF. Alternatively, NNLS can be solved using numerical approaches, such as the projected gradient algorithm proposed in [31]. For the remainder of the paper, we use the notation h=NNLS(x,W) to state that x is approximated by Wh in NNLS sense. Similarly, H=NNLS(X,W) denotes the solution of an NNLS problem with multiple right hand sides.

2.3. Sparse NMF

Various extensions have been proposed in order to incorporate sparsity in NMF, where sparseness is typically measured via some function of the 1-norm. Hoyer [11] proposed an algorithm to minimize the objective XWHF2+λij|Hij|, which penalizes the 1-norm of the coefficient matrix H. Eggert and Koerner [12] used the same objective, but proposed an alternative update which implicitly normalizes the columns of W to unit length. Furthermore, Hoyer [13] defined the following sparseness function for an arbitrary vector x:

sparseness(x)=Dx1/x2D1, (4)

where D is the dimensionality of x. Indeed, sparseness(x) is 0, if all entries of x are non-zero and their absolute values are all equal, and 1 when only one entry is non-zero. For all other x, the function smoothly interpolates between these extreme cases. Hoyer provided an NMF algorithm which constrains the sparseness of the columns of W, the rows of H, or both, to any desired sparseness value according to (4). There are further approaches which aim to achieve a part-based and sparse representation, such as local NMF [32] and non-smooth NMF [33].

2.4. Prior art for 0-sparse NMF

As mentioned in the Introduction, relatively few approaches exist for 0-sparse NMF. The K-SVD algorithm [17] aims to minimizeW,HXWHF, subject to hi0L,i, where LN is the maximal number of non-zero coefficients per column of H. Hence, the nonnegative version of K-SVD (NNK-SVD) [18] can be considered as an NMF algorithm with 0-sparseness constraints on the columns of H. However, the sparse coding stage in nonnegative K-SVD is rather an ad hoc solution, using an approximate version of nonnegative basis pursuit [28]. For the W update stage, the K-SVD dictionary update is modified, by simply truncating negative values to zero after each iteration of an SVD approximation.

In [20], an algorithm for NMF with 0-sparseness constraints on the H columns was proposed. For the sparse coding stage, they used a nonnegative version of least angle regression and selection (LARS) [21], called NLARS. This algorithm returns a so-called solution path of the 1-regularized objective xWhF2+λh1 using an active-set algorithm, i.e. it returns several solution vectors h with varying λ. For a specific column x out of X, one takes the solution h with desired 0-sparseness (when there are several such vectors, one selects the solution with smallest regularization parameter λ). Repeating this for each column of X, one obtains a nonnegative coding matrix with 0-sparseness on its columns. To update W, the authors used the self-normalizing multiplicative update rule described in [12].

In [22], the objective xWhF2+αijfσ(hi,j) is considered, where fσ(h)=exp(h2/2σ2). For σ0, the second term in the objective converges to α times the number of non-zero entries in H. Contrary to the first two approaches, which constrain the 0-norm, this method calculates an NMF which penalizes the smoothed 0-norm, where penalization strength is controlled with the trade-off parameter α. Therefore, the latter approach proceeds similar as the NMF methods which penalize the 1-norm [11,12]. However, note that a trade-off parameter for a penalization term is generally not easy to choose, while 0-sparseness constraints have an immediate meaning.

3. Sparse NMF with 0-constraints

We now introduce our methods for NMF with 0-sparseness constraints on the columns of W and H, respectively. Formally, we consider the problems

minimizeW,HXWHFsubjecttoW0,H0,hi0L,i (5)

and

minimizeW,HXWHFsubjecttoW0,H0,wi0L,i. (6)

We refer as NMF0-H and NMF0-W to problems (5) and (6), respectively. Parameter LN is the maximal allowed number of non-zero entries in wi or hi.

For NMF0-H, the sparseness constraints imply that each column in X is represented by a conical combination of maximal L nonnegative basis vectors. When we interpret the columns of W as features, this means that each data sample is represented by maximal L features, where typically LK. Nevertheless, if the reconstruction error is small, this implies that the extracted features are important to some extend. Furthermore, as noted in the Introduction, for unconstrained NMF it is typically assumed that Kmin(D,N). We do not have to make this restriction for NMF0-H, and can even choose K>D, i.e. we can allow an overcomplete basis matrix W (however, we require K<N).

For NMF0-W, the sparseness constraints enforce basis vectors with limited support. If for example the columns of X contain image data, sparseness constraints on W encourage a part-based representation.

NMF algorithms usually proceed in a two stage iterative manner, i.e. they alternately update W and H (cf. Section 2). We apply the same principle to NMF0-H and NMF0-W, where we take care that the sparseness constraints are maintained.

3.1. NMF0-H

A generic alternating update scheme for NMF0-H is illustrated in Algorithm 2. In the first stage, the sparse coding stage, we aim to solve the nonnegative sparse coding problem. Unfortunately, the sparse coding problem is NP-hard [23], and an approximation is required. We discuss several approaches in Section 3.1.1. In the second stage we aim to enhance the basis matrix W. Here we allow that non-zero values in H are adapted during this step, but we do not allow that zero values become non-zero, i.e. we require that the sparse structure of H is maintained. We will see in Section 3.1.2 that all NMF techniques discussed so far can be used for this purpose. Dependent on the methods used for each stage, we can derive different algorithms from the generic scheme. Note that NNK-SVD [18] and the 0-constrained NMF proposed by Morup et al. [20] also follow this framework.

Algorithm 2

NMF0-H.

1: Initialize W randomly
2: fori=1:numIterdo
3: Nonnegative Sparse Coding: Sparsely encode data X, using fixed basis matrix W, resulting in a sparse, nonnegative matrix H.
4: Basis Matrix Update: Enhance basis matrix W and coding matrix H, maintaining the sparse structure of H.
5: end for

3.1.1. Nonnegative sparse coding

The nonnegative sparse coding problem is formulated as

minimizeHXWHFsubjecttoH0,hi0L,i. (7)

Without loss of generality, we assume that the columns of W are normalized to unit length. A well known and simple sparse coding technique without nonnegativity constraints is orthogonal matching pursuit (OMP) [24], which is shown in Algorithm 3. OMP is a popular technique, due to its simplicity, its low computational complexity, and its theoretical optimality bounds [25,26].

Algorithm 3

Orthogonal matching pursuit (OMP).

1: rx
2: h=0
3: P=
4: for l=1:Ldo
5:  a=WTr
6:  i=argmax|a|
7:  PP{i}
8:  hPWPx
9:  rxWPhP
10: end for

In each iteration, OMP selects the basis vector which reduces the residual r most (steps 6 and 7). After each selection step, OMP projects the data vector into the space spanned by the basis vectors selected so far (step 8).

Several authors proposed nonnegative variants of OMP [26,34,35], where all of them replace step 6 with i=argmaxa, i.e. the absolute value function is removed in order to select a basis vector with a positive coefficient. The second point, where nonnegativity can be violated, is in step 8, the projection step. Bruckstein et al. [26] used NNLS [27] (Algorithm 1) instead of ordinary least squares, in order to maintain nonnegativity. Since this variant is very slow, we used the multiplicative NMF update rule for H (cf. (2)), in order to approximate NNLS [35]. Yang et al. [34] left step 8 unchanged, which violates nonnegativity in general.

However, there is a more natural approach for nonnegative OMP: Note that the body of the outer loop of the active-set algorithm for NNLS (Algorithm 1) and OMP (Algorithm 3) perform identical computations, except that OMP selects i=argmax|a|, while NNLS selects i=argmaxa, exactly as in the nonnegative OMP variants proposed so far. NNLS additionally contains the inner loop (11–18) to correct a possibly negative solution. Hence, a straightforward modification for nonnegative sparse coding, which we call sNNLS (which equally well can be called nonnegative OMP), is simply to stop NNLS, as soon as L basis vectors have been selected (i.e. as soon as |P|=L in Algorithm 1). Note that NNLS can also terminate before it selects L basis vectors. In this case, however, the solution is guaranteed to be optimal [27]. Hence, sNNLS guarantees to find either a nonnegative solution with exactly L positive coefficients, or an optimal nonnegative solution with less than L positive coefficients. Note further that OMP and NNLS have been proposed in distinct communities, namely sparse representation versus nonnegatively constrained least squares. It seems that this connection has been missed so far. The computational complexity of sNNLS is upper bounded by the original NNLS algorithm [27] (see Section 2.2), since it only contains an additional stopping criterion (|P|=L).

We further propose a variant of matching pursuit, which does not only apply to the nonnegative framework, but generally to the matching pursuit principle. We call this variant reverse matching pursuit: instead of adding basis vectors to an initially empty set, we remove basis vectors from an optimal, non-sparse solution. For nonnegative sparse coding, this method is illustrated in Algorithm 4, to which we refer as reverse sparse NNLS (rsNNLS). The algorithm starts with an optimal, non-sparse solution in step 1. While the 0-norm of the solution is greater than L, the smallest entry in the solution vector is set to zero and its index is moved from the in-active set P to the active set Z (steps 4–7). Subsequently, the data vector is approximated in NNLS sense by the remaining basis vectors in P, using steps 9–19 of Algorithm 1 (inner correction loop), where possibly additional basis vectors are removed from P. For simplicity, Algorithm 4 is shown for a single data vector x and corresponding output vector h. An implementation for data matrices X, using the fast combinatorial NNLS algorithm [29], is straightforward. Similar as for sNNLS, the computational complexity of rsNNLS is not worse than for active-set NNLS [27] (Section 2.2), since in each iteration one index is irreversibly removed from P. The NNLS algorithm (step 1) needs at least the same number of iterations to build the non-sparse solution. Hence, rsNNLS needs at most twice as many operations as NNLS.

Algorithm 4

Reverse sparse NNLS (rsNNLS).

1: h=NNLS(x,W)
2: Z={i|hi=0},P={i|hi>0}
3: whileh0>Ldo
4: j=argminiPhi
5: hj0
6: ZZ{j}
7: PP{j}
8: Perform steps 9–19 of Algorithm 1.
9: end while

The second main approach for sparse coding without nonnegativity constraints is basis pursuit (BP) [28], which relaxes the 0-norm with the convex 1-norm. Similar as for matching pursuit, there exist theoretical optimality guarantees and error bounds for BP [36,15]. Typically, BP produces better results than OMP [28,25]. Nonnegative BP (NNBP) can be formulated as

minimizehxWh2+λh1subjecttoh0, (8)

where λ controls the trade-off between reconstruction quality and sparseness. Alternatively, we can formulate NNBP as

minimizehh1subjecttoxWh2ϵ,h0, (9)

where ϵ is the desired maximal reconstruction error. Formulation (9) is more convenient than (8), since the parameter ϵ is more intuitive and easier to choose than λ. Both (8) and (9) are convex optimization problems and can readily be solved [37]. To obtain an 0-sparse solution from a solution of (8) or (9), we select the L basis vectors with the largest coefficients in h and calculate new, optimal coefficients for these basis vectors, using NNLS. All other coefficients are set to zero. The whole procedure is repeated for each column in X, in order to obtain a coding matrix H for problem (7). NNK-SVD follows this approach for nonnegative sparse coding, using the algorithm described in [11] as approximation for problem (8).

3.1.2. Enhancing the basis matrix

Once we have obtained a sparse encoding matrix H, we aim to enhance the basis matrix W (step 4 in Algorithm 2). We also allow the non-zero values in H to vary, but require that the coding scheme is maintained, i.e. that the “zeros” in H have to remain zero during the enhancement stage.

For this purpose, we can use a technique for unconstrained NMF as discussed in Section 2. Note that the multiplicative update rules (2) and (3) can be applied without any modification, since they consist of element-wise multiplications of the old factors with some update term. Hence, if an entry in H is zero before the update, it remains zero after the update. Nevertheless, the multiplicative updates do not increase XWHF, and typically reduce the objective. Therefore, step 4 in Algorithm 2 can be implemented by executing (2) and (3) for several iterations.

We can also perform an update according to ANLS. Since there are no constraints on the basis matrix, we can proceed exactly as in Section 2.2 in order to update W. To update H, we have to make some minor modifications of the fast combinatorial NNLS algorithm [29], in order to maintain the sparse structure of H. Let Z¯ be the set of indices depicting the zero-entries in H after the sparse coding stage. We have to take care that these entries remain zero during the active set algorithm, i.e. we simply do not allow that an entry, whose index is in Z¯, is moved to the in-active set. The convergence criterion of the algorithm (cf. [27]) has to be evaluated considering only entries whose indices are not in Z¯. Similarly, we can modify numerical approaches for NNLS such as the projected gradient algorithm [31]. Generally, the W and H matrices of the previous iteration should be used as initial solution for each ANLS iteration, which significantly enhances the overall speed of NMF0-H.

3.2. NMF0-W

We now address problem NMF0-W (6), where we again follow a two stage iterative approach, as illustrated in Algorithm 5. We first calculate an optimal, unconstrained solution for the basis matrix W (with fixed H) in step 3. Next, we project the basis vectors onto the closest nonnegative vector in Euclidean space, satisfying the desired 0-constraints. This step is easy, since we simply have to delete all entries except the L largest ones. Step 7 enhances H, where also the non-zero entries of W can be adapted, but maintaining the sparse structure of W. As in Section 3.1.2, we can use the multiplicative update rules due to their sparseness maintaining property. Alternatively, we can also use the ANLS scheme, similar as in Section 3.1.2. Our framework is inspired by Hoyer's 1-sparse NMF [13], which uses gradient descent to minimize the Frobenius norm. After each gradient step, the algorithm projects the basis vectors onto the closest vector with desired 1-sparseness (cf. (4)).

Algorithm 5

NMF0-W.

1: Initialize H randomly
2: fori=1:numIterdo
3:  WT=NNLS(XT,HT)
4:  forj=1:Kdo
5:   Set DL smallest values in wj to zero
6:  end for
7:  Coding Matrix Update: Enhance the coding matrix H and basis matrix W, maintaining the sparse structure of W.
8: end for

4. Experiments

4.1. Nonnegative sparse coding

In this section we empirically compare the nonnegative sparse coding techniques discussed in this paper. To be able to evaluate the quality of the sparse coders, we created synthetic sparse data as follows. We considered random overcomplete basis matrices with D=100 dimensions and containing K{200,400,800} basis vectors, respectively. For each K, we generated 10 random basis matrices using isotropic Gaussian noise; then the sign was discarded and each vector was normalized to unit length. Further, for each basis matrix, we generated “true” K×100 sparse encoding matrices H with sparseness factors L{5,10,,50}, i.e. we varied the sparseness from L=5 (very sparse) to L=50 (rather dense). The values of the non-zero entries in H were the absolute value of samples from a Gaussian distribution with standard deviation 10. The sparse synthetic data X is generated using X=WH. We executed NMP [35], NNBP, sNNLS, rsNNLS and NLARS3 [20] on each data set. For NNBP we used formulation (9), where ϵ was chosen such that an SNR of 120dB was achieved, and we used an interior point solver [38] for optimization. All algorithms were executed on a quad-core processor (3.2 GHz, 12 GB memory).

Fig. 1 shows the performance of the sparse coders in terms of reconstruction quality (SNR=10log10XF2/XWHF2dB), percentage of correctly identified basis vectors, and runtime, averaged over the 10 independent data sets per combination of K and L, where the SNR was averaged in the linear domain (i.e. arithmetic mean). Curve parts outside the plot area correspond to SNR values larger than 120 dB (which we considered as perfect reconstruction). NLARS shows clearly the worst performance, while requiring the most computational time after NNBP. sNNLS performs consistently better than NMP, while being approximately as efficient in terms of computational requirements. The proposed rsNNLS clearly shows the best performance, except that it selects slightly less correct basis vectors than NNBP for K=800 and L>40. Note that NNBP requires far the most computational time,4 and that the effort for NNBP depends heavily on the number of basis vectors K. rsNNLS is orders of magnitude faster than NNBP.

Fig. 1.

Fig. 1

Results of sparse coders, applied to sparse synthetic data. All results are shown as a function of the number of non-zero coefficients L (sparseness factor), and are averaged over 10 independent data sets. First row: reconstruction quality in terms of SNR. Markers outside the plot area correspond to perfect reconstruction (>120dB). Second row: percentage of correctly identified basis vectors. Third row: runtime on logarithmic scale. Abbreviations: nonnegative matching pursuit (NMP) [35], nonnegative basis pursuit (NNBP) [28,18], sparse nonnegative least squares (sNNLS) (this paper), reverse sparse nonnegative least squares (rsNNLS) (this paper), nonnegative least angle regression and selection (NLARS) [20].

We repeated the experiment, where we added positive, uniformly distributed noise to the synthetic sparse data, such that SNR=10dB. Note that the choice of ϵ for NNBP is now more difficult: A too small ϵ renders problem (9) infeasible, while a too large ϵ delivers poor results. Therefore, for each column in X, we initially selected ϵ according to SNR=19dB. For the case where problem (9) turned out to be infeasible, we relaxed the SNR constraint by 3dB until a feasible problem was obtained. The results of the experiment with noisy data is shown in Fig. 2 . As expected, all sparse coders achieve a lower reconstruction quality and identify less correct basis vectors than in the noise-free case. The proposed rsNNLS shows the best performance for noisy data.

Fig. 2.

Fig. 2

Results of sparse coders, applied to sparse synthetic data, contaminated with noise (SNR=10 dB). All results are shown as a function of L and are averaged over 10 independent data sets. First row: reconstruction quality in terms of SNR. Second row: percentage of correctly identified basis vectors. Third row: runtime on logarithmic scale. Abbreviations: nonnegative matching pursuit (NMP) [35], nonnegative basis pursuit (NNBP) [28,18], sparse nonnegative least squares (sNNLS) (this paper), reverse sparse nonnegative least squares (rsNNLS) (this paper), nonnegative least angle regression and selection (NLARS) [20].

4.2. NMF0-H applied to spectrogram data

In this section, we compare methods for the update stage in NMF0-H (Algorithm 2). As data we used a magnitude spectrogram of 2 min of speech from the database by Cooke et al. [39]. The speech was sampled at 8 kHz and a 512 point FFT with an overlap of 256 samples was used. The data matrix finally had dimensions 257×3749. We executed NMF0-H for 200 iterations, where rsNNLS was used as sparse coder. We compared the update method from NNK-SVD [18], the multiplicative update rules [1] (Section 3.1.2), and ANLS using the sparseness maintaining active-set algorithm (Section 3.1.2). Note that these methods are difficult to compare, since one update iteration of ANLS fully optimizes first W, then H, while one iteration of NNK-SVD or the multiplicative rules only reduce the objective, which is significantly faster. Therefore, we proceeded as follows: we executed 10 (inner) ANLS iterations per (outer) NMF0-H iteration, and recorded the required time. Then we executed the versions using NNK-SVD and the multiplicative updates, respectively, where both were allowed to update W and H for the same amount of time as ANLS in each outer iteration. Since NNK-SVD updates the columns of W separately [18], we let NNK-SVD update each column for a kth fraction of the ANLS time.

For the number of basis vectors K and sparseness value L, we used each combination of K{100,250,500} and L{5,10,20}. Since the error depends strongly on K and L, we calculated the root mean square error (RMSE) of each update method relative to the error of the ANLS method, and averaged over K and L: RMSErel(i)=19K,LXW(K,L,i)H(K,L,i)F/XWANLS(K,L,i)HANLS(K,L,i)F, where i denotes the iteration number of NMF0-H, W(K,L,i) and H(K,L,i) are the factor matrices of the method under test in iteration i, for parameters K and L. Similarly, WANLS(K,L,i) and HANLS(K,L,i) are the factor matrices of the ANLS update method. Fig. 3 shows the averaged relative RMSE for each update method, as a function of i. The ANLS approach shows the best performance. The multiplicative update rules are a constant factor worse than ANLS, but perform better than NNK-SVD. After 200 iterations, the multiplicative update rules and NNK-SVD achieve an average error which is approximately 1.35% and 1.9% worse than the ANLS error, respectively. NNK-SVD converges slower than the other update methods in the first 50 iterations.

Fig. 3.

Fig. 3

Averaged relative RMSE (compared to ANLS update) of multiplicative update rules (MU) and nonnegative K-SVD (NNK-SVD) over number of NMF0-H iterations. The shaded bars correspond to standard deviation.

4.3. NMF0-W applied to face images

Lee and Seung [5] reported that NMF returns a part-based representation, when applied to a database of face images. However, as noted by Hoyer [13], this effect does not occur when the face images are not aligned, like in the ORL database [40]. In this case, the representation happens to be rather global and holistic. In order to enforce a part-based representation, Hoyer constrained the basis vectors to be sparse. Similarly, we applied NMF0-W to the ORL database, where we constrained the basis vectors to have sparseness values L corresponding to 33%, 25% and 10% of the total pixel number per image (denoted as sparseness classes a, b and c, respectively) As in [13], we trained 25 basis vectors per sparseness value. We executed the algorithm for 30 iterations, using 10 ANLS coding matrix updates per iteration. In order to compare our results to 1-sparse NMF [13], we calculated the average 1-sparseness (using (4)) of our basis vectors. We executed 1-sparse NMF on the same data, where we required the basis vectors to have the same 1-sparseness as our NMF0-W basis vectors. We executed the algorithm for 2500 iterations, which were necessary for convergence. Fig. 4 shows the resulting basis vectors returned by NMF0-W and 1-sparse NMF, where dark pixels indicate high values and white pixels indicate low values.

Fig. 4.

Fig. 4

Top: basis images trained by NMF0-W. Bottom: basis images trained by 1-sparse NMF. Sparseness: 33% (a), 25% (b), 10% (c), 52.4% (d), 43% (e), 18.6% (f) of non-zero pixels.

We see that the results are qualitatively similar, and that the representation becomes a more local one, when sparseness is increased (from left to right). We repeated this experiment 10 times and calculated the 0-sparseness (in % of non-zero pixels), the SNR=10log10XF2/XWHF2dB and measured the runtime for both methods. The averaged results are shown in Table 1 , where the SNR was averaged in the linear domain.

Table 1.

Comparison of 0-sparseness (in percent of non-zero pixels), 1-sparseness (cf. (4)), reconstruction quality in terms of SNR, and runtime for 1-sparse NMF [13] and NMF0-W. SNR denotes the SNR value for 1-NMF, when the same 0-sparseness as for NMF0-W is enforced.

Method 0 (%) 1 SNR (dB) SNR (dB) Time (s)
1-NMF 52.4 0.55 15.09 14.55 2440
NMF0-W 33 0.55 15.07 186



1-NMF 43 0.6 14.96 14.31 2495
NMF0-W 25 0.6 14.94 164



1-NMF 18.6 0.73 14.31 13.52 2598
NMF0-W 10 0.73 14.33 50

We see that the two methods achieve de facto the same reconstruction quality, while NMF0-W uses a significantly lower number of non-zero pixels. One might ask if the larger portion of non-zeros in 1-NMF basis vectors stems from over-counting entries which are extremely small, yet numerically non-zero. Therefore, Table 1 additionally contains a column showing SNR for 1-NMF, which denotes the SNR value when the target 0-sparseness is enforced, i.e. all but the 33%, 25%, 10% of pixels with largest values are set to zero. We see that when a strict 0-constraint is required, NMF0-W achieves a significantly better SNR (at least 0.5 dB in this example). Furthermore, NMF0-W is more than an order of magnitude faster than 1-sparse NMF in this setup.

5. Conclusion

We proposed a generic alternating update scheme for 0-sparse NMF, which naturally incorporates existing approaches for unconstrained NMF, such as the classic multiplicative update rules or the ANLS scheme. For the key problem in NMF0-H, the nonnegative sparse coding problem, we proposed sNNLS, whose interpretation is twofold: it can be regarded as sparse nonnegative least-squares or as nonnegative orthogonal matching pursuit. From the matching-pursuit perspective, sNNLS represents a natural implementation of nonnegativity constraints in OMP. We further proposed the simple but astonishingly well performing rsNNLS, which competes with or outperforms nonnegative basis pursuit. Generally, 0-sparse NMF is a powerful technique with a large number of potential applications. NMF0-H aims to find a (possibly overcomplete) representation using sparsely activated basis vectors. NMF0-W encourages a part-based representation, which might be particularly interesting for applications in computer vision.

Acknowledgment

This work was supported by the Austrian Science Fund (Project number P22488-N23). The authors like to thank Sebastian Tschiatschek for his support concerning the Ipopt library, and his useful comments concerning this paper.

Biographies

graphic file with name fx1.jpg

Robert Peharz received his MSc degree in 2010. He currently pursues his PhD studies at the SPSC Lab, Graz University of Technology. His research interests include nonnegative matrix factorization, sparse coding, machine learning and graphical models, with applications to signal processing, audio engineering and computer vision.

graphic file with name fx2.jpg

Franz Pernkopf received his PhD degree in 2002. In 2002 he received the Erwin Schrödinger Fellowship. From 2004 to 2006, he was Research Associate in the Department of Electrical Engineering at the University of Washington. He received the young investigator award of the province of Styria in 2010. Currently, he is Associate Professor at the SPSC Lab, Graz University of Technology, leading the Intelligent Systems research group. His research interests include machine learning, discriminative learning, graphical models, with applications to image- and speech processing.

Footnotes

1

He actually has shown that the decision problem, whether an exact NMF of a certain rank exists or not, is NP-hard. An optimal algorithm for approximate NMF can be used to solve the decision problem. Hence NP-hardness follows also for approximate NMF.

2

To solve for W, one transposes X and H and executes the algorithm for each column of XT.

3

We used the MATLAB implementation available under http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/5523/zip/imm5523.zip.

4

For NNBP we used a C++ implementation [38], while all other algorithms were implemented in MATLAB. We also tried an implementation using the MATLAB optimization toolbox for NNBP, which was slower by a factor of 3–4.

References

  • 1.Lee D.D., Seung H.S. NIPS. 2001. Algorithms for non-negative matrix factorization; pp. 556–562. [Google Scholar]
  • 2.Dhillon I.S., Sra S. NIPS. 2005. Generalized nonnegative matrix approximations with Bregman divergences; pp. 283–290. [Google Scholar]
  • 3.Cichocki A., Lee H., Kim Y.D., Choi S. Non-negative matrix factorization with α-divergence. Pattern Recognition Lett. 2008;29(9):1433–1440. [Google Scholar]
  • 4.Paatero P., Tapper U. Positive matrix factorization: a non-negative factor model with optimal utilization of error estimates of data values. Environmetrics. 1994;5:111–126. [Google Scholar]
  • 5.Lee D.D., Seung H.S. Learning the parts of objects by nonnegative matrix factorization. Nature. 1999;401:788–791. doi: 10.1038/44565. [DOI] [PubMed] [Google Scholar]
  • 6.Smaragdis P., Brown J. IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. 2003. Non-negative matrix factorization for polyphonic music transcription; pp. 177–180. [Google Scholar]
  • 7.Virtanen T. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Trans. Audio Speech Lang. Process. 2007;15(3):1066–1074. [Google Scholar]
  • 8.Févotte C., Bertin N., Durrieu J.L. Nonnegative matrix factorization with the Itakura–Saito divergence: with application to music analysis. Neural Comput. 2009;21(3):793–830. doi: 10.1162/neco.2008.04-08-771. [DOI] [PubMed] [Google Scholar]
  • 9.Peharz R., Stark M., Pernkopf F. Proceedings of Interspeech. 2010. A factorial sparse coder model for single channel source separation; pp. 386–389. [Google Scholar]
  • 10.Xu W., Liu X., Gong Y. Proceedings of ACM SIGIR'03. 2003. Document clustering based on non-negative matrix factorization; pp. 267–273. [Google Scholar]
  • 11.Hoyer P.O. Proceedings of Neural Networks for Signal Processing. 2002. Non-negative sparse coding; pp. 557–565. [Google Scholar]
  • 12.Eggert J., Koerner E. International Joint Conference on Neural Networks. 2004. Sparse coding and NMF; pp. 2529–2533. [Google Scholar]
  • 13.Hoyer P.O. Non-negative matrix factorization with sparseness constraints. J. Mach. Learn. Res. 2004;5:1457–1469. [Google Scholar]
  • 14.Donoho D., Elad M. Optimally sparse representation in general (nonorthogonal) dictionaries via 1 minimization . Proc. Natl. Acad. Sci. 2003;100(5):2197–2202. doi: 10.1073/pnas.0437847100. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Tropp J. Just relax: convex programming methods for identifying sparse signals. IEEE Trans. Inf. Theory. 2006;51:1030–1051. [Google Scholar]
  • 16.Vavasis S. On the complexity of nonnegative matrix factorization. SIAM J. Matrix Anal. Appl. J. Optim. 2009;20(3):1364–1377. [Google Scholar]
  • 17.Aharon M., Elad M., Bruckstein A. K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 2006;54:4311–4322. [Google Scholar]
  • 18.Aharon M., Elad M., Bruckstein A. vol. 5914. 2005. K-SVD and its non-negative variant for dictionary design; pp. 11.1–11.13. (Proceedings of the SPIE Conference, Curvelet, Directional, and Sparse Representations II). [Google Scholar]
  • 19.D. Dueck, B. Frey, Probabilistic Sparse Matrix Factorization, Technical Report PSI TR 2004-023, Department of Computer Science, University of Toronto, 2004.
  • 20.Morup M., Madsen K., Hansen L. Proceedings of ISCAS. 2008. Approximate L0 constrained non-negative matrix and tensor factorization; pp. 1328–1331. [Google Scholar]
  • 21.Efron B., Hastie T., Johnstone I., Tibshirani R. Least angle regression. Ann. Stat. 2004;32:407–499. [Google Scholar]
  • 22.Yang Z., Chen X., Zhou G., Xie S. Proceedings of SPIE. 2009. Spectral unmixing using nonnegative matrix factorization with smoothed L0 norm constraint. [Google Scholar]
  • 23.Davis G., Mallat S., Avellaneda M. Adaptive greedy approximations. J. Constr. Approx. 1997;13:57–98. [Google Scholar]
  • 24.Pati Y.C., Rezaiifar R., Krishnaprasad P.S. Proceedings of 27th Asilomar Conference on Signals, Systems and Computers. 1993. Orthogonal matching pursuit: recursive function approximation with applications to wavelet decomposition; pp. 40–44. [Google Scholar]
  • 25.Tropp J. Greed is good: algorithmic results for sparse approximation. IEEE Trans. Inf. Theory. 2004;50(10):2231–2242. [Google Scholar]
  • 26.Bruckstein A.M., Elad M., Zibulevsky M. On the uniqueness of nonnegative sparse solutions to underdetermined systems of equations. IEEE Trans. Inf. Theory. 2008;54:4813–4820. [Google Scholar]
  • 27.Lawson C., Hanson R. Prentice-Hall; 1974. Solving Least Squares Problems. [Google Scholar]
  • 28.Chen S., Donoho D., Saunders M. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 1998;20(1):33–61. [Google Scholar]
  • 29.Van Benthem M., Keenan M. Fast algorithm for the solution of large-scale non-negativity-constrained least-squares problems. J. Chemometrics. 2004;18:441–450. [Google Scholar]
  • 30.Kim H., Park H. Non-negative matrix factorization based on alternating non-negativity constrained least squares and active set method. SIAM J. Matrix Anal. Appl. 2008;30:713–730. [Google Scholar]
  • 31.Lin C.J. Projected gradient methods for nonnegative matrix factorization. Neural Comput. 2007;19:2756–2779. doi: 10.1162/neco.2007.19.10.2756. [DOI] [PubMed] [Google Scholar]
  • 32.Li S., Hou X., Zhang H., Cheng Q. Proceedings of the Conference on Computer Vision and Pattern Recognition (CVPR) 2001. Learning spatially localized, parts-based representation. [Google Scholar]
  • 33.Pascual-Montano A., Carazo J., Kochi K., Lehmann D., Pascual-Marqui R. Transactions on Pattern Analysis and Machine Intelligence. 2006. Nonsmooth nonnegative matrix factorization (NSNMF) [DOI] [PubMed] [Google Scholar]
  • 34.Yang A.Y., Maji S., Hong K., Yan P., Sastry S.S. International Conference on Information Fusion. 2009. Distributed compression and fusion of nonnegative sparse signals for multiple-view object recognition. [Google Scholar]
  • 35.Peharz P., Stark M., Pernkopf F. Proceedings of MLSP. 2010. Sparse nonnegative matrix factorization using 0-constraints; pp. 83–88. [Google Scholar]
  • 36.Donoho D.L., Tanner J. Sparse nonnegative solutions of underdetermined linear equations by linear programming. Proc. Natl. Acad. Sci. 2005;102(27):9446–9451. doi: 10.1073/pnas.0502269102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Boyd S., Vandenberghe L. Cambridge University Press; 2004. Convex Optimization. [Google Scholar]
  • 38.Wächter A., Biegler L.T. On the implementation of an interior-point filter line-search algorithm for large-scale nonlinear programming. Math. Programming. 2006;106:25–57. [Google Scholar]
  • 39.Cooke M.P., Barker J., Cunningham S.P., Shao X. An audio-visual corpus for speech perception and automatic speech recognition. J. Am. Stat. Assoc. 2006;120:2421–2424. doi: 10.1121/1.2229005. [DOI] [PubMed] [Google Scholar]
  • 40.Samaria F.S., Harter A. Proceedings of the 2nd IEEE Workshop on Applications of Computer Vision. 1994. Parameterisation of a stochastic model for human face identification; pp. 138–142. [Google Scholar]

RESOURCES