Abstract
In this paper, we study the theoretical properties of a class of iteratively re-weighted least squares (IRLS) algorithms for sparse signal recovery in the presence of noise. We demonstrate a one-to-one correspondence between this class of algorithms and a class of Expectation-Maximization (EM) algorithms for constrained maximum likelihood estimation under a Gaussian scale mixture (GSM) distribution. The IRLS algorithms we consider are parametrized by 0 < ν ≤ 1 and ε > 0. The EM formalism, as well as the connection to GSMs, allow us to establish that the IRLS(ν, ε) algorithms minimize ε-smooth versions of the ℓν ‘norms’. We leverage EM theory to show that, for each 0 < ν ≤ 1, the limit points of the sequence of IRLS(ν, ε) iterates are stationary point of the ε-smooth ℓν ‘norm’ minimization problem on the constraint set. Finally, we employ techniques from Compressive sampling (CS) theory to show that the class of IRLS(ν, ε) algorithms is stable for each 0 < ν ≤ 1, if the limit point of the iterates coincides the global minimizer. For the case ν = 1, we show that the algorithm converges exponentially fast to a neighborhood of the stationary point, and outline its generalization to super-exponential convergence for ν < 1. We demonstrate our claims via simulation experiments. The simplicity of IRLS, along with the theoretical guarantees provided in this contribution, make a compelling case for its adoption as a standard tool for sparse signal recovery.
I. Introduction
Compressive sampling (CS) has been among the most active areas of research in signal processing in recent years [1], [2]. CS provides a framework for efficient sampling and re-construction of sparse signals, and has found applications in communication systems, medical imaging, geophysical data analysis, and computational biology.
The main approaches to CS can be categorized as optimization-based methods, greedy/pursuit methods, coding-theoretic methods, and Bayesian methods (see [2] for detailed discussions and references). In particular, convex optimization-based methods such as ℓ1-minimization, the Dantzig selector, and the LASSO have proven successful for CS, with theoretical performance guarantees both in the absence and in the presence of observation noise. Although these programs can be solved using standard optimization tools, iteratively re-weighted least squares (IRLS) has been suggested as an attractive alternative in the literature. Indeed, a number of authors have demonstrated that IRLS is an efficient solution technique rivalling standard state-of-the-art algorithms based on convex optimization principles [3], [4], [5], [6], [7], [8]. Gorodnitsky and Rao [3] proposed an IRLS-type algorithm (FOCUSS) years prior to the advent of CS and demonstrated its utility in neuroimaging applications. Donoho et al. [4] have suggested the usage of IRLS for solving the basis pursuit de-noising (BPDN) problem in the Lagrangian form. Saab et al. [5] and Chartrand et al. [6] have employed IRLS for non-convex programs for CS. Carrillo and Barner [7] have applied IRLS to the minimization of a smoothed version of the ℓ0 ‘norm’ for CS. Wang et al. [8] have used IRLS for solving the ℓν -minimization problem for sparse recovery, with 0 < ν ≤ 1. Most of the above-mentioned papers lack a rigorous analysis of the convergence and stability of the IRLS in the presence of noise, and merely employ IRLS as a solution technique for other convex and non-convex optimization techniques. However, IRLS has also been studied in detail as a stand-alone optimization-based approach to sparse reconstruction in the absence of noise by Daubechies et al. [9].
In [10], Candès, Wakin and Boyd have called CS the “modern least-squares”: the ease of implementation of IRLS algorithms, along with their inherent connection with ordinary least-squares provide a compelling argument in favor of its adoption as a standard algorithm for recovery of sparse signals [9].
In this work, we extend the utility of IRLS for compressive sampling in the presence of observation noise. For this purpose, we use the Expectation-Maximization (EM) theory for Normal/Independent (N/I) random variables and show that IRLS applied to noisy compressive sampling is an instance of the EM algorithm for constrained maximum likelihood estimation under a N/I assumption on the distribution of its components. This important connection has a two-fold advantage. First, the EM formalism allows to study the convergence of IRLS in the context of the EM theory. Second, one can evaluate the stability of the IRLS as a maximum likelihood problem in the context of noisy CS. More specifically, we show that the said class of IRLS algorithms, parametrized by 0 < ν ≤ 1 and ε > 0, are iterative procedures to maximize ε-smooth approximations to the ℓν ‘norms’. We use EM theory to prove convergence of the algorithms to stationary points of the objective, for each 0 < ν ≤ 1. We employ techniques from CS theory to show that the IRLS(ν, ε) algorithms are stable for each 0 < ν ≤ 1, if the limit point of the iterates coincides with the global minimizer (which is trivially the case for ν = 1, under mild conditions standard for CS). For the case ν = 1, we show that the algorithm converges exponentially fast to a neighborhood of the stationary point, for small enough observation noise. We further outline the generalization of this result to super-exponential convergence for the case of ν < 1. Finally, through numerical simulations we demonstrate the validity of our claims.
The rest of our treatment begins with Section II, where we introduce a fairly large class of EM algorithms for likelihood maximization within the context of N/I random variables. In the following section, we show a one-to-one correspondence between the said class of EM algorithms and IRLS algorithms which have been proposed in the CS literature for sparse recovery. In Sections IV and V, we prove the convergence and stability of the IRLS algorithms identified previously in Section III. We derive rates of convergence in Section VI and demonstrate our theoretical predictions through numerical experiments in Section VII. Finally, we give concluding remarks in Section VIII.
II. Normal/Independent random variables and the Expectation-Maximization algorithm
A. N/I random variables
Consider a positive random variable U with probability distribution function pU(u), and an M-variate normal random vector Z with mean zero and non-singular covariance matrix Σ. For any constant M-dimensional vector μ, the random vector
(1) |
is said to be a normal/independent (N/I) random vector [11]. N/I random vectors encompass large classes of multi-variate distributions such as the Generalized Laplacian and multi-variate t distributions. Many important properties of N/I random vectors can be found in [12] and [11]. In particular, the density of the random vector Y is given by
(2) |
with
(3) |
for x ≥ 0 [11].
N/I random vectors are also commonly referred to as Gaussian scale mixtures (GSMs). In the remainder of our treatment, we use the two terminologies interchangeably.
Eq. (2) is a representation of the density of an elliptically-symmetric random vector Y [13]. Eq. (3) gives a canonical form of the function κ(·) that arises from a given N/I distribution. However, when substituted in Eq. (2), not all κ(·) lead to a distribution in the GSM family, i.e., to random vectors which exhibit a decomposition as in Eq. (1). This will be important in our treatment because we will show that IRLS algorithms which have been proposed for sparse signal recovery correspond to specific choices of κ(x) which do lead to GSMs. In [14], Andrews et al. give sufficient and necessary conditions under which a symmetric density belongs to the family of GSMs. In [11], Lange et al. generalize these results by giving sufficient and necessary conditions under which a spherically-symmetric random vector is a GSM (note that any elliptically-symmetric density as in Eq. (2), with non-singular covariance matrix, can be linearly transformed into a spherically-symmetric density). The following theorem gives necessary and sufficient conditions under which a given choice of κ(x) leads to a density in the N/I family.
Proposition 1 (Conditions for a GSM)
A function f(x) : ℝ ↦ ℝ is called completely mono-tone iff it is infinitely differentiable and (−1)kf(k)(x) ≥ 0 for all non-negative integers k and all x ≥ 0. Suppose Y is an elliptically-symmetric random vector, with representation as in Eq. (2), then Y is a N/I random vector iff κ′(x) is completely monotone.
We refer the reader to [11] for a proof of this result.
B. EM algorithm
Now, suppose that one is given a total of P samples from multi-variate N/I random vectors, yi, with mean and covariances μi(θ) and Σi(θ) respectively, for i = 1, 2, · · ·, P, all parametrized by an unknown parameter vector θ. Let
(4) |
for i = 1, 2, · · ·, P. Then, the log-likelihood of the P samples, parametrized by θ, is given by:
(5) |
An Expectation-Maximization algorithm for maximizing the log-likelihood results if one linearizes the function κ(x) at the current estimate of θ. This is due to the fact that
(6) |
and hence linearization of κ(x) at the current estimate of θ gives the Q-function by taking the scale variable U as the unobserved data [15], [11]. If θ(ℓ) is the current estimate of θ, then the (ℓ + 1)th iteration of an EM algorithm maximizes the following Q-function:
(7) |
Maximization of the above Q-function is usually more tractable than maximizing the original likelihood function.
The EM algorithm is an instance of the more general class of Majorization-Minimization (MM) algorithms [16]. The EM algorithm above can also be derived in the MM formalism, that is, without recourse to missing data or other statistical constructs such as marginal and complete data likelihoods. In [11], Lange et al. take the MM approach (without missing data) and point out that the key ingredient in the MM algorithm is the κ(·) function, which is related to the missing data formulation of the algorithm through Eq. (6).
III. Iterative Re-weighted Least Squares
In this section, we define a class of IRLS algorithms and show that they correspond to a specific class of EM algorithms under GSM assumptions.
A. Definition
Let x ∈ ℝM be such that |{xi : xi ≠ 0}| ≤ s, for some s < M. Then, x is said to be an s-sparse vector. Consider the following observation model
(8) |
where b ∈ ℝN, with N < M is the observation vector, A ∈ ℝN×M is the measurement matrix, and n ∈ ℝN is the observation noise. The noisy compressive sampling problem is concerned with the estimation of x given b, A and a model for n. Suppose that the observation noise n is bounded such that ||n||2 ≤ η, for some fixed η > 0. Let := {x : ||b − Ax||2 ≤ η}. Let w ∈ ℝM such that wi > 0 for all i = 1, 2, · · ·, M. Then, for all x, y ∈ ℝM, the inner-product defined by
(9) |
induces a norm .
Definition 2
Let ν ∈ (0, 1] be a fixed constant. Given an initial guess x(0) of x (e.g. the least-squares solution), the class of IRLS(ν, ε) algorithms for estimating x generates a sequence of iterates/refined estimates of x as follows:
(10) |
with
(11) |
for i = 1, 2, · · ·, M and some fixed ε > 0.
Each iteration of the IRLS algorithm corresponds to a weighted least-squares problem constrained to the closed quadratic convex set , and can be efficiently solved using the standard convex optimization methods. The Lagrangian formulation of the IRLS has a simple closed form expression which makes it very appealing for implementation purposes [4]. Moreover, if the output SNR is greater than 1, that is, ||n||2 ≤ η < ||b||2, then 0 is not a feasible solution. Hence, the gradient of is non-vanishing over . Therefore, the solutions lie on the boundary of , given by ||b−Ax||2 = η. Such a problem has been extensively studied in the optimization literature in its dual form, for which several robust and efficient solutions exist (See [17] and references therein). Finally, note that when η = 0, the above algorithm is similar to the one studied by Daubechies et al. [9]. Throughout the paper, we may drop the dependence of IRLS(ν, ε) on ν and ε, and simply denote it by IRLS, wherever there is no ambiguity.
B. IRLS as an EM algorithm
Consider an M-dimensional random vector y ∈ ℝM with independent elements distributed according to
(12) |
for some function κ(x) with completely monotone derivative. Note that y is parametrized by θ := (x1, x2, · · ·, xM)T ∈ . The Q-function of the form (7) given the observation y = 0 ∈ ℝM is given by:
(13) |
Identifying with in Eq. (10), we have
(14) |
for x ≥ 0. It is not hard to show that κ′(x) is completely monotone [11] and hence, according to Proposition 1, κ(x) given by
(15) |
defines an N/I univariate random variable with density given by Eq. (12). The log-likelihood corresponding to the zero observation is then given by
(16) |
Therefore, the IRLS algorithm can be viewed as an iterative solution, which is an EM algorithm [11], for the following program:
(17) |
Note that the above program corresponds to minimizing a ε-smoothed version of the ℓν ‘norm’ of x (Figure 1) subject to the constraint ||b − Ax||2 ≤ η, that is
(18) |
The function fν (x) has also been considered in [9] in the analysis of the IRLS algorithm for noiseless CS. However, the above parallel to EM theory can be generalized to any other weighting scheme with a completely monotone derivative. For instance, consider the IRLS algorithm with the weighting:
(19) |
for some ε > 0. Using the connection to EM theory [11], it can be shown that this IRLS is an iterative solution to
(20) |
which is a perturbed version of ℓ1 minimization subject to ||b − Ax||2 ≤ η.
IV. Convergence
The convergence of the IRLS iterates in the absence of noise have been studied in [9], where the proofs rely on the null space property of the constraint set. The connection to EM theory allows to derive convergence results in the presence of noise using the rich convergence theory of EM algorithms.
A. Convergence of IRLS as an EM Algorithm
It is not hard to show that the EM algorithm provides a sequence of iterates so that sequence of log-likelihoods converges. However, one needs to be more prudent when making statements about the convergence of the iterates . Let denote a, non-empty, closed, strictly convex subset of ℝN. Let : ℝM ↦ ℝM be the map
(21) |
for all z ∈ ℝM, where
(22) |
Results from convex analysis [18] imply the following sufficient and necessary optimality condition for x* ∈ , the unique minimizer of over :
(23) |
Moreover, continuity of in x and z implies that is a continuous map [19]. We prove this latter fact formally in Appendix A. The proof of convergence of the EM iterates to a stationary point of the likelihood function can be deduced from variations on the global convergence theorem of Zangwill [20] (See [19] and [21]). For completeness, we present a convergence theorem tailored for the problem at hand:
Theorem 3 (Convergence of the sequence of IRLS iterates)
Let x(0) ∈ and be a sequence defined as x(ℓ+1) = (x(ℓ)) for all ℓ. Then, (i) x(ℓ) is bounded and ||x(ℓ) − x(ℓ+1)||2 → 0, (ii) every limit point of is a fixed point of , (iii) every limit point of is a stationary point of the function over , and (iv) fν(x(ℓ)) converges monotonically to fν(x*), for some stationary point x*.
Proof
(i) is a simple extension of Lemmas 4.4 and 5.1 in [9], where one substitutes 〈x(ℓ+1), x(ℓ) − x(ℓ+1)〉w(x(ℓ)) ≥ 0 for the optimality conditions at each iterate ℓ.
(ii) From (i), is a bounded sequence. The Bolzano-Weierstrass theorem establishes that has at least one convergent subsequence. Let x(ℓk) → x̄ be one such convergent sub-sequence:
(24) |
Since x(ℓk+1) = (x(ℓk)), the continuity of the map implies that
(25) |
Therefore, x̄ is a fixed point of the mapping .
(iii) To establish (iii), we will show that the limit point of any convergent subsequence {x(ℓk)}k of satisfies the necessary conditions of the stationary points of the minimization of over . Note that x̄ = (x̄) if and only if 〈x̄, x − x̄〉w(x̄) ≥ 0 for all x ∈ . Moreover,
(26) |
Note that 〈∇fν(x̄), x − x̄ 〉 ≥ 0 is the necessary condition for a stationary point x̄ of fν(x) over the strictly convex set [18]. Finally, (iv) follows from the continuity of in x and z, and convexity of (See Theorem 2 of [21]). This concludes the proof of the theorem.
B. Discussion
Note that Theorem 3 implies that if the minimizer of fν(x) over is unique, then the IRLS iterates will converge to this unique minimizer. Moreover, by Theorem 5 of [21], the limit points of IRLS lie in a compact and connected subset of the set {x : fν (x) = fν(x*)}. In particular, if the set of stationary points of fν(x) is finite, the IRLS sequence of iterates will converge to a unique stationary point. However, in general the IRLS is not guaranteed to converge (i.e. the set of limit points of the sequence of iterates is not necessarily a singleton).
There are various ways to choose ε adaptively or in a static fashion. Daubechies et al. [9] suggest a scheme where ε is possibly decreased in each step. This way fν(x) provides a better approximation to the ℓν norm. Saab et al. [5] cascade a series of IRLS with fixed but decreasing ε, so that the output of each is used as the initialization of the next.
The result of Theorem 3 can be generalized to incorporate iteration-dependent changes of ε. Let be a sequence such that limℓ→∞ε(ℓ) = ε̄ ≥ 0. It is not hard to show that Theorem 3 holds for such a choice of {ε(ℓ)}, by defining , if ε̄ > 0. Let x̄ be a limit point of the IRLS iterates. Then, if ε̄ = 0 and Tx̄ := supp(x̄) ⊆ {1, 2, · · ·, M}, then the results of parts (iii) and (iv) of Theorem 3 holds for minimization of over . This general result encompasses both the approach of Daubechies et al. [9] (in the absence of noise), and Saab et al. [5] as special cases. The technicality under such iteration-dependent choices of ε arises in showing that x̄ is the fixed point of the limit mapping (z), which can be established by invoking the uniform convergence of to (x̄) for any given subsequence {z(ℓ)} converging to x̄. A formal proof is given in Appendix B. For simplicity and clarity of presentation, the remaining results of this paper are presented with the assumption that ε > 0 is fixed.
V. Stability of IRLS for noisy CS
Recall that fν(x) is a smoothed version of . Hence, for 0 < ν ≤ 1, the global minimizer of fν(x) over is expected to be close to the s-sparse x, given sufficient regularity conditions on the matrix A [8], [5]. Bounding the distance of this minimizer to the s-sparse x provides the desired stability bounds. For ν = 1, f1(x) is strictly convex. Therefore the solution of the minimization of f1(x) over the convex set is unique [18]. Hence, the IRLS iterates will converge to the unique minimizer in this case. However, for ν < 1, the IRLS iterates do not necessarily converge to a global minimizer of fν(x) over . In practice, the IRLS is applied with randomly chosen initial values, and the limit point with the highest log-likelihood is chosen [7].
Recall that the matrix A ∈ ℝN×M is said to have Restricted Isometry Property (RIP) [22] of order s < M with constant δs ∈ (0, 1), if for all x ∈ ℝM supported on any index set T ⊂ {1, 2, · · ·, M} satisfying |T| ≤ s, we have
(27) |
The following theorem establishes the stability of the minimization of fν(x) over in the noisy setting:
Theorem 4
Let b = Ax+n be given such that x ∈ ℝM is s-sparse. Let m be a fixed integer and suppose that A ∈ ℝN×M satisfies
(28) |
Suppose that ||n||2 ≤ η and let := {x : ||b − Ax||2 ≤ η}. Let ε > 0 be a fixed constant. Then, the solution to the following program
(29) |
satisfies
(30) |
where C1 and C2 are constants depending only on ν, s/m, δm and δm+s.
Proof
The proof is a modification of the proof of Theorem 4 in [5], which is based on the proof of the main result of [23]. Let T0 := {i : xi ≠ 0}. Let S ⊆ {1, 2, ···, M}. We define
(31) |
Let x̄ be a global minimizer of fν(x) over and let h := x̄ − x. It is not hard to verify the following fact:
(32) |
The above inequality is the equivalent of the cone constraint in [23]. Moreover, it can be shown that
(33) |
for any x ∈ ℝM and S ⊆ {1, 2, ···, M} such that |S| ≤ s. By dividing the set into the sets T1, T2, ··· of size m, sorted according to decreasing magnitudes of the elements of , it can be shown that
(34) |
where T01 := T0 ∪ T1. Also, by the construction of and the hypothesis of the theorem about A, one can show that [5], [23]:
(35) |
Combining Eqs. (34) and (35) with the fact that , yields.
(36) |
where
(37) |
and
(38) |
Remark
Note that the result of Theorem 4 can be extended to compressible signals in a straightforward fashion [5]. Moreover, as it will be shown in the next section, the hypothesis of Eq. (28) can be relaxed to the sparse approximation property developed in [24], with a similar characterization of the global minimizer under study.
VI. Convergence Rate of IRLs for noisy CS
In presenting our results on the convergence rate of IRLS in presence of noise, it is more convenient to employ a slightly weaker notion of near isometry of the matrix A developed in [24]. This is due to the structure of the IRLS algorithm, which makes it more convenient to analyze the convergence rate in the ℓ1 sense, and as it becomes clear shortly, the sparse approximation property is the more appropriate choice of regularity condition on the matrix A.
A. Sparse approximation property and its consequences
We say that a matrix A has the sparse approximation property of order s if
(39) |
for all x ∈ ℝM, where S is an index set such that |S| ≤ s, and D and β are positive constants. Note that RIP of order 2s implies sparse approximation property [24], but the converse is not necessarily true. The error bounds obtained in Theorem 4 can be expressed in terms of D and β in a straightforward fashion [24]. A useful consequence of the sparse approximation property is the reverse triangle inequality in presence of noise:
Proposition 5
Let A satisfy the sparse approximation property of order s with constants β < 1 and D. Let x1, x2 ∈ := {x : ||b − Ax||2 ≤ η} and suppose that x1 is s-sparse. Then, we have:
Proof
Let T be the support of x1. Then, we have:
(40) |
The sparse approximation property implies that
(41) |
Moreover, ||A(x2 − x1)||2 ≤ 2η, by the construction of . Hence, combining Eqs. (40) and (41) yield:
(42) |
which together with Eq. (40) gives the statement of the proposition.
The above reverse triangle inequality allows to characterize the stability of the IRLS in the ℓ1 sense. This is indeed the method used by Daubechies et. al [9] in the absence of noise. Let x̄ be the minimizer of f1(x) over . Then, it is straightforward to show that
(43) |
Combining the above inequality with the statement of Proposition 5 yields:
(44) |
Note that the above bound is optimal in terms of η up to a constant, since , and in fact ||n||1 may achieve the value of .
B. Convergence rate of the IRLS
Let be a sequence of IRLS iterates that converge to a stationary point x̄, for ν = 1. We have the following theorem regarding the convergence rate of IRLS:
Theorem 6
Suppose that the matrix A satisfies the sparse approximation property of order s with constants D and β. Suppose that for some ρ < 1 we have
(45) |
and let R0 be the right hand side of Eq. (44), so that ||x̄ − x||1 ≤ R0. Assume that
(46) |
Let
(47) |
Then, there exists a finite ℓ0 such that for all ℓ > ℓ0 we have:
(48) |
for some R1 comparable to R0, which is explicitly given in this paper.
Proof
The proof of the theorem is mainly based on the proof of Theorem 6.4 of [9]. The convergence of the IRLS iterates implies that e(ℓ) → 0. Let T be the support of the s-sparse vector x. Therefore, there exists ℓ0 such that
(49) |
Clearly the right hand side of the above inequality is positive, since
(50) |
by hypothesis. Following the proof method of [9], we want to show (by induction) that for all ℓ > ℓ0, we have
(51) |
for some R1 that we will specify later. Consider e(ℓ+1) = x(ℓ+1) − x̄. The first order necessary conditions on x(ℓ+1) give:
(52) |
Substituting x(ℓ+1) by x̄ + e(ℓ+1) yields
(53) |
We intend to bound the term on the right hand side. First, note that
(54) |
since
by hypothesis. Moreover, the sparse approximation property implies that
(55) |
thanks to the tube constraint ||b − Ax||2 ≤ η. Hence, the left hand side of Eq. (54) can be bounded as:
(56) |
Also, we have:
(57) |
Note that γ(ℓ) → 0, since e(ℓ) → 0. Therefore, we have
(58) |
An application of the Cauchy-Schwarz inequality yields:
(59) |
Hence,
(60) |
First, note that γ(ℓ) is a bounded sequence for ℓ > ℓ0. Let γ0 be an upper bound on γ(ℓ) for all ℓ > ℓ0. We also have:
(61) |
Now, we have:
(62) |
where
(63) |
This concludes the proof of the theorem.
C. Discussion
Eq. (62) implies that . Therefore, the IRLS iterates approach a neighborhood of radius (in the ℓ1 sense) around the stationary point x̄ exponentially fast. Note that the radius of this neighborhood is comparable to the upper bound on the distance of x̄ to the s-sparse vector x (in the ℓ1 sense) given by R0. Hence, it is expected that with relatively few iterations of IRLS, one gets a reasonable estimate of x (Indeed, numerical studies in Section VII-C confirm this observation). Although the bound of the theorem holds for all ℓ > ℓ0, it is most useful when (1 − μ)−1 R1 is less than ρ mini∈T |x̄i|. A sufficient condition to guarantee this is that , which gives an upper bound on the noise level η and ε.
It is straightforward to extend the above theorem to the case of ν < 1. As shown in [9], the local convergence of in the case of ν < 1 and in the absence of noise is super-linear, with exponent 2 − ν. It is not hard to show that in the presence of noise, one can recover the super-linear local convergence with exponent 2 − ν. We refer the reader to Theorem 7.9 of [9], which can be extended to the noisy case with the respective modifications to the proof of Theorem 6.
VII. Numerical Experiments
In this section, we use numerical simulations to explore and validate the stability and convergence rate analyses of the previous sections. In particular, we compare ℓ1-minimization to fν(·)-minimization, in both cases in the presence of noise, for different values of ν, ε, and signal-to-noise ratio (SNR).
A. Experimental set-up
For fixed ν, ε and η,
Select N and M so that A is an N × M matrix; sample A with independent Gaussian entries.
Select 1 ≤ S < M/2.
Select T0 of size S uniformly at random and set xj = 1 for all j ∈ T0, and 0 otherwise.
Make b = Ax + n, where each entry of n is drawn uniformly in (−α, α), for some α that depends on η; find the solution x̄ to the program of Eq. (18) by IRLS.
Compare x̄ to x.
Repeat 50 times.
For each ν, ε and η, we compare the program solved in Step 4 to solving the program of Eq. (19) for ν = 1(ℓ1-minimization). We solve each IRLS iteration, as well as the ℓ1-minimization problem, using CVX, a package for specifying and solving convex programs [25], [26].
Remark
Modulo some constants, both η and ε appear in the same proportion in the stability bound derived in Theorem 3. Intuitively, this means that, the higher the SNR (small η), the smaller the value of ε one should pick to solve the program. In our experiment, we start with a fixed ε for the smallest SNR value (5 dB), and scale this value linearly for each subsequent value of the SNR. In particular, we use , where we use the loose notation η(SNR) to reflect the fact that each choice of SNR corresponds to a choice of η, and vice versa. In summary, our experimental set-up remains the same, except for the fact that we choose values of ε which depend on η.
B. Analysis of Stability
Figures 2 and 3 demonstrate the stability of IRLS for ε = 10−4, respectively for choices of ν = 1 and ν = 1/2. Figure 2 shows (as expected) that the stability of IRLS is comparable to that of ℓ1-minimization for ν = 1 and small ε. Moreover, only a few number of IRLS iterations are required to reach a satisfactory value of the MSE. These observations also apply to Figure 3, which further highlights the sparsifying properties of fν(·)-minimization for ν < 1. Indeed, the MSE achieved for ν = 1/2 are smaller than those achieved for ν = 1. Figure 4 shows that the approximation to the ℓ1-norm improves as one decreases the value of ε. In all three figures, we can clearly identify the log-linear dependence of the MSE as a function of η, which is predicted by the bound we derived in Theorem 4.
C. Convergence rate analysis
Figure 5 shows that the IRLS algorithm for f1(·)-minimization converges exponentially fast to a neighborhood of the fixed-point of the algorithm. Moreover, the larger the SNR, the faster the convergence. These two observations are as predicted by the bound of Theorem 6. Figure 6 shows an alternate depiction of these observations in the log scale. Figure 7 shows that the IRLS algorithm for f1/2(·)-minimization converges super-exponentially fast to a neighborhood of the fixed-point of the algorithm. As observed in Figure 5, the larger the SNR, the faster the convergence. Figure 8 shows an alternate depiction of these observations in the log scale.
VIII. Discussion
In this paper, we provided a rigorous theoretical analysis of various iteratively re-weighted least-squares algorithms which have been proposed in the literature for recovery of sparse signals in the presence of noise [3], [4], [5], [6], [7], [8], [9]. We framed the recovery problem as one of constrained likelihood maximization using EM under Gaussian scale mixture assumptions. On the one hand, we were able to leverage the power of the EM theory to prove convergence of the said IRLS algorithms, and on the other hand, we were able to employ tools from CS theory to prove the stability of these IRLS algorithms and to derive explicit rates of convergence. We supplemented our theoretical analysis with numerical experiments which confirmed our predictions.
The EM interpretation of the IRLS algorithms, along with the derivation of the objective functions maximized by these IRLS algorithms, are novel. The proof of convergence is novel and uses ideas from Zangwill [20] which, in a sense, are more general than the proof presented by Daubechies [9] in the noiseless case. We have not presented the proof in the most general setting. However, we believe that the key ideas in the proof could be useful in various other settings involving iterative procedures to solve optimization problems. The proof of stability of the algorithms is novel; it relies on various properties of the function fν(·), along with techniques developed by Candés et al. [23]. The analysis of the rates of convergence is novel and makes interesting use of the sparse approximation property [24], along with some of the techniques introduced in [9].
Although we have opted for a fairly theoretical treatment, we would like to emphasize that the beauty of IRLS lies in its simplicity, not in its theoretical properties. Indeed, the simplicity of IRLS alone makes it appealing, especially for those who do not possess formal training in numerical optimization: no doubt, it is easier to implement least-squares, constrained or otherwise, than it is to implement a solver based on barrier or interior-point methods. Our hope is that a firm theoretical understanding of the IRLS algorithms considered here will increase their adoption as a standard framework for sparse approximation.
Appendix A. Continuity of (z)
The proof of Theorem 3 relies on the continuity of (·) as a map from ℝM into . We establish continuity by showing that for all zn → z as n → ∞, (zn) → (z). We will show that every convergent subsequence of (zn) converges to (z). Since is non-empty, there exists x̄ ∈ such that
(64) |
On the other hand, zn is bounded because it is convergent. This implies that there exists B such that maxi |zni| ≤ B, so that . Therefore,
(65) |
so that (zn) is uniformly bounded. Therefore, there exists a convergent subsequence of (zn). Now, let (znk) be any convergent subsequence, and let be its limit. By definition of (znk) and results from convex optimization [18], for each nk, (znk) is the unique element of satisfying
(66) |
Taking limits and invoking continuity of the inner-product, we obtain
(67) |
Continuity of (·) follows from the fact that (z) is the unique element of which satisifies
(68) |
Therefore, = (z), which establishes the continuity of (·).
Appendix B. Iteration-dependent choices of ε
Let be a non-increasing sequence such that limℓ→∞ ε(ℓ) = ε̄ ≥ 0. In this case, the mapping must be substituted by the mapping ℝM ↦ ℝM:
(69) |
for all z ∈ ℝM, where
(70) |
The main difference in the proof is in part (ii), where it is shown that x̄ is a fixed point of the mapping . We consider two cases: 1) ε̄ > 0 and, 2) ε̄ = 0.
Case 1
Suppose that ε̄ > 0 and that {x(ℓk)}k is a converging subsequence of the IRLS iterates with the limit x̄. Since the sequence {x(ℓ)} is bounded, there exists an L, such that for all ℓ > L, all the iterates x(ℓ) lie in a bounded and closed ball B0 ⊂ ℝM. Moreover, let
(71) |
Clearly, (x̄) is bounded (since the true vector x ∈ is bounded). Let B ⊂ ℝM be a closed ball in ℝM such that B0 ⊆ B and (x̄) ∈ B. Then, we have
(72) |
Now, recall that
(73) |
It is easy to show that the function is uniformly convergent to
(74) |
for all x ∈ B. To see this, note that
(75) |
where Lt denotes the Lipschitz constant of the function (x2 + t2)ν/2−1. Since ε(ℓk), ε̄ > 0, the Lipschitz constants are uniformly bounded. Moreover, since x ∈ B, then each is bounded, hence the uniform convergence of the function is implied. Given the uniform convergence, a result from variational analysis (Theorem 7.33 of [27]) establishes that
(76) |
where . Note that the minimizer of over the convex set ∩ B is unique. Therefore, the above inclusion is in fact equality. Hence,
(77) |
by the construction of B. The rest of the proof remains the same by substituting ε̄ for ε.
Case 2
Suppose that ε̄ = 0 and that supp(x̄) =: T ⊆ {1, 2, ···, M}. In this case, if T ≠ {1, 2, ···, M}, the limit limℓ→∞ (z) does not exist. Hence, the proof technique used for Case 1 does no longer hold. However, with a careful examination of the limiting behavior of the mapping (z), we will show that x̄ is a fixed point of the mapping:
(78) |
for |zi| > 0 for all i ∈ T. Due to the closedness of , x̄ ∈ . So, the set ∩ {x : xTc = 0} is non-empty. If this set is a singleton {x̄}, then x̄ is clearly the fixed point. If not, then there exists z ∈ ∩ {x : xTc = 0} such that z ≠ x̄. Then, the necessary conditions for each minimization at step ℓk gives:
(79) |
First consider the terms over T. We have:
(80) |
Next, consider the terms over Tc:
(81) |
Hence, we have:
(82) |
which is the necessary and sufficient condition for x̄ being a fixed point of (z) over all z ∈ ∩ {x : xTc = 0}. Similar to the proof the Theorem 3, it can be shown that x̄ satisfies the necessary optimality conditions for the function:
(83) |
over the set ∩ {x : xTc = 0}. This concludes the proof. Note that the case of ε̄ = 0 is not favorable in general. Carefully chosen sequences {ε(ℓ)} as in [9] together with the assumption that the s-sparse vector x ∈ is unique, can result in convergence of the IRLS to the true s-sparse x. However, for general sequences of {ε(ℓ)} with limℓ→∞ ε(ℓ) = 0, this is not necessarily the case.
Contributor Information
Behtash Babadi, Email: behtash@nmr.mgh.harvard.edu.
Demba Ba, Email: demba@mit.edu.
Patrick L. Purdon, Email: patrickp@nmr.mgh.harvard.edu.
Emery N. Brown, Email: enb@neurostat.mit.edu.
References
- 1.Donoho DL. Compressed sensing. IEEE Transactions on Information Theory. 2006 Apr;52:1289–1306. [Google Scholar]
- 2.Bruckstein A, Donoho D, Elad M. From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Review. 2009;51(1):34–81. [Google Scholar]
- 3.Gorodnitsky I, Rao BD. Sparse signal reconstruction from limited data using focuss: a recursive weighted norm minimization algorithm. IEEE Transactions on Signal Processing. 1997;45(3):600–616. [Google Scholar]
- 4.Donoho DL, Elad M, Temlyakov VN. Stable recovery of sparse overcomplete representations in the presence of noise. IEEE Transactions on Information Theory. 2006 Jan;52:6–18. [Google Scholar]
- 5.Saab R, Chartrand R, Yilmaz O. Stable sparse approximations via nonconvex optimization. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008); 2008. pp. 3885–3888. [Google Scholar]
- 6.Chartrand R, Yin W. Iteratively reweighted algorithms for compressive sensing. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008); 2008. pp. 3869–3872. [Google Scholar]
- 7.Carrillo RE, Barner K. Iteratively re-weighted least squares for sparse signal reconstruction from noisy measurements,” in. 43rd Annual Conference on Information Sciences and Systems (CISS 2009); March 2009.pp. 448–453. [Google Scholar]
- 8.Wang W, Xu W, Tang A. On the performance of sparse recovery via ℓp-minimization. IEEE Transactions on Information Theory. 2011 Nov;57:7255–7278. [Google Scholar]
- 9.Daubechies I, DeVore R, Fornasier M, Gntrk CS. Iteratively reweighted least squares minimization for sparse recovery. Comm Pure Appl Math. 2010;63(1):1–38. [Google Scholar]
- 10.Candès EJ, Wakin M, Boyd S. Enhancing sparsity by reweighted ℓ1 minimization. J Fourier Anal Appl. 2008 Dec;14:877–905. [Google Scholar]
- 11.Lange K, Sinsheimer JS. Normal/independent distributions and their applications in robust regression. Journal of Computational and Graphical Statistics. 1993;2(2):175–198. [Google Scholar]
- 12.Dempster AP, Laird NM, Rubin DB. Iteratively reweighted least squares for linear regression when errors are normal/independent distributed. In: Krishnaiah PR, editor. Multivariate Analysis V. Elsevier Science Publishers; 1980. pp. 35–57. [Google Scholar]
- 13.Huber P, Ronchetti E MyiLibrary. Robust statistics. Vol. 1. Wiley Online Library; 1981. [Google Scholar]
- 14.Andrews D, Mallows C. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological) 1974:99–102. [Google Scholar]
- 15.Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B (Methodological) 1977;39(1):1–38. [Google Scholar]
- 16.Lange K. Optimization. Springer; 2004. [Google Scholar]
- 17.Golub GH, von Matt U. Quadratically constrained least squares and quadratic problems. Numerische Mathmatik. 1992;59(1):561–580. [Google Scholar]
- 18.Bertsekas DP. Convex Optimization Theory. 1. Athena Scientific; 2009. [Google Scholar]
- 19.Wu CFJ. On the convergence properties of the em algorithm. Ann Statist. 1983;11(1):95–103. [Google Scholar]
- 20.Zangwill WI. Nonlinear Programming: a unified approach. Prentice-Hall; 1969. [Google Scholar]
- 21.Nettleton D. Convergence properties of the em algorithm in constrained parameter spaces. The Canadian Journal of Statistics/La Revue Canadienne de Statistique. 1999;27(3):639–648. [Google Scholar]
- 22.Candès EJ, Tao T. Decoding by linear programming. IEEE Trans on Information Theory. 2005 Dec;51:4203–4215. [Google Scholar]
- 23.Candès EJ, Romberg J, Tao T. Stable signal recovery for incomplete and inaccurate measurements. Commun Pure Appl Math. 2006 Aug;59:1207–1223. [Google Scholar]
- 24.Sun Q. Sparse approximation property and stable recovery of sparse signals from noisy measurements. IEEE Transactions on Signal Processing. 2011;59(10):5086–5090. [Google Scholar]
- 25.Grant M, Boyd S. CVX: Matlab software for disciplined convex programming, version 1.21. 2011 Apr; http://www.stanford.edu/~boyd/software.html.
- 26.Grant M, Boyd S. Graph implementations for nonsmooth convex programs. In: Blondel V, Boyd S, Kimura H, editors. Recent Advances in Learning and Control. Springer-Verlag Limited; 2008. pp. 95–110. Lecture Notes in Control and Information Sciences. http://stanford.edu/~boyd/graphdcp.html. [Google Scholar]
- 27.Rockafellar RT, Wets RJB. Variational Analysis. Springer-Verlag; 1997. [Google Scholar]