Skip to main content
Springer logoLink to Springer
. 2020 Oct 22;85(2):33. doi: 10.1007/s10915-020-01332-8

Variable Smoothing for Convex Optimization Problems Using Stochastic Gradients

Radu Ioan Boţ 1,, Axel Böhm 1
PMCID: PMC7581594  PMID: 33122873

Abstract

We aim to solve a structured convex optimization problem, where a nonsmooth function is composed with a linear operator. When opting for full splitting schemes, usually, primal–dual type methods are employed as they are effective and also well studied. However, under the additional assumption of Lipschitz continuity of the nonsmooth function which is composed with the linear operator we can derive novel algorithms through regularization via the Moreau envelope. Furthermore, we tackle large scale problems by means of stochastic oracle calls, very similar to stochastic gradient techniques. Applications to total variational denoising and deblurring, and matrix factorization are provided.

Keywords: Structured convex optimization problem, Variable smoothing algorithm, Convergence rate, Stochastic gradients

Introduction

The problem at hand is the following structured convex optimization problem

minxHf(x)+g(Kx), 1

for real Hilbert spaces H and G, f:HR¯:=R{±} a proper, convex and lower semicontinuous function, g:GR a, possibly nonsmooth, convex and Lipschitz continuous function, and K:HG a linear continuous operator.

Our aim will be to devise an algorithm for solving (1) following the full splitting paradigm (see [5, 6, 8, 9, 15, 17, 29]). In other words, we allow only proximal evaluations for simple nonsmooth functions, but no proximal evaluations for compositions with linear continuous operators, like, for instance, for gK.

We will accomplish this feat by the means of a smoothing strategy, which, for the purpose of this paper, means, making use of the Moreau-Yosida approximation. The approach can be described as follows: we “smooth” g, i.e. we replace it by its Moreau envelope, and solve the resulting optimization problem by an accelerated proximal-gradient algorithm (see [3, 13, 21]). This approach is similar to those in [7, 10, 11, 20, 22], where a convergence rate of O(log(k)k) is proved. However, our techniques (for the deterministic case) resemble more the ones in [28], where an improved rate of O(1k) is shown, with the most notable difference to our work is that we use a simpler stepsize and treat the stochastic case.

The only other family of methods able to solve problems of type (1) are the so called primal–dual algorithms, first and foremost the primal–dual hybrid gradient (PDHG) introduced in [15]. In comparison, this method does not need the Lipschitz continuity of g in order to prove convergence. However, in this very general case, convergence rates can only be shown for the so-called restricted primal–dual gap function. In order to derive from here convergence rates for the primal objective function, either Lipschitz continuity of g or finite dimensionality of the problem plus the condition that g must have full domain are necessary (see, for instance, [5, Theorem 9]). This means, that for infinite dimensional problems the assumptions required by both, PDHG and our method, for deriving convergence rates for the primal objective function are in fact equal, but for finite dimensional problems the assumption of PDHG are weaker. In either case, however, we are able to prove these rates for the sequence of iterates (xk)k1 itself whereas PDHG only has them for the sequence of so-called ergodic iterates, i.e. (1ki=1kxi)k1, which is naturally undesirable as the averaging slows the convergence down. Furthermore, we do not show any convergence for the iterates as these are notoriously hard to obtain for accelerated method whereas PDHG gets these in the strongly convex setting via standard fixed point arguments (see e.g. [29]).

Furthermore, we will also consider the case where only a stochastic oracle of the proximal operator of g is available to us. This setup corresponds e.g. to the case where the objective function is given as

minxHf(x)+i=1mgi(Kix), 2

where, for i=1,,m, Gi are real Hilbert spaces, gi:GiR are convex and Lipschitz continuous functions and Ki:HGi are linear continuous operators, but the number of summands being large we wish to not compute all proximal operators of all gi,i=1,,m, for purpose of making iterations cheaper to compute.

For the finite sum case (2), there exist algorithms of similar spirit such as those in [14, 24]. Some algorithms do in fact deal with a similar setup of stochastic gradient like evaluations, see [26], but only for smooth terms in the objective function.

In Sect. 2 we will cover the preliminaries about the Moreau-Yosida envelope as well as useful identities and estimates connected to it. In Sect. 3 we will deal with the deterministic case and prove a convergence rate of O(1k) for the function values at the iterates. Next up, in Sect. 4, we will consider the stochastic case as described above and prove a convergence rate of Olog(k)k. Last but not least, we will look at some numerical examples in image processing in Sect. 5.

It is important to note that the proof for the deterministic setting differs surprisingly from the one for the stochastic setting. The technique for the stochastic setting is less refined in the sense that there is no coupling between the smoothing parameter and the extrapolation parameter. Where as this technique works also works for the deterministic setting it gives a worse convergence rate of Ologkk. The tight coupling of the two sequences of parameters, however does not work in the proof of the stochastic algorithm as it does not allow for the particular choice of the smoothing parameters needed there.

Preliminaries

In the main problem (1), the nonsmooth function regularizer g is supposed to be Lipschitz continuous. This assumption is necessary to ensure our main convergence results, however, many of the preliminary lemmata of this section hold true similarly if the function is only assumed to be lower semicontinuous. We will point this out in every statement of this section individually.

Definition 2.1

For a proper, convex and lower semicontinuous function g:HR¯, its convex conjugate is denoted by g defined as a function from H to R¯, given by

g(x):=suppHx,p-g(p)xH.

As mentioned in the introduction, we want to smooth a nonsmooth function by considering its Moreau envelope. The next definition will clarify exactly what object we are talking about.

Definition 2.2

For a proper, convex and lower semicontinuous function g:HR¯, its Moreau envelope with the parameter μ0 is defined as a function from H to R, given by

μg(·):=g+μ2·2(·)=suppH·,p-g(p)-μ2p2.

From this definition, however, it is not completely evident that the Moreau envelope indeed fulfills its purpose in being a smooth representation of the original function. The next lemma will remedy this fact.

Lemma 2.1

(see [2, Proposition 12.29]) Let g:HR¯ be a proper, convex and lower semicontinuous function and μ>0. Then its Moreau envelope is Fréchet differentiable on H. In particular, the gradient itself is given by

(μg)(x)=1μx-proxμgx=prox1μgxμxH

and is μ-1-Lipschitz continuous.

In particular, for all μ>0, a gradient step with respect to the Moreau envelope corresponds to a proximal step

x-μ(μg)(x)=proxμgxxH.

The previous lemma establishes two things. Not only does it clarify the smoothness of the Moreau envelope, but it also gives a way of computing its gradient. Obviously, a smooth representation whose gradient we would not be able to compute would not be any good.

As mentioned in the introduction, we want to smooth the nonsmooth summand of the objective function which is composed with the linear operator as this can be considered the crux of problem (1). The function gK will be smoothed via considering instead μgK:HR. Clearly, by the chain rule, this function is continuously differentiable with gradient given for every xH by

μgK(x)=Kμg(Kx)=1μKKx-proxμgKx=Kprox1μgKxμ,

and is thus Lipschitz continuous with Lipschitz constant K2μ, where K denotes the operator norm of K.

Lipschitz continuity will play an integral role in our investigations, as can be seen by the following lemmas.

Lemma 2.2

(see [4, Proposition 4.4.6]) Let g:HR be a convex and Lg-Lipschitz continuous function. Then, the domain of its Fenchel conjugate is bounded, i.e.

domgB(0,Lg),

where B(0,Lg) denotes the open ball with radius Lg around the origin.

The Moreau envelope even preserves the Lipschitzness of the original function.

Lemma 2.3

(see [18, Lemma 2.1]) Let g:HR be a convex and Lg-Lipschitz continuous function. Then its Moreau envelope μg is Lg-Lipschitz as well, i.e.

|μg(x)-μg(y)|Lgx-yx,yH.

Proof

We observe that for all xH

μg(x)g(proxμgx).

Therefore we can bound the gradient norm

μg(x)sup{v:yH,vg(y)}LgxH, 3

where we used in the last step that the Lipschitz continuity of g. The statement follows from the mean-value theorem.

The following lemmata are not new, but we provide proofs anyways in order to remain self-contained and to shed insight on how to use the Moreau envelope for the interested reader.

Lemma 2.4

(see [28, Lemma 10 (a)]) Let g:HR¯ be proper, convex and lower semicontinuous. The maximizing argument in the definition of the Moreau-Yosida envelope is given by its gradient, i.e. for μ>0 it holds that

argmaxpH·,p-g(p)-μ2p2=μg(·).

Proof

Let xH be fixed. It holds

argmaxpHx,p-g(p)-μ2p2=argmaxpH-12μx2+x,p-μ2p2-g(p)=argmaxpH-μ2xμ-p2-g(p)=argminpHg(p)+μ2xμ-p2=prox1μgxμ

and the conclusion follows by using Lemma 2.1.

Lemma 2.5

(see [28, Lemma 10 (a)]) For a proper, convex and lower semicontinuous function g:HR¯ and every xH we can consider the mapping from (0,+) to R given by

μμg(x). 4

This mapping is convex and differentiable and its derivative is given by

μμg(x)=-12μg(x)2xHμ(0,+).

Proof

Let xH be fixed. From the definition of the Moreau-Yosida envelope we can see that the mapping given in (4) is a pointwise supremum of functions which are linear in μ. It is therefore convex. Furthermore, since the objective function is strongly concave, this supremum is uniquely attained at μg(x)=argmaxpHx,p-g(p)-μ2p2. According to the Danskin Theorem, the function μμg(x) is differentiable and its gradient is given by

μμg(x)=μsuppHx,p-g(p)-μ2p2=-12μg(x)2μ(0,+).

Lemma 2.6

( [28, Lemma 10 (b)]) Let g:HR¯ be proper, convex and lower semicontinuous. For μ1,μ2>0 and every xH it holds

μ1g(x)μ2g(x)+(μ2-μ1)12μ1g(x)2. 5

If g is additionally Lg-Lipschitz and if μ2μ1>0, then

μ2g(x)μ1g(x)μ2g(x)+(μ2-μ1)Lg22. 6

Proof

Let xH be fixed. Via Lemma 2.5 we know that the map μμg(x) is convex and differentiable. We can therefore use the gradient inequality to deduce that

μ2g(x)μ1g(x)+(μ2-μ1)μμg(x)|μ=μ1=μ1g(x)-(μ2-μ1)12μ1g(x)2,

which is exactly the first statement of the lemma. The first inequality of (6) follows directly from the definition of the Moreau envelope and the second one from (5) and (3).

By applying a limiting argument it is easy to see that (6) implies that for any μ>0

μg(x)g(x)μg(x)+μLg22, 7

which shows that the Moreau envelope is always a lower approximation the original function.

Lemma 2.7

(see [28, Lemma 10 (c)]) Let g:HR¯ be proper, convex and lower semicontinuous. Then, for μ>0 and every x,yH we have that

μg(x)+μg(x),y-xg(y)-μ2μg(x)2.

Proof

Using Lemma 2.4 and the definition of the Moreau-Yosida envelope we get that

μg(x)+μg(x),y-x=x,μg(x)-g(μg(x))-μ2μg(x)2+μg(x),y-x=μg(x),y-g(μg(x))-μ2μg(x)2suppH{p,y-g(p)}-μ2μg(x)2=g(y)-μ2μg(x)2.

In the convergence proof of Lemma 3.3 we will need the inequality in the above lemma at the points Kx and Ky, namely

g(Ky)-μ2μg(Kx)2μg(Kx)+μg(Kx),Ky-Kx=μg(Kx)+Kμg(Kx),y-x=μg(Kx)+(μgK)(x),y-xx,yH. 8

The following lemma is a standard result for convex and Fréchet differentiable functions.

Lemma 2.8

(see [23]) For a convex and Fréchet differentiable function h:HR with Lh-Lipschitz continuous gradient we have that

h(x)+h(x),y-xh(y)-12Lhh(x)-h(y)2x,yH.

By applying Lemma 2.8 with μg, Kx and Ky instead of h, x and y respectively, we obtain

μg(Kx)+(μgK)(x),y-xμg(Ky)-μ2μg(Kx)-μg(Ky)2x,yH. 9

The following technical result will be used in the proof of the convergence statement.

Lemma 2.9

For α(0,1) and every x,yH we have that

(1-α)x-y2+αy2α(1-α)x2.

Deterministic Method

Problem 3.1

The problem at hand reads

minxHF(x):=f(x)+g(Kx),

for a proper, convex and lower semicontinuous function f:HR¯, a convex and Lg-Lipschitz continuous (Lg>0) function g:GR, and a nonzero linear continuous operator K:HG.

The idea of the algorithm which we propose to solve (1) is to smooth g and then to solve the resulting problem by means of an accelerated proximal-gradient method.

Algorithm 3.1

(Variable Accelerated SmooThing (VAST)) Let y0=x0H,(μk)k0(0,+), and (tk)k1 a sequence of real numbers with t1=1 and tk1 for every k2. Consider the following iterative scheme

(k1)Lk=K2μkγk=1Lkxk=proxγkfyk-1-γkKprox1μkgKyk-1μkyk=xk+tk-1tk+1(xk-xk-1).

Remark 3.1

The assumption t1=1 can be removed but guarantees easier computation and is also in line with classical choices of (tk)k1 in [13, 21].

Remark 3.2

The sequence (uk)k1 given by

uk:=xk-1+tk(xk-xk-1)k1,

despite not appearing in the algorithm, will feature a prominent role in the convergence proof. Due to the convention t1=1 we have that

u1:=x0+t1(x1-x0)=x1.

We also denote

Fk=f+μkgKk0.

The next theorem is the main result of this section and it will play a fundamental role when proving a convergence rate of O(1k) for the sequence (F(xk))k0.

Theorem 3.1

Consider the setup of Problem 3.1 and let (xk)k0 and (yk)k0 be the sequences generated by Algorithm 3.1. Assume that for every k1

μk-μk+1-μk+1tk+10

and

1-1tk+1γk+1tk+12=γktk2.

Then, for every optimal solution x of Problem 3.1, it holds

F(xN)-F(x)x0-x22γNtN2+μNLg22N1.

The proof of this result relies on several partial results which we will prove as follows.

Lemma 3.1

The following statement holds for every zH and every k0

Fk+1(xk+1)+12γk+1xk+1-z2f(z)+μk+1g(Kyk)+(μk+1gK)(yk),z-yk+12γk+1z-yk2.

Proof

Let k0 be fixed. Since, by the definition of the proximal map, xk+1 is the minimizer of a 1γk+1-strongly convex function we know that for every zH

f(xk+1)+μk+1g(Kyk)+(μk+1gK)(yk),xk+1-yk+12γk+1xk+1-yk2+12γk+1xk+1-z2f(z)+μk+1g(Kyk)+(μk+1gK)(yk),z-yk+12γk+1z-yk2.

Next we use the Lk+1-smoothness of μk+1gK and the fact that 1γk+1=Lk+1 to deduce

f(xk+1)+μk+1g(Kxk+1)+12γk+1xk+1-z2f(z)+μk+1g(Kyk)+(μk+1gK)(yk),z-yk+12γk+1z-yk2.

Lemma 3.2

Let x be an optimal solution of Problem 3.1. Then it holds

γ1(F1(x1)-F(x))+12u1-x212x-x02.

Proof

We use the gradient inequality to deduce that for every zH and every k0

μk+1g(Kyk)+(μk+1gK)(yk),z-ykμk+1g(Kz)g(Kz)

and plug this into the statement of Lemma 3.1 to conclude that

Fk+1(xk+1)+12γk+1xk+1-z2F(z)+12γk+1z-yk2.

For k=0 we get that

F1(x1)+12γ1x1-x2F(x)+12γ1x-y02.

Now we us the fact that u1=x1 and y0=x0 to obtain the conclusion.

Lemma 3.3

Let x be an optimal solution of Problem 3.1. The following descent-type inequality holds for every k0

Fk+1(xk+1)-F(x)+uk+1-x22γk+1tk+121-1tk+1Fk(xk)-F(x)+uk-x22γk+1tk+12+1-1tk+1μk-μk+1-μk+1tk+1μk+1g(Kxk)2.

Proof

Let k0 be fixed. We apply Lemma 3.1 with z:=1-1tk+1xk+1tk+1x to deduce that

Fk+1(xk+1)+uk+1-x22γk+1tk+12f1-1tk+1xk+1tk+1x+μk+1g(Kyk)+1-1tk+1(μk+1gK)(yk),xk-yk+1tk+1(μk+1gK)(yk),x-yk+12γk+1tk+12uk-x2.

Using the convexity of f gives

Fk+1(xk+1)+uk+1-x22γk+1tk+121-1tk+1f(xk)+1tk+1f(x)+1-1tk+1μk+1g(Kyk)+1-1tk+1(μk+1gK)(yk),xk-yk+1tk+1μk+1g(Kyk)+1tk+1(μk+1gK)(yk),x-yk+uk-x22γk+1tk+12. 10

Now, we use (8) to deduce that

1tk+1μk+1g(Kyk)+1tk+1(μk+1gK)(yk),x-yk1tk+1g(Kx)-1tk+1μk+12μk+1g(Kyk)2 11

and (9) to conclude that

1-1tk+1μk+1g(Kyk)+1-1tk+1(μk+1gK)(yk),xk-yk1-1tk+1μk+1g(Kxk)-1-1tk+1μk+12μk+1g(Kyk)-μk+1g(Kxk)2. 12

Combining (10), (11) and (12) gives

Fk+1(xk+1)+uk+1-x22γk+1tk+121-1tk+1μk+1g(Kxk)+1-1tk+1f(xk)+1tk+1g(Kx)+1tk+1f(x)-1-1tk+1μk+12μk+1g(Kyk)-μk+1g(Kxk)2-1tk+1μk+12μk+1g(Kyk)2+uk-x22γk+1tk+12.

The first term on the right hand side is μk+1g(Kxk) but we would like it to be μkg(Kxk). Therefore we use Lemma 2.6 to deduce that

Fk+1(xk+1)+uk+1-x22γk+1tk+121-1tk+1μkg(Kxk)+1-1tk+1f(xk)+1tk+1g(Kx)+1tk+1f(x)+1-1tk+1(μk-μk+1)12μk+1g(Kxk)2-1-1tk+1μk+12μk+1g(Kyk)-μk+1g(Kxk)2-1tk+1μk+12μk+1g(Kyk)2+uk-x22γk+1tk+12. 13

Next up we want to estimate all the norms of gradients by using Lemma 2.9 which says that

1-1tk+1μk+1g(Kyk)-μk+1g(Kxk)2+1tk+1μk+1g(Kyk)21-1tk+11tk+1μk+1g(Kxk)2. 14

Combining (13) and (14) gives

Fk+1(xk+1)+uk+1-x22γk+1tk+121-1tk+1μkg(Kxk)+1-1tk+1f(xk)+1tk+1g(Kx)+1tk+1f(x)+1-1tk+1(μk-μk+1)12μk+1g(Kxk)2-μk+121-1tk+11tk+1μk+1g(Kxk)2+uk-x22γk+1tk+12.

Now we combine the two terms containing μk+1g(Kxk)2 and get that

Fk+1(xk+1)+uk+1-x22γk+1tk+121-1tk+1μkg(Kxk)+1-1tk+1f(xk)+1tk+1g(Kx)+1tk+1f(x)+uk-x22γk+1tk+12+1-1tk+1μk-μk+1-μk+1tk+112μk+1g(Kxk)2.

By subtracting F(x)=f(x)+g(Kx) on both sides we finally obtain

Fk+1(xk+1)-F(x)+uk+1-x22γk+1tk+121-1tk+1Fk(xk)-F(x)+uk-x22γk+1tk+12+1-1tk+1μk-μk+1-μk+1tk+112μk+1g(Kxk)2.

Now we are in the position to prove Theorem 3.1.

Proof of Theorem 3.1

We start with the statement of Lemma 3.3 and use the assumption that

μk-μk+1-μk+1tk+10

to make the last term in the inequality disappear for every k0

Fk+1(xk+1)-F(x)+uk+1-x22γk+1tk+121-1tk+1Fk(xk)-F(x)+uk-x22γk+1tk+12.

Now we use the assumption that

1-1tk+1γk+1tk+12=γktk2

to get that for every k0

γk+1tk+12(Fk+1(xk+1)-F(x))+uk+1-x22γktk2(Fk(xk)-F(x))+uk-x22. 15

Let N2. Summing (15) from k=1 to N-1 and getting rid of the nonnegative term uN-x2 gives

γNtN2(FN(xN)-F(x))γ1(F1(x1)-F(x))+u1-x22N2.

Since t1=1, the above inequality is fulfilled also for N=1. Using Lemma 3.2 shows that

FN(xN)-F(x)x0-x2γNtN2N1.

The above inequality, however, is still in terms of the smoothed objective function. In order to go to the actual objective function we apply (7) and deduce that

F(xN)-F(x)FN(xN)-F(x)+μNLg22x0-x22γNtN2+μNLg22N1.

Corollary 3.1

By choosing the parameters (μk)k1,(tk)k1,(γk)k1 in the following way,

t1=1,μ1=bK2,forb>0,

and for every k1

tk+1:=tk2+2tk,μk+1:=μktk2tk+12-tk+1,γk:=μkK2, 16

they fulfill

μk-μk+1-μk+1tk+10 17

and

1-1tk+1γk+1tk+12=γktk2 18

For this choice of the parameters we have that

F(xN)-F(x)x0-x2b(N+1)+bLg2K2(N+1)exp4π26N1.

Proof

Since γk and μk are a scalar multiple of each other (18) is equivalent to

1-1tk+1μk+1tk+12=μktk2k1

and further to (by taking into account that tk+1>1 for every k1)

μk+1=μktk2tk+12tk+1tk+1-1=μktk2tk+12-tk+1k1. 19

Our update choice in (16) for the sequence (μk)k1 is exactly chosen in such a way that it satisfies this. Plugging (19) into (17) gives for every k1 the condition

11+1tk+1tk2tk+12tk+1tk+1-1=tk2tk+12tk+1+1tk+1-1,

which is equivalent to

0tk+13-tk+12-tk2tk+1-tk2

and further to

tk+12+tk2tk+1tk+12-tk2.

Plugging in tk+1=tk2+2tk we get that this equivalent to

tk+12+tk2tk+12tkk1,

which is evidently fulfilled. Thus, the choices in (16) are indeed feasible for our algorithm.

Now we want to prove the claimed convergence rates. Via induction we show that

k+12tkkk1. 20

Evidently, this holds for t1=1. Assuming that (20) holds for k1, we easily see that

tk+1=tk2+2tkk2+2kk2+2k+1=k+1

and, on the other hand,

tk+1=tk2+2tk(k+1)24+k+1=12k2+6k+512k2+4k+4=k+22.

In the following we prove a similar estimate for the sequence (μk)k1. To this end we show, again by induction, the following recursion for every k2

μk=μ1j=1k-1tjj=2k(tj-1)1tk. 21

For k=2 this follows from the definition (19). Assume now that (21) holds for k2. From here we have that

μk+1=μktk2tk+1(tk+1-1)=μ1j=1k-1tjj=2k(tj-1)1tktk2tk+1(tk+1-1)=μ1j=1ktjj=2k+1(tj-1)1tk+1.

Using (21) together with (20) we can check that for every k1

μk+1=μ1j=1ktjj=2k+1(tj-1)1tk+1=μ1tk+1j=1ktj(tj+1-1)μ1tk+1=bK21tk+1, 22

where we used in the last step the fact that tk+1tk+1.

The last thing to check is the fact that μk goes to zero like 1k. First we check that for every k1

tktk+1-11+1tk+1(tk+1-1). 23

This can be seen via

(tk+1)tk+1(tk+1)2=tk+12+1k1.

By bringing tk+1 to the other side we get that

tk+1tktk+12-tk+1+1,

from which we can deduce (23) by dividing by tk+12-tk+1.

We plug in the estimate (23) in (21) and get for every k2

μk=μ1j=1k-1tjj=1k-1(tj+1-1)1tkμ1j=1k-11+1tj+1(tj+1-1)1tkμ1j=1k-11+4(j+2)j1tkμ1j=1k-11+4j21tkμ1expπ2461tk=bK2expπ2461tk.

With the above inequalities we can to deduce the claimed convergence rates. First note that from Theorem 3.1 we have

F(xN)-F(x)x0-x22γNtN2+μNLg22N1.

Now, in order to obtain the desired conclusion, we use the above estimates and deduce for every N1

x0-x22γNtN2+μNLg22x0-x22btN+bLg2K22tNexp4π26x0-x2b(N+1)+bLg2K2(N+1)exp4π26,

where we used that

γNtN=μNtNK2b,

as shown in (22).

Remark 3.3

Consider the choice (see [21])

t1=1,tk+1=1+1+4tk22k1

and

μ1=bK2,forb>0.

Since

tk2=tk+12-tk+1k1,

we see that in this setting we have to choose

μk=bK2andγk=bk1.

Thus, the sequence of optimal function values (F(xN))N1 approaches a bK2Lg2-approximation of the optimal objective value F(x) with a convergence rate of O(1N2), i.e.

F(xN)-F(x)2x0-x2b(N+1)2+bK2Lg22N1.

Stochastic Method

Problem 4.1

The problem is the same as in the deterministic case

minxHf(x)+g(Kx)

other than the fact that at each iteration we are only given a stochastic estimator of the quantity

(μkgK)(·)=Kprox1μkg1μkK·k1.

Remark 4.1

See Algorithm 4.3 for a setting where such an estimator is easily computed.

For the stochastic quantities arising in this section we will use the following notation. For every k0, we denote by σ(x0,,xk) the smallest σ-algebra generated by the family of random variables {x0,,xk} and by Ek(·):=E(·|σ(x0,,xk)) the conditional expectation with respect to this σ-algebra.

Algorithm 4.1

(stochastic Variable Accelerated SmooThing (sVAST)) Let y0=x0H,(μk)k1 a sequence of positive and nonincreasing real numbers, and (tk)k1 a sequence of real numbers with t1=1 and tk1 for every k2. Consider the following iterative scheme

(k1)Lk=K2μkγk=1Lkxk=proxγkfyk-1-γkξk-1yk=xk+tk-1tk+1(xk-xk-1),

where we make the standard assumptions about our gradient estimator of being unbiased, i.e.

Ek(ξk)=(μk+1gK)(yk),

and having bounded variance

Ekξk-(μk+1gK)(yk)2σ2

for every k0.

Note that we use the same notations as in the deterministic case

uk:=xk-1+tk(xk-xk-1)andFk(·):=f+μkgKk1.

Lemma 4.1

The following statement holds for every (deterministic) zH and every k0

EkFk+1(xk+1)+xk+1-z22γk+1Fk+1(z)+z-yk22γk+1+γk+1σ2+K2Lg22

Proof

Here we have to proceed a little bit different from Lemma 3.1. Namely, we have to treat the gradient step and the proximal step differently. For this purpose we define the auxiliary variable

zk:=yk-1-γkξk-1k1.

Let k1 be fixed. From the gradient step we get

z-zk2=yk-1-γkξk-1-z2=yk-1-z2-2γkξk-1,yk-1-z+γk2ξk-12.

Taking the conditional expectation gives

Ek-1z-zk2=yk-1-z2-2γk(μkgK)(yk-1),yk-1-z+γk2Ek-1ξk-12.

Using the gradient inequality we deduce

Ek-1z-zk2yk-1-z2-2γk((μkgK)(yk-1)-(μkgK)(z))+γk2Ek-1ξk-12

and therefore

γk(μkgK)(yk-1)+12Ek-1z-zk212yk-1-z2+γk(μkgK)(z)+γk22Ek-1ξk-12. 24

Also from the smoothness of (μkgK) we deduce via the Descent Lemma that

μkg(Kzk)μkg(Kyk-1)+(μkgK)(yk-1),zk-yk-1+Lk2zk-yk-12.

Plugging in the definition of zk and using the fact that Lk=1γk we get

μkg(Kzk)μkg(Kyk-1)-γk(μkgK)(yk-1),ξk-1+γk2ξk-12.

Now we take the conditional expectation to deduce that

Ek-1(μkg(Kzk))μkg(Kyk-1)-γk(μkgK)(yk-1)2+γk2Ek-1ξk-12. 25

Multiplying (25) by γk and adding it to (24) gives

γkEk-1μkg(Kzk)+12Ek-1z-zk2γkμkg(Kz)+12yk-1-z2-γk2(μkgK)(yk-1)2+γk2Ek-1ξk-12.

Now we use the assumption about the bounded variance to deduce that

γkEk-1μkg(Kzk)+12Ek-1z-zk2γkμkg(Kz)+12yk-1-z2+γk2σ2. 26

Next up for the proximal step we deduce

f(xk)+12γkxk-zk2+12γkxk-z2f(z)+12γkz-zk2. 27

Taking the conditional expectation and combining (26) and (27) we get

Ek-1γk(μkg(Kzk)+f(xk))+12xk-zk2+12xk-z2γkFk(z)+12yk-1-z2+γk2σ2.

From here, using now Lemma 2.3, we get that

Ek-1γkFk(xk)-γkLgKxk-zk+12xk-zk2+12xk-z2γkFk(z)+12yk-1-z2+γk2σ2.

Now we use

-12γk2Lg2K212xk-zk2-γkLgKxk-zk

to obtain that

Ek-1γkFk(xk)+12xk-z2γkFk(z)+12yk-1-z2+γk2σ2+12γk2Lg2K2.

Lemma 4.2

Let x be an optimal solution of Problem 4.1. Then it holds

Eγ1(F1(x1)-F1(x))+12u1-x212x0-x2+γ12σ2+12γ12Lg2K2.

Proof

Applying the previous lemma with k=0 and z=x, we get that

Eγ1F1(x1)+12x1-x2γ1F1(x)+12y0-x2+γ12σ2+12γ12Lg2K2.

Therefore, using the fact that y0=x0 and u1=x1,

Eγ1(F1(x1)-F1(x))+12u1-x212x0-x2+γ12σ2+12γ12Lg2K2,

which finishes the proof.

Theorem 4.1

Consider the setup of Problem 4.1 and let (xk)k0 and (yk)k0 denote the sequences generated by Algorithm 4.1. Assume that for all k1

ρk+1:=tk2-tk+12+tk+10.

Then, for every optimal solution x of Problem 4.1, it holds

EF(xN)-F(x)1γNtN212x0-x2+1γNtN2K2Lg22k=1Nγk2(tk+ρk)+1γNtN2σ2+K2Lg22k=1Ntk2γk2N1.

Proof of Theorem 4.1

Let k0 be fixed. Lemma 4.1 for z:=1-1tk+1xk+1tk+1x gives

EkFk+1(xk+1)+12γk+11tk+1uk+1-1tk+1x2Fk+11-1tk+1xk+1tk+1x+12γk+11tk+1x-1tk+1uk2+γk+1σ2+K2Lg22.

From here and from the convexity of Fk+1 follows

EkFk+1(xk+1)-Fk+1(x)-1-1tk+1(Fk+1(xk)-Fk+1(x))uk-x22γk+1tk+12-Ekuk+1-x22γk+1tk+12+γk+1σ2+K2Lg22.

Now, by multiplying both sides with by tk+12, we deduce

Ektk+12(Fk+1(xk+1)-Fk+1(x))+(tk+1-tk+12)(Fk+1(xk)-Fk+1(x))12γk+1uk-x2-Ekuk+1-x2+tk+12γk+1σ2+K2Lg22. 28

Next, by adding tk2(Fk+1(xk)-Fk+1(x)) on both sides of (28), gives

Ektk+12(Fk+1(xk+1)-Fk+1(x))+ρk+1(Fk+1(xk)-Fk+1(x))tk2(Fk+1(xk)-Fk+1(x))+12γk+1uk-x2-Ekuk+1-x2+tk+12γk+1σ2+K2Lg22.

Utilizing (6) together with the assumption that (μk)k1 is nonincreasing leads to

Ektk+12(Fk+1(xk+1)-Fk+1(x))+ρk+1(Fk+1(xk)-Fk+1(x))tk2(Fk(xk)-Fk(x))+12γk+1uk-x2-Ekuk+1-x2+tk2(μk-μk+1)Lg22+tk+12γk+1σ2+K2Lg22.

Now, using that tk2tk+12-tk+1, we get

Ektk+12(Fk+1(xk+1)-Fk+1(x))+ρk+1(Fk+1(xk)-Fk+1(x))tk2(Fk(xk)-Fk(x))+12γk+1(uk-x2-Ekuk+1-x2)+tk2μkLg22-tk+12μk+1Lg22+tk+1μk+1Lg22+tk+12γk+1σ2+K2Lg22.

Multiplying both sides with γk+1 and putting all terms on the correct sides yields

Ek(γk+1tk+12Fk+1(xk+1)-Fk+1(x)+μk+1Lg22+12uk+1-x2)+γk+1ρk+1(Fk+1(xk)-Fk+1(x))γk+1tk2Fk(xk)-Fk(x)+μkLg22+12uk-x2+γk+1tk+1μk+1Lg22+tk+12γk+12σ2+K2Lg22. 29

At this point we would like to discard the term γk+1ρk+1(Fk+1(xk)-Fk+1(x)) which we currently cannot as the positivity of Fk+1(xk)-Fk+1(x) is not ensured. So we add γk+1ρk+1μk+1Lg22 on both sides of (29) and get

Ek(γk+1tk+12Fk+1(xk+1)-Fk+1(x)+μk+1Lg22+12uk+1-x2)+γk+1ρk+1Fk+1(xk)-Fk+1(x)+μk+1Lg22γk+1tk2Fk(xk)-Fk(x)+μkLg22+12uk-x2++γk+1μk+1Lg22(tk+1+ρk+1)+tk+12γk+12σ2+K2Lg22. 30

Using again (6) to deduce that

γk+1ρk+1Fk+1(xk)-Fk+1(x)+μk+1Lg22γk+1ρk+1(F(xk)-F(x))0

we can now discard said term from (30), giving

Ekγk+1tk+12Fk+1(xk+1)-Fk+1(x)+μk+1Lg22+12uk+1-x2γk+1tk2Fk(xk)-Fk(x)+μkLg22+12uk-x2+γk+1μk+1Lg22(tk+1+ρk+1)+tk+12γk+12σ2+K2Lg22. 31

Last but not least we use that Fk(xk)-Fk(x)+μkLg22F(xk)-F(x)0 and γk+1γk to follow that

γk+1tk2Fk(xk)-Fk(x)+μkLg22γktk2Fk(xk)-Fk(x)+μkLg22. 32

Combining (31) and (32) yields

Ekγk+1tk+12Fk+1(xk+1)-Fk+1(x)+μk+1Lg22+12uk+1-x2γktk2Fk(xk)-Fk(x)+μkLg22+12uk-x2+γk+1μk+1Lg22(tk+1+ρk+1)+tk+12γk+12σ2+K2Lg22. 33

Let N2. We take the expected value on both sides (33) and sum from k=1 to N-1. Getting rid of the non-negative terms uN-x2 gives

EγNtN2FN(xN)-FN(x)+μNLg22Eγ1F1(x1)-F1(x)+μ1Lg22+12u1-x2+k=2NγkμkLg2(tk+ρk)+k=2Ntk2γk2σ2+K2Lg22.

Since t1=1, the above inequality holds also for N=1. Now, using Lemma 4.2 we get that for every N1

EγNtN2FN(xN)-FN(x)+μNLg2212x0-x2+k=1NγkμkLg22(tk+ρk)+k=1Ntk2γk2σ2+K2Lg22.

From (7) we follow that

γNtN2F(xN)-F(x)γNtN2FN(xN)-FN(x)+μNLg22,

therefore, for every N1

EγNtN2FN(xN)-FN(x)12x0-x2+k=1NγkμkLg22(tk+ρk)+k=1Ntk2γk2σ2+K2Lg22.

By using the fact that μk=γkK2 for every k1 gives

EγNtN2(F(xN)-F(x))12x0-x2+K2Lg22k=1Nγk2(tk+ρk)+σ2+K2Lg22k=1Ntk2γk2N1.

Thus,

EF(xN)-F(x)1γNtN212x0-x2+1γNtN2K2Lg22k=1Nγk2(tk+ρk)+1γNtN2σ2+K2Lg22k=1Ntk2γk2N1.

Corollary 4.1

Let

t1=1,tk+1=1+1+4tk22k1,

and, for b>0,

μk=bk32K2,andγk=bk32k1.

Then,

EF(xN)-F(x)2x0-x2bN+bK2Lg2π231N+2b2σ2+K2Lg21+log(N)NN1.

Furthermore, we have that F(xN) converges almost surely to F(x) as N+.

Proof

First we notice that the choice of tk+1=1+1+4tk22 fulfills that

ρk+1=tk2-tk+12+tk+1=0k1.

Now we derive the stated convergence result by first showing via induction that

1k1tk2kk1.

Assuming that this holds for k1, we have that

tk+1=1+1+4tk221+1+4k221+1+4k+4k22=k+1

and

tk+1=1+1+4tk221+1+4(k2)221+k22k+12.

Furthermore, for every N1 we have that

1γNtN2K2Lg22k=1Nγk2(tk+ρk)4bNK2Lg22k=1Nb2k3k=2bK2Lg2Nk=1Nk-22bK2Lg2Nk=1k-2=bK2Lg2π231N. 34

The statement of the convergence rate in expectation follows now by plugging in our parameter choices into the statement of Theorem 4.1, using the estimate (34) and checking that

k=1Ntk2γk2b2k=1N1kb2(1+log(N))N1.

The almost sure convergence of (F(xN))N1 can be deduced by looking at (33) and dividing by γk+1tk+12 and using that γk+1tk+12γktk2 as well as ρk=0, which gives for every k0

EkFk+1(xk+1)-Fk+1(x)+μk+1Lg22+12γk+1tk+12uk+1-x2Fk(xk)-Fk(x)+μkLg22+12γktk2uk-x2+μk+1tk+1Lg22+γk+1σ2+K2Lg22.

Plugging in our choice of parameters gives for every k0

EkFk+1(xk+1)-Fk+1(x)+μk+1Lg22+12γk+1tk+12uk+1-x2Fk(xk)-Fk(x)+μkLg22+12γktk2uk-x2+Ck32,

where C>0.

Thus, by the famous Robbins-Siegmund Theorem (see [25, Theorem 1]) we get that (Fk+1(xk+1)-Fk+1(x)+μk+1Lg22)k0 converges almost surely. In particular, from the convergence to 0 in expectation we know that the almost sure limit must also be the constant zero.

Finite Sum The formulation of the previous section can be used to deal e.g. with problems of the form

minxHf(x)+i=1mgi(Kix) 35

for f:HR¯ a proper, convex and lower semicontinuous function, gi:GiR convex and Lgi-Lipschitz continuous functions and Ki:HGi linear continuous operators for i=1,,m.

Clearly one could consider

graphic file with name 10915_2020_1332_Equ158_HTML.gif

with K2=i=1mKi2 and

graphic file with name 10915_2020_1332_Equ159_HTML.gif

in order to reformulate the problem as

minxHf(x)+g(Kx)

and use Algorithm 3.1 together with the parameter choices described in Corollary 3.1 on this. This results in the following algorithm.

Algorithm 4.2

Let y0=x0H,μ1=bK, for b>0, and t1=1. Consider the following iterative scheme

(k1)γk=i=1mKi2μkxk=proxγkfyk-1-γki=1mKiprox1μkgiKiyk-1μktk+1=tk2+2tkyk=xk+tk-1tk+1(xk-xk-1)μk+1=μktk2tk+12-tk+1.

However, Problem (35) also lends itself to be tackled via the stochastic version of our method, Algorithm 4.1, by randomly choosing a subset of the summands. Together with the parameter choices described in Corollary 4.1 which results in the following scheme.

Algorithm 4.3

Let y0=x0H,b>0, and t1=1. Consider the following iterative scheme

(k1)μk=bi=1mKi2k-32γk=bk-32xk=proxγkfyk-1-γkϵi,kpii=1mKiprox1μkgiKiyk-1μktk+1=1+1+4tk22yk=xk+tk-1tk+1(xk-xk-1),

with ϵk:=(ϵ1,k,ϵ2,k,,ϵm,k) a sequence of i.i.d., {0,1}m random variables and pi=P[ϵi,1=1].

Since the above two methods were not explicitly developed for this separable case and can therefore not make use of more refined estimation of the constant K, as it is done in e.g. [14]. However, in the stochastic case, this fact is remedied due to the scaling of the stepsize with respect to the i-th component by pi-1.

Remark 4.2

In theory Algorithm 4.1 could be used to treat more general stochastic problems than finite sums like (35), but in the former case it is not clear anymore how a gradient estimator can be found, so we do not discuss it here.

Numerical Examples

We will focus our numerical experiments on image processing problems. The examples are implemented in python using the operator discretization library (ODL) [1]. We define the discrete gradient operators D1 and D2 representing the discretized derivative in the first and second coordinate respectively, which we will need for the numerical examples. Both map from Rm×n to Rm×n and are defined by

(D1u)i,j:=ui+1,j-ui,j1i<m,0else,

and

(D2u)i,j:=ui,j+1-ui,j1j<m,0else.

The operator norm of D1 and D2, respectively, is 2 (where we equipped Rm×n with the Frobenius norm). This yields an operator norm of 8 for the total gradient D:=D1×D2 as a map from Rm×n to Rm×n×Rm×n, see also [12].

We will compare our methods, i.e. the Variable Accelerated SmooThing (VAST) and its stochastic counterpart (sVAST) to the Primal Dual Hybrid Gradient (PDHG) of [15] as well as its stochastic version (sPDHG) from [14]. Furthermore, we will illustrate another competitor, the method by Pesquet and Repetti, see [24], which is another stochastic version of PDHG (see also [29]).

In all examples we choose the parameters in accordance with [14]:

  • for PDHG and Pesquet&Repetti: τ=σi=γK

  • for sPDHG: σi=γK and τ=γnmaxiKi,

where γ=0.99.

Total Variation Denoising

The task at hand is to reconstruct an image from its noisy observation. We do this by solving

minxRm×nαx-b2+D1x1+D2x1,

with α>0 as regularization parameter, in the following setting: f=α·-b2,g1=g2=·1,K1=D1,K2=D2.

Figure 1 illustrates the images (of dimension m=442 and n=331) used in for this example. These include the groundtruth, i.e. the uncorrupted image, as well as the data for the optimization problem b, which visualizes the level of noise. In Fig. 2 we can see that for the deterministic setting our method is as good as PDHG. For the objective function values, Fig. 2b, this is not too surprising as both algorithms share the same convergence rate. For the distance to a solution however we completely lack a convergence result. Nevertheless in Fig. 2a we can see that our method performs also well with respect to this measure.

Fig. 1.

Fig. 1

TV denoising. Images used. The approximate solution is computed by running PDHG for 7000 iterations

Fig. 2.

Fig. 2

TV denoising. Plots illustrating the performance of different methods

In the stochastic setting we can see in Fig. 2 that, while sPDHG provides some benefit over its deterministic counterpart, the stochastic version of our method, although significantly increasing the variance, provides great benefit, at least for the objective function values.

Furthermore, Fig. 3, shows the reconstructions of sPDHG and our method which are, despite the different objective function values, quite comparable.

Fig. 3.

Fig. 3

TV Denoising. A comparison of the reconstruction for the stochastic variable smoothing method and the stochastic PDHG

Total Variation Deblurring

For this example we want to reconstruct an image from a blurred and noisy image. We assume to know the blurring operator C:Rm×nRm×n. This is done by solving

minxRm×nαCx-b2+D1x1+D2x1, 36

for α>0 as regularization parameter, in the following setting: f=0,g1=α·-b2,g2=g3=·1,K1=C,K2=D1,K2=D2.

Figure 4 shows the images used to set up the optimization problem (36), in particular Fig. 4b which corresponds to b in said problem.

Fig. 4.

Fig. 4

TV Deblurring. The approximate solution is computed by running PDHG for 3000 iterations

In Fig. 5 we see that while PDGH performs better in the deterministic setting, in particular in the later iteration, the stochastic variable smoothing method provides a significant improvement where sPDHG method seems not to converge. It is interesting to note that in this setting even the deterministic version of our algorithm exhibits a slightly chaotic behaviour. Although neither of the two methods is monotone in the primal objective function PDHG seems here much more stable.

Fig. 5.

Fig. 5

TV deblurring. Plots illustrating the performance of different methods

Matrix Factorization

In this section we want to solve a nonconvex and nonsmooth optimization problem of completely positive matrix factorization, see [16, 19, 27]. For an observed matrix ARd×d we want to find a completely positive low rank factorization, meaning we are looking for xR0r×d with rd such that xTx=A. This can be formulated as the following optimization problem

minxR0r×dxTx-A1, 37

where xT denotes the transpose of the matrix x. The more natural approach might be to use a smooth formulation where ·22 is used instead of the 1-Norm we are suggesting. However, the former choice of distance measure, albeit smooth, comes with its own set of problems (mainly a non-Lipschitz gradient).

The so called Prox-Linear method presented in [18] solves the above problem (37), by linearizing the smooth (Rd×d-valued) function xxTx inside the nonsmooth distance function. In particular for the problem

minxg(c(x)),

for a smooth vector valued function c and a convex and Lipschitz function g, [18] proposes to iteratively solve the subproblem

xk+1=argminxg(c(xk)+c(xk)(x-xk))+12tx-xk22k0, 38

for a stepsize t(LgLDc)-1. For our particular problem described in (37) the subproblem looks as follows

xk+1=argminxR0r×dxkTx-A1+12x-xk22, 39

and therefore fits our general setup described in (1) with the identification f=·-xk22+δR0r×d(x), g=·1 and K=xkT. Moreover, due to its separable structure, the subproblem (39) fits the special case described in (35) and can therefore be tackled by the stochastic version of our algorithm presented in Algorithm 4.3. In particular reformulating (38) for the stochastic finite sum setting we interpret the subproblem as

xk+1=argminxR0r×di=1dxkT[i,:]x-A[i,:]1+12x-xk22,

where A[i,  : ] denotes the i-th row of the matrix A (Fig. 6).

Fig. 6.

Fig. 6

Comparison of the evolutions of the objective function values for different starting points. We run 40 epochs with 5 iterations each. For each epoch we choose the last iterate of the previous epoch as the linearization. For the stochastic methods we fix the number of rows (batch size) which are randomly chosen in each update a priori and count d divided by this number as one iteration. For the randomly chosen initial point we use a batch size of 3 (to allow for more exploration) and for the one close to the solution we use 5 in order to give a more accuracy. The parameter b in the variable smoothing method was chosen with minimal tuning to be 0.1 for both the deterministic and the stochastic version

In comparison to Sects. 5.1 and 5.2 a new aspect becomes important when evaluating methods for solving (38). Now, it is not only relevant how well subproblem (39) is solved, but also the trajectory taken in doing so as different paths might lead to different local minima. This can be seen in Fig. 6 where PDHG gets stuck early on in bad local minima. The variable smoothing method (especially the stochastic version) is able to move further from the starting point and find better local minima. Note that in general the methods have a difficulty in finding the global minimum xtrueR3×60 (with optimal objective function value zero, as constructed A:=xtrueTxtrueR60×60 in all examples).

Acknowledgements

The authors are thankful to two anonymous reviewers for comments and remarks which improved the quality of the presentation and led to the numerical experiment on matrix factorization.

Funding

Open access funding provided by Austrian Science Fund (FWF).

Footnotes

Research partially supported by FWF (Austrian Science Fund) project I 2419-N32. Research supported by the doctoral programme Vienna Graduate School on Computational Optimization (VGSCO), FWF (Austrian Science Fund), Project W 1260.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Radu Ioan Boţ, Email: radu.bot@univie.ac.at.

Axel Böhm, Email: axel.boehm@univie.ac.at.

References

  • 1.Adler, J., Kohr, H., Öktem, O.: Operator Discretization Library. https://odlgroup.github.io/odl/ (2017)
  • 2.Bauschke HH, Combettes PL. Convex Analysis and Monotone Operator Theory in Hilbert Spaces. New York: Springer; 2011. [Google Scholar]
  • 3.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009;2(1):183–202. doi: 10.1137/080716542. [DOI] [Google Scholar]
  • 4.Borwein JM, Vanderwerff JD. Convex Functions: Constructions, Characterizations and Counterexamples. Cambridge: Cambridge University Press; 2010. [Google Scholar]
  • 5.Boţ RI, Csetnek ER. On the convergence rate of a forward–backward type primal–dual splitting algorithm for convex optimization problems. Optimization. 2015;64(1):5–23. doi: 10.1080/02331934.2014.966306. [DOI] [Google Scholar]
  • 6.Boţ RI, Csetnek ER, Heinrich A, Hendrich C. On the convergence rate improvement of a primal–dual splitting algorithm for solving monotone inclusion problems. Math. Program. 2015;150(2):251–279. doi: 10.1007/s10107-014-0766-0. [DOI] [Google Scholar]
  • 7.Boţ RI, Hendrich C. A double smoothing technique for solving unconstrained nondifferentiable convex optimization problems. Comput. Optim. Appl. 2013;54(2):239–262. doi: 10.1007/s10589-012-9523-6. [DOI] [Google Scholar]
  • 8.Boţ RI, Hendrich CC. A Douglas-Rachford type primal–dual method for solving inclusions with mixtures of composite and parallel-sum type monotone operators. SIAM J. Optim. 2013;23(4):2541–2565. doi: 10.1137/120901106. [DOI] [Google Scholar]
  • 9.Boţ RI, Hendrich C. Convergence analysis for a primal–dual monotone+ skew splitting algorithm with applications to total variation minimization. J. Math. Imaging Vis. 2014;49(3):551–568. doi: 10.1007/s10851-013-0486-8. [DOI] [Google Scholar]
  • 10.Boţ RI, Hendrich C. On the acceleration of the double smoothing technique for unconstrained convex optimization problems. Optimization. 2015;64(2):265–288. doi: 10.1080/02331934.2012.745530. [DOI] [Google Scholar]
  • 11.Boţ RI, Hendrich C. A variable smoothing algorithm for solving convex optimization problems. TOP. 2015;23(1):124–150. doi: 10.1007/s11750-014-0326-z. [DOI] [Google Scholar]
  • 12.Chambolle A. An algorithm for total variation minimization and applications. J. Math. Imaging Vis. 2004;20(1–2):89–97. [Google Scholar]
  • 13.Chambolle A, Dossal C. On the convergence of the iterates of the Fast Iterative Shrinkage/Thresholding Algorithm. J. Optim. Theory Appl. 2015;166(3):968–982. doi: 10.1007/s10957-015-0746-4. [DOI] [Google Scholar]
  • 14.Chambolle A, Ehrhardt MJ, Richtárik P, Schönlieb CB. Stochastic primal–dual hybrid gradient algorithm with arbitrary sampling and imaging applications. SIAM J. Optim. 2018;28(4):2783–2808. doi: 10.1137/17M1134834. [DOI] [Google Scholar]
  • 15.Chambolle A, Pock T. A first-order primal–dual algorithm for convex problems with applications to imaging. J. Math. Imaging Vis. 2011;40(1):120–145. doi: 10.1007/s10851-010-0251-1. [DOI] [Google Scholar]
  • 16.Chen, C., Pong, T.K., Tan, L., Zeng, L.: A difference-of-convex approach for split feasibility with applications to matrix factorizations and outlier detection. J. Glob. Optim. 10.1007/s10898-020-00899-8 (2020)
  • 17.Condat L. A primal–dual splitting method for convex optimization involving Lipschitzian, proximable and linear composite terms. J. Optim. Theory Appl. 2013;158(2):460–479. doi: 10.1007/s10957-012-0245-9. [DOI] [Google Scholar]
  • 18.Drusvyatskiy D, Paquette C. Efficiency of minimizing compositions of convex functions and smooth maps. Math. Program. 2019;178:1–56. doi: 10.1007/s10107-018-1311-3. [DOI] [Google Scholar]
  • 19.Groetzner P, Dür M. A factorization method for completely positive matrices. Linear Algebra Appl. 2020;591:1–24. doi: 10.1016/j.laa.2019.12.024. [DOI] [Google Scholar]
  • 20.Nesterov Y. Smooth minimization of non-smooth functions. Math. Program. 2005;103(1):127–152. doi: 10.1007/s10107-004-0552-5. [DOI] [Google Scholar]
  • 21.Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2) Doklady Akademija Nauk USSR. 1983;269:543–547. [Google Scholar]
  • 22.Nesterov Y. Smoothing technique and its applications in semidefinite optimization. Math. Program. 2007;110(2):245–259. doi: 10.1007/s10107-006-0001-8. [DOI] [Google Scholar]
  • 23.Nesterov Y. Introductory Lectures on Convex Optimization: A Basic Course. New York: Springer; 2013. [Google Scholar]
  • 24.Pesquet J-C, Repetti A. A class of randomized primal–dual algorithms for distributed optimization. J. Nonlinear Convex Anal. 2015;16(12):2453–2490. [Google Scholar]
  • 25.Robbins, H., Siegmund, D.: A convergence theorem for non negative almost supermartingales and some applications. In: Optimizing Methods in Statistics, Proceedings of a Symposium Held at the Center for Tomorrow, Ohio State University, June 14–16, Elsevier, pp. 233–257 (1971)
  • 26.Rosasco L, Villa S, Vũ BC. A first-order stochastic primal-dual algorithm with correction step. Numer. Funct. Anal. Optim. 2017;38(5):602–626. doi: 10.1080/01630563.2016.1254243. [DOI] [Google Scholar]
  • 27.Shi Q, Sun H, Songtao L, Hong M, Razaviyayn M. Inexact block coordinate descent methods for symmetric nonnegative matrix factorization. IEEE Trans. Signal Process. 2017;65(22):5995–6008. doi: 10.1109/TSP.2017.2731321. [DOI] [Google Scholar]
  • 28.Tran-Dinh Q, Fercoq O, Cevher V. A smooth primal–dual optimization framework for nonsmooth composite convex minimization. SIAM J. Optim. 2018;28(1):96–134. doi: 10.1137/16M1093094. [DOI] [Google Scholar]
  • 29.Vũ BC. A splitting algorithm for dual monotone inclusions involving cocoercive operators. Adv. Comput. Math. 2013;38(3):667–681. doi: 10.1007/s10444-011-9254-8. [DOI] [Google Scholar]

Articles from Journal of Scientific Computing are provided here courtesy of Springer

RESOURCES