Skip to main content
Springer logoLink to Springer
. 2022 Jun 14;199(1-2):305–341. doi: 10.1007/s10107-022-01833-4

Subgradient ellipsoid method for nonsmooth convex problems

Anton Rodomanov 1,, Yurii Nesterov 2
PMCID: PMC10121548  PMID: 37155414

Abstract

In this paper, we present a new ellipsoid-type algorithm for solving nonsmooth problems with convex structure. Examples of such problems include nonsmooth convex minimization problems, convex-concave saddle-point problems and variational inequalities with monotone operator. Our algorithm can be seen as a combination of the standard Subgradient and Ellipsoid methods. However, in contrast to the latter one, the proposed method has a reasonable convergence rate even when the dimensionality of the problem is sufficiently large. For generating accuracy certificates in our algorithm, we propose an efficient technique, which ameliorates the previously known recipes (Nemirovski in Math Oper Res 35(1):52–78, 2010).

Keywords: Subgradient method, Ellipsoid method, Accuracy certificates, Separating oracle, Convex optimization, Nonsmooth optimization, Saddle-point problems, Variational inequalities

Introduction

The Ellipsoid Method is a classical algorithm in Convex Optimization. It was proposed in 1976 by Yudin and Nemirovski [23] as the modified method of centered cross-sections and then independently rediscovered a year later by Shor [21] in the form of the subgradient method with space dilation. However, the popularity came to the Ellipsoid Method only when Khachiyan used it in 1979 for proving his famous result on polynomial solvability of Linear Programming [10]. Shortly after, several polynomial algorithms, based on the Ellipsoid Method, were developed for some combinatorial optimization problems [9]. For more details and historical remarks on the Ellipsoid Method, see [2, 3, 14].

Despite its long history, the Ellipsoid Method still has some issues which have not been fully resolved or have been resolved only recently. One of them is the computation of accuracy certificates which is important for generating approximate solutions to dual problems or for solving general problems with convex structure (saddle-point problems, variational inequalities, etc.). For a long time, the procedure for calculating an accuracy certificate in the Ellipsoid Method required solving an auxiliary piecewise linear optimization problem (see, e.g., sect. 5 and 6 in [14]). Although this auxiliary computation did not use any additional calls to the oracle, it was still computationally expensive and, in some cases, could take even more time than the Ellipsoid Method itself. Only recently an efficient alternative has been proposed [16].

Another issue with the Ellipsoid Method is related to its poor dependency on the dimensionality of the problem. Consider, e.g., the minimization problem

minxQf(x), 1

where f:RnR is a convex function and Q:={xRn:xR} is the Euclidean ball of radius R>0. The Ellipsoid Method for solving (1) can be written as follows (see, e.g., sect. 3.2.8 in [19]):

xk+1:=xk-1n+1Wkgkgk,Wkgk1/2,Wk+1:=n2n2-1(Wk-2n+1WkgkgkTWkgk,Wkgk),k0, 2

where x0:=0, W0:=R2I (I is the identity matrix) and gk:=f(xk) is an arbitrary nonzero subgradient if xkQ, and gk is an arbitrary separator1 of xk from Q if xkQ.

To solve problem (1) with accuracy ϵ>0 (in terms of the function value), the Ellipsoid Method needs

O(n2lnMRϵ) 3

iterations, where M>0 is the Lipschitz constant of f on Q (see theorem 3.2.11 in [19]). Looking at this estimate, we can see an immediate drawback: it directly depends on the dimension and becomes useless when n. In particular, we cannot guarantee any reasonable rate of convergence for the Ellipsoid Method when the dimensionality of the problem is sufficiently big.

Note that the aforementioned drawback is an artifact of the method itself, not its analysis. Indeed, when n, iteration (2) reads

xk+1:=xk,Wk+1:=Wk,k0.

Thus, the method stays at the same point and does not make any progress.

On the other hand, the simplest Subgradient Method for solving (1) possesses the “dimension-independent” O(M2R2/ϵ2) iteration complexity bound (see, e.g., sect. 3.2.3 in [19]). Comparing this estimate with (3), we see that the Ellipsoid Method is significantly faster than the Subgradient Method only when n is not too big compared to MR/ϵ and significantly slower otherwise. Clearly, this situation is strange because the former algorithm does much more work at every iteration by “improving” the “metric” Wk which is used for measuring the norm of the subgradients.

In this paper, we propose a new ellipsoid-type algorithm for solving nonsmooth problems with convex structure, which does not have the discussed above drawback. Our algorithm can be seen as a combination of the Subgradient and Ellipsoid methods and its convergence rate is basically as good as the best of the corresponding rates of these two methods (up to some logarithmic factors). In particular, when n, the convergence rate of our algorithm coincides with that of the Subgradient Method.

Contents

This paper is organized as follows. In Sect. 2.1, we review the general formulation of a problem with convex structure and the associated with it notions of accuracy certificate and residual. Our presentation mostly follows [16] with examples taken from [18]. Then, in Sect. 2.2, we introduce the notions of accuracy semicertificate and gap and discuss their relation with those of accuracy certificate and residual.

In Sect. 3, we present the general algorithmic scheme of our methods. To measure the convergence rate of this scheme, we introduce the notion of sliding gap and establish some preliminary bounds on it.

In Sect. 4, we discuss different choices of parameters in our general scheme. First, we show that, by setting some of the parameters to zero, we obtain the standard Subgradient and Ellipsoid methods. Then we consider a couple of other less trivial choices which lead to two new algorithms. The principal of these new algorithms is the latter one, which we call the Subgradient Ellipsoid Method. We demonstrate that the convergence rate of this algorithm is basically as good as the best of those of the Subgradient and Ellipsoid methods.

In Sect. 5, we show that, for both our new methods, it is possible to efficiently generate accuracy semicertificates whose gap is upper bounded by the sliding gap. We also compare our approach with the recently proposed technique from [16] for building accuracy certificates for the standard Ellipsoid Method.

In Sect. 6, we discuss how to efficiently implement our general scheme and the procedure for generating accuracy semicertificates. In particular, we show that the time and memory requirements of our scheme are the same as in the standard Ellipsoid Method.

Finally, in Sect. 7, we discuss some open questions.

Notation and generalities

In this paper, E denotes an arbitrary n-dimensional real vector space. Its dual space, composed of all linear functionals on E, is denoted by E. The value of sE, evaluated at xE, is denoted by s,x. See [19, sect. 4.2.1] for the supporting discussion of abstract real vector spaces in Optimization.

Let us introduce in the spaces E and E a pair of conjugate Euclidean norms. To this end, let us fix a self-adjoint positive definite linear operator B:EE and define

x:=Bx,x1/2,xE,s:=s,B-1s1/2,sE.

Note that, for any sE and xE, we have the Cauchy-Schwarz inequality

|s,x|sx,

which becomes an equality if and only if s and Bx are collinear. In addition to x and ·, we often work with other Euclidean norms defined in the same way but using another reference operator instead of B. In this case, we write ·G and ·G, where G:EE is the corresponding self-adjoint positive definite linear operator.

Sometimes, in the formulas, involving products of linear operators, it is convenient to treat xE as a linear operator from R to E, defined by xα:=αx, and x as a linear operator from E to R, defined by xs:=s,x. Likewise, any sE can be treated as a linear operator from R to E, defined by sα:=αs, and s as a linear operator from E to R, defined by sx:=s,x. Then, xx and ss are rank-one self-adjoint linear operators from E to E and from E to E respectively, acting as follows: (xx)s=s,xx and (ss)x=s,xs for any xE and sE.

For a self-adjoint linear operator G:EE, by trG and detG, we denote the trace and determinant of G with respect to our fixed operator B:

trG:=tr(B-1G),detG:=det(B-1G).

Note that, in these definitions, B-1G is a linear operator from E to E, so tr(B-1G) and det(B-1G) are the standard well-defined notions of trace and determinant of a linear operator acting on the same space. For example, they can be defined as the trace and determinant of the matrix representation of B-1G with respect to an arbitrary chosen basis in E (the result is independent of the particular choice of basis). Alternatively, trG and detG can be equivalently defined as the sum and product, respectively, of the eigenvalues of G with respect to B.

For a point xE and a real r>0, by

Bx,r:={yE:xr},

we denote the closed Euclidean ball with center x and radius r.

Given two solids2Q,Q0E, we can define the relative volume of Q with respect to Q0 by vol(Q/Q0):=volQe/volQ0e, where e is an arbitrary basis in E, Qe,Q0eRn are the coordinate representations of the sets Q,Q0 in the basis e and vol is the Lebesgue measure in Rn. Note that the relative volume is independent of the particular choice of the basis e. Indeed, for any other basis f, we have Qe=TfeQf, Q0e=TfeQ0f, where Tfe is the n×n change-of-basis matrix, so volQe=(detTfe)(volQf), volQ0e=(detTfe)(volQ0f) and hence volQe/volQ0e=volQf/volQ0f.

For us, it will be convenient to define the volume of a solid QE as the relative volume of Q with respect to the unit ball:

volQ:=vol(Q/B0,1).

For an ellipsoid W:={xE:Gx,x1}, where G:EE is a self-adjoint positive definite linear operator, we have volW=(detG)-1/2.

Convex problems and accuracy certificates

Description and examples

In this paper, we consider numerical algorithms for solving problems with convex structure. The main examples of such problems are convex minimization problems, convex-concave saddle-point problems, convex Nash equilibrium problems, and variational inequalities with monotone operators.

The general formulation of a problem with convex structure involves two objects:

  • Solid QE (called the feasible set), represented by the Separation Oracle: given any point xE, this oracle can check whether xintQ, and if not, it reports a vector gQ(x)E\{0} which separates x from Q:
    gQ(x),x-y0,yQ. 4
  • Vector field g:intQE, represented by the First-Order Oracle: given any point xintQ, this oracle returns the vector g(x).

In what follows, we only consider the problems satisfying the following condition:

xQ:g(x),x-x0,xintQ. 5

Remark 1

A careful reader may note that the notation x overlaps with our general notation for the linear operator generated by a point x (see Sect. 1). However, there should be no risk of confusion since the precise meaning of x can usually be easily inferred from the context.

A numerical algorithm for solving a problem with convex structure starts at some point x0E. At each step k0, it queries the oracles at the current test point xk to obtain the new information about the problem, and then somehow uses this new information to form the next test point xk+1. Depending on whether xkintQ, the kth step of the algorithm is called productive or nonproductive.

The total information, obtained by the algorithm from the oracles after k1 steps, comprises its execution protocol which consists of:

  • The test points x0,,xk-1E.

  • The set of productive steps Ik:={0ik-1:xiintQ}.

  • The vectors g0,,gk-1E reported by the oracles: gi:=g(xi), if iIk, and gi:=gQ(xi), if iIk, 0ik-1.

An accuracy certificate, associated with the above execution protocol, is a nonnegative vector λ:=(λ0,,λk-1) such that Sk(λ):=iIkλi>0 (and, in particular, Ik). Given any solid Ω, containing Q, we can define the following residual of λ on Ω:

ϵk(λ):=maxxΩ1Sk(λ)i=0k-1λigi,xi-x, 6

which is easily computable whenever Ω is a simple set (e.g., a Euclidean ball). Note that

ϵk(λ)maxxQ1Sk(λ)i=0k-1λigi,xi-xmaxxQ1Sk(λ)iIkλigi,xi-x 7

and, in particular, ϵk(λ)0 in view of (5).

In what follows, we will be interested in the algorithms, which can produce accuracy certificates λ(k) with ϵk(λ(k))0 at a certain rate. This is a meaningful goal because, for all known instances of problems with convex structure, the residual ϵk(λ) upper bounds a certain natural inaccuracy measure for the corresponding problem. Let us briefly review some standard examples (for more examples, see [16, 18] and the references therein).

Example 1

(Convex minimization problem) Consider the problem

f:=minxQf(x), 8

where QE is a solid and f:ER{+} is closed convex and finite on intQ.

The First-Order Oracle for (8) is g(x):=f(x), xintQ, where f(x) is an arbitrary subgradient of f at x. Clearly, (5) holds for x being any solution of (8).

One can verify that, in this example, the residual ϵk(λ) upper bounds the functional residual: for x^k:=1Sk(λ)iIkλixi or xk:=argmin{f(x):xXk}, where Xk:={xi:iIk}, we have f(x^k)-fϵk(λ) and f(xk)-fϵk(λ).

Moreover, ϵk(λ), in fact, upper bounds the primal-dual gap for a certain dual problem for (8). Indeed, let f:ER{+} be the conjugate function of f. Then, we can represent (8) in the following dual form:

f=minxQmaxsdomf[s,x-f(s)]=maxsdomf[-f(s)-ξQ(-s)], 9

where domf:={sE:f(s)<+} and ξQ(-s):=maxxQ-s,x. Denote sk:=1Sk(λ)iIkλigi. Then, using (7) and the convexity of f and f, we obtain

ϵk(λ)1Sk(λ)iIkλigi,xi+ξQ(-sk)=1Sk(λ)iIkλi[f(xi)+f(gi)]+ξQ(-sk)f(x^k)+f(sk)+ξQ(-sk).

Thus, x^k and sk are ϵk(λ)-approximate solutions (in terms of function value) to problems (8) and (9), respectively. Note that the same is true if we replace x^k with xk.

Example 2

(Convex-concave saddle-point problem) Consider the following problem: Find (u,v)U×V such that

f(u,v)f(u,v)f(u,v),(u,v)U×V, 10

where U, V are solids in some finite-dimensional vector spaces Eu, Ev, respectively, and f:U×VR is a continuous function which is convex-concave, i.e., f(·,v) is convex and f(u,·) is concave for any uU and any vV.

In this example, we set E:=Eu×Ev, Q:=U×V and use the First-Order Oracle

g(x):=(fu(x),-fv(x)),x:=(u,v)intQ,

where fu(x) is an arbitrary subgradient of f(·,v) at u and fv(y) is an arbitrary supergradient of f(u,·) at v. Then, for any x:=(u,v)intQ and any x:=(u,v)Q,

g(x),x-x=fu(x),u-u-fv(x),v-vf(u,v)-f(u,v). 11

In particular, (5) holds for x:=(u,v) in view of (10).

Let ϕ:UR and ψ:VR be the functions

ϕ(u):=maxvVf(u,v),ψ(v):=minuUf(u,v).

In view of (10), we have ψ(v)f(u,v)ϕ(u) for all (u,v)U×V. Therefore, the difference ϕ(u)-ψ(v) (called the primal-dual gap) can be used for measuring the quality of an approximate solution x:=(u,v)Q to problem (10).

Denoting x^k:=1Sk(λ)iIkλixi=:(u^k,v^k) and using (7), we obtain

ϵk(λ)maxxQ1Sk(λ)iIkλigi,xi-xmaxuU,vV1Sk(λ)iIkλi[f(ui,v)-f(u,vi)]maxuU,vV[f(u^k,v)-f(u,v^k)]=ϕ(u^k)-ψ(v^k),

where the second inequality is due to (11) and the last one follows from the convexity-concavity of f. Thus, the residual ϵk(λ) upper bounds the primal-dual gap for the approximate solution x^k.

Example 3

(Variational inequality with monotone operator) Let QE be a solid and let V:QE be a continuous operator which is monotone, i.e., V(x)-V(y),x-y0 for all x,yQ. The goal is to solve the following (weak) variational inequality:

FindxQ:V(x),x-x0,xQ. 12

Since V is continuous, this problem is equivalent to its strong variant: find xQ such that V(x),x-x0 for all xQ.

A standard tool for measuring the quality of an approximate solution to (12) is the dual gap function, introduced in [1]:

f(x):=maxyQV(y),x-y,xQ.

It is easy to see that f is a convex nonnegative function which equals 0 exactly at the solutions of (12).

In this example, the First-Order Oracle is defined by g(x):=V(x) for any xintQ. Denote x^k:=1Sk(λ)iIkλixi. Then, using (7) and the monotonicity of V, we obtain

ϵk(λ)maxxQ1Sk(λ)iIkλiV(xi),xi-xmaxxQ1Sk(λ)iIkλiV(x),xi-x=f(x^k).

Thus, ϵk(λ) upper bounds the dual gap function for the approximate solution x^k.

Establishing convergence of residual

For the algorithms, considered in this paper, instead of accuracy certificates and residuals, it turns out to be more convenient to speak about closely related notions of accuracy semicertificates and gaps, which we now introduce.

As before, let x0,,xk-1 be the test points, generated by the algorithm after k1 steps, and let g0,,gk-1 be the corresponding oracle outputs. An accuracy semicertificate, associated with this information, is a nonnegative vector λ:=(λ0,,λk-1) such that Γk(λ):=i=0k-1λigi>0. Given any solid Ω, containing Q, the gap of λ on Ω is defined in the following way:

δk(λ):=maxxΩ1Γk(λ)i=0k-1λigi,xi-x. 13

Comparing these definitions with those of accuracy certificate and residual, we see that the only difference between them is that now we use a different “normalizing” coefficient: Γk(λ) instead of Sk(λ). Also, in the definitions of semicertificate and gap, we do not make any distinction between productive and nonproductive steps. Note that δk(λ)0.

Let us demonstrate that by making the gap sufficiently small, we can make the corresponding residual sufficiently small as well. For this, we need the following standard assumption about our problem with convex structure (see, e.g., [16]).

Assumption 1

The vector field g, reported by the First-Order Oracle, is semibounded:

g(x),y-xV,xintQ,yQ.

A classical example of a semibounded field is a bounded one: if there is M0, such that g(x)M for all xintQ, then g is semibounded with V:=MD, where D is the diameter of Q. However, there exist other examples. For instance, if g is the subgradient field of a convex function f:ER{+}, which is finite and continuous on Q, then g is semibounded with V:=maxQf-minQf (variation of f on Q); however, g is not bounded if f is not Lipschitz continuous (e.g., f(x):=-x on Q:=[0,1]). Another interesting example is the subgradient field g of a ν-self-concordant barrier f:ER{+} for the set Q; in this case, g is semibounded with V:=ν (see, e.g., [19, Theorem 5.3.7]), while f(x)+ at the boundary of Q.

Lemma 1

Let λ be a semicertificate such that δk(λ)<r, where r is the largest of the radii of Euclidean balls contained in Q. Then, λ is a certificate and

ϵk(λ)δk(λ)r-δk(λ)V.

Proof

Denote δk:=δk(λ), Γk:=Γk(λ), Sk:=Sk(λ). Let x¯Q be such that Bx¯,rQ. For each 0ik-1, let zi be a maximizer of zgi,z-x¯ on Bx¯,r. Then, for any 0ik-1, we have gi,x¯-xi=gi,zi-xi-rgi with ziQ. Therefore,

i=0k-1λigi,x¯-xi=i=0k-1λigi,zi-xi-rΓkSkV-rΓk, 14

where the inequality follows from the separation property (4) and Assumption 1.

Let xΩ be arbitrary. Denoting y:=(δkx¯+(r-δk)x)/rΩ, we obtain

(r-δk)i=0k-1λigi,xi-x=ri=0k-1λigi,xi-y+δki=0k-1λigi,x¯-xirδkΓk+δki=0k-1λigi,x¯-xiδkSkV, 15

where the inequalities follow from the definition (13) of δk and (14), respectively.

It remains to show that λ is a certificate, i.e., Sk>0. But this is simple. Indeed, if Sk=0, then, taking x:=x¯ in (15) and using (14), we get 0i=0k-1λigi,xi-x¯rΓk, which contradicts our assumption that λ is a semicertificate, i.e., Γk>0.

According to Lemma 1, from the convergence rate of the gap δk(λ(k)) to zero, we can easily obtain the corresponding convergence rate of the residual ϵk(λ(k)). In particular, to ensure that ϵk(λ(k))ϵ for some ϵ>0, it suffices to make δk(λ(k))δ(ϵ):=ϵr/(ϵ+V). For this reason, in the rest of this paper, we can focus our attention on studying the convergence rate only for the gap.

General algorithmic scheme

Consider the general scheme presented in Algorithm 1. This scheme works with an arbitrary oracle G:EE satisfying the following condition:

xBx0,R:G(x),x-x0,xE. 16

The point x from (16) is typically called a solution of our problem. For the general problem with convex structure, represented by the First-Order Oracle g and the Separation Oracle gQ for the solid Q, the oracle G is usually defined as follows: G(x):=g(x), if xintQ, and G(x):=gQ(x), otherwise. To ensure that (16) holds, the constant R needs to be chosen sufficiently big so that QBx0,R.graphic file with name 10107_2022_1833_Figa_HTML.jpg

Note that, in Algorithm 1, ωk are strictly convex quadratic functions and k are affine functions. Therefore, the sets Ωk are certain ellipsoids and Lk- are certain halfspaces (possibly degenerate).

Let us show that Algorithm 1 is a cutting-plane scheme in which the sets ΩkLk- are the localizers of the solution x.

Lemma 2

In Algorithm 1, for all k0, we have xΩkLk- and Q^k+1Ωk+1Lk+1-, where Q^k+1:={xΩkLk-:gk,x-xk0}.

Proof

Let us prove the claim by induction. Clearly, Ω0=Bx0,R, L0-=E, hence Ω0L0-=Bx0,Rx by (16). Suppose we have already proved that xΩkLk- for some k0. Combining this with (16), we obtain xQ^k+1, so it remains to show that Q^k+1Ωk+1Lk+1-. Let xQ^k+1 (ΩkLk-) be arbitrary. Note that 0gk,xk-xUk. Hence, by (17), k+1(x)k(x)0 and ωk+1(x)ωk(x)12R2, which means that xΩk+1Lk+1-.

Next, let us establish an important representation of the ellipsoids Ωk via the functions k and the test points xk. For this, let us define Gk:=2ωk(0) for each k0. Observe that these operators satisfy the following simple relations (cf. (17)):

G0=B,Gk+1=Gk+bkgkgk,k0. 18

Also, let us define the sequence Rk>0 by the recurrence

R0=R,Rk+12=Rk2+(ak+12bkUk)2gkGk21+bkgkGk2,k0. 19

Lemma 3

In Algorithm 1, for all k0, we have

Ωk={xE:-k(x)+12x-xkGk212Rk2}.

In particular, for all k0 and all xΩkLk-, we have x-xkGkRk.

Proof

Let ψk:ER be the function ψk(x):=k(x)+ωk(x). Note that ψk is a quadratic function with Hessian Gk and minimizer xk. Hence, for any xE, we have

ψk(x)=ψk+12x-xkGk2, 20

where ψk:=minxEψk(x).

Let us compute ψk. Combining (17), (18) and (20), for any xE, we obtain

ψk+1(x)=ψk(x)+(ak+12bkUk)gk,x-xk+12bkgk,x-xk2=ψk+12x-xkGk2+(ak+12bkUk)gk,x-xk+12bkgk,x-xk2=ψk+12x-xkGk+12+(ak+12bkUk)gk,x-xk, 21

Therefore,

ψk+1=ψk-12(ak+12bkUk)2gkGk+12=ψk-12(ak+12bkUk)2gkGk21+bkgkGk2, 22

where the last identity follows from the fact that Gk+1-1gk=Gk-1gk/(1+bkgkGk2) (since Gk+1Gk-1gk=(1+bkgkGk2)gk in view of (18)). Since (22) is true for any k0 and since ψ0=0, we thus obtain, in view of (19),

ψk=12(R2-Rk2). 23

Let xΩk be arbitrary. Using the definition of ψk(x) and (23), we obtain

-k(x)+12x-xkGk2=ωk(x)-ψk=ωk(x)+12(Rk2-R2).

Thus, xΩkωk(x)12R2-k(x)+12x-xkGk212Rk2. In particular, for any xΩkLk-, we have k(x)0 and x-xkGkRk.

Lemma 3 has several consequences. First, we see that the localizers ΩkLk- are contained in the ellipsoids {x:x-xkGkRk} whose centers are the test points xk.

Second, we get a uniform upper bound on the function -k on the ellipsoid Ωk: -k(x)12Rk2 for all xΩk. This observation leads us to the following definition of the sliding gap:

Δk:=maxxΩk1Γk[-k(x)]=maxxΩk1Γki=0k-1aigi,xi-x,k1, 24

provided that Γk:=i=0k-1aigi>0. According to our observation, we have

ΔkRk22Γk. 25

At the same time, Δk0 in view of Lemma 2 and 16

Comparing the definition (24) of the sliding gap Δk with the definition (13) of the gap δk(a(k)) for the semicertificate a(k):=(a0,,ak-1), we see that they are almost identical. The only difference between them is that the solid Ωk, over which the maximum is taken in the definition of the sliding gap, depends on the iteration counter k. This seems to be unfortunate because we cannot guarantee that each Ωk contains the feasible set Q (as required in the definition of gap) even if so does the initial solid Ω0=Bx0,R. However, this problem can be dealt with. Namely, in Sect. 5, we will show that the semicertificate a(k) can be efficiently converted into another semicertificate λ(k) for which δk(λ(k))Δk when taken over the initial solid Ω:=Ω0. Thus, the sliding gap Δk is a meaningful measure of convergence rate of Algorithm 1, and it makes sense to call the coefficients a(k) a preliminary semicertificate.

Let us now demonstrate that, for a suitable choice of the coefficients ak and bk in Algorithm 1, we can ensure that the sliding gap Δk converges to zero.

Remark 2

From now on, in order to avoid taking into account some trivial degenerate cases, it will be convenient to make the following minor technical assumption:

In Algorithm1,gk0for allk0.

Indeed, when the oracle reports gk=0 for some k0, it usually means that the test point xk, at which the oracle was queried, is, in fact, an exact solution to our problem. For example, if the standard oracle for a problem with convex structure has reported gk=0, we can terminate the method and return the certificate λ:=(0,,0,1) for which the residual ϵk(λ)=0.

Let us choose the coefficients ak and bk in the following way:

ak:=αkR+12θγRkgkGk,bk:=γgkGk2,k0, 26

where αk,θ,γ0 are certain coefficients to be chosen later.

According to (25), to estimate the convergence rate of the sliding gap, we need to estimate the rate of growth of the coefficients Rk and Γk from above and below, respectively. Let us do this.

Lemma 4

In Algorithm 1 with parameters (26), for all k0, we have

Rk2[qc(γ)]kCkR2, 27

where qc(γ):=1+cγ22(1+γ), c:=12(τ+1)(θ+1)2, Ck:=1+τ+1τi=0k-1αi2 and τ>0 can be chosen arbitrarily. Moreover, if αk=0 for all k0, then, Rk2=[qc(γ)]kR2 for all k0 with c:=12(θ+1)2.

Proof

By the definition of Uk and Lemma 3, we have

Uk=maxxΩkLk-gk,xk-xmaxx-xkGkRkgk,xk-x=RkgkGk. 28

At the same time, Uk0 in view of Lemma 2 and (16). Hence,

(ak+12bkUk)2gkGk21+bkgkGk2(ak+12bkRkgkGk)2gkGk21+bkgkGk2=11+γ(αkR+12(θ+1)γRk)2,

where the identity follows from (26). Combining this with (19), we obtain

Rk+12Rk2+11+γ(αkR+12(θ+1)γRk)2. 29

Note that, for any ξ1,ξ20 and any τ>0, we have

(ξ1+ξ2)2=ξ12+2ξ1ξ2+ξ22τ+1τξ12+(τ+1)ξ22=(τ+1)(1τξ12+ξ22)

(look at the minimum of the right-hand side in τ). Therefore, for arbitrary τ>0,

Rk+12Rk2+τ+11+γ(1ταk2R2+14(θ+1)2γ2Rk2)=qRk2+βkR2,

where we denote q:=qc(γ)1 and βk:=τ+1τ(1+γ)αk2. Dividing both sides by qk+1, we get

Rk+12qk+1Rk2qk+βkR2qk+1.

Since this is true for any k0, we thus obtain, in view of (19), that

Rk2qkR02q0+R2i=0k-1βiqi+1=(1+i=0k-1βiqi+1)R2,

Multiplying both sides by qk and using that βiqi+1τ+1ταi2, we come to (27).

When αk=0 for all k0, we have k=0 and Lk-=E for all k0. Therefore, by Lemma 3, Ωk={x:x-xkGkRk} and hence (28) is, in fact, an equality. Consequently, (29) becomes Rk+12=Rk2+cγ22(1+γ)Rk2=qc(γ)Rk2, where c:=12(θ+1)2.

Remark 3

From the proof, one can see that the quantity Ck in Lemma 4 can be improved up to Ck:=1+τ+1τ(1+γ)i=0k-1αi2[qc(γ)]i+1.

Lemma 5

In Algorithm 1 with parameters (26), for all k1, we have

ΓkR(i=0k-1αi+12θγn[(1+γ)k/n-1]). 30

Proof

By the definition of Γk and (26), we have

Γk=i=0k-1aigi=Ri=0k-1αiρi+12θγi=0k-1Riρi,

where ρi:=gi/giGi. Let us estimate each sum from below separately.

For the first sum, we can use the trivial bound ρi1, which is valid for any i0 (since GiB in view of (18)). This gives us i=0k-1αiρii=0k-1αi.

Let us estimate the second sum. According to (19), for any i0, we have RiR. Hence, i=0k-1RiρiRi=0k-1ρiRi=0k-1ρi21/2 and it remains to lower bound i=0k-1ρi2. By 18 and 26, G0=B and Gi+1=Gi+γgigi/giGi2 for all i0. Therefore,

i=0k-1ρi2=1γi=0k-1(trGi+1-trGi)=1γ(trGk-trB)=1γ(trGk-n)nγ[(detGk)1/n-1]=nγ[(1+γ)k/n-1],

where we have applied the arithmetic-geometric mean inequality. Combining the obtained estimates, we get (30).

Main instances of general scheme

Let us now consider several possibilities for choosing the coefficients αk, θ and γ in (26).

Subgradient method

The simplest possibility is to choose

αk>0,θ:=0,γ:=0.

In this case, bk=0 for all k0, so Gk=B and ωk(x)=ω0(x)=12x2 for all xE and all k0 (see (17) and (18)). Consequently, the new test points xk+1 in Algorithm 1 are generated according to the following rule:

xk+1=argminxE[i=0kaigi,x-xi+12x2],

where ai=αiR/gi. Thus, Algorithm 1 is the Subgradient Method: xk+1=xk-akgk.

In this example, each ellipsoid Ωk is simply a ball: Ωk=Bx0,R for all k0. Hence, the sliding gap Δk, defined in (24), does not “slide” and coincides with the gap of the semicertificate a:=(a0,,ak-1) on the solid Bx0,R:

Δk=maxxBx0,R1Γki=0k-1aigi,xi-x.

In view of Lemmas 4 and 5, for all k1, we have

Rk2(1+i=0k-1αi2)R2,ΓkRi=0k-1αi

(tend τ+ in Lemma 4). Substituting these estimates into (25), we obtain the following well-known estimate for the gap in the Subgradient Method:

Δk1+i=0k-1αi22i=0k-1αiR.

The standard strategies for choosing the coefficients αi are as follows (see, e.g., sect. 3.2.3 in [19]):

  1. We fix in advance the number of iterations k1 of the method and use constant coefficients αi:=1k, 0ik-1. This corresponds to the so-called Short-Step Subgradient Method, for which we have
    ΔkRk.
  2. Alternatively, we can use time-varying coefficients αi:=1i+1, i0. This approach does not require us to fix in advance the number of iterations k. However, the corresponding convergence rate estimate becomes slightly worse:
    Δklnk+22kR.
    (Indeed, i=0k-1αi2=i=1k1ilnk+1, while i=0k-1αik.)

Remark 4

If we allow projections onto the feasible set, then, for the resulting Subgradient Method with time-varying coefficients αi, one can establish the O(1/k) convergence rate for the “truncated” gap

Δk0,k:=maxxBx0,R1Γk0,ki=k0kaigi,xi-x,

where Γk0,k:=i=k0kaigi, k0:=k/2. For more details, see sect. 5.2.1 in [2] or sect. 3.1.1 in [12].

Standard ellipsoid method

Another extreme choice is the following one:

αk:=0,θ:=0,γ>0. 31

For this choice, we have ak=0 for all k0. Hence, k=0 and Lk-=E for all k0. Therefore, the localizers in this method are the following ellipsoids (see Lemma 3):

ΩkLk-=Ωk={xE:x-xkGkRk},k0. 32

Observe that, in this example, Γki=0k-1aigi=0 for all k1, so there is no preliminary semicertificate and the sliding gap is undefined. However, we can still ensure the convergence to zero of a certain meaningful measure of optimality, namely, the average radius of the localizers Ωk:

avradΩk:=(volΩk)1/n,k0. 33

Indeed, let us define the following functions for any real c,p>0:

qc(γ):=1+cγ22(1+γ),ζp,c(γ):=[qc(γ)]p1+γ,γ>0. 34

According to Lemma 4, for any k0, we have

Rk2=[q1/2(γ)]kR2. 35

At the same time, in view of (18) and (26), detGk=i=0k-1(1+bigiGi2)=(1+γ)k for all k0. Combining this with (32)–(34), we obtain, for any k0, that

avradΩk=Rk(detGk)1/(2n)=[q1/2(γ)]k/2R(1+γ)k/(2n)=[ζn,1/2(γ)]k/(2n)R. 36

Let us now choose γ which minimizes avradΩk. For such computations, the following auxiliary result is useful (see Sect. A for the proof).

Lemma 6

For any c1/2 and any p2, the function ζp,c, defined in (34), attains its minimum at a unique point

γc(p):=2c2p2-(2c-1)+cp-11cp2cp 37

with the corresponding value ζp,c(γc(p))e-1/(2cp).

Applying Lemma 6 to 36, we see that the optimal value of γ is

γ:=γ1/2(n)=2n/2+n/2-1=2n-1, 38

for which ζn,1/2(γ)e-1/n. With this choice of γ, we obtain, for all k0, that

avradΩke-k/(2n2)R. 39

One can check that Algorithm 1 with parameters (26), (31) and (38) is, in fact, the standard Ellipsoid Method (see Remark 6).

Ellipsoid method with preliminary semicertificate

As we have seen, we cannot measure the convergence rate of the standard Ellipsoid Method using the sliding gap because there is no preliminary semicertificate in this method. Let us present a modification of the standard Ellipsoid Method which does not have this drawback but still enjoys the same convergence rate as the original method (up to some absolute constants).

For this, let us choose the coefficients in the following way:

αk:=0,θ:=2-1(0.41),γ>0. 40

Then, in view of Lemma 4, for all k0, we have

Rk2=[q1(γ)]kR2, 41

Also, by Lemma 5, Γk12θRγn[(1+γ)k/n-1] for all k1. Thus, for each k1, we obtain the following estimate for the sliding gap (see (25)):

Δk[q1(γ)]kRθγn[(1+γ)k/n-1]=1θκk(γ,n)[ζ2n,1(γ)]k/(2n)R, 42

where κk(γ,n):=γn(1-1(1+γ)k/n) and ζ2n,1(γ) is defined in (34).

Note that the main factor in estimate (42) is [ζ2n,1(γ)]k/(2n). Let us choose γ by minimizing this expression. Applying Lemma 6, we obtain

γ:=γ1(2n)12n1n. 43

Theorem 1

In Algorithm 1 with parameters (26), (40), (43), for all k1,

Δk6e-k/(8n2)R.

Proof

Suppose kn2. According to Lemma 6, we have ζ2n,1(γ)e-1/(4n). Hence, by (42), Δk1θκk(γ,n)e-k/(8n2)R. It remains to estimate from below θκk(γ,n).

Since kn2, we have (1+γ)k/n(1+γ)n1+γn. Hence, κk(γ,n)γn1+γn. Note that the function ττ1+τ is increasing on R+. Therefore, using (43), we obtain κk(γ,n)1/21+1/2=16. Thus, θκk(γ,n)2-1616 for our choice of θ.

Now suppose kn2. Then, 6e-k/(8n2)6e-1/85. Therefore, it suffices to prove that Δk5R or, in view of (24), that gi,xi-x5Rgi, where xΩkLk- and 0ik-1 are arbitrary. Note that gi,xi-xgiGixi-xGigixi-xGi since GiB (see (18)). Hence, it remains to prove that xi-xGi5R.

Recall from (18) and (19) that GiGk and RiRk. Therefore,

xi-xGixi-xGi+x-xGixi-xGi+x-xGkxi-xGi+xk-xGk+xk-xGkRi+2Rk3Rk,

where the penultimate inequality follows from Lemma 2 and 3. According to (41), Rk=[q1(γ)]k/2R[q1(γ)]n2/2R (recall that q1(γ)1). Thus, it remains to show that 3[q1(γ)]n2/25. But this is immediate. Indeed, by (34) and (43), we have [q1(γ)]n2/2en2γ2/(4(1+γ))e1/4, so 3[q1(γ)]n2/23e1/45.

Subgradient ellipsoid method

The previous algorithm still shares the drawback of the original Ellipsoid Method, namely, it does not work when n. To eliminate this drawback, let us choose αk similarly to how this is done in the Subgradient Method.

Consider the following choice of parameters:

αi:=βiθθ+1,θ:=23-1(0.26),γ:=γ1(2n)12n1n, 44

where βi>0 are certain coefficients (to be specified later) and γ1(2n) is defined in (37).

Theorem 2

In Algorithm 1 with parameters (26) and (44), where β01, we have, for all k1,

Δk2i=0k-1βi(1+i=0k-1βi2)R,ifkn2,6e-k/(8n2)(1+i=0k-1βi2)R,ifkn2. 45

Proof

Applying Lemma 4 with τ:=θ and using (44), we obtain

Rk2[q1(γ)]kCkR2,Ck=1+i=0k-1βi2. 46

At the same time, by Lemma 5, we have

ΓkR(θθ+1i=0k-1βi+12θγn[(1+γ)k/n-1]). 47

Note that 12θγn12θθ/(θ+1) by (44). Since β01, we thus obtain

Γk12Rθγn(1+(1+γ)k/n-1)12Rθγn(1+γ)k/(2n)122Rθ(1+γ)k/(2n)112R(1+γ)k/(2n), 48

where the last two inequalities follow from (44). Therefore, by (25), (46) and (48),

ΔkRk22Γk6[q1(γ)]k(1+γ)k/(2n)CkR=6[ζ2n,1(γ)]k/(2n)CkR,

where ζ2n,1(γ) is defined in (34). Observe that, for our choice of γ, by Lemma 6, we have ζ2n,1(γ)e-1/(4n). This proves the second estimate3 in (45).

On the other hand, dropping the second term in (47), we can write

ΓkRθθ+1i=0k-1βi. 49

Suppose kn2. Then, from (34) and (44), it follows that

[q1(γ)]k[q1(γ)]n2eγ2n2/(2(1+γ))e.

Hence, by (46), RkeCkR2. Combining this with (25) and (49), we obtain

Δk12e(θ+1)θ1i=0k-1βiCkR.

By numerical evaluation, one can verify that, for our choice of θ, we have 12e(θ+1)θ2. This proves the first estimate in (45).

Exactly as in the Subgradient Method, we can use the following two strategies for choosing the coefficients βi:

  1. We fix in advance the number of iterations k1 of the method and use constant coefficients βi:=1k, 0ik-1. In this case,
    Δk4R/k,ifkn2,12Re-k/(8n2),ifkn2. 50
  2. We use time-varying coefficients βi:=1i+1, i0. In this case,
    Δk2(lnk+2)R/k,ifkn2,6(lnk+2)Re-k/(8n2),ifkn2.

Let us discuss convergence rate estimate (50). Up to absolute constants, this estimate is exactly the same as in the Subgradient Method when kn2 and as in the Ellipsoid Method when kn2. In particular, when n, we recover the convergence rate of the Subgradient Method.

To provide a better interpretation of the obtained results, let us compare the convergence rates of the Subgradient and Ellipsoid methods:

Subgradient Method:1/kEllipsoid Method:e-k/(2n2).

To compare these rates, let us look at their squared ratio:

ρk:=(1/ke-k/(2n2))2=ek/n2k.

Let us find out for which values of k the rate of the Subgradient Method is better than that of the Ellipsoid Method and vice versa. We assume that n2.

Note that the function τeτ/τ is strictly decreasing on 0,1 and strictly increasing on 1,+ (indeed, its derivative equals eτ(τ-1)/τ2). Hence, ρk is strictly decreasing in k for 1kn2 and strictly increasing in k for kn2. Since n2, we have ρ2=e2/n2/2e1/2/21. At the same time, ρk+ when k. Therefore, there exists a unique integer K02 such that ρk1 for all kK0 and ρk1 for all kK0.

Let us estimate K0. Clearly, for any n2kn2ln(2n), we have

ρken2ln(2n)/n2n2ln(2n)=2nln(2n)1,

while, for any k3n2ln(2n), we have

ρke3n2ln(2n)/n23n2ln(2n)=(2n)33n2ln(2n)=8n3ln(2n)1.

Hence,

n2ln(2n)K03n2ln(2n).

Thus, up to an absolute constant, n2ln(2n) is the switching moment, starting from which the rate of the Ellipsoid Method becomes better than that of the Subgradient Method.

Returning to our obtained estimate (50), we see that, ignoring absolute constants and ignoring the “small” region of the values of k between n2 and n2lnn, our convergence rate is basically the best of the corresponding convergence rates of the Subgradient and Ellipsoid methods.

Constructing accuracy semicertificate

Let us show how to convert a preliminary accuracy semicertificate, produced by Algorithm 1, into a semicertificate whose gap on the initial solid is upper bounded by the sliding gap. The key ingredient here is the following auxiliary algorithm which was first proposed in [16] for building accuracy certificates in the standard Ellipsoid Method.

Augmentation algorithm

Let k0 be an integer and let Q0,,Qk be solids in E such that

Q^i:={xQi:gi,x-xi0}Qi+1,0ik-1, 51

where xiE, giE. Further, suppose that, for any sE and any 0ik-1, we can compute a dual multiplier μ0 such that

maxxQ^is,x=maxxQi[s,x+μgi,xi-x] 52

(provided that certain regularity conditions hold). Let us abbreviate any solution μ of this problem by μ(s,Qi,xi,gi).

Consider now the following routine. graphic file with name 10107_2022_1833_Figb_HTML.jpg

Lemma 7

Let μ0,,μk-10 be generated by Algorithm 2. Then,

maxxQ0[sk,x+i=0k-1μigi,xi-x]maxxQksk,x.

Proof

Indeed, at every iteration i=k-1,,0, we have

maxxQi+1si+1,xmaxxQ^isi+1,x=maxxQi[si+1,x+μigi,xi-x]=maxxQisi,x+μigi,xi.

Summing up these inequalities for i=0,,k-1, we obtain

maxxQksk,xmaxxQ0s0,x+i=0k-1μigi,xi=maxxQ0[sk,x+i=0k-1gi,xi-x],

where the identity follows from the fact that s0=sk-i=0k-1μigi.

Methods with preliminary certificate

Let us apply the Augmentation Algorithm for building an accuracy semicertificate for Algorithm 1. We only consider those instances for which Γk:=i=0k-1aigi>0 so that the sliding gap Δk is well-defined:

Δk:=maxxΩk1Γk[-k(x)]=maxxΩkLk-1Γk[-k(x)]=maxxΩkLk-1Γki=0k-1aigi,xi-x.

Recall that the vector a:=(a0,,ak-1) is called a preliminary semicertificate.

For technical reasons, it will be convenient to add the following termination criterion into Algorithm 1:

Terminate Algorithm1at Step2ifUkδgk, 53

where δ>0 is a fixed constant. Depending on whether this termination criterion has been satisfied at iteration k, we call it a terminal or nonterminal iteration, respectively.

Remark 5

In practice, one can set δ to an arbitrarily small value (within machine precision) if the desired target accuracy is unknown. As can be seen from the subsequent discussion, the main purpose of the termination criterion (53) is to ensure that Uk never becomes equal to zero during the iterations of Algorithm 1. This guarantees the existence of dual multiplier in (52) for any sE at every nonterminal iteration. The case Uk=0 corresponds to the degenerate situation when Algorithm 1 has “accidentally” found an exact solution.

Let k1 be an iteration of Algorithm 1. According to Lemma 2, the sets Qi:=ΩiLi- satisfy (51). Since the method has not been terminated during the course of the previous iterations, we have4Ui>0 for all 0ik-1. Therefore, for any 0ik-1, there exists xQi such that gi,x-xi<0. This guarantees the existence of dual multiplier in (52).

Let us apply Algorithm 2 to sk:=-i=0k-1aigi in order to obtain dual multipliers μ:=(μ0,,μk-1). From Lemma 7, it follows that

maxxBx0,Ri=0k-1(ai+μi)gi,xi-xmaxxQki=0k-1aigi,xi-x=ΓkΔk,

(note that Q0=Ω0L0-=Bx0,R). Thus, defining λ:=a+μ, we obtain Γk(λ)i=0k-1λigii=0k-1aigiΓk>0 and

δk(λ)maxxBx0,R1Γk(λ)i=0k-1λigi,xi-xΓkΓk(λ)ΔkΔk,

Thus, λ is a semicertificate whose gap on Bx0,R is bounded by the sliding gap Δk.

If k0 is a terminal iteration, then, by the termination criterion and the definition of Uk (see Algorithm 1), we have maxxΩkLk-1gkgk,xk-xδ. In this case, we apply Algorithm 2 to sk:=-gk to obtain dual multipliers μ0,,μk-1. By the same reasoning as above but with the vector (0,,0,1) instead of (a0,,ak-1), we can obtain that δk+1(λ)δ, where λ:=(μ0,,μk-1,1).

Standard ellipsoid method

In the standard Ellipsoid Method, there is no preliminary semicertificate. Therefore, we cannot apply the above procedure. However, in this method, it is still possible to generate an accuracy semicertificate, although the corresponding procedure is slightly more involved. Let us now briefly describe this procedure and discuss how it differs from the previous approach. For details, we refer the reader to [16].

Let k1 be an iteration of the method. There are two main steps. The first step is to find a direction sk, in which the “width” of the ellipsoid Ωk (see (32)) is minimal:

sk:=argmins=1maxx,yΩks,x-y=argmins=1[maxxΩks,x-minxΩks,x].

It is not difficult to see that sk is given by the unit eigenvector5 of the operator Gk, corresponding to the largest eigenvalue. For the corresponding minimal “width” of the ellipsoid, we have the following bound via the average radius:

maxx,yΩksk,x-yρk, 54

where ρk:=2avradΩk. Recall that avradΩke-k/(2n2)R in view of (39).

At the second step, we apply Algorithm 2 two times with the sets Qi:=Ωi: first, to the vector sk to obtain dual multipliers μ:=(μ0,,μk-1) and then to the vector -sk to obtain dual multipliers μ:=(μ0,,μk-1). By Lemma 7 and (54), we have

maxxBx0,R[sk,x-xk+i=0k-1μigi,xi-x]maxxΩksk,x-xkρk,maxxBx0,R[sk,xk-x+i=0k-1μigi,xi-x]maxxΩksk,xk-xρk

(note that Q0=Ω0=Bx0,R). Consequently, for λ:=μ+μ, we obtain

maxxBx0,Ri=0k-1λigi,xi-x2ρk.

Finally, one can show that

Γk(λ)i=0k-1λigir-ρkD,

where D is the diameter of Q and r is the maximal of the radii of Euclidean balls contained in Q. Thus, whenever ρk<r, λ is a semicertificate with the following gap on Bx0,R:

δk(λ)maxxBx0,R1Γk(λ)i=0k-1λigi,xi-x2ρkDr-ρk.

Compared to the standard Ellipsoid Method, we see that, in the Subgradient Ellipsoid methods, the presence of the preliminary semicertificate removes the necessity in finding the minimal-“width” direction and requires only one run of the Augmentation Algorithm.

Implementation details

Explicit representations

In the implementation of Algorithm 1, instead of the operators Gk, it is better to work with their inverses Hk:=Gk-1. Applying the Sherman-Morrison formula to (18), we obtain the following update rule for Hk:

Hk+1=Hk-bkHkgkgkHk1+bkgk,Hkgk,k0. 55

Let us now obtain an explicit formula for the next test point xk+1. This has already been partly done in the proof of Lemma 3. Indeed, recall that xk+1 is the minimizer of the function ψk+1(x). From (21), we see that xk+1=xk-(ak+12bkUk)Hk+1gk. Combining it with (55), we obtain

xk+1=xk-ak+12bkUk1+bkgk,HkgkHkgk,k0. 56

Finally, one can obtain the following explicit representations for Lk- and Ωk:

Lk-={xE:ck,xσk},Ωk={xE:x-zkHk-12Dk}, 57

where, for any k0,

c0:=0,σ0:=0,ck+1:=ck+akgk,σk+1:=σk+akgk,xk,zk:=xk-Hkck,Dk:=Rk2+2(σk-ck,xk)+ck,Hkck. 58

Indeed, recalling the definition of functions k, we see that k(x)=ck,x-σk for all xE. Therefore, Lk-{x:k(x)0}={x:ck,xσk}. Further, by Lemma 3, Ωk={x:ck,x+12x-xkGk212Rk2+σk}. Note that ck,x+12x-xkGk2=12x-zkGk2+ck,xk-12ckGk2 for any xE. Hence, Ωk={x:12x-zkGk212Dk}.

Remark 6

Now we can justify the claim made in Sect. 4.2 that Algorithm 1 with parameters (26), (31) and (38) is the standard Ellipsoid Method. Indeed, from (26) and (32), we see that bk=γgk,Hkgk and Uk=Rkgk,Hkgk1/2. Also, in view of (38), γ1+γ=2n+1. Hence, by (56) and (55),

xk+1=xk-Rkn+1Hkgkgk,Hkgk1/2,Hk+1=Hk-2n+1HkgkgkHkgk,Hkgk,k0. 59

Further, according to (35) and (38), for any k0, we have Rk2=qkR2, where q=1+1(n-1)(n+1)=n2n2-1. Thus, method (59) indeed coincides6 with the standard Ellipsoid Method (2) under the change of variables Wk:=Rk2Hk.

Computing support function

To calculate Uk in Algorithm 1, we need to compute the following quantity (see (57)):

Uk=maxx{gk,xk-x:x-zkHk-12Dk,ck,xσk}.

Let us discuss how to do this.

First, let us introduce the following support function to simplify our notation:

ξ(H,s,a,β):=maxx{s,x:xH-121,a,xβ},

where H:EE is a self-adjoint positive definite linear operator, s,aE and βR. In this notation, assuming that Dk>0, we have

Uk=gk,xk-zk+ξ(DkHk,-gk,ck,σk-ck,zk).

Let us show how to compute ξ(H,s,a,β). Dualizing the linear constraint, we obtain

ξ(H,s,a,β)=minτ0[s-τaH-1+τβ], 60

provided that there exists some xE such that xH-1<1, a,xβ (Slater condition). One can show that (60) has the following solution (see Lemma 10):

τ(H,s,a,β):=0,ifa,HsβsH-1,u(H,s,a,β),otherwise, 61

where u(H,s,a,β) is the unconstrained minimizer of the objective function in (60).

Let us present an explicit formula for u(H,s,a,β). For future use, it will be convenient to write down this formula in a slightly more general form for the following multidimensional7 variant of problem (60):

minuRm[s-AuH-1+u,b], 62

where sE, H:EE is a self-adjoint positive definite linear operator, A:RmE is a linear operator with trivial kernel and bRm, b,(AHA)-1b<1. It is not difficult to show that problem (62) has the following unique solution (see Lemma 9):

u(H,s,A,b):=(AHA)-1(As-rb),r:=s,Hs-s,A(AHA)-1As1-b,(AHA)-1b. 63

Note that, in order for the above approach to work, we need to guarantee that the sets Ωk and Lk- satisfy a certain regularity condition, namely, intΩkLk-. This condition can be easily fulfilled by adding into Algorithm 1 the termination criterion (53).

Lemma 8

Consider Algorithm 1 with termination criterion (53). Then, at each iteration k0, at the beginning of Step , we have intΩkLk-. Moreover, if k is a nonterminal iteration, we also have gk,x-xk0 for some xintΩkLk-.

Proof

Note that intΩ0L0-=intBx0,R. Now suppose intΩkLk- for some nonterminal iteration k0. Denote Pk-:={xE:gk,x-xk0}. Since iteration k is nonterminal, Uk>0 and hence ΩkLk-intPk-. Combining it with the fact that intΩkLk-, we obtain intΩkLk-intPk- and, in particular, intΩkLk-Pk-. At the same time, slightly modifying the proof of Lemma 2 (using that intΩi={xE:ωi(x)<12R2} for any i0 since ωi is a strictly convex quadratic function), it is not difficult to show that intΩkLk-Pk-intΩk+1Lk+1-. Thus, intΩk+1Lk+1-, and we can continue by induction.

Computing dual multipliers

Recall from Sect. 5 that the procedure for generating an accuracy semicertificate for Algorithm 1 requires one to repeatedly carry out the following operation: given sE and some iteration number i0, compute a dual multiplier μ0 such that

maxxΩiLi-{s,x:gi,x-xi0}=maxxΩiLi-[s,x+μgi,xi-x].

This can be done as follows.

First, using (57), let us rewrite the above primal problem more explicitly:

maxx{s,x:x-ziHi-12Di,ci,xσi,gi,x-xi0}.

Our goal is to dualize the second linear constraint and find the corresponding multiplier. However, for the sake of symmetry, it is better to dualize both linear constraints, find the corresponding multipliers and then keep only the second one.

Let us simplify our notation by introducing the following problem:

maxx{s,x:xH-11,a1,xb1,a2,xb2}, 64

where H:EE is a self-adjoint positive definite linear operator, s,a1,a2E and b1,b2R. Clearly, our original problem can be transformed into this form by setting H:=DiHi, a1:=ci, a2:=gi, b1:=σi-ci,zi, b2:=gi,xi-zi. Note that this transformation does not change the dual multipliers.

Dualizing the linear constraints in (64), we obtain the following dual problem:

minμR+2[s-μ1a1-μ2a2H-1+μ1b1+μ2b2], 65

which is solvable provided the following Slater condition holds:

xE:xH-1<1,a1,xb1,a2,xb2. 66

Note that (66) can be ensured by adding termination criterion (53) into Algorithm 1 (see Lemma 8).

A solution of (65) can be found using Algorithm 3. In this routine, τ(·), ξ(·) and u(·) are the auxiliary operations, defined in Sect. 6.2, and A:=(a1,a2) is the linear operator Au:=u1a1+u2a2 acting from R2 to E. The correctness of Algorithm 3 is proved in Theorem 3.graphic file with name 10107_2022_1833_Figc_HTML.jpg

Time and memory requirements

Let us discuss the time and memory requirements of Algorithm 1, taking into account the previously mentioned implementation details.

The main objects in Algorithm 1, which need to be stored and updated between iterations, are the test points xk, matrices Hk, scalars Rk, vectors ck and scalars σk, see (19), (55), 56 and (58) for the corresponding updating formulas. To store all these objects, we need O(n2) memory.

Consider now what happens at each iteration k. First, we compute Uk. For this, we calculate zk and Dk according to (58) and then perform the calculations described in Sect. 6.2. The most difficult operation there is computing the matrix-vector product, which takes O(n2) time. After that, we calculate the coefficients ak and bk according to (26), where αk, θ and γ are certain scalars, easily computable for all main instances of Algorithm 1 (see Sects. 4.14.4). The most expensive step there is computing the norm gkGk, which can be done in O(n2) operations by evaluating the product Hkgk. Finally, we update our main objects, which takes O(n2) time.

Thus, each iteration of Algorithm 1 has O(n2) time and memory complexities, exactly as in the standard Ellipsoid Method.

Now let us analyze the complexity of the auxiliary procedure from Sect. 5 for converting a preliminary semicertificate into a semicertificate. The main operation in this procedure is running Algorithm 2, which iterates “backwards”, computing some dual multiplier μi at each iteration i=k-1,,0. Using the approach from Sect. 6.3, we can compute μi in O(n2) time, provided that the objects xi, gi, Hi, zi, Di, ci, σi are stored in memory. Note, however, that, in contrast to the “forward” pass, when iterating “backwards”, there is no way to efficiently recompute all these objects without storing in memory a certain “history” of the main process from iteration 0 up to k. The simplest choice is to keep in this “history” all the objects mentioned above, which requires O(kn2) memory. A slightly more efficient idea is to keep the matrix-vector products Higi instead of Hi and then use (55) to recompute Hi from Hi+1 in O(n2) operations. This allows us to reduce the size of the “history” down to O(kn) while still keeping the O(kn2) total time complexity of the auxiliary procedure. Note that these estimates are exactly the same as those for the best currently known technique for generating accuracy certificates in the standard Ellipsoid Method [16]. In particular, if we generate a semicertificate only once at the very end, then the time complexity of our procedure is comparable to that of running the standard Ellipsoid Method without computing any certificates. Alternatively, as suggested in [16], one can generate semicertificates, say, every 2,4,8,16, iterations. Then, the total “overhead” of the auxiliary procedure for generating semicertificates will be comparable to the time complexity of the method itself.

Conclusion

In this paper, we have addressed one of the issues of the standard Ellipsoid Method, namely, its poor convergence for problems of large dimension n. For this, we have proposed a new algorithm which can be seen as the combination of the Subgradient and Ellipsoid methods.

Our developments can be considered as a first step towards constructing universal methods for nonsmooth problems with convex structure. Such methods could significantly improve the practical efficiency of solving various applied problems.

Note that there are still some open questions. First, the convergence estimate of our method with time-varying coefficients contains an extra factor proportional to the logarithm of the iteration counter. We have seen that this logarithmic factor has its roots yet in the Subgradient Method. However, as discussed in Remark 4, for the Subgradient Method, this issue can be easily resolved by allowing projections onto the feasible set and working with “truncated” gaps. An even better alternative, which does not require any of this machinery, is to use Dual Averaging [18] instead of the Subgradient Method. It is an interesting question whether one can combine the Dual Averaging with the Ellipsoid Method similarly to how we have combined the Subgradient and Ellipsoid methods.

Second, the convergence rate estimate, which we have obtained for our method, is not continuous in the dimension n. Indeed, for small values of the iteration counter k, this estimate behaves as that of the Subgradient Method and then, at some moment (around n2), it switches to the estimate of the Ellipsoid Method. As discussed at the end of Sect. 4.4, there exists some “small” gap between these two estimates around the switching moment. Nevertheless, the method itself is continuous in n and does not contain any explicit switching rules. Therefore, there should be some continuous convergence rate estimate for our method, and it is an open question to find it.

Another interesting question is to understand what happens with the proposed method on other (less general) classes of convex problems than those, considered in this paper. For example, it is well-known that, on smooth and/or strongly convex problems, (sub)gradient methods have much better convergence rates than on the general nonsmooth problems. We expect that similar conclusions should also be valid for the proposed Subgradient Ellipsoid Method. However, to achieve the acceleration, it may be necessary to introduce some modifications in the algorithm such as using different step sizes. We leave this direction for future research.

Finally, apart from the Ellipsoid Method, there exist other “dimension-dependent” methods (e.g., the Center-of-Gravity Method8 [13, 20], the Inscribed Ellipsoid Method [22], the Circumscribed Simplex Method [6], etc.). Similarly, the Subgradient Method is not the only “dimension-independent” method and there exist numerous alternatives which are better suited for certain problem classes (e.g., the Fast Gradient Method [17] for Smooth Convex Optimization or methods for Stochastic Programming [7, 8, 11, 15]). Of course, it is interesting to consider different combinations of the aforementioned “dimension-dependent” and “dimension-independent” methods. In this regard, it is also worth mentioning the works [4, 5], where the authors propose new variants of gradient-type methods for smooth strongly convex minimization problems inspired by the geometric construction of the Ellipsoid Method.

Acknowledgements

We would like to thank the anonymous reviewers for their valuable time and efforts spent on reviewing this manuscript. Their feedback was very useful.

Proof of Lemma 6

Proof

Everywhere in the proof, we assume that the parameter c is fixed and drop all the indices related to it.

Let us show that ζp is a convex function. Indeed, the function ω:R×R++R, defined by ω(x,t):=x2t, is convex. Hence, the function q, defined in (34), is also convex. Further, since ω is increasing in its first argument on R+, the function ωp:R+×R++R, defined by ωp(x,t):=xpt, is also convex as the composition of ω with the mapping (x,t)(xp/2,t), whose first component is convex (since p2) and the second one is affine. Note that ωp is increasing in its first argument. Hence, ζp is indeed a convex function as the composition of ωp with the mapping γ(q(γ),1+γ), whose first part is convex and the second one is affine.

Differentiating, for any γ>0, we obtain

ζp(γ)=p[q(γ)]p-1q(γ)(1+γ)-[q(γ)]p(1+γ)2=[q(γ)]p-1(pq(γ)(1+γ)-q(γ))(1+γ)2.

Therefore, the minimizers of ζp are exactly solutions to the following equation:

pq(γ)(1+γ)=q(γ). 67

Note that q(γ)=c[2γ(1+γ)-γ2]2(1+γ)2=cγ(2+γ)2(1+γ)2 (see (34)). Hence, (67) can be written as cpγ(2+γ)=2(1+γ)+cγ2 or, equivalently, c(p-1)γ2+2(cp-1)γ=2. Clearly, γ=0 is not a solution of this equation. Making the change of variables γ=2u, u0, we come the quadratic equation u2-2(cp-1)u=2c(p-1) or, equivalently, to [u-(cp-1)]2=2c(p-1)+(cp-1)2=c2p2-(2c-1). This equation has two solutions: u1:=cp-1+c2p2-(2c-1) and u2:=cp-1-c2p2-(2c-1). Note that u2cp-1-c2p2+1cp-1-(cp+1)=-2. Hence, γ2:=2u2-1 cannot be a minimizer of ζp. Consequently, only u1 is an acceptable solution (note that u1>0 in view of our assumptions on c and p). Thus, (37) is proved.

Let us show that γ(p) belongs to the interval specified in (37). For this, we need to prove that 1cpγ(p)2. Note that the function ha(t):=tt2-a+t-1, where a0, is decreasing in t. Indeed, 1ha(t)=1-at2-1t+1 is an increasing function in t. Hence, cpγ(p)=2h2c-1(cp)2limth2c-1(t)=1. On the other hand, using that p2 and denoting α:=2c1, we get cpγ(p)=2hα-1(cp)2g(α), where g(α):=hα-1(α)=αα2-α+1+α-1. Note that g is decreasing in α. Indeed, denoting τ:=1α0,1, we obtain 1g(α)=1-τ+τ2-τ+1, which is a decreasing function in τ. Thus, cpγ(p)2g(1)=2.

It remains to prove that ζp(γ(p))e-1/(2cp). Let ϕ:2,+R be the function

ϕ(p):=-lnζp(γ(p))=ln(1+γ(p))-plnq(γ(p)). 68

We need to show that ϕ(p)1/(2cp) for all p2 or, equivalently, that the function χ:0,12R, defined by χ(τ):=ϕ(1τ), satisfies χ(τ)τ2c for all τ0,12. For this, it suffices to show that χ is convex, limτ0χ(τ)=0 and limτ0χ(τ)=12c. Differentiating, we see that χ(τ)=-1τ2ϕ(1τ) and χ(τ)=2τ3ϕ(1τ)+1τ4ϕ(1τ) for all τ0,12. Thus, we need to justify that

2ϕ(p)+pϕ(p)0 69

for all p2 and that

limpϕ(p)=0,limp[-p2ϕ(p)]=12c. 70

Let p2 be arbitrary. Differentiating and using (67), we obtain

ϕ(p)=γ(p)1+γ(p)-lnq(γ(p))-pq(γ(p))γ(p)q(γ(p))=-lnq(γ(p)),ϕ(p)=-q(γ(p))γ(p)q(γ(p))=-γ(p)p(1+γ(p)). 71

Therefore,

2ϕ(p)+pϕ(p)=-2lnq(γ(p))-γ(p)1+γ(p)-cγ2(p)+γ(p)1+γ(p),

where the inequality follows from (34) and the fact that ln(1+τ)τ for any τ>-1. Thus, to show (69), we need to prove that -γ(p)cγ2(p) or, equivalently, ddp1γ(p)c. But this is immediate. Indeed, using (37), we obtain ddp1γ(p)=c2cpc2p2-(2c-1)+1c since the function τττ2-1 is decreasing. Thus, (69) is proved.

It remains to show (70). From (37), we see that γ(p)0 and pγ(p)1c as p. Hence, using (34), we obtain

limpp2lnq(γ(p))=limpcp2γ2(p)2(1+γ(p))=c2limpp2γ2(p)=12c.

Consequently, in view of (68) and (71), we have

limpϕ(p)=limp[ln(1+γ(p))-plnq(γ(p))]=0,limp[-p2ϕ(p)]=limpp2lnq(γ(p))=12c,

which is exactly (70).

Support function and dual multipliers: proofs

For brevity, everywhere in this section, we write x and · instead of ·H-1 and ·H-1, respectively. We also denote B0:={xE:x1}.

Auxiliary operations

Lemma 9

Let sE, let A:RmE be a linear operator with trivial kernel and let bRm, b,(AHA)-1b<1. Then, problem (62) has a unique solution given by (63).

Proof

Note that the sublevel sets of the objective function in (62) are bounded:

s-Au+u,bAu-s+u,b(1-b,(AHA)-1b1/2)Au-s

for all uRm. Hence, problem (62) has a solution.

Let uRm be a solution of problem (62). If s=Au, then u=(AHA)-1As, which coincides with the solution given by (63) (note that, in this case, r=0).

Now suppose sAu. Then, from the first-order optimality condition, we obtain that b=A(s-Au)/ρ, where ρ:=s-Au>0. Hence, u=(AHA)-1(As-ρb) and

ρ2=s-Au2=s2-2As,u+AHAu,u=s2-2As,(AHA)-1(As-ρb)+As-ρb,(AHA)-1(As-ρb)=s2-s,A(AHA)-1As+ρ2b,(AHA)-1b.

Thus, ρ=r and u=u(H,s,A,b) given by (63).

Lemma 10

Let s,aE, βR be such that a,xβ for some xintB0. Then, problem (60) has a solution given by (61). Moreover, this solution is unique if β<a.

Proof

Let ϕ:RR be the function ϕ(τ):=s-τa+τβ. By our assumptions, β>-a if a0 and β0 if a=0. If additionally β<a, then |β|<a.

If s=0, then ϕ(τ)=τ(a+β)ϕ(0) for all τ0, so 0 is a solution of (60). Clearly, this solution is unique when β<a because then |β|<a.

From now on, suppose s0. Then, ϕ is differentiable at 0 with ϕ(0)=β-a,s/s. If a,sβs, then ϕ(0)0, so 0 is a solution of (60). Note that this solution is unique if a,s<βs because then ϕ(0)>0, i.e., ϕ is strictly increasing on R+.

Suppose a,s>βs. Then, β<a and thus |β|<a. Note that, for any τ0, we have ϕ(τ)τ(a+β)-s. Hence, the sublevel sets of ϕ, intersected with R+, are bounded, so problem (60) has a solution. Since ϕ(0)<0, any solution of (60) is strictly positive and so must be a solution of problem (62) for A:=a and b:=β. But, by Lemma 9, the latter solution is unique and equals u(H,s,a,β).

We have proved that (61) is indeed a solution of (60). Moreover, when a,sβs, we have shown that this solution is unique. It remains to prove the uniqueness of solution when a,s=βs, assuming additionally that β<a. But this is simple. Indeed, by our assumptions, |β|<a, so |a,s|=|β|s<as. Hence, a and s are linearly independent. But then ϕ is strictly convex, and thus its minimizer is unique.

Computation of dual multipliers

In this section, we prove the correctness of Algorithm 3.

For sE, let X(s) be the subdifferential of · at the point s:

X(s):={Hs/s},ifs0,B0,ifs=0. 72

Clearly, X(s)B0 for any sE. When s0, we denote the unique element of X(s) by x(s).

Let us formulate a convenient optimality condition.

Lemma 11

Let A be the linear operator from Rm to E, defined by Au:=i=1muiai, where a1,,amE, and let bRm, sE. Then, μR+m is a minimizer of ψ(μ):=s-Aμ+μ,b over R+m if and only if X(s-Aμ)L1(μ1)Lm(μm), where, for each 1im and τ>0, we denote Li(τ):={xE:ai,xbi}, if τ=0, and Li(τ):={xE:ai,x=bi}, if τ>0.

Proof

Indeed, the standard optimality condition for a convex function over the nonnegative orthant is as follows: μR+m is a minimizer of ψ on R+m if and only if there exists gψ(μ) such that gi0 and giμi=0 for all 1im. It remains to note that ψ(μ)=b-AX(s-Aμ).

Theorem 3

Algorithm 3 is well-defined and returns a solution of (65).

Proof

i. For each i=1,2 and τ0, denote Li-:={xE:ai,xbi}, Li:={xE:ai,x=bi}, Li(τ):=Li-, if τ=0, and Li(τ):=Li, if τ>0.

ii. From (66) and Lemma 10, it follows that Step  is well-defined and, for each i=1,2, τi is a solution of (60) with parameters (s,ai,bi). Hence, by Lemma 11,

X(s-τiai)Li(τi),i=1,2. 73

iii. Consider Step . Note that the condition ξ1b2 is equivalent to B0L1-L2- since ξ1=maxxB0L1-a2,x. If B0L1-L2-, then, by (73), X(s-τ1a1)L1(τ1)L2-=X(s-τ1a1)L1(τ1), so, by Lemma 11, (τ1,0) is indeed a solution of (65).

Similarly, if ξ2b1, then B0L2-L1- and (0,τ2) is a solution of (65).

iv. From now on, we can assume that B0L1-intL2+, B0L2-intL1+, where intLi+:={xE:ai,x>bi}, i=1,2. Combining this with (66), we obtain9

intB0L1L2-,intB0L2L1-. 74

Suppose a2,H(s-τ1a1)b2s-τ1a1 at Step . 1) If sτ1a1, then X(s-τ1a1) is a singleton, x(s-τ1a1)=H(s-τ1a1)/s-τ1a1, so we obtain x(s-τ1a1)L2-. Combining this with (73), we get x(s-τ1a1)L1(τ1)L2-. 2) If s=τ1a1, then X(s-τ1a1)L1(τ1)L2-=B0L1(τ1)L2- in view of the first claim in (74) (recall that L1L1(τ1)). Thus, in any case, X(s-τa1)L1(τ1)L2-, and so, by Lemma 11, (τ1,0) is a solution of (65).

Similarly, one can consider the case when a1,H(s-τ2a2)b1s-τ2a2 at Step .

Suppose we have reached Step . From now on, we can assume that

X(s-τ1a1)L1(τ1)intL2+,X(s-τ2a2)L2(τ2)intL1+. 75

Indeed, since both conditions at Step  have not been satisfied, sτiai, i=1,2, and x(s-τ1a1)L2-, x(s-τ2a2)L1-. Also, by (73), x(s-τiai)Li(τi), i=1,2.

Let μR+2 be any solution of (65). By Lemma 11, X(s-Aμ)L1(μ1)L2(μ2). Note that we cannot have μ2=0. Indeed, otherwise, we get X(s-μ1a1)L1(μ1)L2-, so μ1 must be a solution of (60) with parameters (s,a1,b1). But, by Lemma 10, such a solution is unique (in view of the second claim in (75), a1,x>b1 for some xB0, so b1<a1). Hence, μ1=τ1, and we obtain a contradiction with (75). Similarly, we can show that μ10. Consequently, μ1,μ2>0, which means that μ is a solution of (62).

Thus, at this point, any solution of (65) must be a solution of (62). In view of Lemma 9, to finish the proof, it remains to show that the vectors a1, a2 are linearly independent and b,(AHA)-1b<1. But this is simple. Indeed, from (75), it follows that

eitherB0L1intL2+orB0L2intL1+ 76

since τ1 and τ2 cannot both be equal to 0. Combining (76) and (74), we see that intB0L1L2 and, in particular, L1L2. Hence, a1, a2 are linearly independent (otherwise, L1=L2, which contradicts (76)). Taking any xintB0L1L2, we obtain x<1 and Ax=b, hence b,(AHA)-1b=Ax,(AHA)-1Axx2<1, where we have used A(AHA)-1AH-1.

Footnotes

1

More precisely, gk must be a non-zero vector such that gk,xk-x0 for all xQ. In particular, for the Euclidean ball, one can take gk:=xk.

2

Hereinafter, a solid is any convex compact set with nonempty interior.

3

In fact, we have proved the second estimate in (45) for all k1 (not only for kn2).

4

Recall that gi0 for all i0 by (2).

5

Here eigenvectors and eigenvalues are defined with respect to the operator B inducing the norm x.

6

Note that, in (2), we identify the spaces E, E with Rn in such a way that ·,· coincides with the standard dot-product and x coincides with the standard Euclidean norm. Therefore, B becomes the identity matrix and gk becomes gkT.

7

Hereinafter, we identify (Rm) with Rm in such a way that ·,· is the standard dot product.

8

Although this method is not practical, it is still interested from an academic point of view.

9

Take an appropriate convex combination of two points from the specified nonempty convex sets.

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 788368).

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Anton Rodomanov, Email: anton.rodomanov@uclouvain.be.

Yurii Nesterov, Email: yurii.nesterov@uclouvain.be.

References

  • 1.Auslender A. Résolution numérique d’inégalités variationnelles. RAIRO. 1973;7(2):67–72. [Google Scholar]
  • 2.Ben-Tal, A., Nemirovski, A.: Lectures on modern convex optimization. Lecture notes (2021)
  • 3.Bland R, Goldfarb D, Todd M. The ellipsoid method: a survey. Oper. Res. 1981;29(6):1039–1091. doi: 10.1287/opre.29.6.1039. [DOI] [Google Scholar]
  • 4.Bubeck, S., Lee, Y.T.: Black-box optimization with a politician. In: International Conference on Machine Learning, pp. 1624–1631. PMLR (2016)
  • 5.Bubeck, S., Lee, Y.T., Singh, M.: A geometric alternative to Nesterov’s accelerated gradient descent. arXiv preprint arXiv:1506.08187 (2015)
  • 6.Bulatov, V., Shepot’ko, L.: Method of centers of orthogonal simplexes for solving convex programming problems. Methods Optim. Appl. (1982)
  • 7.Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12(7) (2011)
  • 8.Dvurechensky P, Gasnikov A. Stochastic intermediate gradient method for convex problems with stochastic inexact oracle. J. Optim. Theory Appl. 2016;171(1):121–145. doi: 10.1007/s10957-016-0999-6. [DOI] [Google Scholar]
  • 9.Grötschel M, Lovász L, Schrijver A. The ellipsoid method and its consequences in combinatorial optimization. Combinatorica. 1981;1(2):169–197. doi: 10.1007/BF02579273. [DOI] [Google Scholar]
  • 10.Khachiyan L. A polynomial algorithm in linear programming. Soviet Math. Dokl. 1979;244(5):1093–1096. [Google Scholar]
  • 11.Lan G. An optimal method for stochastic composite optimization. Math. Program. 2012;133(1):365–397. doi: 10.1007/s10107-010-0434-y. [DOI] [Google Scholar]
  • 12.Lan G. First-Order and Stochastic Optimization Methods for Machine Learning. Switzerland: Springer; 2020. [Google Scholar]
  • 13.Levin A. An algorithm for minimizing convex functions. Soviet Math. Dokl. 1965;160(6):1244–1247. [Google Scholar]
  • 14.Nemirovski, A.: Information-based complexity of convex programming. Lecture notes (1995)
  • 15.Nemirovski A, Juditsky A, Lan G, Shapiro A. Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 2009;19(4):1574–1609. doi: 10.1137/070704277. [DOI] [Google Scholar]
  • 16.Nemirovski A, Onn S, Rothblum UG. Accuracy certificates for computational problems with convex structure. Math. Oper. Res. 2010;35(1):52–78. doi: 10.1287/moor.1090.0427. [DOI] [Google Scholar]
  • 17.Nesterov Y. A method for solving the convex programming problem with convergence rate O(1/k2) Soviet Math. Dokl. 1983;269:543–547. [Google Scholar]
  • 18.Nesterov Y. Primal-dual subgradient methods for convex problems. Math. Program. 2009;120(1):221–259. doi: 10.1007/s10107-007-0149-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Nesterov Y . Lectures on convex optimization. Berlin: Springer; 2018. [Google Scholar]
  • 20.Newman D. Location of the maximum on unimodal surfaces. J. ACM (JACM) 1965;12(3):395–398. doi: 10.1145/321281.321291. [DOI] [Google Scholar]
  • 21.Shor N. Cut-off method with space extension in convex programming problems. Cybernetics. 1977;13(1):94–96. doi: 10.1007/BF01071394. [DOI] [Google Scholar]
  • 22.Tarasov S, Khachiyan L, Erlikh I. The method of inscribed ellipsoids. Soviet Math. Dokl. 1988;37(1):226–230. [Google Scholar]
  • 23.Yudin D, Nemirovskii A. Informational complexity and efficient methods for the solution of convex extremal problems. Matekon. 1976;13(2):22–45. [Google Scholar]

Articles from Mathematical Programming are provided here courtesy of Springer

RESOURCES