Skip to main content
Springer logoLink to Springer
. 2021 Mar 10;189(1):317–339. doi: 10.1007/s10957-021-01838-7

Minimizing Uniformly Convex Functions by Cubic Regularization of Newton Method

Nikita Doikov 1,, Yurii Nesterov 2
PMCID: PMC8550329  PMID: 34720181

Abstract

In this paper, we study the iteration complexity of cubic regularization of Newton method for solving composite minimization problems with uniformly convex objective. We introduce the notion of second-order condition number of a certain degree and justify the linear rate of convergence in a nondegenerate case for the method with an adaptive estimate of the regularization parameter. The algorithm automatically achieves the best possible global complexity bound among different problem classes of uniformly convex objective functions with Hölder continuous Hessian of the smooth part of the objective. As a byproduct of our developments, we justify an intuitively plausible result that the global iteration complexity of the Newton method is always better than that of the gradient method on the class of strongly convex functions with uniformly bounded second derivative.

Keywords: Newton method, Cubic regularization, Global complexity bounds, Strong convexity, Uniform convexity

Introduction

A big step in a second-order optimization theory is related to the global complexity guarantees which were justified in [17] for the cubic regularization of the Newton method. The following results provide a good perspective for the development of this approach, discovering accelerated [14], adaptive [4, 5] and universal [10] schemes. The latter methods can automatically adjust to a smoothness properties of the particular objective function. In the same vein, the second-order algorithms for solving a system of nonlinear equations were discovered in [13], and randomized variants for solving large-scale optimization problems were proposed in  [79, 12, 18].

Despite to a number of nice properties, global complexity bounds of the cubically regularized Newton method for the cases of strongly convex and uniformly convex objective are not still fully investigated, as well as the notion of second-order non-degeneracy (see discussion in Sect. 5 in [14]). We are going to address this issue in the current paper.

The rest of the paper is organized as follows. Section 2 contains all necessary definitions and main properties of the classes of uniformly convex functions and twice-differentiable functions with Hölder continuous Hessian. We introduce the notion of the condition number γf(ν) of a certain degree ν[0,1] and present some basic examples.

In Sect. 3, we describe a general regularized Newton scheme and show the linear rate of convergence for this method on the class of uniformly convex functions with a known degree ν[0,1] of nondegeneracy. Then, we introduce the adaptive cubically regularized Newton method and collect useful inequalities and properties, which are related to this algorithm.

In Sect. 4, we study global iteration complexity of the cubically regularized Newton method on the classes of uniformly convex functions with Hölder continuous Hessian. We show that for nondegeneracy of any degree ν[0,1], which is formalized by the condition γf(ν)>0, the algorithm automatically achieves the linear rate of convergence with the value γf(ν) being the main complexity factor.

Finally, in Sect. 5 we compare our complexity bounds with the known bounds for other methods and discuss the results. In particular, we justify an intuitively plausible (but quite a delayed) result that the global complexity of the cubically regularized Newton method is always better than that of the gradient method on the class of strongly convex functions with uniformly bounded second derivative.

Uniformly Convex Functions with Hölder Continuous Hessian

Let us start from some notation. In what follows, we denote by E a finite-dimensional real vector space and by E its dual space, which is a space of linear functions on E. The value of function sE at point xE is denoted by s,x. Let us fix some linear self-adjoint positive-definite operator B:EE and introduce the following Euclidean norms in the primal and dual spaces:

x:=Bx,x1/2,xE,s:=s,B-1s1/2,sE.

For any linear operator A:EE, its norm is induced in a standard way:

A:=maxxE{Ax|x1}.

Our goal is to solve the convex optimization problem in the composite form:

minxdomFF(x):=f(x)+h(x), 1

where f is a twice differentiable on its open domain uniformly convex function, and h is a simple closed convex function with domhdomf. Simple means that all auxiliary subproblems with an explicit presence of h are easily solvable.

For a smooth function f, its gradient at point x is denoted by f(x)E, and its Hessian is denoted by 2f(x):EE. For convex but not necessary differentiable function h, we denote by h(x)E its subdifferential at the point xdomh.

We say that differentiable function f is uniformly convex of degree p2 on a convex set Cdomf if for some constant σ>0 it satisfies inequality

f(y)f(x)+f(x),y-x+σpy-xp,x,yC. 2

Uniformly convex functions of degree p=2 are known as strongly convex. If inequality (2) holds with σ=0, the function f is called just convex. The following convenient condition is sufficient for function f to be uniformly convex on a convex set Cdomf:

Lemma 2.1

Lemma 1 in [14]) Let for some σ>0 and p2 the following inequality holds:

f(x)-f(y),x-yσx-yp,x,yC. 3

Then, function f is uniformly convex of degree p on set C with parameter σ.

From now on, we assume C:=domFdomf. By the composite representation (1), we have for every xdomF and for all F(x)F(x):

F(y)F(x)+F(x),y-x+σpx-yp,ydomF. 4

Therefore, if σ>0, then we can have only one point xdomF with F(x)=F, which always exists for F being uniformly convex and closed. A useful consequence of uniform convexity is the following upper bound for the residual.

Lemma 2.2

Let f be uniformly convex of degree p2 with constant σ>0 on set domF. Then, for every xdomF and for all F(x)F(x) we have

F(x)-Fp-1p1σ1p-1F(x)pp-1. 5

Proof

In view of (4), bound (5) follows as in the proof of Lemma 3 in [14].

It is reasonable to define the best possible constant σ in inequality (3) for a certain degree p. This leads us to a system of constants:

σf(p):=infx,ydomFxyf(x)-f(y),x-yx-yp,p2. 6

We prefer to use inequality (3) for the definition of σf(p), instead of (2), because of its symmetry in x and y. Note that the value σf(p) also depends on the domain of F. However, we omit this dependence in our notation since it is always clear from the context.

It is easy to see that the univariate function σf(·) is log-concave. Thus, for all p2>p12 we have:

σf(p)(σf(p1))p2-pp2-p1·(σf(p2))p-p1p2-p1,p[p1,p2]. 7

For a twice-differentiable function f, we say that it has Hölder continuous Hessian of degree ν[0,1] on a convex set Cdomf, if for some constant H, it holds:

2f(x)-2f(y)Hx-yν,x,yC. 8

Two simple consequences of (8) are as follows:

f(y)-f(x)-2f(x)(y-x)Hx-y1+ν1+ν, 9
|f(y)-Q(x;y)|Hx-y2+ν(1+ν)(2+ν), 10

where Q(xy) is the quadratic model of f at the point x:

Q(x;y):=f(x)+f(x),y-x+122f(x)(y-x),y-x.

In order to characterize the level of smoothness of function f on the set C:=domF, let us define the system of Hölder constants (see [10]):

Hf(ν):=supx,ydomFxy2f(x)-2f(y)x-yν,ν[0,1]. 11

We allow Hf(ν) to be equal to + for some ν. Note that function Hf(·) is log-convex. Thus, any 0ν1<ν21 such that Hf(νi)<+,i=1,2, provide us with the following upper bounds for the whole interval:

Hf(ν)(Hf(ν1))ν2-νν2-ν1·(Hf(ν2))ν-ν1ν2-ν1,ν[ν1,ν2]. 12

If for some specific ν[0,1] we have Hf(ν)=0, this implies that 2f(x)=2f(y) for all x,ydomF. In this case restriction, fdomF is a quadratic function and we conclude that Hf(ν)=0 for all ν[0,1]. At the same time, having two points x,ydomF with 0<x-y1, we get a simple uniform lower bound for all constants Hf(ν):

Hf(ν)2f(x)-2f(y),ν[0,1].

Let us give an example of function, which has Hölder continuous Hessian for all ν[0,1].

Example 2.1

For a given aiE, 1im, consider the following convex function:

f(x)=lni=1meai,x,xE.

Let us fix Euclidean norm x=Bx,x1/2,xE, with operator B:=i=1maiai. Without loss of generality, we assume that B0 (otherwise we can reduce dimension of the problem). Then,

Hf(0)1,Hf(1)2.

Therefore, by (12) we get, for any ν[0,1]:

Hf(ν)2ν.

Proof

Denote κ(x)i=1meai,x. Let us fix arbitrary x,yE and direction hE. Then, straightforward computation gives:

f(x),h=1κ(x)i=1meai,xai,h,2f(x)h,h=1κ(x)i=1meai,xai,h2-(1κ(x)i=1meai,xai,h)2=1κ(x)i=1meai,xai,h-f(x),h20.

Hence, we get

2f(x)=maxh12f(x)h,hmaxh1i=1mai,h2=maxh1h2=1.

Since all Hessians of function f are positive definite, we conclude that Hf(0)1. Inequality Hf(1)2 can be easily obtained from the following representation of the third derivative:

f(x)[h,h,h]=1κ(x)i=1meai,xai,h-f(x),h32f(x)h,hmax1i,jmai-aj,h2h3.

Let us imagine now that we want to describe the iteration complexity of some method, which solves the composite optimization problem (1) up to an absolute accuracy ϵ>0 in the function value. We assume that the smooth part f of its objective is uniformly convex and has Hölder continuous Hessians. Which degrees p and ν should be used in our analysis? Suppose that, for the number of calls of the oracle, we are interested in obtaining a polynomial-time bound of the form:

O(Hf(ν))α·(σf(p))β·logF(x0)-Fε,α,β0.

Denote by [x] the physical dimension of variable xE, and by [f] the physical dimension of the value f(x). Then, we have [f(x)]=[f]/[x] and [2f(x)]=[f]/[x]2. This gives us

[Hf(ν)]=[f][x]2+ν,[σf(p)]=[f][x]p,[(Hf(ν))α·(σf(p))β]=[f]α+β[x]α(2+ν)+βp.

While x and f(x) can be measured in arbitrary physical quantities, the value “number of iterations” cannot have physical dimension. This leads to the following relations:

α+β=0andα(2+ν)+βp=0.

Therefore, despite to the fact that our function can belong to several problem classes simultaneously, from the physical point of view only one option is available:

p=2+ν

Hence, for a twice-differentiable convex function f with infν[0,1]Hf(ν)>0, we can define only one meaningful condition number of degree ν[0,1]:

γf(ν):=σf(2+ν)Hf(ν). 13

If for some particular ν we have Hf(ν)=+, then by our definition: γf(ν)=0.

It will be shown that the condition number γf(ν) serves as a main factor in the global iteration complexity bounds for the regularized Newton method as applied to the problem (1). Let us prove that this number cannot be big.

Lemma 2.3

Let infν[0,1]Hf(ν)>0 and therefore the condition number γf(·) be well defined. Then,

γf(ν)11+ν+infx,ydomF2f(x)2f(y)-2f(x),ν[0,1]. 14

In the case when domF is unbounded: supxdomFx=+, then

γf(ν)11+ν,ν(0,1]. 15

Proof

Indeed, for any x,ydomF, xy, we have:

σf(2+ν)(6)f(y)-f(x),y-xy-x2+ν=f(y)-f(x)-2f(x)(y-x),y-xy-x2+ν+2f(x)(y-x),y-xy-x2+ν(9)Hf(ν)1+ν+2f(x)y-xν.

Now, dividing both sides of this inequality by Hf(ν), we get inequality (14) from the definition of Hf(ν) (11). Inequality (15) can be obtained by taking the limit y+.

From inequalities (7) and (12), we can get the following lower bound:

γf(ν)(γf(ν1))ν2-νν2-ν1·(γf(ν2))ν-ν1ν2-ν1,ν[ν1,ν2],

where 0ν1<ν21. However, it turns out that in unbounded case we can have a nonzero condition number γf(ν) only for a single degree.

Lemma 2.4

  Let domF be unbounded: supxdomFx=+. Assume that for a fixed ν[0,1] we have γf(ν)>0. Then,

γf(α)=0for allα[0,1]\{ν}.

Proof

Consider firstly the case: α>ν. From the condition γf(ν)>0, we conclude that Hf(ν)<+. Then, for any x,ydomF we have:

σf(2+α)y-x2+α2+α(2)f(y)-f(x)-f(x),y-x(10)122f(x)(y-x),(y-x)+Hf(ν)y-x2+ν(1+ν)(2+ν).

Dividing both sides of this inequality by y-x2+α and letting x+, we get σf(2+ν)=0. Therefore, γf(α)=0. For the second case, α<ν, we cannot have γf(α)>0, since the previous reasoning results in γf(ν)=0.

Let us look now at an important example of a uniformly convex function with Hölder continuous Hessian. It is convenient to start with some properties of powers of Euclidean norm.

Lemma 2.5

For fixed real p1, consider the following function:

fp(x)=1pxp,xE.

1. For p2, function fp(·) is uniformly convex of degree p:1)

fp(x)-fp(y),x-y22-px-yp,x,yE. 16

2. If 1p2, then function fp(·) has ν-Hölder continuous gradient with ν=p-1:

fp(x)-fp(y)21-νx-yν,x,yE. 17

Proof

Firstly, recall two useful inequalities, which are valid for all a,b0:

|aα-bα||a-b|α,when0α1, 18
|aα-bα||a-b|α,whenα1. 19

Let us fix arbitrary x,yE. The left-hand side of inequality (16) equals

xp-2Bx-yp-2By,x-y=xp+yp-Bx,y(xp-2+yp-2),

and we need to verify that it is bigger than 22-p[x2+y2-2Bx,y]p2. The case x=0 or y=0 is trivial. Therefore, assume x0 and y0. Denoting τ:=yx, r:=Bx,yx·y, we have the following statement to prove:

1+τprτ(1+τp-2)+22-p[1+τ2-2rτ]p2,τ>0,|r|1.

Since the function in the right-hand side is convex in r, we need to check only two marginal cases:

  1. r=1: 1+τpτ(1+τp-2)+22-p|1-τ|p, which is equivalent to (1-τ)(1-τp-1)22-p|1-τ|p. This is true by (19).

  2. r=-1: 1+τp-τ(1+τp-2)+22-p(1+τ)p, which is equivalent to (1+τp-1)22-p(1+τ)p-1. This is true in view of convexity of function τp-1 for τ0.

Thus, we have proved (16). Let us prove the second statement. Consider the function f^q(s)=1qsq, sE, with q=pp-12. In view of our first statement, we have:

s1-s2,f^q(s1)-f^q(s2)12q-2s1-s2q,s1,s2E. 20

For arbitrary x1,x2E, define si=fp(xi)=Bxixi2-p, i=1,2. Then si=xip-1, and consequently,

xi=xi2-pB-1si=si2-pp-1B-1si=f^q(si).

Therefore, substituting these vectors in (20), we get

12q-2fp(x1)-fp(x2)qfp(x1)-fp(x2),x1-x2.

Thus, fp(x1)-fp(x2)2q-2q-1x1-x21q-1. It remains to note that 1q-1=p-1=ν.

Example 2.2

For real p2 and arbitrary x0E, consider the following function:

f(x)=1px-x0p=fp(x-x0),xE.

Then, σf(p)=12p-2. Moreover, if p=2+ν for some ν(0,1], then it holds

Hf(ν)(1+ν)21-ν,

and Hf(α)=+, for all α[0,1]\{ν}. Therefore, in this case we have γf(ν)12(1+ν), and γf(α)=0 for all α[0,1]\{ν}.

Proof

Let us take an arbitrary x0 and set y:=-x. Then,

f(x)-f(y),y-x=xp-2Bx+xp-2Bx,2x=4xp.

On the other hand, y-xp=2pxp. Therefore, σf(p)(6)22-p, and (16) tells us that this inequality is satisfied as equality.

Let us prove now that Hf(ν)(1+ν)21-ν for p=2+ν with some ν(0,1]. This is

2f(x)-2f(y)(1+ν)21-νx-yν,x,yE. 21

The corresponding Hessians can be represented as follows:

2f(x)=xνB+νBxxBx2-ν,xE\{0},2f(0)=0.

For the case x=y=0, inequality (21) is trivial. Assume now that x0. If 0[x,y], then y=-βx for some β0 and we have:

2f(x)-2f(-βx)|1-βν|(1+ν)xν(1+β)ν(1+ν)21-νxν=(1+ν)21-νx-yν,

which is (21). Let 0[x,y]. For an arbitrary fixed direction hE, we get:

|(2f(x)-2f(y))h,h|=|xν-yν·h2+ν·Bx,h2x2-ν-By,h2y2-ν|.

Consider the points u=Bxx1-ν=fq(x) and v=Byy1-ν=fq(y) with q=1+ν. Then,

xν=u,Bx,h2x2-ν=u,h2uandyν=v,By,h2y2-ν=v,h2v.

Therefore,

|(2f(x)-2f(y))h,h|=|u-v·h2+ν·u,h2u-v,h2v|. 22

Let us estimate the right-hand side of (22) from above. Consider a continuously differentiable univariate function:

ϕ(τ):=u(τ)·h2+ν·u(τ),h2u(τ),u(τ):=u+τ(v-u),τ[0,1].

Note that

ϕ(τ)=u(τ),B-1(v-u)u(τ)·h2+2νu(τ),hv-u,hu(τ)-νu(τ),h2u(τ),B-1(v-u)u(τ)3=u(τ),B-1(v-u)u(τ)·h2-νu(τ),h2u(τ)20+2νu(τ),hv-u,hu(τ).

Denote γ:=u(τ),hu(τ)·h[-1,1]. Then,

|ϕ(τ)|v-u·h2·(1-νγ2+2ν|γ|)(1+ν)·v-u·h2.

Thus, we have:

|(2f(x)-2f(y))h,h|=|ϕ(1)-ϕ(0)|(1+ν)·v-u·h2. 23

It remains to use the definition of u and v and apply inequality (17) with p=q. Thus, we have proved, that for p=2+ν the Hessian of f is Hölder continuous of degree ν. At the same time, taking y=0, we get 2f(x)-2f(y)=2f(x)=(1+ν)xν. These values cannot be uniformly bounded in xE by any multiple of xα with αν. So, the Hessian of f is not Hölder continuous for any degree different from 2+ν.

Remark 2.1

Inequalities (16) and (17) have the following symmetric consequences:

p2fp(x)-fp(y)22-px-yp-1,p2fp(x)-fp(y)22-px-yp-1,

which are valid for all x,yE.

Regularized Newton Method

Let us start from the case when we know that for a specific ν[0,1] function f has Hölder continuous Hessian: Hf(ν)<+. Then, from (10), we have the global upper bound for the objective function:

F(y)Mν,H(x;y):=Q(x;y)+Hx-y2+ν(1+ν)(2+ν)+h(y),x,ydomF,

where H>0 is large enough: HHf(ν). Thus, it is natural to employ the minimum of a regularized quadratic model:

Tν,H(x):=argminydomFMν,H(x;y),Mν,H(x):=minydomFMν,H(x;y),

and define the following general iteration process [10]:

xk+1:=Tν,Hk(xk),k0 24

where the value Hk is chosen either to be a constant from the interval [0,2Hf(ν)] or by some adaptive procedure.

For the class of uniformly convex functions of degree p=2+ν, we can justify the following global convergence result for this process.

Theorem 3.1

Assume that for some ν[0,1] we have 0<Hf(ν)<+ and σf(2+ν)>0. Let the coefficients {Hk}k0 in the process (24) satisfy the following conditions:

0HkβHf(ν),F(xk+1)Mν,Hk(xk),k0, 25

with some constant β0. Then, for the sequence {xk}k0 generated by the process we have:

F(xk+1)-F(1-1+ν2+ν·min{γf(ν)(1+ν)(1+β)(2+ν),1}11+ν)F(xk)-F. 26

Thus, the rate of convergence is linear and for reaching the gap F(xK)-Fε it is enough to perform K=2+ν1+ν·max{(1+β)(2+ν)γf(ν)(1+ν),1}11+νlogF(x0)-Fε iterations.

Proof

As in the proof of Theorem 3.1 in [10], from (25) one can see that

F(xk+1)F(xk)-αF(xk)-F+α2+ν(1+β)Hf(ν)xk-x2+ν(1+ν)(2+ν),

for any α[0,1]. Then, taking into account the uniform convexity (4), we get

F(xk+1)F(xk)-α-α2+ν(1+β)Hf(ν)(1+ν)σf(2+ν)F(xk)-F.

The minimum of the right-hand side is attained at α=min{γf(ν)(1+ν)(2+ν)(1+β),1}11+ν. Plugging this value into the bound above, we get inequality (26).

Unfortunately, in practice it is difficult to decide on an appropriate value of ν[0,1] with Hf(ν)<+. Therefore, it is interesting to develop the universal methods which are not based on some particular parameters. Recently, it was shown [10] that one good choice for such universal scheme is the cubic regularization of the Newton Method [17]. This is actually the process (24) with the fixed parameter ν=1. For this choice, in the rest part of the paper we omit the corresponding index in the definitions of all necessary objects: MH(x;y):=M1,H(x;y), TH(x):=T1,H(x), and MH(x):=M1,H(x)=MH(x;TH(x)). The adaptive scheme of our method with dynamic estimation of the constant H is as follows.

Algorithm 1: Adaptive Cubic Regularization of Newton Method
Initialization. Choose x0domF, H0>0.
Iteration k0.
   1: Find the minimal integer ik0 such that F(THk2ik(xk))MHk2ik(xk).
   2: Perform the Cubic Step: xk+1=THk2ik(xk).
   3: Set Hk+1:=2ik-1Hk.

Let us present the main properties of the composite Cubic Newton step xTH(x). Denote

rH(x):=TH(x)-x.

Since point TH(x) is a minimum of strictly convex function MH(x;·), it satisfies the following first-order optimality condition:

f(x)+2f(x)(TH(x)-x)+HrH(x)2B(TH(x)-x),y-TH(x)+h(y)h(TH(x)),ydomF. 27

In other words, the vector

h(TH(x)):=-f(x)-2f(x)(TH(x)-x)-HrH(x)2B(TH(x)-x)

belongs to the subdifferential of h:

h(TH(x))h(TH(x)). 28

Computation of a point T=TH(x), satisfying condition (28), requires some standard techniques of Convex Optimization and Linear Algebra (see [1, 3, 16, 17]). Arithmetical complexity of such a procedure is usually similar to that of the standard Newton step.

Plugging into (27) y:=xdomF, we get:

f(x),x-TH(x)2f(x)(TH(x)-x),TH(x)-x+HrH3(x)2+h(TH(x))-h(x). 29

Thus, we obtain the following bound for the minimal value MH(x) of the cubic model:

MH(x)(29)f(x)-122f(x)(TH(x)-x),TH(x)-x-HrH3(x)3+h(x)=F(x)-122f(x)(TH(x)-x),TH(x)-x-HrH3(x)3. 30

If for some value ν[0,1] the Hessian is Hölder continuous: Hf(ν)<+, then by (9) and (28) we get the following bound for the subgradient:

F(TH(x)):=f(TH(x))+h(TH(x))

at the new point:

F(TH(x))f(TH(x))-f(x)-2f(x)(TH(x)-x)+HrH2(x)2(9)Hf(ν)rH1+ν(x)1+ν+HrH2(x)2=rH1+ν(x)·(Hf(ν)1+ν+HrH1-ν(x)2). 31

One of the main strong point of the classical Newton’s is its local quadratic convergence for the class of strongly convex functions with Lipschitz continuous Hessian: σf(2)>0 and 0<Hf(1)<+ (see, for example, [15]). This property holds for the cubically regularized Newton as well [14, 17]. Indeed, ensuring F(TH(x))MH(x) as in Algorithm 1, and having HβHf(1) with some β0, we get:

F(TH(x))-F(5)12σf(2)F(TH(x))2(31)(1+β)2Hf2(1)8σf(2)rH4(x)(1+β)2Hf2(1)8σf3(2)2f(x)(TH(x)-x),TH(x)-x2(30)(1+β)2Hf2(1)2σf3(2)F(x)-F2.

And the region of quadratic convergence is as follows:

Q={xdomF:F(x)-F2σf3(2)(1+β)2Hf2(1)}.

After reaching it, the method starts to double the right digits of the answer at every step, and this cannot last for a long time. Therefore, from now on we are mainly interested in the global complexity bounds of Algorithm 1, which work for an arbitrary starting point x0.

For noncomposite case, as it was shown in [10], if for some ν[0,1] we have 0<Hf(ν)<+ and the objective is just convex, then Algorithm 1 with small initial parameter H0 generates a solution x^ with f(x^)-fε in O((Hf(ν)D02+νε)11+ν) iterations, where D0:=maxxx-x:f(x)f(x0). Thus, the method in [10] has a sublinear rate of convergence on the class of convex functions with Hölder continuous Hessian. It can automatically adapt to the actual level of smoothness. In what follows we show that the same algorithm achieves linear rate of convergence for the class of uniformly convex functions of degree p=2+ν, namely for functions with strictly positive condition number: supν[0,1]γf(ν)>0.

In the remaining part of the paper, we usually assume that the smooth part of our objective is not purely quadratic. This is equivalent to the condition infν[0,1]Hf(ν)>0. However, to conclude this section, let us briefly discuss the case minν[0,1]Hf(ν)=0. If we would know in advance that f is a convex quadratic function, then no regularization is needed since a single step xTH(x) with H:=0 solves the problem. However, if our function is given by a black-box oracle and we do not know a priori that its smooth part is quadratic, then we can still use Algorithm 1. For this case, we prove the following simple result.

Proposition 3.1

Let A:EE be a self-adjoint positive semidefinite linear operator and bE. Assume that f(x):=12Ax,x-b,x, and the minimum xArgminxdomF{F(x):=f(x)+h(x)} does exist. Then, in order to get F(xK)-Fε with arbitrary ε>0, it is enough to perform

K=log2H0x0-x36ε+1 32

iterations of Algorithm 1.

Proof

In our case, the quadratic model coincides with the smooth part of the objective: Q(x;y)f(y),x,yE. Therefore, at every iteration k0 of Algorithm 1 we have ik=0 and Hk=2-kH0. Note that xk+1=T2-kH0(xk)=argminydomF{F(y)+2-kH06y-xk3}, and

F(xk+1)F(y)+2-kH06y-xk3,ydomF. 33

Let us prove that xk+1-xxk-x for all k0. If this is true, then plugging yx into (33), we get: F(xk+1)-F2-kH06x0-x3 which results in the estimate (32). Indeed,

xk-x2=(xk-xk+1)+(xk+1-x)2=xk+1-x2+xk-xk+12+2B(xk-xk+1),xk+1-x,

and it is enough to show that B(xk-xk+1),x-xk+10. Since xk+1 satisfies the first-order optimality condition:

-2-(k+1)H0xk+1-xkB(xk+1-xk):=F(xk+1)F(xk+1), 34

we have:

B(xk-xk+1),x-xk+1=(34)2k+1H0xk-xk+1F(xk+1),x-xk+10,

where the last inequality follows from the convexity of the objective.

Complexity Results for Uniformly Convex Functions

In this section, we are going to justify the global linear rate of convergence of Algorithm 1 for a class of twice differentiable uniformly convex functions with Hölder continuous Hessian. Universality of this method is ensured by the adaptive estimation of the parameter H over the whole sequence of iterations. It is important to distinguish two cases: Hk+1<Hk and Hk+1Hk.

First, we need to estimate the progress in the objective function after minimizing the cubic model. There are two different situations here:

eitherHrH1-ν(x)2Hf(ν)1+ν,orHrH1-ν(x)>2Hf(ν)1+ν.

Lemma 4.1

Let 0<Hf(ν)<+ and σf(2+ν)>0 for some ν[0,1]. Then, for arbitrary xdomF and H>0 we have:

F(x)-MH(x)min[F(x)-F·(1+ν)(2+ν)·min{((1+ν)γf(ν)2(2+ν))11+ν,1},F(TH(x))-F3(1+ν)2(2+ν)·(2+ν1+ν)3(1+ν)2(2+ν)·(σf(2+ν))32(2+ν)3H]. 35

Proof

Let us consider two cases. 1) HrH1-ν(x)2Hf(ν)1+ν. Then, for arbitrary ydomF, we have:

MH(x):=Q(x;TH(x))+H6TH(x)-x3+h(TH(x))Q(x;y)+HrH1-ν(x)y-x2+ν2(2+ν)+h(y)(10)F(y)+Hf(ν)y-x2+ν(1+ν)(2+ν)+HrH1-ν(x)y-x2+ν2(2+ν)F(y)+2Hf(ν)y-x2+ν(1+ν)(2+ν),

where the first inequality follows from the fact, that

TH(x)=argminydomF{Q(x;y)+HrH1-ν(x)y-x2+ν2(2+ν)+h(y)}.

Let us restrict y to the segment: y=αx+(1-α)x, with α[0,1]. Taking into account the uniform convexity, we get:

MH(x)F(x)-αF(x)-F+α2+ν2Hf(ν)x-x2+ν(1+ν)(2+ν)(4)F(x)-(α-α2+ν2Hf(ν)(1+ν)σf(2+ν))F(x)-F.

The minimum of the right-hand side is attained at α=min{(1+ν)γf(ν)2(2+ν),1}11+ν. Plugging this value into the bound, we have:

MH(x)F(x)-min{((1+ν)γf(ν)2(2+ν))1/(1+ν),1}·(1+ν)(2+ν)·F(x)-F,

and this is the first argument of the minimum in (35).

2) HrH1-ν(x)>2Hf(ν)1+ν. By (31), we have the bound:

F(TH(x))<HrH2(x). 36

Using the fact that 2f(x)0, we get the second argument of the minimum:

F(x)-MH(x)(30)HrH3(x)3(36)F(TH(x))323H(5)2+ν1+ν3(1+ν)2(2+ν)·(σf(2+ν))32(2+ν)3H·F(TH(x))-F3(1+ν)2(2+ν).

Denote by κf(ν) the following auxiliary value:

κf(ν):=Hf(ν)21+ν(σf(2+ν))1-ν(1+ν)(2+ν)·6·(8+ν)1-ν1+ν(1+ν)(2+ν)21+ν·(1+ν2+ν)1-ν2+ν,ν[0,1]. 37

The next lemma shows what happens when parameter H is increasing during the iterations.

Lemma 4.2

Assume that for a fixed xdomF the parameter H>0 is such that:

F(TH(x))>MH(x). 38

If for some ν[0,1], we have σf(2+ν)>0, then it holds:

HF(T2H(x))-F1-ν2+ν<κf(ν). 39

Proof

Firstly, let us prove that from (38) we have:

HrH1-ν(x)<6Hf(ν)(1+ν)(2+ν). 40

Assuming by contradiction, HrH1-ν(x)6Hf(ν)(1+ν)(2+ν), we get:

MH(x):=HTH(x)-x36+Q(x;TH(x))+h(TH(x))Hf(ν)TH(x)-x2+ν(1+ν)(2+ν)+Q(x;TH(x))+h(TH(x))(10)F(TH(x)),

which contradicts (38). Secondly, by its definition, MH(x) is a concave function of H. Therefore, its derivative ddHMH(x)=16rH3(x) is non-increasing. Hence, it holds:

r2H(x)rH(x)<(40)(6Hf(ν)(1+ν)(2+ν)H)11-ν. 41

Finally, by the smoothness and the uniform convexity, we obtain:

HF(T2H(x))-F1-ν2+ν(5)H1+ν2+ν(1σf(2+ν))11+ν1-ν2+νF(T2H(x))1-ν1+ν(31)H1+ν2+ν(1σf(2+ν))11+ν1-ν2+ν(r2H1+ν(x)·Hf(ν)1+ν+Hr2H1-ν(x))1-ν1+ν<(41)H1+ν2+ν(1σf(2+ν))11+ν1-ν2+ν(r2H1+ν(x)·(8+ν)Hf(ν)(1+ν)(2+ν))1-ν1+ν<(41)1+ν2+ν(1σf(2+ν))11+ν1-ν2+νHf(ν)(1+ν)(2+ν)21+ν6(8+ν)1-ν1+ν=:κf(ν).

We are ready to prove the main result of this paper.

Theorem 4.1

Assume that for a fixed ν[0,1] we have 0<Hf(ν)<+ and σf(2+ν)>0. Let parameter H0 in Algorithm 1 be small enough:

H0κf(ν)F(x0)-F(1-ν)/(2+ν), 42

where κf(ν) is defined by (37). Let the sequence {xk}k=0K generated by the method satisfy condition:

F(THk2j(xk))-Fε>0,0jik,0kK-1. 43

Then, for every 0kK-1, we have:

F(xk+1)-F(1-min{(2+ν)(1+ν)(2+ν)1/(1+ν)γf(ν)11+ν(1+ν)63/2·21/2·(8+ν)(1-ν)/(2+2ν),12})·F(xk)-F. 44

Therefore, the rate of convergence is linear, and

Kmax{γf(ν)-11+ν·1+ν2+ν·63/2·21/2·(8+ν)(1-ν)/(2+2ν)(1+ν)(2+ν)1/(1+ν),1}·logF(x0)-Fε.

Moreover, we have the following bound for the total number of oracle calls NK during the first K iterations:

NK2K+log2κf(ν)ε(1-ν)/(2+ν)-log2H0. 45

Proof

The proof is based on Lemmas 4.1 and 4.2, and monotonicity of the sequence {F(xk)}k0. Firstly, we need to show that every iteration of the method is well-defined. Namely, we are going to verify that for a fixed 0kK-1, there exists a finite integer 0 such that either F(THk2(xk))MHk2(xk) or F(THk2+1(xk))-F<ε. Indeed, let us set

:=max0,log2κf(ν)Hkε(1-ν)/(2+ν),andH:=Hk2κf(ν)ε(1-ν)/(2+ν). 46

Then, if we have both F(TH(xk))>MH(xk) and F(T2H(xk))-Fε, we get by Lemma 4.2:

H<(39)κf(ν)F(T2H(xk))-F(1-ν)/(2+ν)κf(ν)ε(1-ν)/(2+ν),

which contradicts (46). Therefore, if we are unable to find the value 0ik (see line 1 of Algorithm) in a finite number of steps, that only means we have already solved the problem up to accuracy ε.

Now, let us show that for every 0kK it holds:

HkF(xk)-F1-ν2+νmaxκf(ν),H0F(x0)-F1-ν2+ν. 47

This inequality is obviously valid for k=0. Assume it is also valid for some k0. Then, by definition of Hk+1 (see line 3 of Algorithm), we have Hk+1=Hk2ik-1. There are two cases. 1) ik=0. Then, Hk+1<Hk. By monotonicity of {F(xk)}k0 and by induction, we get:

Hk+1F(xk+1)-F1-ν2+ν<HkF(xk)-F1-ν2+νmaxκf(ν),H0F(x0)-F1-ν2+ν.

2) ik>0. Then, applying Lemma 4.2 with H:=Hk2ik-1=Hk+1 and x:=xk, we have:

Hk+1F(xk+1)-F1-ν2+ν=HF(T2H(x))-F1-ν2+ν(39)κf(ν).

Thus, (47) is true by induction. Choosing H0 small enough (42), we have:

2HkF(xk)-F1-ν2+ν2κf(ν),0kK. 48

From Lemma 4.1 we know, that one of the two following estimates is true (denote δk:=F(xk)-F):

  1. F(xk)-F(xk+1)α·δkδk+1(1-α)·δk, or

  2. F(xk)-F(xk+1)β·δk+1δk+1(1+β)-1δk(1-min{β,1}/2)·δk,

where α:=1+ν2+ν·min{((1+ν)γf(ν)2(2+ν))11+ν,1}, and

β:=(2+ν1+ν)3(1+ν)2(2+ν)·(σf(2+ν))32(2+ν)3(2κf(ν))1/2=(37)2+ν1+ν·21/2·(1+ν)(2+ν)11+ν63/2·(8+ν)(1-ν)/(2+2ν)·γf(ν)11+ν.

It remains to notice that αmin{β,1}/2. Thus, we obtain (44).

Finally, let us estimate the total number of the oracle calls NK during the first K iterations. At each iteration, the oracle is called ik+1 times, and we have Hk+1=Hk2ik-1. Therefore,

NK=k=0K-1(ik+1)=k=0K-1log2Hk+1Hk+2=2K+log2HK-log2H0(48),(43)2K+log2κf(ν)ε(1-ν)/(2+ν)-log2H0.

Note that condition (42) for the initial choice of H0 can be seen as a definition of the moment, after which we can guarantee the linear rate of convergence (44). In practice, we can launch Algorithm 1 with arbitrary H0>0. There are two possible options: either the method halves Hk at every step in the beginning, so Hk becomes small very quickly, or this value is increased at least once, and the required bound is guaranteed by Lemma 4.2. It can be easily proved, that this initial phase requires no more than K0=log2H0ε(1-ν)/(1+ν)κf(ν) oracle calls.

Discussion

Let us discuss the global complexity results, provided by Theorem 4.1 for the Cubic Regularization of the Newton Method with the adaptive adjustment of the regularization parameter.

For the class of twice continuously differentiable strongly convex functions with Lipschitz continuous gradients fSμ,L2,1(domF), it is well known that the classical gradient descent method needs

O(LμlogF(x0)-Fε) 49

iterations for computing ε-solution of the problem (e.g., [15]). As it was shown in [6], this result is shared by a variant of Cubic Regularization of the Newton method. This is much better than the bound O((Lμ)2logF(x0)-Fε), known for the damped Newton method (e.g., [2]).

For the class of uniformly convex functions of degree p=2+ν having Hölder continuous Hessian of degree ν[0,1], we have proved the following parametric estimates: O(max{(γf(ν))-11+ν,1}·logF(x0)-Fε), where γf(ν):=σf(2+ν)Hf(ν) is the condition number of degree ν. However, in practice we may not know exactly an appropriate value of the parameter ν. It is important that our algorithm automatically adjusts to the best possible complexity bound:

O(max{infν[0,1](γf(ν))-11+ν,1}·logF(x0)-Fε). 50

Note that for fSμ,L2,1(domF) we have:

2f(x)-2f(y)L-μ,x,ydomF.

Thus, Hf(0)L-μ and γf(0)μL-μ. So we can conclude that the estimate (50) is better than (49). Moreover, addition to our objective arbitrary convex quadratic function does not change any of Hf(ν),ν[0,1]. Thus, it can only improve the condition number γf(ν), while the ratio L/μ may become arbitrarily bad. It confirms an intuition that a natural Newton-type minimization scheme should not be affected by any quadratic parts of the objective, and the notion of well-conditioned and ill-conditioned problems for second-order methods should be different from that of for first-order ones.

Note that in the recent paper [11], a linear rate of convergence was also proven for the accelerated second-order scheme, with the complexity bound:

O(max{(γf(ν))-12+ν,1}·logHf(ν)D02+νε). 51

This is the better rate than (50). However, the method requires to know the parameter ν, and the constant of uniform convexity. Thus, one theoretical question remains open: is it possible to construct universal second-order scheme, matching (51) in the uniformly convex case.

Looking at the definitions of Hf(ν) and σf(2+ν), we can see that, for all x,ydomF,xy,

σf(2+ν)f(x)-f(y),x-yx-y2+ν,1Hf(ν)x-yν2f(x)-2f(y),

and

γf(ν):=σf(2+ν)Hf(ν)f(x)-f(y),x-y2f(x)-2f(y)·x-y2.

The last fraction does not depend on any particular ν. So, for any twice-differentiable convex function, we can define the following number:

γf:=infx,ydomFxyf(x)-f(y),x-y2f(x)-2f(y)·x-y2.

If it is positive, then it could serve as an indicator of the second-order non-degeneracy, for which we have a lower bound: γfγf(ν),ν[0,1].

Conclusions

In this work, we have introduced the second-order condition number of a certain degree, which plays as the main complexity factor for solving uniformly convex minimization problems with Hölder-continuous Hessian of the objective by second-order optimization schemes.

We have proved that cubically regularized Newton method with an adaptive estimation of the regularization parameter achieves global linear rate of convergence on this class of functions. The algorithm does not require to know any parameters of the problem class and automatically fits to the best possible degree of nondegeneracy.

Using this technique, we have justified that global iteration complexity of cubic Newton is always better than corresponding one of gradient method for the standard class of strongly convex functions with uniformly bounded second derivative.

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Acknowledgements

The research results of this paper were obtained with support of ERC Advanced Grant 788368.

Footnotes

1

) For the integer values of p, this inequality was proved in [14].

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Nikita Doikov, Email: nikita.doikov@uclouvain.be.

Yurii Nesterov, Email: yurii.nesterov@uclouvain.be.

References

  • 1.Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)
  • 2.Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004. [Google Scholar]
  • 3.Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv:1612.00547 (2016)
  • 4.Cartis C, Gould NI, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 2011;130(2):295–319. doi: 10.1007/s10107-009-0337-y. [DOI] [Google Scholar]
  • 5.Cartis C, Gould NI, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 2011;127(2):245–295. doi: 10.1007/s10107-009-0286-5. [DOI] [Google Scholar]
  • 6.Cartis C, Gould NI, Toint PL. Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim. Methods Softw. 2012;27(2):197–219. doi: 10.1080/10556788.2011.602076. [DOI] [Google Scholar]
  • 7.Cartis C, Scheinberg K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 2018;169(2):337–375. doi: 10.1007/s10107-017-1137-4. [DOI] [Google Scholar]
  • 8.Doikov, N., Richtárik, P.: Randomized block cubic Newton method. In: International Conference on Machine Learning, pp. 1289–1297 (2018)
  • 9.Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information. arXiv:1710.05782 (2017)
  • 10.Grapiglia GN, Nesterov Y. Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM J. Optim. 2017;27(1):478–506. doi: 10.1137/16M1087801. [DOI] [Google Scholar]
  • 11.Grapiglia GN, Nesterov Y. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM J. Optim. 2019;29(1):77–99. doi: 10.1137/17M1142077. [DOI] [Google Scholar]
  • 12.Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: International Conference on Machine Learning, pp. 1895–1904 (2017)
  • 13.Nesterov Y. Modified Gauss–Newton scheme with worst case guarantees for global performance. Optim. Methods Softw. 2007;22(3):469–483. doi: 10.1080/08927020600643812. [DOI] [Google Scholar]
  • 14.Nesterov Y. Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 2008;112(1):159–181. doi: 10.1007/s10107-006-0089-x. [DOI] [Google Scholar]
  • 15.Nesterov Y. Lectures on Convex Optimization. Berlin: Springer; 2018. [Google Scholar]
  • 16.Nesterov, Y.: Implementable tensor methods in unconstrained convex optimization. In: Mathematical Programming pp. 1–27 (2019) [DOI] [PMC free article] [PubMed]
  • 17.Nesterov Y, Polyak BT. Cubic regularization of Newton’s method and its global performance. Math. Program. 2006;108(1):177–205. doi: 10.1007/s10107-006-0706-8. [DOI] [Google Scholar]
  • 18.Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)

Articles from Journal of Optimization Theory and Applications are provided here courtesy of Springer

RESOURCES