Minimizing Uniformly Convex Functions by Cubic Regularization of Newton Method

Nikita Doikov; Yurii Nesterov

doi:10.1007/s10957-021-01838-7

. 2021 Mar 10;189(1):317–339. doi: 10.1007/s10957-021-01838-7

Minimizing Uniformly Convex Functions by Cubic Regularization of Newton Method

Nikita Doikov ^1,^✉, Yurii Nesterov ²

PMCID: PMC8550329 PMID: 34720181

Abstract

In this paper, we study the iteration complexity of cubic regularization of Newton method for solving composite minimization problems with uniformly convex objective. We introduce the notion of second-order condition number of a certain degree and justify the linear rate of convergence in a nondegenerate case for the method with an adaptive estimate of the regularization parameter. The algorithm automatically achieves the best possible global complexity bound among different problem classes of uniformly convex objective functions with Hölder continuous Hessian of the smooth part of the objective. As a byproduct of our developments, we justify an intuitively plausible result that the global iteration complexity of the Newton method is always better than that of the gradient method on the class of strongly convex functions with uniformly bounded second derivative.

Keywords: Newton method, Cubic regularization, Global complexity bounds, Strong convexity, Uniform convexity

Introduction

A big step in a second-order optimization theory is related to the global complexity guarantees which were justified in [17] for the cubic regularization of the Newton method. The following results provide a good perspective for the development of this approach, discovering accelerated [14], adaptive [4, 5] and universal [10] schemes. The latter methods can automatically adjust to a smoothness properties of the particular objective function. In the same vein, the second-order algorithms for solving a system of nonlinear equations were discovered in [13], and randomized variants for solving large-scale optimization problems were proposed in [7–9, 12, 18].

Despite to a number of nice properties, global complexity bounds of the cubically regularized Newton method for the cases of strongly convex and uniformly convex objective are not still fully investigated, as well as the notion of second-order non-degeneracy (see discussion in Sect. 5 in [14]). We are going to address this issue in the current paper.

The rest of the paper is organized as follows. Section 2 contains all necessary definitions and main properties of the classes of uniformly convex functions and twice-differentiable functions with Hölder continuous Hessian. We introduce the notion of the condition number $γ_{f} (ν)$ of a certain degree $ν \in [0, 1]$ and present some basic examples.

In Sect. 3, we describe a general regularized Newton scheme and show the linear rate of convergence for this method on the class of uniformly convex functions with a known degree $ν \in [0, 1]$ of nondegeneracy. Then, we introduce the adaptive cubically regularized Newton method and collect useful inequalities and properties, which are related to this algorithm.

In Sect. 4, we study global iteration complexity of the cubically regularized Newton method on the classes of uniformly convex functions with Hölder continuous Hessian. We show that for nondegeneracy of any degree $ν \in [0, 1]$ , which is formalized by the condition $γ_{f} (ν) > 0$ , the algorithm automatically achieves the linear rate of convergence with the value $γ_{f} (ν)$ being the main complexity factor.

Finally, in Sect. 5 we compare our complexity bounds with the known bounds for other methods and discuss the results. In particular, we justify an intuitively plausible (but quite a delayed) result that the global complexity of the cubically regularized Newton method is always better than that of the gradient method on the class of strongly convex functions with uniformly bounded second derivative.

Uniformly Convex Functions with Hölder Continuous Hessian

Let us start from some notation. In what follows, we denote by $E$ a finite-dimensional real vector space and by $E^{*}$ its dual space, which is a space of linear functions on $E$ . The value of function $s \in E^{*}$ at point $x \in E$ is denoted by $⟨ s, x ⟩$ . Let us fix some linear self-adjoint positive-definite operator $B : E \to E^{*}$ and introduce the following Euclidean norms in the primal and dual spaces:

\begin{matrix} \begin{matrix} ‖ x ‖ : = & {⟨ B x, x ⟩}^{1 / 2}, x \in E, {‖ s ‖}_{*} : = {⟨ s, B^{- 1} s ⟩}^{1 / 2}, s \in E^{*} . \end{matrix} \end{matrix}

For any linear operator $A : E \to E^{*}$ , its norm is induced in a standard way:

\begin{matrix} \begin{matrix} ‖ A ‖ : = & max_{x \in E} {{‖ A x ‖}_{*} | ‖ x ‖ \leq 1} . \end{matrix} \end{matrix}

Our goal is to solve the convex optimization problem in the composite form:

\begin{matrix} \begin{matrix} min_{x \in dom F} F (x) : = & f (x) + h (x), \end{matrix} \end{matrix}

where f is a twice differentiable on its open domain uniformly convex function, and h is a simple closed convex function with $dom h \subseteq dom f$ . Simple means that all auxiliary subproblems with an explicit presence of h are easily solvable.

For a smooth function f, its gradient at point x is denoted by $\nabla f (x) \in E^{*}$ , and its Hessian is denoted by $\nabla^{2} f (x) : E \to E^{*}$ . For convex but not necessary differentiable function h, we denote by $\partial h (x) \subset E^{*}$ its subdifferential at the point $x \in dom h$ .

We say that differentiable function f is uniformly convex of degree $p \geq 2$ on a convex set $C \subseteq dom f$ if for some constant $σ > 0$ it satisfies inequality

\begin{matrix} \begin{matrix} f (y) & \geq & f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{σ}{p} {‖ y - x ‖}^{p}, x, y \in C . \end{matrix} \end{matrix}

Uniformly convex functions of degree $p = 2$ are known as strongly convex. If inequality (2) holds with $σ = 0$ , the function f is called just convex. The following convenient condition is sufficient for function f to be uniformly convex on a convex set $C \subseteq dom f$ :

Lemma 2.1

Lemma 1 in [14]) Let for some $σ > 0$ and $p \geq 2$ the following inequality holds:

\begin{matrix} \begin{matrix} ⟨ \nabla f (x) - \nabla f (y), x - y ⟩ & \geq & {σ ‖ x - y ‖}^{p}, x, y \in C . \end{matrix} \end{matrix}

Then, function f is uniformly convex of degree p on set C with parameter $σ$ .

From now on, we assume $C : = dom F \subseteq dom f .$ By the composite representation (1), we have for every $x \in dom F$ and for all $F^{'} (x) \in \partial F (x)$ :

\begin{matrix} \begin{matrix} F (y) \geq & F (x) + ⟨ F^{'} (x), y - x ⟩ + \frac{σ}{p} {‖ x - y ‖}^{p}, y \in dom F . \end{matrix} \end{matrix}

Therefore, if $σ > 0$ , then we can have only one point $x^{*} \in dom F$ with $F (x^{*}) = F^{*}$ , which always exists for F being uniformly convex and closed. A useful consequence of uniform convexity is the following upper bound for the residual.

Lemma 2.2

Let f be uniformly convex of degree $p \geq 2$ with constant $σ > 0$ on set $dom F$ . Then, for every $x \in dom F$ and for all $F^{'} (x) \in \partial F (x)$ we have

\begin{matrix} \begin{matrix} F (x) - F^{*} & \leq & \frac{p - 1}{p} {(\frac{1}{σ})}^{\frac{1}{p - 1}} {‖ F^{'} (x) ‖}_{*}^{\frac{p}{p - 1}} . \end{matrix} \end{matrix}

Proof

In view of (4), bound (5) follows as in the proof of Lemma 3 in [14]. $□$

It is reasonable to define the best possible constant $σ$ in inequality (3) for a certain degree p. This leads us to a system of constants:

\begin{matrix} \begin{matrix} σ_{f} (p) & : = & inf_{\begin{matrix} x, y \in dom F \\ x \neq y \end{matrix}} \frac{⟨ \nabla f (x) - \nabla f (y), x - y ⟩}{{‖ x - y ‖}^{p}}, p \geq 2 . \end{matrix} \end{matrix}

We prefer to use inequality (3) for the definition of $σ_{f} (p)$ , instead of (2), because of its symmetry in x and y. Note that the value $σ_{f} (p)$ also depends on the domain of F. However, we omit this dependence in our notation since it is always clear from the context.

It is easy to see that the univariate function $σ_{f} (\cdot)$ is log-concave. Thus, for all $p_{2} > p_{1} \geq 2$ we have:

\begin{matrix} \begin{matrix} σ_{f} (p) & \geq & (σ_{f} (p_{1}))^{\frac{p_{2} - p}{p_{2} - p_{1}}} \cdot (σ_{f} (p_{2}))^{\frac{p - p_{1}}{p_{2} - p_{1}}}, p \in [p_{1}, p_{2}] . \end{matrix} \end{matrix}

For a twice-differentiable function f, we say that it has Hölder continuous Hessian of degree $ν \in [0, 1]$ on a convex set $C \subseteq dom f$ , if for some constant $H$ , it holds:

\begin{matrix} \begin{matrix} ‖ \nabla^{2} f (x) - \nabla^{2} f (y) ‖ \leq & {H ‖ x - y ‖}^{ν}, x, y \in C . \end{matrix} \end{matrix}

Two simple consequences of (8) are as follows:

\begin{matrix} ‖ \nabla f (y) - \nabla f (x) - \nabla^{2} {f (x) (y - x) ‖}_{*} \leq \frac{{H ‖ x - y ‖}^{1 + ν}}{1 + ν}, \end{matrix}

\begin{matrix} | f (y) - Q (x ; y) | \leq \frac{{H ‖ x - y ‖}^{2 + ν}}{(1 + ν) (2 + ν)}, \end{matrix}

where Q(x; y) is the quadratic model of f at the point x:

\begin{matrix} \begin{matrix} Q (x ; y) & : = & f (x) + ⟨ \nabla f (x), y - x ⟩ + \frac{1}{2} ⟨ \nabla^{2} f (x) (y - x), y - x ⟩ . \end{matrix} \end{matrix}

In order to characterize the level of smoothness of function f on the set $C : = dom F$ , let us define the system of Hölder constants (see [10]):

\begin{matrix} \begin{matrix} H_{f} (ν) & : = & sup_{\begin{matrix} x, y \in dom F \\ x \neq y \end{matrix}} \frac{‖ \nabla^{2} f (x) - \nabla^{2} f (y) ‖}{{‖ x - y ‖}^{ν}}, ν \in [0, 1] . \end{matrix} \end{matrix}

We allow $H_{f} (ν)$ to be equal to $+ \infty$ for some $ν$ . Note that function $H_{f} (\cdot)$ is log-convex. Thus, any $0 \leq ν_{1} < ν_{2} \leq 1$ such that $H_{f} (ν_{i}) < + \infty, i = 1, 2$ , provide us with the following upper bounds for the whole interval:

\begin{matrix} \begin{matrix} H_{f} (ν) & \leq & (H_{f} (ν_{1}))^{\frac{ν_{2} - ν}{ν_{2} - ν_{1}}} \cdot (H_{f} (ν_{2}))^{\frac{ν - ν_{1}}{ν_{2} - ν_{1}}}, ν \in [ν_{1}, ν_{2}] . \end{matrix} \end{matrix}

If for some specific $ν \in [0, 1]$ we have $H_{f} (ν) = 0$ , this implies that $\nabla^{2} f (x) = \nabla^{2} f (y)$ for all $x, y \in dom F$ . In this case restriction, ${(f|}_{dom F}$ is a quadratic function and we conclude that $H_{f} (ν) = 0$ for all $ν \in [0, 1]$ . At the same time, having two points $x, y \in dom F$ with $0 < ‖ x - y ‖ \leq 1$ , we get a simple uniform lower bound for all constants $H_{f} (ν)$ :

\begin{matrix} \begin{matrix} H_{f} (ν) & \geq & ‖ \nabla^{2} f (x) - \nabla^{2} f (y) ‖, ν \in [0, 1] . \end{matrix} \end{matrix}

Let us give an example of function, which has Hölder continuous Hessian for all $ν \in [0, 1]$ .

Example 2.1

For a given $a_{i} \in E^{*}$ , $1 \leq i \leq m$ , consider the following convex function:

\begin{matrix} \begin{matrix} f (x) & = & ln (\sum_{i = 1}^{m}, e^{⟨ a_{i}, x ⟩}), x \in E . \end{matrix} \end{matrix}

Let us fix Euclidean norm $‖ x ‖ = {⟨ B x, x ⟩}^{1 / 2}, x \in E$ , with operator $B : = \sum_{i = 1}^{m} a_{i} a_{i}^{*}$ . Without loss of generality, we assume that $B ≻ 0$ (otherwise we can reduce dimension of the problem). Then,

\begin{matrix} \begin{matrix} H_{f} (0) & \leq & 1, H_{f} (1) \leq 2 . \end{matrix} \end{matrix}

Therefore, by (12) we get, for any $ν \in [0, 1]$ :

\begin{matrix} \begin{matrix} H_{f} (ν) & \leq & 2^{ν} . \end{matrix} \end{matrix}

Proof

Denote $κ (x) \equiv \sum_{i = 1}^{m} e^{⟨ a_{i}, x ⟩}$ . Let us fix arbitrary $x, y \in E$ and direction $h \in E$ . Then, straightforward computation gives:

\begin{matrix} ⟨ \nabla f (x), h ⟩ = & \frac{1}{κ (x)} \sum_{i = 1}^{m} e^{⟨ a_{i}, x ⟩} ⟨ a_{i}, h ⟩, \\ ⟨ \nabla^{2} f (x) h, h ⟩ = & \frac{1}{κ (x)} \sum_{i = 1}^{m} e^{⟨ a_{i}, x ⟩} {⟨ a_{i}, h ⟩}^{2} - (\frac{1}{κ (x)} \sum_{i = 1}^{m} e^{⟨ a_{i}, x ⟩} ⟨ a_{i}, h ⟩)^{2} \\ = & \frac{1}{κ (x)} \sum_{i = 1}^{m} e^{⟨ a_{i}, x ⟩} {(⟨ a_{i}, h ⟩ - ⟨ \nabla f (x), h ⟩)}^{2} \geq 0 . \end{matrix}

Hence, we get

\begin{matrix} \begin{matrix} ‖ \nabla^{2} f (x) ‖ = & max_{‖ h ‖ \leq 1} ⟨ \nabla^{2} f (x) h, h ⟩ \leq max_{‖ h ‖ \leq 1} \sum_{i = 1}^{m} {⟨ a_{i}, h ⟩}^{2} = max_{‖ h ‖ \leq 1} {‖ h ‖}^{2} = 1 . \end{matrix} \end{matrix}

Since all Hessians of function f are positive definite, we conclude that $H_{f} (0) \leq 1$ . Inequality $H_{f} (1) \leq 2$ can be easily obtained from the following representation of the third derivative:

\begin{matrix} f^{'''} (x) [h, h, h] = & \frac{1}{κ (x)} \sum_{i = 1}^{m} e^{⟨ a_{i}, x ⟩} {(⟨ a_{i}, h ⟩ - ⟨ \nabla f (x), h ⟩)}^{3} \\ \leq & ⟨ \nabla^{2} f (x) h, h ⟩ max_{1 \leq i, j \leq m} ⟨ a_{i} - a_{j}, h ⟩ \leq 2 {‖ h ‖}^{3} . \end{matrix}

$□$

Let us imagine now that we want to describe the iteration complexity of some method, which solves the composite optimization problem (1) up to an absolute accuracy $ϵ > 0$ in the function value. We assume that the smooth part f of its objective is uniformly convex and has Hölder continuous Hessians. Which degrees p and $ν$ should be used in our analysis? Suppose that, for the number of calls of the oracle, we are interested in obtaining a polynomial-time bound of the form:

\begin{matrix} \begin{matrix} O ({(H_{f} (ν))}^{α} \cdot {(σ_{f} (p))}^{β} \cdot log \frac{F (x_{0}) - F^{*}}{ε}), α, β \neq 0 . \end{matrix} \end{matrix}

Denote by $[x]$ the physical dimension of variable $x \in E$ , and by $[f]$ the physical dimension of the value f(x). Then, we have $[\nabla f (x)] = [f] / [x]$ and $[\nabla^{2} f (x)] = [f] / {[x]}^{2}$ . This gives us

\begin{matrix} \begin{matrix} [H_{f} (ν)] = \frac{[f]}{{[x]}^{2 + ν}}, [σ_{f} (p)] = \frac{[f]}{{[x]}^{p}}, [{(H_{f} (ν))}^{α} \cdot {(σ_{f} (p))}^{β}] = \frac{{[f]}^{α + β}}{{[x]}^{α (2 + ν) + β p}} . \end{matrix} \end{matrix}

While x and f(x) can be measured in arbitrary physical quantities, the value “number of iterations” cannot have physical dimension. This leads to the following relations:

\begin{matrix} α + β = 0 and α (2 + ν) + β p = 0 . \end{matrix}

Therefore, despite to the fact that our function can belong to several problem classes simultaneously, from the physical point of view only one option is available:

\begin{matrix} p = 2 + ν \end{matrix}

Hence, for a twice-differentiable convex function f with ${inf}_{ν \in [0, 1]} H_{f} (ν) > 0$ , we can define only one meaningful condition number of degree $ν \in [0, 1]$ :

\begin{matrix} \begin{matrix} γ_{f} (ν) : = \frac{σ_{f} (2 + ν)}{H_{f} (ν)} . \end{matrix} \end{matrix}

If for some particular $ν$ we have $H_{f} (ν) = + \infty$ , then by our definition: $γ_{f} (ν) = 0$ .

It will be shown that the condition number $γ_{f} (ν)$ serves as a main factor in the global iteration complexity bounds for the regularized Newton method as applied to the problem (1). Let us prove that this number cannot be big.

Lemma 2.3

Let ${inf}_{ν \in [0, 1]} H_{f} (ν) > 0$ and therefore the condition number $γ_{f} (\cdot)$ be well defined. Then,

\begin{matrix} \begin{matrix} γ_{f} (ν) & \leq & \frac{1}{1 + ν} + inf_{x, y \in dom F} \frac{‖ \nabla^{2} f (x) ‖}{‖ \nabla^{2} f (y) - \nabla^{2} f (x) ‖}, ν \in [0, 1] . \end{matrix} \end{matrix}

In the case when $dom F$ is unbounded: ${sup}_{x \in dom F} ‖ x ‖ = + \infty$ , then

\begin{matrix} \begin{matrix} γ_{f} (ν) & \leq & \frac{1}{1 + ν}, ν \in (0, 1] . \end{matrix} \end{matrix}

Proof

Indeed, for any $x, y \in dom F$ , $x \neq y$ , we have:

\begin{matrix} σ_{f} (2 + ν) \overset{(6)}{\leq} \frac{⟨ \nabla f (y) - \nabla f (x), y - x ⟩}{{‖ y - x ‖}^{2 + ν}} \\ = \frac{⟨ \nabla f (y) - \nabla f (x) - \nabla^{2} f (x) (y - x), y - x ⟩}{{‖ y - x ‖}^{2 + ν}} + \frac{⟨ \nabla^{2} f (x) (y - x), y - x ⟩}{{‖ y - x ‖}^{2 + ν}} \\ \overset{(9)}{\leq} \frac{H_{f} (ν)}{1 + ν} + \frac{‖ \nabla^{2} f (x) ‖}{{‖ y - x ‖}^{ν}} . \end{matrix}

Now, dividing both sides of this inequality by $H_{f} (ν)$ , we get inequality (14) from the definition of $H_{f} (ν)$ (11). Inequality (15) can be obtained by taking the limit $‖ y ‖ \to + \infty$ . $□$

From inequalities (7) and (12), we can get the following lower bound:

\begin{matrix} \begin{matrix} γ_{f} (ν) & \geq & (γ_{f} (ν_{1}))^{\frac{ν_{2} - ν}{ν_{2} - ν_{1}}} \cdot (γ_{f} (ν_{2}))^{\frac{ν - ν_{1}}{ν_{2} - ν_{1}}}, ν \in [ν_{1}, ν_{2}], \end{matrix} \end{matrix}

where $0 \leq ν_{1} < ν_{2} \leq 1$ . However, it turns out that in unbounded case we can have a nonzero condition number $γ_{f} (ν)$ only for a single degree.

Lemma 2.4

Let $dom F$ be unbounded: ${sup}_{x \in dom F} ‖ x ‖ = + \infty$ . Assume that for a fixed $ν \in [0, 1]$ we have $γ_{f} (ν) > 0$ . Then,

\begin{matrix} γ_{f} (α) = 0 for all α \in [0, 1] \ {ν} . \end{matrix}

Proof

Consider firstly the case: $α > ν$ . From the condition $γ_{f} (ν) > 0$ , we conclude that $H_{f} (ν) < + \infty$ . Then, for any $x, y \in dom F$ we have:

\begin{matrix} \frac{σ_{f} (2 + α) {‖ y - x ‖}^{2 + α}}{2 + α} \overset{(2)}{\leq} f (y) - f (x) - ⟨ \nabla f (x), y - x ⟩ \\ \overset{(10)}{\leq} \frac{1}{2} ⟨ \nabla^{2} f (x) (y - x), (y - x) ⟩ + \frac{H_{f} (ν) {‖ y - x ‖}^{2 + ν}}{(1 + ν) (2 + ν)} . \end{matrix}

Dividing both sides of this inequality by ${‖ y - x ‖}^{2 + α}$ and letting $‖ x ‖ \to + \infty$ , we get $σ_{f} (2 + ν) = 0$ . Therefore, $γ_{f} (α) = 0$ . For the second case, $α < ν$ , we cannot have $γ_{f} (α) > 0$ , since the previous reasoning results in $γ_{f} (ν) = 0$ . $□$

Let us look now at an important example of a uniformly convex function with Hölder continuous Hessian. It is convenient to start with some properties of powers of Euclidean norm.

Lemma 2.5

For fixed real $p \geq 1$ , consider the following function:

\begin{matrix} \begin{matrix} f_{p} (x) = \frac{1}{p} {‖ x ‖}^{p}, x \in E . \end{matrix} \end{matrix}

1. For $p \geq 2$ , function $f_{p} (\cdot)$ is uniformly convex of degree p:1 $^{)}$

\begin{matrix} ⟨ \nabla f_{p} (x) - \nabla f_{p} (y), x - y ⟩ \geq 2^{2 - p} {‖ x - y ‖}^{p}, x, y \in E . \end{matrix}

2. If $1 \leq p \leq 2$ , then function $f_{p} (\cdot)$ has $ν$ -Hölder continuous gradient with $ν = p - 1$ :

\begin{matrix} ‖ \nabla f_{p} (x) - \nabla f_{p} {(y) ‖}_{*} \leq 2^{1 - ν} {‖ x - y ‖}^{ν}, x, y \in E . \end{matrix}

Proof

Firstly, recall two useful inequalities, which are valid for all $a, b \geq 0$ :

\begin{matrix} | a^{α} - b^{α} {| \leq | a - b |}^{α}, when 0 \leq α \leq 1, \end{matrix}

\begin{matrix} | a^{α} - b^{α} {| \geq | a - b |}^{α}, when α \geq 1 . \end{matrix}

Let us fix arbitrary $x, y \in E$ . The left-hand side of inequality (16) equals

\begin{matrix} {⟨ ‖ x ‖}^{p - 2} B x - {‖ y ‖}^{p - 2} {B y, x - y ⟩ = ‖ x ‖}^{p} + {‖ y ‖}^{p} - {⟨ B x, y ⟩ (‖ x ‖}^{p - 2} + {‖ y ‖}^{p - 2}), \end{matrix}

and we need to verify that it is bigger than $2^{2 - p} {[‖ x ‖}^{2} + {‖ y ‖}^{2} - 2 ⟨ B x, y ⟩]^{\frac{p}{2}} .$ The case $x = 0$ or $y = 0$ is trivial. Therefore, assume $x \neq 0$ and $y \neq 0$ . Denoting $τ : = \frac{‖ y ‖}{‖ x ‖}$ , $r : = \frac{⟨ B x, y ⟩}{‖ x ‖ \cdot ‖ y ‖}$ , we have the following statement to prove:

\begin{matrix} \begin{matrix} 1 + τ^{p} \geq & r τ (1 + τ^{p - 2}) + 2^{2 - p} [1 + τ^{2} - 2 r τ]^{\frac{p}{2}}, τ > 0, | r | \leq 1 . \end{matrix} \end{matrix}

Since the function in the right-hand side is convex in r, we need to check only two marginal cases:

$r = 1 :$ $1 + τ^{p} \geq τ (1 + τ^{p - 2}) + 2^{2 - p} {| 1 - τ |}^{p}$ , which is equivalent to $(1 - τ) (1 - τ^{p - 1}) \geq 2^{2 - p} {| 1 - τ |}^{p}$ . This is true by (19).
$r = - 1 :$ $1 + τ^{p} \geq - τ (1 + τ^{p - 2}) + 2^{2 - p} {(1 + τ)}^{p}$ , which is equivalent to $(1 + τ^{p - 1}) \geq 2^{2 - p} {(1 + τ)}^{p - 1}$ . This is true in view of convexity of function $τ^{p - 1}$ for $τ \geq 0$ .

Thus, we have proved (16). Let us prove the second statement. Consider the function ${\hat{f}}_{q} (s) = \frac{1}{q} {‖ s ‖}_{*}^{q}$ , $s \in E^{*}$ , with $q = \frac{p}{p - 1} \geq 2$ . In view of our first statement, we have:

\begin{matrix} \begin{matrix} ⟨ s_{1} - s_{2}, \nabla {\hat{f}}_{q} (s_{1}) - \nabla {\hat{f}}_{q} (s_{2}) ⟩ \geq & {(\frac{1}{2})}^{q - 2} {‖ s_{1} - s_{2} ‖}_{*}^{q}, s_{1}, s_{2} \in E^{*} . \end{matrix} \end{matrix}

For arbitrary $x_{1}, x_{2} \in E$ , define $s_{i} = \nabla f_{p} (x_{i}) = \frac{B x_{i}}{‖ x_{i} ‖^{2 - p}}$ , $i = 1, 2$ . Then $‖ s_{i} ‖_{*} = {‖ x_{i} ‖}^{p - 1}$ , and consequently,

\begin{matrix} \begin{matrix} x_{i} = & ‖ x_{i} ‖^{2 - p} B^{- 1} s_{i} = {‖ s_{i} ‖}_{*}^{\frac{2 - p}{p - 1}} B^{- 1} s_{i} = \nabla {\hat{f}}_{q} (s_{i}) . \end{matrix} \end{matrix}

Therefore, substituting these vectors in (20), we get

\begin{matrix} {(\frac{1}{2})}^{q - 2} {‖ \nabla f_{p} (x_{1}) - \nabla f_{p} (x_{2}) ‖}_{*}^{q} \leq ⟨ \nabla f_{p} (x_{1}) - \nabla f_{p} (x_{2}), x_{1} - x_{2} ⟩ . \end{matrix}

Thus, $‖ \nabla f_{p} (x_{1}) - \nabla f_{p} (x_{2}) ‖_{*} \leq 2^{\frac{q - 2}{q - 1}} {‖ x_{1} - x_{2} ‖}^{\frac{1}{q - 1}}$ . It remains to note that $\frac{1}{q - 1} = p - 1 = ν$ . $□$

Example 2.2

For real $p \geq 2$ and arbitrary $x_{0} \in E$ , consider the following function:

\begin{matrix} \begin{matrix} f (x) = \frac{1}{p} {‖ x - x_{0} ‖}^{p} = f_{p} (x - x_{0}), x \in E . \end{matrix} \end{matrix}

Then, $σ_{f} (p) = {(\frac{1}{2})}^{p - 2}$ . Moreover, if $p = 2 + ν$ for some $ν \in (0, 1]$ , then it holds

\begin{matrix} \begin{matrix} H_{f} (ν) \leq (1 + ν) 2^{1 - ν}, \end{matrix} \end{matrix}

and $H_{f} (α) = + \infty$ , for all $α \in [0, 1] \ {ν}$ . Therefore, in this case we have $γ_{f} (ν) \geq \frac{1}{2 (1 + ν)},$ and $γ_{f} (α) = 0$ for all $α \in [0, 1] \ {ν}$ .

Proof

Let us take an arbitrary $x \neq 0$ and set $y : = - x$ . Then,

\begin{matrix} ⟨ \nabla f (x) - \nabla f (y), y - x ⟩ = {⟨ ‖ x ‖}^{p - 2} B x + {‖ x ‖}^{p - 2} {B x, 2 x ⟩ = 4 ‖ x ‖}^{p} . \end{matrix}

On the other hand, ${‖ y - x ‖}^{p} = 2^{p} {‖ x ‖}^{p}$ . Therefore, $σ_{f} (p) \overset{(6)}{\leq} 2^{2 - p}$ , and (16) tells us that this inequality is satisfied as equality.

Let us prove now that $H_{f} (ν) \leq (1 + ν) 2^{1 - ν}$ for $p = 2 + ν$ with some $ν \in (0, 1]$ . This is

\begin{matrix} ‖ \nabla^{2} f (x) - \nabla^{2} f (y) ‖ \leq (1 + ν) 2^{1 - ν} {‖ x - y ‖}^{ν}, x, y \in E . \end{matrix}

The corresponding Hessians can be represented as follows:

\begin{matrix} \begin{matrix} \nabla^{2} f (x) = & {‖ x ‖}^{ν} B + \frac{ν B x x^{*} B}{{‖ x ‖}^{2 - ν}}, x \in E \ {0}, \nabla^{2} f (0) = 0 . \end{matrix} \end{matrix}

For the case $x = y = 0$ , inequality (21) is trivial. Assume now that $x \neq 0$ . If $0 \in [x, y]$ , then $y = - β x$ for some $β \geq 0$ and we have:

\begin{matrix} ‖ \nabla^{2} f (x) - \nabla^{2} f (- β x) ‖ \leq & | 1 - β^{ν} {| (1 + ν) ‖ x ‖}^{ν} \leq {(1 + β)}^{ν} (1 + ν) 2^{1 - ν} {‖ x ‖}^{ν} \\ = & (1 + ν) 2^{1 - ν} {‖ x - y ‖}^{ν}, \end{matrix}

which is (21). Let $0 \notin [x, y]$ . For an arbitrary fixed direction $h \in E$ , we get:

\begin{matrix} \begin{matrix} | 〈 (\nabla^{2} f (x) - \nabla^{2} f (y)) h, h 〉 | = & | ({‖ x ‖}^{ν} - {‖ y ‖}^{ν}) \cdot {‖ h ‖}^{2} + ν \cdot (\frac{{⟨ B x, h ⟩}^{2}}{{‖ x ‖}^{2 - ν}} - \frac{{⟨ B y, h ⟩}^{2}}{{‖ y ‖}^{2 - ν}}) | . \end{matrix} \end{matrix}

Consider the points $u = \frac{Bx}{{‖ x ‖}^{1 - ν}} = \nabla f_{q} (x)$ and $v = \frac{By}{{‖ y ‖}^{1 - ν}} = \nabla f_{q} (y)$ with $q = 1 + ν$ . Then,

\begin{matrix} \begin{matrix} {‖ x ‖}^{ν} = {‖ u ‖}_{*}, \frac{{⟨ B x, h ⟩}^{2}}{{‖ x ‖}^{2 - ν}} = \frac{{⟨ u, h ⟩}^{2}}{{‖ u ‖}_{*}} {and ‖ y ‖}^{ν} = {‖ v ‖}_{*}, \frac{{⟨ B y, h ⟩}^{2}}{{‖ y ‖}^{2 - ν}} = \frac{{⟨ v, h ⟩}^{2}}{{‖ v ‖}_{*}} . \end{matrix} \end{matrix}

Therefore,

\begin{matrix} | 〈 (\nabla^{2} f (x) - \nabla^{2} f (y)) h, h 〉 | \\ = | ({‖ u ‖}_{*} - {‖ v ‖}_{*}) \cdot {‖ h ‖}^{2} + ν \cdot (\frac{{⟨ u, h ⟩}^{2}}{{‖ u ‖}_{*}} - \frac{{⟨ v, h ⟩}^{2}}{{‖ v ‖}_{*}}) | . \end{matrix}

Let us estimate the right-hand side of (22) from above. Consider a continuously differentiable univariate function:

\begin{matrix} ϕ (τ) : = & {‖ u (τ) ‖}_{*} \cdot {‖ h ‖}^{2} + ν \cdot \frac{{⟨ u (τ), h ⟩}^{2}}{{‖ u (τ) ‖}_{*}}, \\ u (τ) : = & u + τ (v - u), τ \in [0, 1] . \end{matrix}

Note that

\begin{matrix} ϕ^{'} (τ) = & \frac{⟨ u (τ), B^{- 1} (v - u) ⟩}{{‖ u (τ) ‖}_{*}} \cdot {‖ h ‖}^{2} + \frac{2 ν ⟨ u (τ), h ⟩ ⟨ v - u, h ⟩}{{‖ u (τ) ‖}_{*}} \\ - \frac{ν {⟨ u (τ), h ⟩}^{2} ⟨ u (τ), B^{- 1} (v - u) ⟩}{{‖ u (τ) ‖}_{*}^{3}} \\ = & \frac{⟨ u (τ), B^{- 1} (v - u) ⟩}{{‖ u (τ) ‖}_{*}} \cdot \underset{\geq 0}{\underset{⏟}{({‖ h ‖}^{2} - \frac{ν {⟨ u (τ), h ⟩}^{2}}{{‖ u (τ) ‖}_{*}^{2}})}} + \frac{2 ν ⟨ u (τ), h ⟩ ⟨ v - u, h ⟩}{{‖ u (τ) ‖}_{*}} . \end{matrix}

Denote $γ : = \frac{⟨ u (τ), h ⟩}{{‖ u (τ) ‖}_{*} \cdot ‖ h ‖} \in [- 1, 1]$ . Then,

\begin{matrix} | ϕ^{'} (τ) | \leq {‖ v - u ‖}_{*} \cdot {‖ h ‖}^{2} \cdot (1 - ν γ^{2} + 2 ν | γ |) \leq {(1 + ν) \cdot ‖ v - u ‖}_{*} \cdot {‖ h ‖}^{2} . \end{matrix}

Thus, we have:

\begin{matrix} | 〈 (\nabla^{2} f (x) - \nabla^{2} f (y)) h, h 〉 | = | ϕ (1) - ϕ (0) | \leq {(1 + ν) \cdot ‖ v - u ‖}_{*} \cdot {‖ h ‖}^{2} . \end{matrix}

It remains to use the definition of u and v and apply inequality (17) with $p = q$ . Thus, we have proved, that for $p = 2 + ν$ the Hessian of f is Hölder continuous of degree $ν$ . At the same time, taking $y = 0$ , we get $‖ \nabla^{2} f (x) - \nabla^{2} f (y) ‖ = ‖ \nabla^{2} {f (x) ‖ = (1 + ν) ‖ x ‖}^{ν}$ . These values cannot be uniformly bounded in $x \in E$ by any multiple of ${‖ x ‖}^{α}$ with $α \neq ν$ . So, the Hessian of f is not Hölder continuous for any degree different from $2 + ν$ . $□$

Remark 2.1

Inequalities (16) and (17) have the following symmetric consequences:

\begin{matrix} p \geq 2 \Rightarrow & ‖ \nabla f_{p} (x) - \nabla f_{p} {(y) ‖}_{*} \geq 2^{2 - p} {‖ x - y ‖}^{p - 1}, \\ p \leq 2 \Rightarrow & ‖ \nabla f_{p} (x) - \nabla f_{p} {(y) ‖}_{*} \leq 2^{2 - p} {‖ x - y ‖}^{p - 1}, \end{matrix}

which are valid for all $x, y \in E$ .

Regularized Newton Method

Let us start from the case when we know that for a specific $ν \in [0, 1]$ function f has Hölder continuous Hessian: $H_{f} (ν) < + \infty$ . Then, from (10), we have the global upper bound for the objective function:

\begin{matrix} \begin{matrix} F (y) & \leq & M_{ν, H} (x ; y) : = Q (x ; y) + \frac{{H ‖ x - y ‖}^{2 + ν}}{(1 + ν) (2 + ν)} + h (y), x, y \in dom F, \end{matrix} \end{matrix}

where $H > 0$ is large enough: $H \geq H_{f} (ν)$ . Thus, it is natural to employ the minimum of a regularized quadratic model:

\begin{matrix} \begin{matrix} T_{ν, H} (x) : = \underset{y \in dom F}{argmin} M_{ν, H} (x ; y), M_{ν, H}^{*} (x) : = min_{y \in dom F} M_{ν, H} (x ; y), \end{matrix} \end{matrix}

and define the following general iteration process [10]:

\begin{matrix} x_{k + 1} : = T_{ν, H_{k}} (x_{k}), k \geq 0 \end{matrix}

where the value $H_{k}$ is chosen either to be a constant from the interval $[0, 2 H_{f} (ν)]$ or by some adaptive procedure.

For the class of uniformly convex functions of degree $p = 2 + ν$ , we can justify the following global convergence result for this process.

Theorem 3.1

Assume that for some $ν \in [0, 1]$ we have $0 < H_{f} (ν) < + \infty$ and $σ_{f} (2 + ν) > 0$ . Let the coefficients ${H_{k}}_{k \geq 0}$ in the process (24) satisfy the following conditions:

\begin{matrix} 0 \leq H_{k} \leq β H_{f} (ν), F (x_{k + 1}) \leq M_{ν, H_{k}}^{*} (x_{k}), k \geq 0, \end{matrix}

with some constant $β \geq 0$ . Then, for the sequence ${x_{k}}_{k \geq 0}$ generated by the process we have:

\begin{matrix} \begin{matrix} F (x_{k + 1}) - F^{*} \leq & (1 - \frac{1 + ν}{2 + ν} \cdot min {\frac{γ_{f} (ν) (1 + ν)}{(1 + β) (2 + ν)}, 1}^{\frac{1}{1 + ν}}) (F (x_{k}) - F^{*}) . \end{matrix} \end{matrix}

Thus, the rate of convergence is linear and for reaching the gap $F (x_{K}) - F^{*} \leq ε$ it is enough to perform $K = ⌈ \frac{2 + ν}{1 + ν} \cdot max {\frac{(1 + β) (2 + ν)}{γ_{f} (ν) (1 + ν)}, 1}^{\frac{1}{1 + ν}} log \frac{F (x_{0}) - F^{*}}{ε} ⌉$ iterations.

Proof

As in the proof of Theorem 3.1 in [10], from (25) one can see that

\begin{matrix} \begin{matrix} F (x_{k + 1}) \leq & F (x_{k}) - α (F (x_{k}) - F^{*}) + α^{2 + ν} \frac{(1 + β) H_{f} (ν) {‖ x_{k} - x^{*} ‖}^{2 + ν}}{(1 + ν) (2 + ν)}, \end{matrix} \end{matrix}

for any $α \in [0, 1]$ . Then, taking into account the uniform convexity (4), we get

\begin{matrix} \begin{matrix} F (x_{k + 1}) \leq & F (x_{k}) - (α - α^{2 + ν} \frac{(1 + β) H_{f} (ν)}{(1 + ν) σ_{f} (2 + ν)}) (F (x_{k}) - F^{*}) . \end{matrix} \end{matrix}

The minimum of the right-hand side is attained at $α^{*} = min {\frac{γ_{f} (ν) (1 + ν)}{(2 + ν) (1 + β)}, 1}^{\frac{1}{1 + ν}}$ . Plugging this value into the bound above, we get inequality (26). $□$

Unfortunately, in practice it is difficult to decide on an appropriate value of $ν \in [0, 1]$ with $H_{f} (ν) < + \infty$ . Therefore, it is interesting to develop the universal methods which are not based on some particular parameters. Recently, it was shown [10] that one good choice for such universal scheme is the cubic regularization of the Newton Method [17]. This is actually the process (24) with the fixed parameter $ν = 1$ . For this choice, in the rest part of the paper we omit the corresponding index in the definitions of all necessary objects: $M_{H} (x ; y) : = M_{1, H} (x ; y)$ , $T_{H} (x) : = T_{1, H} (x)$ , and $M_{H}^{*} (x) : = M_{1, H}^{*} (x) = M_{H} (x ; T_{H} (x))$ . The adaptive scheme of our method with dynamic estimation of the constant H is as follows.

Algorithm 1: Adaptive Cubic Regularization of Newton Method
Initialization. Choose $x_{0} \in dom F$ , $H_{0} > 0$ .
Iteration $k \geq 0$ .
1: Find the minimal integer $i_{k} \geq 0$ such that $F (T_{H_{k} 2^{i_{k}}} (x_{k})) \leq M_{H_{k} 2^{i_{k}}}^{*} (x_{k})$ .
2: Perform the Cubic Step: $x_{k + 1} = T_{H_{k} 2^{i_{k}}} (x_{k})$ .
3: Set $H_{k + 1} : = 2^{i_{k} - 1} H_{k}$ .

Open in a new tab

Let us present the main properties of the composite Cubic Newton step $x \mapsto T_{H} (x)$ . Denote

\begin{matrix} r_{H} (x) : = ‖ T_{H} (x) - x ‖ . \end{matrix}

Since point $T_{H} (x)$ is a minimum of strictly convex function $M_{H} (x ; \cdot)$ , it satisfies the following first-order optimality condition:

\begin{matrix} 〈 \nabla f (x) + \nabla^{2} f (x) (T_{H} (x) - x) + \frac{H r_{H} (x)}{2} B (T_{H} (x) - x), y - T_{H} (x) 〉 \\ + h (y) \geq h (T_{H} (x)), y \in dom F . \end{matrix}

In other words, the vector

\begin{matrix} \begin{matrix} h^{'} (T_{H} (x)) : = & - \nabla f (x) - \nabla^{2} f (x) (T_{H} (x) - x) - \frac{H r_{H} (x)}{2} B (T_{H} (x) - x) \end{matrix} \end{matrix}

belongs to the subdifferential of h:

\begin{matrix} \begin{matrix} h^{'} (T_{H} (x)) & \in & \partial h (T_{H} (x)) . \end{matrix} \end{matrix}

Computation of a point $T = T_{H} (x)$ , satisfying condition (28), requires some standard techniques of Convex Optimization and Linear Algebra (see [1, 3, 16, 17]). Arithmetical complexity of such a procedure is usually similar to that of the standard Newton step.

Plugging into (27) $y : = x \in dom F$ , we get:

\begin{matrix} ⟨ \nabla f (x), x - T_{H} (x) ⟩ \\ \geq ⟨ \nabla^{2} f (x) (T_{H} (x) - x), T_{H} (x) - x ⟩ + \frac{H r_{H}^{3} (x)}{2} + h (T_{H} (x)) - h (x) . \end{matrix}

Thus, we obtain the following bound for the minimal value $M_{H}^{*} (x)$ of the cubic model:

\begin{matrix} M_{H}^{*} (x) \overset{(29)}{\leq} f (x) - \frac{1}{2} ⟨ \nabla^{2} f (x) (T_{H} (x) - x), T_{H} (x) - x ⟩ - \frac{H r_{H}^{3} (x)}{3} + h (x) \\ = F (x) - \frac{1}{2} ⟨ \nabla^{2} f (x) (T_{H} (x) - x), T_{H} (x) - x ⟩ - \frac{H r_{H}^{3} (x)}{3} . \end{matrix}

If for some value $ν \in [0, 1]$ the Hessian is Hölder continuous: $H_{f} (ν) < + \infty$ , then by (9) and (28) we get the following bound for the subgradient:

\begin{matrix} F^{'} (T_{H} (x)) : = \nabla f (T_{H} (x)) + h^{'} (T_{H} (x)) \end{matrix}

at the new point:

\begin{matrix} ‖ F^{'} (T_{H} (x)) ‖_{*} \\ \leq ‖ \nabla f (T_{H} (x)) - \nabla f (x) - \nabla^{2} f (x) (T_{H} (x) - x) ‖_{*} + \frac{H r_{H}^{2} (x)}{2} \\ \overset{(9)}{\leq} \frac{H_{f} (ν) r_{H}^{1 + ν} (x)}{1 + ν} + \frac{H r_{H}^{2} (x)}{2} = r_{H}^{1 + ν} (x) \cdot (\frac{H_{f} (ν)}{1 + ν} + \frac{H r_{H}^{1 - ν} (x)}{2}) . \end{matrix}

One of the main strong point of the classical Newton’s is its local quadratic convergence for the class of strongly convex functions with Lipschitz continuous Hessian: $σ_{f} (2) > 0$ and $0 < H_{f} (1) < + \infty$ (see, for example, [15]). This property holds for the cubically regularized Newton as well [14, 17]. Indeed, ensuring $F (T_{H} (x)) \leq M_{H}^{*} (x)$ as in Algorithm 1, and having $H \leq β H_{f} (1)$ with some $β \geq 0$ , we get:

\begin{matrix} F (T_{H} (x)) - F^{*} \overset{(5)}{\leq} \frac{1}{2 σ_{f} (2)} {‖ F^{'} (T_{H} (x)) ‖}_{*}^{2} \overset{(31)}{\leq} \frac{{(1 + β)}^{2} H_{f}^{2} (1)}{8 σ_{f} (2)} r_{H}^{4} (x) \\ \leq \frac{{(1 + β)}^{2} H_{f}^{2} (1)}{8 σ_{f}^{3} (2)} {⟨ \nabla^{2} f (x) (T_{H} (x) - x), T_{H} (x) - x ⟩}^{2} \\ \overset{(30)}{\leq} \frac{{(1 + β)}^{2} H_{f}^{2} (1)}{2 σ_{f}^{3} (2)} {(F (x) - F^{*})}^{2} . \end{matrix}

And the region of quadratic convergence is as follows:

\begin{matrix} Q = {x \in dom F : F (x) - F^{*} \leq \frac{2 σ_{f}^{3} (2)}{{(1 + β)}^{2} H_{f}^{2} (1)}} . \end{matrix}

After reaching it, the method starts to double the right digits of the answer at every step, and this cannot last for a long time. Therefore, from now on we are mainly interested in the global complexity bounds of Algorithm 1, which work for an arbitrary starting point $x_{0}$ .

For noncomposite case, as it was shown in [10], if for some $ν \in [0, 1]$ we have $0 < H_{f} (ν) < + \infty$ and the objective is just convex, then Algorithm 1 with small initial parameter $H_{0}$ generates a solution $\hat{x}$ with $f (\hat{x}) - f^{*} \leq ε$ in $O ((\frac{H_{f} (ν) D_{0}^{2 + ν}}{ε})^{\frac{1}{1 + ν}})$ iterations, where $D_{0} : = max_{x} \{‖ x -, x^{*}, ‖ : f (x) \leq f, (x_{0})\}$ . Thus, the method in [10] has a sublinear rate of convergence on the class of convex functions with Hölder continuous Hessian. It can automatically adapt to the actual level of smoothness. In what follows we show that the same algorithm achieves linear rate of convergence for the class of uniformly convex functions of degree $p = 2 + ν$ , namely for functions with strictly positive condition number: ${sup}_{ν \in [0, 1]} γ_{f} (ν) > 0 .$

In the remaining part of the paper, we usually assume that the smooth part of our objective is not purely quadratic. This is equivalent to the condition ${inf}_{ν \in [0, 1]} H_{f} (ν) > 0$ . However, to conclude this section, let us briefly discuss the case ${min}_{ν \in [0, 1]} H_{f} (ν) = 0$ . If we would know in advance that f is a convex quadratic function, then no regularization is needed since a single step $x \mapsto T_{H} (x)$ with $H : = 0$ solves the problem. However, if our function is given by a black-box oracle and we do not know a priori that its smooth part is quadratic, then we can still use Algorithm 1. For this case, we prove the following simple result.

Proposition 3.1

Let $A : E \to E^{*}$ be a self-adjoint positive semidefinite linear operator and $b \in E^{*}$ . Assume that $f (x) : = \frac{1}{2} ⟨ A x, x ⟩ - ⟨ b, x ⟩,$ and the minimum $x^{*} \in \underset{x \in dom F}{Argmin} {F (x) : = f (x) + h (x)}$ does exist. Then, in order to get $F (x_{K}) - F^{*} \leq ε$ with arbitrary $ε > 0$ , it is enough to perform

\begin{matrix} \begin{matrix} K = ⌈ {log}_{2} \frac{H_{0} {‖ x_{0} - x^{*} ‖}^{3}}{6 ε} + 1 ⌉ \end{matrix} \end{matrix}

iterations of Algorithm 1.

Proof

In our case, the quadratic model coincides with the smooth part of the objective: $Q (x ; y) \equiv f (y), x, y \in E .$ Therefore, at every iteration $k \geq 0$ of Algorithm 1 we have $i_{k} = 0$ and $H_{k} = 2^{- k} H_{0}$ . Note that $x_{k + 1} = T_{2^{- k} H_{0}} (x_{k}) = \underset{y \in dom F}{argmin} {F (y) + \frac{2^{- k} H_{0}}{6} {‖ y - x_{k} ‖}^{3}}$ , and

\begin{matrix} \begin{matrix} F (x_{k + 1}) & \leq & F (y) + \frac{2^{- k} H_{0}}{6} {‖ y - x_{k} ‖}^{3}, y \in dom F . \end{matrix} \end{matrix}

Let us prove that $‖ x_{k + 1} - x^{*} ‖ \leq ‖ x_{k} - x^{*} ‖$ for all $k \geq 0$ . If this is true, then plugging $y \equiv x^{*}$ into (33), we get: $F (x_{k + 1}) - F^{*} \leq 2^{- k} \frac{H_{0}}{6} {‖ x_{0} - x^{*} ‖}^{3}$ which results in the estimate (32). Indeed,

\begin{matrix} ‖ x_{k} - x^{*} ‖^{2} = & ‖ (x_{k} - x_{k + 1}) + (x_{k + 1} - x^{*}) ‖^{2} \\ = & ‖ x_{k + 1} - x^{*} ‖^{2} + {‖ x_{k} - x_{k + 1} ‖}^{2} + 2 ⟨ B (x_{k} - x_{k + 1}), x_{k + 1} - x^{*} ⟩, \end{matrix}

and it is enough to show that $⟨ B (x_{k} - x_{k + 1}), x^{*} - x_{k + 1} ⟩ \leq 0$ . Since $x_{k + 1}$ satisfies the first-order optimality condition:

\begin{matrix} \begin{matrix} - 2^{- (k + 1)} H_{0} ‖ x_{k + 1} - x_{k} ‖ B (x_{k + 1} - x_{k}) & : = & F^{'} (x_{k + 1}) \in \partial F (x_{k + 1}), \end{matrix} \end{matrix}

we have:

\begin{matrix} ⟨ B (x_{k} - x_{k + 1}), x^{*} - x_{k + 1} ⟩ \\ \overset{(34)}{=} \frac{2^{k + 1}}{H_{0} ‖ x_{k} - x_{k + 1} ‖} ⟨ F^{'} (x_{k + 1}), x^{*} - x_{k + 1} ⟩ \leq 0, \end{matrix}

where the last inequality follows from the convexity of the objective. $□$

Complexity Results for Uniformly Convex Functions

In this section, we are going to justify the global linear rate of convergence of Algorithm 1 for a class of twice differentiable uniformly convex functions with Hölder continuous Hessian. Universality of this method is ensured by the adaptive estimation of the parameter H over the whole sequence of iterations. It is important to distinguish two cases: $H_{k + 1} < H_{k}$ and $H_{k + 1} \geq H_{k}$ .

First, we need to estimate the progress in the objective function after minimizing the cubic model. There are two different situations here:

\begin{matrix} either H r_{H}^{1 - ν} (x) \leq \frac{2 H_{f} (ν)}{1 + ν}, or H r_{H}^{1 - ν} (x) > \frac{2 H_{f} (ν)}{1 + ν} . \end{matrix}

Lemma 4.1

Let $0 < H_{f} (ν) < + \infty$ and $σ_{f} (2 + ν) > 0$ for some $ν \in [0, 1]$ . Then, for arbitrary $x \in dom F$ and $H > 0$ we have:

\begin{matrix} F (x) - M_{H}^{*} (x) \\ \geq min [(F (x) - F^{*}) \cdot \frac{(1 + ν)}{(2 + ν)} \cdot min {(\frac{(1 + ν) γ_{f} (ν)}{2 (2 + ν)})^{\frac{1}{1 + ν}}, 1}, \\ {(F (T_{H} (x)) - F^{*})}^{\frac{3 (1 + ν)}{2 (2 + ν)}} \cdot (\frac{2 + ν}{1 + ν})^{\frac{3 (1 + ν)}{2 (2 + ν)}} \cdot \frac{{(σ_{f} (2 + ν))}^{\frac{3}{2 (2 + ν)}}}{3 \sqrt{H}}] . \end{matrix}

Proof

Let us consider two cases. 1) $H r_{H}^{1 - ν} (x) \leq \frac{2 H_{f} (ν)}{1 + ν}$ . Then, for arbitrary $y \in dom F$ , we have:

\begin{matrix} M_{H}^{*} (x) : = Q (x ; T_{H} (x)) + \frac{H}{6} {‖ T_{H} (x) - x ‖}^{3} + h (T_{H} (x)) \\ \leq Q (x ; y) + \frac{H r_{H}^{1 - ν} (x) {‖ y - x ‖}^{2 + ν}}{2 (2 + ν)} + h (y) \\ \overset{(10)}{\leq} F (y) + \frac{H_{f} (ν) {‖ y - x ‖}^{2 + ν}}{(1 + ν) (2 + ν)} + \frac{H r_{H}^{1 - ν} (x) {‖ y - x ‖}^{2 + ν}}{2 (2 + ν)} \\ \leq F (y) + \frac{2 H_{f} (ν) {‖ y - x ‖}^{2 + ν}}{(1 + ν) (2 + ν)}, \end{matrix}

where the first inequality follows from the fact, that

\begin{matrix} T_{H} (x) = & \underset{y \in dom F}{argmin} {Q (x ; y) + \frac{H r_{H}^{1 - ν} (x) {‖ y - x ‖}^{2 + ν}}{2 (2 + ν)} + h (y)} . \end{matrix}

Let us restrict y to the segment: $y = α x^{*} + (1 - α) x,$ with $α \in [0, 1]$ . Taking into account the uniform convexity, we get:

\begin{matrix} M_{H}^{*} (x) \leq F (x) - α (F (x) - F^{*}) + α^{2 + ν} \frac{2 H_{f} (ν) {‖ x^{*} - x ‖}^{2 + ν}}{(1 + ν) (2 + ν)} \\ \overset{(4)}{\leq} F (x) - (α - α^{2 + ν} \frac{2 H_{f} (ν)}{(1 + ν) σ_{f} (2 + ν)}) (F (x) - F^{*}) . \end{matrix}

The minimum of the right-hand side is attained at $α^{*} = min {\frac{(1 + ν) γ_{f} (ν)}{2 (2 + ν)}, 1}^{\frac{1}{1 + ν}} .$ Plugging this value into the bound, we have:

\begin{matrix} M_{H}^{*} (x) \leq & F (x) - min {(\frac{(1 + ν) γ_{f} (ν)}{2 (2 + ν)})^{1 / (1 + ν)}, 1} \\ \cdot \frac{(1 + ν)}{(2 + ν)} \cdot (F (x) - F^{*}), \end{matrix}

and this is the first argument of the minimum in (35).

2) $H r_{H}^{1 - ν} (x) > \frac{2 H_{f} (ν)}{1 + ν} .$ By (31), we have the bound:

\begin{matrix} ‖ F^{'} (T_{H} (x)) ‖_{*} < H r_{H}^{2} (x) . \end{matrix}

Using the fact that $\nabla^{2} f (x) ⪰ 0$ , we get the second argument of the minimum:

\begin{matrix} F (x) - M_{H}^{*} (x) \overset{(30)}{\geq} \frac{H r_{H}^{3} (x)}{3} \overset{(36)}{\geq} \frac{‖ F^{'} (T_{H} (x)) ‖_{*}^{\frac{3}{2}}}{3 \sqrt{H}} \\ \overset{(5)}{\geq} {(\frac{2 + ν}{1 + ν})}^{\frac{3 (1 + ν)}{2 (2 + ν)}} \cdot \frac{{(σ_{f} (2 + ν))}^{\frac{3}{2 (2 + ν)}}}{3 \sqrt{H}} \cdot {(F (T_{H} (x)) - F^{*})}^{\frac{3 (1 + ν)}{2 (2 + ν)}} . \end{matrix}

$□$

Denote by $κ_{f} (ν)$ the following auxiliary value:

\begin{matrix} κ_{f} (ν) : = & \frac{H_{f} {(ν)}^{\frac{2}{1 + ν}}}{{(σ_{f} (2 + ν))}^{\frac{1 - ν}{(1 + ν) (2 + ν)}}} \\ \cdot \frac{6 \cdot {(8 + ν)}^{\frac{1 - ν}{1 + ν}}}{{((1 + ν) (2 + ν))}^{\frac{2}{1 + ν}}} \cdot (\frac{1 + ν}{2 + ν})^{\frac{1 - ν}{2 + ν}}, ν \in [0, 1] . \end{matrix}

The next lemma shows what happens when parameter H is increasing during the iterations.

Lemma 4.2

Assume that for a fixed $x \in dom F$ the parameter $H > 0$ is such that:

\begin{matrix} \begin{matrix} F (T_{H} (x)) & > & M_{H}^{*} (x) . \end{matrix} \end{matrix}

If for some $ν \in [0, 1]$ , we have $σ_{f} (2 + ν) > 0$ , then it holds:

\begin{matrix} \begin{matrix} H {(F (T_{2 H} (x)) - F^{*})}^{\frac{1 - ν}{2 + ν}} < κ_{f} (ν) . \end{matrix} \end{matrix}

Proof

Firstly, let us prove that from (38) we have:

\begin{matrix} \begin{matrix} H r_{H}^{1 - ν} (x) & < & \frac{6 H_{f} (ν)}{(1 + ν) (2 + ν)} . \end{matrix} \end{matrix}

Assuming by contradiction, $H r_{H}^{1 - ν} (x) \geq \frac{6 H_{f} (ν)}{(1 + ν) (2 + ν)}$ , we get:

\begin{matrix} M_{H}^{*} (x) : = \frac{H ‖ T_{H} {(x) - x ‖}^{3}}{6} + Q (x ; T_{H} (x)) + h (T_{H} (x)) \\ \geq \frac{H_{f} (ν) {‖ T_{H} (x) - x ‖}^{2 + ν}}{(1 + ν) (2 + ν)} + Q (x ; T_{H} (x)) + h (T_{H} (x)) \\ \overset{(10)}{\geq} F (T_{H} (x)), \end{matrix}

which contradicts (38). Secondly, by its definition, $M_{H}^{*} (x)$ is a concave function of H. Therefore, its derivative $\frac{d}{dH} M_{H}^{*} (x) = \frac{1}{6} r_{H}^{3} (x)$ is non-increasing. Hence, it holds:

\begin{matrix} \begin{matrix} r_{2 H} (x) \leq & r_{H} (x) \overset{(40)}{<} (\frac{6 H_{f} (ν)}{(1 + ν) (2 + ν) H})^{\frac{1}{1 - ν}} . \end{matrix} \end{matrix}

Finally, by the smoothness and the uniform convexity, we obtain:

\begin{matrix} H {(F (T_{2 H} (x)) - F^{*})}^{\frac{1 - ν}{2 + ν}} \overset{(5)}{\leq} H {(\frac{1 + ν}{2 + ν}, (, \frac{1}{σ_{f} (2 + ν)}, )^{\frac{1}{1 + ν}})}^{\frac{1 - ν}{2 + ν}} {‖ F^{'} (T_{2 H} (x)) ‖}_{*}^{\frac{1 - ν}{1 + ν}} \\ \overset{(31)}{\leq} H {(\frac{1 + ν}{2 + ν}, (, \frac{1}{σ_{f} (2 + ν)}, )^{\frac{1}{1 + ν}})}^{\frac{1 - ν}{2 + ν}} (r_{2 H}^{1 + ν} (x) \cdot (\frac{H_{f} (ν)}{1 + ν} + H r_{2 H}^{1 - ν} (x)))^{\frac{1 - ν}{1 + ν}} \\ \overset{(41)}{<} H {(\frac{1 + ν}{2 + ν}, (, \frac{1}{σ_{f} (2 + ν)}, )^{\frac{1}{1 + ν}})}^{\frac{1 - ν}{2 + ν}} (r_{2 H}^{1 + ν} (x) \cdot \frac{(8 + ν) H_{f} (ν)}{(1 + ν) (2 + ν)})^{\frac{1 - ν}{1 + ν}} \\ \overset{(41)}{<} {(\frac{1 + ν}{2 + ν}, (, \frac{1}{σ_{f} (2 + ν)}, )^{\frac{1}{1 + ν}})}^{\frac{1 - ν}{2 + ν}} {(\frac{H_{f} (ν)}{(1 + ν) (2 + ν)})}^{\frac{2}{1 + ν}} 6 {(8 + ν)}^{\frac{1 - ν}{1 + ν}} = : κ_{f} (ν) . \end{matrix}

$□$

We are ready to prove the main result of this paper.

Theorem 4.1

Assume that for a fixed $ν \in [0, 1]$ we have $0 < H_{f} (ν) < + \infty$ and $σ_{f} (2 + ν) > 0$ . Let parameter $H_{0}$ in Algorithm 1 be small enough:

\begin{matrix} \begin{matrix} H_{0} & \leq & \frac{κ_{f} (ν)}{{(F (x_{0}) - F^{*})}^{(1 - ν) / (2 + ν)}}, \end{matrix} \end{matrix}

where $κ_{f} (ν)$ is defined by (37). Let the sequence ${x_{k}}_{k = 0}^{K}$ generated by the method satisfy condition:

\begin{matrix} F (T_{H_{k} 2^{j}} (x_{k})) - F^{*} \geq ε > 0, 0 \leq j \leq i_{k}, 0 \leq k \leq K - 1 . \end{matrix}

Then, for every $0 \leq k \leq K - 1$ , we have:

\begin{matrix} F (x_{k + 1}) - F^{*} \\ \leq (1 - min {\frac{(2 + ν) {((1 + ν) (2 + ν))}^{1 / (1 + ν)} {(γ_{f}, (ν))}^{\frac{1}{1 + ν}}}{(1 + ν) 6^{3 / 2} \cdot 2^{1 / 2} \cdot {(8 + ν)}^{(1 - ν) / (2 + 2 ν)}}, \frac{1}{2}}) \cdot (F (x_{k}) - F^{*}) . \end{matrix}

Therefore, the rate of convergence is linear, and

\begin{matrix} \begin{matrix} K & \leq & max {{(γ_{f}, (ν))}^{\frac{- 1}{1 + ν}} \cdot \frac{1 + ν}{2 + ν} \cdot \frac{6^{3 / 2} \cdot 2^{1 / 2} \cdot {(8 + ν)}^{(1 - ν) / (2 + 2 ν)}}{{((1 + ν) (2 + ν))}^{1 / (1 + ν)}}, 1} \cdot log \frac{F (x_{0}) - F^{*}}{ε} . \end{matrix} \end{matrix}

Moreover, we have the following bound for the total number of oracle calls $N_{K}$ during the first K iterations:

\begin{matrix} \begin{matrix} N_{K} & \leq & 2 K + {log}_{2} \frac{κ_{f} (ν)}{ε^{(1 - ν) / (2 + ν)}} - {log}_{2} H_{0} . \end{matrix} \end{matrix}

Proof

The proof is based on Lemmas 4.1 and 4.2, and monotonicity of the sequence ${F (x_{k})}_{k \geq 0}$ . Firstly, we need to show that every iteration of the method is well-defined. Namely, we are going to verify that for a fixed $0 \leq k \leq K - 1$ , there exists a finite integer $ℓ \geq 0$ such that either $F (T_{H_{k} 2^{ℓ}} (x_{k})) \leq M_{H_{k} 2^{ℓ}}^{*} (x_{k})$ or $F (T_{H_{k} 2^{ℓ + 1}} (x_{k})) - F^{*} < ε$ . Indeed, let us set

\begin{matrix} \begin{matrix} ℓ : = max \{0, {log}_{2} ⌈\frac{κ_{f} (ν)}{H_{k} ε^{(1 - ν) / (2 + ν)}}⌉\}, & and & H : = H_{k} 2^{ℓ} \geq \frac{κ_{f} (ν)}{ε^{(1 - ν) / (2 + ν)}} . \end{matrix} \end{matrix}

Then, if we have both $F (T_{H} (x_{k})) > M_{H}^{*} (x_{k})$ and $F (T_{2 H} (x_{k})) - F^{*} \geq ε$ , we get by Lemma 4.2:

\begin{matrix} \begin{matrix} H & \overset{(39)}{<} & \frac{κ_{f} (ν)}{{(F (T_{2 H} (x_{k})) - F^{*})}^{(1 - ν) / (2 + ν)}} \leq \frac{κ_{f} (ν)}{ε^{(1 - ν) / (2 + ν)}}, \end{matrix} \end{matrix}

which contradicts (46). Therefore, if we are unable to find the value $0 \leq i_{k} \leq ℓ$ (see line 1 of Algorithm) in a finite number of steps, that only means we have already solved the problem up to accuracy $ε$ .

Now, let us show that for every $0 \leq k \leq K$ it holds:

\begin{matrix} \begin{matrix} H_{k} {(F (x_{k}) - F^{*})}^{\frac{1 - ν}{2 + ν}} \leq & max \{κ_{f} (ν), H_{0} {(F (x_{0}) - F^{*})}^{\frac{1 - ν}{2 + ν}}\} . \end{matrix} \end{matrix}

This inequality is obviously valid for $k = 0$ . Assume it is also valid for some $k \geq 0$ . Then, by definition of $H_{k + 1}$ (see line 3 of Algorithm), we have $H_{k + 1} = H_{k} 2^{i_{k} - 1}$ . There are two cases. 1) $i_{k} = 0$ . Then, $H_{k + 1} < H_{k}$ . By monotonicity of ${F (x_{k})}_{k \geq 0}$ and by induction, we get:

\begin{matrix} \begin{matrix} H_{k + 1} {(F (x_{k + 1}) - F^{*})}^{\frac{1 - ν}{2 + ν}} & < & H_{k} {(F (x_{k}) - F^{*})}^{\frac{1 - ν}{2 + ν}} \\ \leq & max \{κ_{f} (ν), H_{0} {(F (x_{0}) - F^{*})}^{\frac{1 - ν}{2 + ν}}\} . \end{matrix} \end{matrix}

2) $i_{k} > 0$ . Then, applying Lemma 4.2 with $H : = H_{k} 2^{i_{k} - 1} = H_{k + 1}$ and $x : = x_{k}$ , we have:

\begin{matrix} \begin{matrix} H_{k + 1} {(F (x_{k + 1}) - F^{*})}^{\frac{1 - ν}{2 + ν}} = H {(F (T_{2 H} (x)) - F^{*})}^{\frac{1 - ν}{2 + ν}} \overset{(39)}{\leq} κ_{f} (ν) . \end{matrix} \end{matrix}

Thus, (47) is true by induction. Choosing $H_{0}$ small enough (42), we have:

\begin{matrix} \begin{matrix} 2 H_{k} {(F (x_{k}) - F^{*})}^{\frac{1 - ν}{2 + ν}} & \leq & 2 κ_{f} (ν), 0 \leq k \leq K . \end{matrix} \end{matrix}

From Lemma 4.1 we know, that one of the two following estimates is true (denote $δ_{k} : = F (x_{k}) - F^{*}$ ):

$F (x_{k}) - F (x_{k + 1}) \geq α \cdot δ_{k} \Leftrightarrow δ_{k + 1} \leq (1 - α) \cdot δ_{k},$ or
$F (x_{k}) - F (x_{k + 1}) \geq β \cdot δ_{k + 1} \Leftrightarrow δ_{k + 1} \leq {(1 + β)}^{- 1} δ_{k} \leq (1 - min {β, 1} / 2) \cdot δ_{k}$ ,

where $α : = \frac{1 + ν}{2 + ν} \cdot min {(\frac{(1 + ν) γ_{f} (ν)}{2 (2 + ν)})^{\frac{1}{1 + ν}}, 1},$ and

\begin{matrix} \begin{matrix} β : = & (\frac{2 + ν}{1 + ν})^{\frac{3 (1 + ν)}{2 (2 + ν)}} \cdot \frac{{(σ_{f} (2 + ν))}^{\frac{3}{2 (2 + ν)}}}{3 {(2 κ_{f} (ν))}^{1 / 2}} \overset{(37)}{=} \frac{2 + ν}{1 + ν} \cdot \frac{2^{1 / 2} \cdot {((1 + ν) (2 + ν))}^{\frac{1}{1 + ν}}}{6^{3 / 2} \cdot {(8 + ν)}^{(1 - ν) / (2 + 2 ν)}} \cdot γ_{f} {(ν)}^{\frac{1}{1 + ν}} . \end{matrix} \end{matrix}

It remains to notice that $α \geq min {β, 1} / 2$ . Thus, we obtain (44).

Finally, let us estimate the total number of the oracle calls $N_{K}$ during the first K iterations. At each iteration, the oracle is called $i_{k} + 1$ times, and we have $H_{k + 1} = H_{k} 2^{i_{k} - 1}$ . Therefore,

\begin{matrix} N_{K} = & \sum_{k = 0}^{K - 1} (i_{k} + 1) = \sum_{k = 0}^{K - 1} ({log}_{2} \frac{H_{k + 1}}{H_{k}} + 2) \\ = & 2 K + {log}_{2} H_{K} - {log}_{2} H_{0} \overset{(48), (43)}{\leq} 2 K + {log}_{2} \frac{κ_{f} (ν)}{ε^{(1 - ν) / (2 + ν)}} - {log}_{2} H_{0} . \end{matrix}

$□$

Note that condition (42) for the initial choice of $H_{0}$ can be seen as a definition of the moment, after which we can guarantee the linear rate of convergence (44). In practice, we can launch Algorithm 1 with arbitrary $H_{0} > 0$ . There are two possible options: either the method halves $H_{k}$ at every step in the beginning, so $H_{k}$ becomes small very quickly, or this value is increased at least once, and the required bound is guaranteed by Lemma 4.2. It can be easily proved, that this initial phase requires no more than $K_{0} = ⌈ {log}_{2} \frac{H_{0} ε^{(1 - ν) / (1 + ν)}}{κ_{f} (ν)} ⌉$ oracle calls.

Discussion

Let us discuss the global complexity results, provided by Theorem 4.1 for the Cubic Regularization of the Newton Method with the adaptive adjustment of the regularization parameter.

For the class of twice continuously differentiable strongly convex functions with Lipschitz continuous gradients $f \in S_{μ, L}^{2, 1} (dom F)$ , it is well known that the classical gradient descent method needs

\begin{matrix} \begin{matrix} O (\frac{L}{μ} log \frac{F (x_{0}) - F^{*}}{ε}) \end{matrix} \end{matrix}

iterations for computing $ε$ -solution of the problem (e.g., [15]). As it was shown in [6], this result is shared by a variant of Cubic Regularization of the Newton method. This is much better than the bound $O ((\frac{L}{μ})^{2} log \frac{F (x_{0}) - F^{*}}{ε}),$ known for the damped Newton method (e.g., [2]).

For the class of uniformly convex functions of degree $p = 2 + ν$ having Hölder continuous Hessian of degree $ν \in [0, 1]$ , we have proved the following parametric estimates: $O (max {(γ_{f} (ν))^{\frac{- 1}{1 + ν}}, 1} \cdot log \frac{F (x_{0}) - F^{*}}{ε}),$ where $γ_{f} (ν) : = \frac{σ_{f} (2 + ν)}{H_{f} (ν)}$ is the condition number of degree $ν$ . However, in practice we may not know exactly an appropriate value of the parameter $ν$ . It is important that our algorithm automatically adjusts to the best possible complexity bound:

\begin{matrix} \begin{matrix} O (max {{inf}_{ν \in [0, 1]} (γ_{f} (ν))^{\frac{- 1}{1 + ν}}, 1} \cdot log \frac{F (x_{0}) - F^{*}}{ε}) . \end{matrix} \end{matrix}

Note that for $f \in S_{μ, L}^{2, 1} (dom F)$ we have:

\begin{matrix} \begin{matrix} ‖ \nabla^{2} f (x) - \nabla^{2} f (y) ‖ \leq L - μ, x, y \in dom F . \end{matrix} \end{matrix}

Thus, $H_{f} (0) \leq L - μ$ and $γ_{f} (0) \geq \frac{μ}{L - μ}$ . So we can conclude that the estimate (50) is better than (49). Moreover, addition to our objective arbitrary convex quadratic function does not change any of $H_{f} (ν), ν \in [0, 1]$ . Thus, it can only improve the condition number $γ_{f} (ν)$ , while the ratio $L / μ$ may become arbitrarily bad. It confirms an intuition that a natural Newton-type minimization scheme should not be affected by any quadratic parts of the objective, and the notion of well-conditioned and ill-conditioned problems for second-order methods should be different from that of for first-order ones.

Note that in the recent paper [11], a linear rate of convergence was also proven for the accelerated second-order scheme, with the complexity bound:

\begin{matrix} \begin{matrix} O (max {{(γ_{f} (ν))}^{\frac{- 1}{2 + ν}}, 1} \cdot log \frac{H_{f} (ν) D_{0}^{2 + ν}}{ε}) . \end{matrix} \end{matrix}

This is the better rate than (50). However, the method requires to know the parameter $ν$ , and the constant of uniform convexity. Thus, one theoretical question remains open: is it possible to construct universal second-order scheme, matching (51) in the uniformly convex case.

Looking at the definitions of $H_{f} (ν)$ and $σ_{f} (2 + ν)$ , we can see that, for all $x, y \in dom F, x \neq y$ ,

\begin{matrix} \begin{matrix} σ_{f} (2 + ν) \leq \frac{⟨ \nabla f (x) - \nabla f (y), x - y ⟩}{{‖ x - y ‖}^{2 + ν}}, \frac{1}{H_{f} (ν)} \leq \frac{{‖ x - y ‖}^{ν}}{‖ \nabla^{2} f (x) - \nabla^{2} f (y) ‖}, \end{matrix} \end{matrix}

and

\begin{matrix} \begin{matrix} γ_{f} (ν) : = \frac{σ_{f} (2 + ν)}{H_{f} (ν)} \leq \frac{⟨ \nabla f (x) - \nabla f (y), x - y ⟩}{‖ \nabla^{2} f (x) - \nabla^{2} {f (y) ‖ \cdot ‖ x - y ‖}^{2}} . \end{matrix} \end{matrix}

The last fraction does not depend on any particular $ν$ . So, for any twice-differentiable convex function, we can define the following number:

\begin{matrix} \begin{matrix} γ_{f} : = inf_{\begin{matrix} x, y \in dom F \\ x \neq y \end{matrix}} \frac{⟨ \nabla f (x) - \nabla f (y), x - y ⟩}{‖ \nabla^{2} f (x) - \nabla^{2} {f (y) ‖ \cdot ‖ x - y ‖}^{2}} . \end{matrix} \end{matrix}

If it is positive, then it could serve as an indicator of the second-order non-degeneracy, for which we have a lower bound: $γ_{f} \geq γ_{f} (ν), ν \in [0, 1] .$

Conclusions

In this work, we have introduced the second-order condition number of a certain degree, which plays as the main complexity factor for solving uniformly convex minimization problems with Hölder-continuous Hessian of the objective by second-order optimization schemes.

We have proved that cubically regularized Newton method with an adaptive estimation of the regularization parameter achieves global linear rate of convergence on this class of functions. The algorithm does not require to know any parameters of the problem class and automatically fits to the best possible degree of nondegeneracy.

Using this technique, we have justified that global iteration complexity of cubic Newton is always better than corresponding one of gradient method for the standard class of strongly convex functions with uniformly bounded second derivative.

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

Acknowledgements

The research results of this paper were obtained with support of ERC Advanced Grant 788368.

Footnotes

$^{)}$ For the integer values of p, this inequality was proved in [14].

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Nikita Doikov, Email: nikita.doikov@uclouvain.be.

Yurii Nesterov, Email: yurii.nesterov@uclouvain.be.

References

1.Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)
2.Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004. [Google Scholar]
3.Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv:1612.00547 (2016)
4.Cartis C, Gould NI, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 2011;130(2):295–319. doi: 10.1007/s10107-009-0337-y. [DOI] [Google Scholar]
5.Cartis C, Gould NI, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 2011;127(2):245–295. doi: 10.1007/s10107-009-0286-5. [DOI] [Google Scholar]
6.Cartis C, Gould NI, Toint PL. Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim. Methods Softw. 2012;27(2):197–219. doi: 10.1080/10556788.2011.602076. [DOI] [Google Scholar]
7.Cartis C, Scheinberg K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 2018;169(2):337–375. doi: 10.1007/s10107-017-1137-4. [DOI] [Google Scholar]
8.Doikov, N., Richtárik, P.: Randomized block cubic Newton method. In: International Conference on Machine Learning, pp. 1289–1297 (2018)
9.Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information. arXiv:1710.05782 (2017)
10.Grapiglia GN, Nesterov Y. Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM J. Optim. 2017;27(1):478–506. doi: 10.1137/16M1087801. [DOI] [Google Scholar]
11.Grapiglia GN, Nesterov Y. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM J. Optim. 2019;29(1):77–99. doi: 10.1137/17M1142077. [DOI] [Google Scholar]
12.Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: International Conference on Machine Learning, pp. 1895–1904 (2017)
13.Nesterov Y. Modified Gauss–Newton scheme with worst case guarantees for global performance. Optim. Methods Softw. 2007;22(3):469–483. doi: 10.1080/08927020600643812. [DOI] [Google Scholar]
14.Nesterov Y. Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 2008;112(1):159–181. doi: 10.1007/s10107-006-0089-x. [DOI] [Google Scholar]
15.Nesterov Y. Lectures on Convex Optimization. Berlin: Springer; 2018. [Google Scholar]
16.Nesterov, Y.: Implementable tensor methods in unconstrained convex optimization. In: Mathematical Programming pp. 1–27 (2019) [DOI] [PMC free article] [PubMed]
17.Nesterov Y, Polyak BT. Cubic regularization of Newton’s method and its global performance. Math. Program. 2006;108(1):177–205. doi: 10.1007/s10107-006-0706-8. [DOI] [Google Scholar]
18.Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)

[CR1] 1.Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., Ma, T.: Finding approximate local minima faster than gradient descent. In: Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pp. 1195–1199. ACM (2017)

[CR2] 2.Boyd S, Vandenberghe L. Convex Optimization. Cambridge: Cambridge University Press; 2004. [Google Scholar]

[CR3] 3.Carmon, Y., Duchi, J.C.: Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv:1612.00547 (2016)

[CR4] 4.Cartis C, Gould NI, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Math. Program. 2011;130(2):295–319. doi: 10.1007/s10107-009-0337-y. [DOI] [Google Scholar]

[CR5] 5.Cartis C, Gould NI, Toint PL. Adaptive cubic regularisation methods for unconstrained optimization. Part I: motivation, convergence and numerical results. Math. Program. 2011;127(2):245–295. doi: 10.1007/s10107-009-0286-5. [DOI] [Google Scholar]

[CR6] 6.Cartis C, Gould NI, Toint PL. Evaluation complexity of adaptive cubic regularization methods for convex unconstrained optimization. Optim. Methods Softw. 2012;27(2):197–219. doi: 10.1080/10556788.2011.602076. [DOI] [Google Scholar]

[CR7] 7.Cartis C, Scheinberg K. Global convergence rate analysis of unconstrained optimization methods based on probabilistic models. Math. Program. 2018;169(2):337–375. doi: 10.1007/s10107-017-1137-4. [DOI] [Google Scholar]

[CR8] 8.Doikov, N., Richtárik, P.: Randomized block cubic Newton method. In: International Conference on Machine Learning, pp. 1289–1297 (2018)

[CR9] 9.Ghadimi, S., Liu, H., Zhang, T.: Second-order methods with cubic regularization under inexact information. arXiv:1710.05782 (2017)

[CR10] 10.Grapiglia GN, Nesterov Y. Regularized Newton methods for minimizing functions with Hölder continuous Hessians. SIAM J. Optim. 2017;27(1):478–506. doi: 10.1137/16M1087801. [DOI] [Google Scholar]

[CR11] 11.Grapiglia GN, Nesterov Y. Accelerated regularized Newton methods for minimizing composite convex functions. SIAM J. Optim. 2019;29(1):77–99. doi: 10.1137/17M1142077. [DOI] [Google Scholar]

[CR12] 12.Kohler, J.M., Lucchi, A.: Sub-sampled cubic regularization for non-convex optimization. In: International Conference on Machine Learning, pp. 1895–1904 (2017)

[CR13] 13.Nesterov Y. Modified Gauss–Newton scheme with worst case guarantees for global performance. Optim. Methods Softw. 2007;22(3):469–483. doi: 10.1080/08927020600643812. [DOI] [Google Scholar]

[CR14] 14.Nesterov Y. Accelerating the cubic regularization of Newton’s method on convex problems. Math. Program. 2008;112(1):159–181. doi: 10.1007/s10107-006-0089-x. [DOI] [Google Scholar]

[CR15] 15.Nesterov Y. Lectures on Convex Optimization. Berlin: Springer; 2018. [Google Scholar]

[CR16] 16.Nesterov, Y.: Implementable tensor methods in unconstrained convex optimization. In: Mathematical Programming pp. 1–27 (2019) [DOI] [PMC free article] [PubMed]

[CR17] 17.Nesterov Y, Polyak BT. Cubic regularization of Newton’s method and its global performance. Math. Program. 2006;108(1):177–205. doi: 10.1007/s10107-006-0706-8. [DOI] [Google Scholar]

[CR18] 18.Tripuraneni, N., Stern, M., Jin, C., Regier, J., Jordan, M.I.: Stochastic cubic regularization for fast nonconvex optimization. In: Advances in Neural Information Processing Systems, pp. 2899–2908 (2018)

PERMALINK

Minimizing Uniformly Convex Functions by Cubic Regularization of Newton Method

Nikita Doikov

Yurii Nesterov

Abstract

Introduction

Uniformly Convex Functions with Hölder Continuous Hessian

Lemma 2.1

Lemma 2.2

Proof

Example 2.1

Proof

Lemma 2.3

Proof

Lemma 2.4

Proof

Lemma 2.5

Proof

Example 2.2

Proof

Remark 2.1

Regularized Newton Method

Theorem 3.1

Proof

Proposition 3.1

Proof

Complexity Results for Uniformly Convex Functions

Lemma 4.1

Proof

Lemma 4.2

Proof

Theorem 4.1

Proof

Discussion

Conclusions

Acknowledgements

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases