New Results on Superlinear Convergence of Classical Quasi-Newton Methods

Anton Rodomanov; Yurii Nesterov

doi:10.1007/s10957-020-01805-8

. 2021 Jan 9;188(3):744–769. doi: 10.1007/s10957-020-01805-8

New Results on Superlinear Convergence of Classical Quasi-Newton Methods

Anton Rodomanov ^1,^✉, Yurii Nesterov ²

PMCID: PMC7929971 PMID: 33746292

Abstract

We present a new theoretical analysis of local superlinear convergence of classical quasi-Newton methods from the convex Broyden class. As a result, we obtain a significant improvement in the currently known estimates of the convergence rates for these methods. In particular, we show that the corresponding rate of the Broyden–Fletcher–Goldfarb–Shanno method depends only on the product of the dimensionality of the problem and the logarithm of its condition number.

Keywords: Quasi-Newton methods, Convex Broyden class, DFP, BFGS, Superlinear convergence, Local convergence, Rate of convergence

Introduction

We study local superlinear convergence of classical quasi-Newton methods for smooth unconstrained optimization. These algorithms can be seen as an approximation of the standard Newton method, in which the exact Hessian is replaced by some operator, which is updated in iterations by using the gradients of the objective function. The two most famous examples of quasi-Newton algorithms are the Davidon–Fletcher–Powell (DFP) [1, 2] and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) [3–7] methods, which together belong to the Broyden family [8] of quasi-Newton algorithms. For an introduction into the topic, see [9] and [10, Chapter 6]. See also [11] for the discussion of quasi-Newton algorithms in the context of nonsmooth optimization.

The superlinear convergence of quasi-Newton methods was established as early as in 1970s, firstly by Powell [12] and Dixon [13, 14] for the methods with exact line search, and then by Broyden, Dennis and Moré [15] and Dennis and Moré [16] for the methods without line search. The latter two approaches have been extended onto more general methods under various settings (see, e.g., [17–25]).

However, explicit rates of superlinear convergence for quasi-Newton algorithms were obtained only recently. The first results were presented in [26] for the greedy quasi-Newton methods. After that, in [27], the classical quasi-Newton methods were considered, for which the authors established certain superlinear convergence rates, depending on the problem dimension and its condition number. The analysis was based on the trace potential function, which was then augmented by the logarithm of determinant of the inverse Hessian approximation to extend the proof onto the general nonlinear case.

In this paper, we further improve the results of [27]. For the classical quasi-Newton methods, we obtain new convergence rate estimates, which have better dependency on the condition number of the problem. In particular, we show that the superlinear convergence rate of BFGS depends on the condition number only through the logarithm. As compared to the previous work, the main difference in the analysis is the choice of the potential function: now the main part is formed by the logarithm of determinant of Hessian approximation, which is then augmented by the trace of inverse Hessian approximation.

It is worth noting that recently, in [28], another analysis of local superlinear convergence of the classical DFP and BFGS methods was presented with the resulting rate, which is independent of the dimensionality of the problem and its condition number. However, to obtain such a rate, the authors had to make an additional assumption that the methods start from a sufficiently good initial Hessian approximation. Without this assumption, to our knowledge, their proof technique, based on the Frobenius-norm potential function, leads only to the rates, which are weaker than those in [27].

This paper is organized as follows. In Sect. 2, we introduce our notation. In Sect. 3, we study the convex Broyden class of quasi-Newton updates for approximating a self-adjoint positive definite operator. In Sect. 4, we analyze the rate of convergence of the classical quasi-Newton methods from the convex Broyden class as applied to minimizing a quadratic function. On this simple example, where the Hessian is constant, we illustrate the main ideas of our analysis. In Sect. 5, we consider the general unconstrained optimization problem. Finally, in Sect. 6, we discuss why the new superlinear convergence rates, obtained in this paper, are better than the previously known ones.

Notation

In what follows, $E$ denotes an n-dimensional real vector space. Its dual space, composed of all linear functionals on $E$ , is denoted by $E^{*}$ . The value of a linear function $s \in E^{*}$ , evaluated at a point $x \in E$ , is denoted by $⟨ s, x ⟩$ .

For a smooth function $f : E \to R$ , we denote by $\nabla f (x)$ and $\nabla^{2} f (x)$ its gradient and Hessian, respectively, evaluated at a point $x \in E$ . Note that $\nabla f (x) \in E^{*}$ , and $\nabla^{2} f (x)$ is a self-adjoint linear operator from $E$ to $E^{*}$ .

The partial ordering of self-adjoint linear operators is defined in the standard way. We write $A_{1} ⪯ A_{2}$ for $A_{1}, A_{2} : E \to E^{*}$ , if $⟨ (A_{2} - A_{1}) x, x ⟩ \geq 0$ for all $x \in E$ , and $H_{1} ⪯ H_{2}$ for $H_{1}, H_{2} : E^{*} \to E$ , if $⟨ s, (H_{2} - H_{1}) s ⟩ \geq 0$ for all $s \in E^{*}$ .

Any self-adjoint positive definite linear operator $A : E \to E^{*}$ induces in the spaces $E$ and $E^{*}$ the following pair of conjugate Euclidean norms:

\begin{matrix} {‖ h ‖}_{A} : = {⟨ A h, h ⟩}^{1 / 2}, h \in E, {‖ s ‖}_{A}^{*} : = {⟨ s, A^{- 1} s ⟩}^{1 / 2}, s \in E^{*} . \end{matrix}

When $A = \nabla^{2} f (x)$ , where $f : E \to R$ is a smooth function with positive definite Hessian, and $x \in E$ , we prefer to use notation ${‖ \cdot ‖}_{x}$ and ${‖ \cdot ‖}_{x}^{*}$ , provided that there is no ambiguity with the reference function f.

Sometimes, in the formulas, involving products of linear operators, it is convenient to treat $x \in E$ as a linear operator from $R$ to $E$ , defined by $x α = α x$ , and $x^{*}$ as a linear operator from $E^{*}$ to $R$ , defined by $x^{*} s = ⟨ s, x ⟩$ . Likewise, any $s \in E^{*}$ can be treated as a linear operator from $R$ to $E^{*}$ , defined by $s α = α s$ , and $s^{*}$ as a linear operator from $E$ to $R$ , defined by $s^{*} x = ⟨ s, x ⟩$ . In this case, $x x^{*}$ and $s s^{*}$ are rank-one self-adjoint linear operators from $E^{*}$ to $E$ and from $E^{*}$ to $E$ , respectively, acting as follows: $(x x^{*}) s = ⟨ s, x ⟩ x$ and $(s s^{*}) x = ⟨ s, x ⟩ s$ for $x \in E$ and $s \in E^{*}$ .

Given two self-adjoint linear operators $A : E \to E^{*}$ and $H : E^{*} \to E$ , we define the trace and the determinant of A with respect to H as follows: $⟨ H, A ⟩ : = Tr (H A)$ , and $Det (H, A) : = Det (H A)$ . Note that HA is a linear operator from $E$ to itself, and hence, its trace and determinant are well defined by the eigenvalues (they coincide with the trace and determinant of the matrix representation of HA with respect to an arbitrary chosen basis in the space $E$ , and the result is independent of the particular choice of the basis). In particular, if H is positive definite, then $⟨ H, A ⟩$ and $Det (H, A)$ are, respectively, the sum and the product of the eigenvalues of A relative to $H^{- 1}$ . Observe that $⟨ \cdot, \cdot ⟩$ is a bilinear form, and for any $x \in E$ , we have $⟨ A x, x ⟩ = ⟨ x x^{*}, A ⟩$ . When A is invertible, we also have $⟨ A^{- 1}, A ⟩ = n$ and $Det (A^{- 1}, δ A) = δ^{n}$ for any $δ \in R$ . Also recall the following multiplicative formula for the determinant: $Det (H, A) = Det (H, G) \cdot Det (G^{- 1}, A)$ , which is valid for any invertible linear operator $G : E \to E^{*}$ . If the operator H is positive semidefinite, and $A_{1} ⪯ A_{2}$ for some self-adjoint linear operators $A_{1}, A_{2} : E \to E^{*}$ , then $⟨ H, A_{1} ⟩ \leq ⟨ H, A_{2} ⟩$ and $Det (H, A_{1}) \leq Det (H, A_{2})$ . Similarly, if A is positive semidefinite and $H_{1} ⪯ H_{2}$ for some self-adjoint linear operators $H_{1}, H_{2} : E^{*} \to E$ , then $⟨ H_{1}, A ⟩ \leq ⟨ H_{2}, A ⟩$ and $Det (H_{1}, A) \leq Det (H_{2}, A)$ .

Convex Broyden Class

Let A and G be two self-adjoint positive definite linear operators from $E$ to $E^{*}$ , where A is the target operator, which we want to approximate, and G is its current approximation. The Broyden class of quasi-Newton updates of G with respect to A along a direction $u \in E \ {0}$ is the following family of updating formulas, parameterized by a scalar $τ \in R$ :

\begin{matrix} \begin{matrix} {Broyd}_{τ} (A, G, u) & = & ϕ_{τ} [G - \frac{A u u^{*} G + G u u^{*} A}{⟨ A u, u ⟩} + (\frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} + 1) \frac{A u u^{*} A}{⟨ A u, u ⟩}] \\ + (1 - ϕ_{τ}) [G - \frac{G u u^{*} G}{⟨ G u, u ⟩} + \frac{A u u^{*} A}{⟨ A u, u ⟩}], \end{matrix} \end{matrix}

where

\begin{matrix} \begin{matrix} ϕ_{τ} : = ϕ_{τ} (A, G, u) : = & \frac{τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩}}{τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩}} . \end{matrix} \end{matrix}

If the denominator in (3) is zero, we left both $ϕ_{τ}$ and ${Broyd}_{τ} (A, G, u)$ undefined. For the sake of convenience, we also set ${Broyd}_{τ} (A, G, u) = G$ for $u = 0$ .

In this paper, we are interested in the convex Broyden class, which is described by the values of $τ \in [0, 1]$ . Note that for all such $τ$ the denominator in (3) is always positive for any $u \neq 0$ , so both $ϕ_{τ}$ and ${Broyd}_{τ} (A, G, u)$ are well defined; moreover, $ϕ_{τ} \in [0, 1]$ . For $τ = 1$ , we have $ϕ_{τ} = 1$ , and (2) becomes the DFP update; for $τ = 0$ , we have $ϕ_{τ} = 0$ , and (2) becomes the BFGS update.

Remark 3.1

Usually the Broyden class is defined directly in terms of the parameter $ϕ$ . However, in the context of this paper, it is more convenient to work with $τ$ instead of $ϕ$ . As can be seen from (66), $τ$ is exactly the weight of the DFP component in the updating formula for the inverse operator.

A basic property of an update from the convex Broyden class is that it preserves the bounds on the eigenvalues with respect to the target operator.

Lemma 3.1

(see [27, Lemma 2.1]) If $\frac{1}{ξ} A ⪯ G ⪯ η A$ for some $ξ, η \geq 1$ , then, for any $u \in E$ , and any $τ \in [0, 1]$ , we have $\frac{1}{ξ} A ⪯ {Broyd}_{τ} (A, G, u) ⪯ η A$ .

Consider the measure of closeness of G to A along direction $u \in E \ {0}$ :

\begin{matrix} ν (A, G, u) : = \frac{{⟨ (G - A) G^{- 1} (G - A) u, u ⟩}^{1 / 2}}{{⟨ A u, u ⟩}^{1 / 2}} \overset{(1)}{=} \frac{{‖ (G - A) u ‖}_{G}^{*}}{{‖ u ‖}_{A}} . \end{matrix}

Let us present two potential functions, whose improvement after one update from the convex Broyden class can be bounded from below by a certain nonnegative monotonically increasing function of $ν$ , vanishing at zero.

First, consider the log-det barrier

\begin{matrix} V (A, G) = ln Det (A^{- 1}, G) . \end{matrix}

It will be useful when $A ⪯ G$ . Note that in this case $V (A, G) \geq 0$ .

Lemma 3.2

Let $A, G : E \to E^{*}$ be self-adjoint positive definite linear operators, $A ⪯ G ⪯ η A$ for some $η \geq 1$ . Then, for any $τ \in [0, 1]$ and $u \in E \ {0}$ :

\begin{matrix} V (A, G) - V (A, {Broyd}_{τ} (A, G, u)) \geq ln (1 + (τ \frac{1}{η} + 1 - τ) ν^{2} (A, G, u)) . \end{matrix}

Proof

Indeed, denoting $G_{+} : = {Broyd}_{τ} (A, G, u)$ , we obtain

\begin{matrix} \begin{matrix} V (A, G) - V (A, G_{+}) \overset{(5)}{=} ln Det (G_{+}^{- 1}, G) \\ \overset{(67)}{=} & ln (τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩}) \\ = & ln (1 + τ \frac{⟨ A (A^{- 1} - G^{- 1}) A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ (G - A) u, u ⟩}{⟨ A u, u ⟩}) . \end{matrix} \end{matrix}

Since1 $0 ⪯ G - A ⪯ (1 - \frac{1}{η}) G$ , we have

\begin{matrix} (G - A) G^{- 1} (G - A) ⪯ (1 - \frac{1}{η}) (G - A) ⪯ \frac{1}{1 + \frac{1}{η}} (G - A) ⪯ G - A . \end{matrix}

Therefore, denoting $ν : = ν (A, G, u)$ , we can write that

\begin{matrix} \frac{⟨ (G - A) u, u ⟩}{⟨ A u, u ⟩} \overset{(7)}{\geq} \frac{⟨ (G - A) G^{- 1} (G - A) u, u ⟩}{⟨ A u, u ⟩} \overset{(4)}{=} ν^{2}, \end{matrix}

and, since $A (A^{- 1} - G^{- 1}) A = G - A - (G - A) G^{- 1} (G - A)$ , that

\begin{matrix} \begin{matrix} \frac{⟨ A (A^{- 1} - G^{- 1}) A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} & = & \frac{⟨ (G - A - (G - A) G^{- 1} (G - A)) u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} \overset{(7)}{\geq} \frac{1}{η} \frac{⟨ (G - A) G^{- 1} (G - A) u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} \\ \geq & \frac{1}{η} \frac{⟨ (G - A) G^{- 1} (G - A) u, u ⟩}{⟨ A u, u ⟩} \overset{(4)}{=} \frac{1}{η} ν^{2} . \end{matrix} \end{matrix}

Substituting the above two inequalities into (6), we obtain the claim. $□$

Now consider another potential function, the augmented log-det barrier:

\begin{matrix} ψ (G, A) : = ln Det (A^{- 1}, G) - ⟨ G^{- 1}, G - A ⟩ . \end{matrix}

As compared to the log-det barrier, this potential function is more universal since it works even if the condition $A ⪯ G$ is violated. Note that the augmented log-det barrier is in fact the Bregman divergence, generated by the strictly convex function $d (A) : = - ln Det (B^{- 1}, A)$ , defined on the set of self-adjoint positive definite linear operators from $E$ to $E^{*}$ , where $B : E \to E^{*}$ is an arbitrary fixed self-adjoint positive definite linear operator. Indeed,

\begin{matrix} \begin{matrix} ψ (G, A) & = & - ln Det (B^{- 1}, A) + ln Det (B^{- 1}, G) - ⟨ - G^{- 1}, A - G ⟩ \\ = & d (A) - d (G) - ⟨ \nabla d (G), A - G ⟩ \geq 0 . \end{matrix} \end{matrix}

Remark 3.2

The idea of combining the trace with the logarithm of determinant to form a potential function for the analysis of quasi-Newton methods can be traced back to [29]. Note also that in [27], the authors studied the evolution of $ψ (A, G)$ , i.e. the Bregman divergence was centered at A instead of G.

Lemma 3.3

For any real $α \geq β > 0$ , we have $α + \frac{1}{β} - 1 \geq 1$ , and

\begin{matrix} \begin{matrix} α - ln β - 1 \geq & \frac{\sqrt{3}}{2 + \sqrt{3}} ln (α + \frac{1}{β} - 1) \geq & \frac{6}{13} ln (α + \frac{1}{β} - 1) . \end{matrix} \end{matrix}

Proof

We only need to prove the first inequality in (10) since the second one follows from it and the fact that $\frac{\sqrt{3} + 2}{\sqrt{3}} = 1 + \frac{2}{\sqrt{3}} \leq 1 + \frac{7}{6} = \frac{13}{6}$ (since $2 \leq \frac{7}{2 \sqrt{3}}$ ).

Let $β > 0$ be fixed, and let $ζ_{1} : (1 - \frac{1}{β}, + \infty) \to R$ be the function, defined by $ζ_{1} (α) : = α - \frac{\sqrt{3}}{2 + \sqrt{3}} ln (α + \frac{1}{β} - 1)$ . Note that the domain of $ζ_{1}$ includes the point $α = β$ since $β \geq 2 - \frac{1}{β} > 1 - \frac{1}{β}$ . Let us show that $ζ_{1}$ increases on the interval $[β, + \infty)$ . Indeed, for any $α \geq β$ , we have

\begin{matrix} \begin{matrix} ζ_{1}^{'} (α) = & 1 - \frac{\sqrt{3}}{2 + \sqrt{3}} \frac{1}{α + \frac{1}{β} - 1} > 1 - \frac{1}{α + \frac{1}{β} - 1} = \frac{α + \frac{1}{β} - 2}{α + \frac{1}{β} - 1} \geq \frac{β + \frac{1}{β} - 2}{α + \frac{1}{β} - 1} \geq 0 . \end{matrix} \end{matrix}

Thus, it is sufficient to prove (10) only in the case when $α = β$ . Equivalently, we need to show that the function $ζ_{2} : (0, + \infty) \to R$ , defined by the formula $ζ_{2} (α) : = α - ln α - 1 - \frac{\sqrt{3}}{2 + \sqrt{3}} ln (α + \frac{1}{α} - 1)$ , is nonnegative. Differentiating, we find that, for all $α > 0$ , we have

\begin{matrix} \begin{matrix} ζ_{2}^{'} (α) & = & 1 - \frac{1}{α} - \frac{\sqrt{3}}{2 + \sqrt{3}} \frac{1 - \frac{1}{α^{2}}}{α + \frac{1}{α} - 1} = (1 - \frac{1}{α}) (1 - \frac{\sqrt{3}}{2 + \sqrt{3}} \frac{1 + \frac{1}{α}}{α + \frac{1}{α} - 1}) \\ = & (1 - \frac{1}{α}) \frac{α + \frac{1}{α} - 1 - (2 \sqrt{3} - 3) (1 + \frac{1}{α})}{α + \frac{1}{α} - 1} = (1 - \frac{1}{α}) \frac{α - 2 (\sqrt{3} - 1) + {(\sqrt{3} - 1)}^{2} \frac{1}{α}}{1 + \frac{1}{α} - 1} \\ = & (1 - \frac{1}{α}) \frac{{(\sqrt{α} - (\sqrt{3} - 1) \frac{1}{\sqrt{α}})}^{2}}{α + \frac{1}{α} - 1} . \end{matrix} \end{matrix}

Hence, $ζ_{2}^{'} (α) \leq 0$ for $0 < α \leq 1$ , and $ζ_{2}^{'} (α) \geq 0$ for $α \geq 1$ . Thus, the minimum of $ζ_{2}$ is attained at $α = 1$ . Consequently, $ζ_{2} (α) \geq ζ_{2} (1) = 0$ for all $α > 0$ . $□$

It turns out that, up to some constants, the improvement in the augmented log-det barrier can be bounded from below by exactly the same logarithmic function of $ν$ , which was used for the simple log-det barrier.

Lemma 3.4

Let $A, G : E \to E^{*}$ be self-adjoint positive definite linear operators, $\frac{1}{ξ} A ⪯ G ⪯ η A$ for some $ξ, η \geq 1$ . Then, for any $τ \in [0, 1]$ and $u \in E \ {0}$ :

\begin{matrix} ψ (G, A) - ψ ({Broyd}_{τ} (A, G, u), A) \geq \frac{6}{13} ln (1 + (τ \frac{1}{ξ η} + 1 - τ) ν^{2} (A, G, u)) . \end{matrix}

Proof

Indeed, denoting $G_{+} : = {Broyd}_{τ} (A, G, u)$ , we obtain

\begin{matrix} \begin{matrix} ⟨ G^{- 1} - G_{+}^{- 1}, A ⟩ & \overset{(66)}{=} & τ [\frac{⟨ A G^{- 1} A G^{- 1} A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} - 1] + (1 - τ) [\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} - 1], \end{matrix} \end{matrix}

and

\begin{matrix} Det (G_{+}^{- 1}, G) \overset{(67)}{=} τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ A u, u ⟩}{⟨ G u, u ⟩} . \end{matrix}

Thus,

\begin{matrix} \begin{matrix} ψ (G, A) - ψ (G_{+}, A) \overset{(8)}{=} ⟨ G^{- 1} - G_{+}^{- 1}, A ⟩ + ln Det (G_{+}^{- 1}, G) \\ = τ α_{1} + (1 - τ) α_{0} + ln (τ β_{1}^{- 1} + (1 - τ) β_{0}^{- 1}) - 1 = α - ln β - 1, \end{matrix} \end{matrix}

where we denote $α_{1} : = \frac{⟨ A G^{- 1} A G^{- 1} A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩}$ , $β_{1} : = \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩}$ , $α_{0} : = \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩}$ , $β_{0} : = \frac{⟨ A u, u ⟩}{⟨ G u, u ⟩}$ , $α : = τ α_{1} + (1 - τ) α_{0}$ , $β : = {(τ β_{1}^{- 1} + (1 - τ) β_{0}^{- 1})}^{- 1}$ . Note that $α_{1} \geq β_{1}$ and $α_{0} \geq β_{0}$ by the Cauchy–Schwartz inequality. At the same time, $τ β_{1} + (1 - τ) β_{2} \geq β$ by the convexity of the inverse function $t \mapsto t^{- 1}$ . Hence, we can apply Lemma 3.3 to estimate (11) from below. Note that

\begin{matrix} \begin{matrix} α + \frac{1}{β} - 1 & = & τ \frac{⟨ (A + A G^{- 1} A G^{- 1} A) u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ (G + A) u, u ⟩}{⟨ A u, u ⟩} - 1 \\ = & 1 + τ \frac{⟨ (G - A) G^{- 1} A G^{- 1} (G - A) ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ (G - A) G^{- 1} (G - A) u, u ⟩}{⟨ A u, u ⟩} \\ \geq & 1 + (τ \frac{1}{ξ η} + 1 - τ) \frac{⟨ (G - A) G^{- 1} (G - A) u, u ⟩}{⟨ A u, u ⟩} \\ \overset{(4)}{=} & 1 + (τ \frac{1}{ξ η} + 1 - τ) ν^{2} (A, G, u) . \end{matrix} □ \end{matrix}

The measure $ν (A, G, u)$ , defined in (4), is the ratio of the norm of $(G - A) u$ , measured with respect to G, and the norm of u, measured with respect to A. It is important that we can change the corresponding metrics to $G_{+}$ and G, respectively, by paying only with the minimal eigenvalue of G relative to A.

Lemma 3.5

Let $A, G : E \to E^{*}$ be self-adjoint positive definite linear operators such that $\frac{1}{ξ} A ⪯ G$ for some $ξ > 0$ . Then, for any $τ \in [0, 1]$ , any $u \in E \ {0}$ , and $G_{+} : = {Broyd}_{τ} (A, G, u)$ , we have

\begin{matrix} ν^{2} (A, G, u) \geq \frac{1}{1 + ξ} \frac{⟨ (G - A) G_{+}^{- 1} (G - A) u, u ⟩}{⟨ G u, u ⟩} . \end{matrix}

Proof

From (66), it is easy to see that $G_{+}^{- 1} A u = u$ . Hence,

\begin{matrix} \begin{matrix} \frac{⟨ (G - A) G_{+}^{- 1} (G - A) u, u ⟩}{⟨ G u, u ⟩} & = & \frac{⟨ G G_{+}^{- 1} G u, u ⟩}{⟨ G u, u ⟩} + \frac{⟨ A u, G_{+}^{- 1} A u ⟩}{⟨ G u, u ⟩} - 2 \frac{⟨ G u, G_{+}^{- 1} A u ⟩}{⟨ G u, u ⟩} \\ = & \frac{⟨ G G_{+}^{- 1} G u, u ⟩}{⟨ G u, u ⟩} + \frac{⟨ A u, u ⟩}{⟨ G u, u ⟩} - 2 . \end{matrix} \end{matrix}

Since $1 - t \leq \frac{1}{t} - 1$ for all $t > 0$ , we further have

\begin{matrix} \begin{matrix} \frac{⟨ G G_{+}^{- 1} G u, u ⟩}{⟨ G u, u ⟩} & \overset{(66)}{=} & τ [1 - \frac{{⟨ A u, u ⟩}^{2}}{⟨ G u, u ⟩ ⟨ A G^{- 1} A u, u ⟩} + \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩}] \\ + (1 - τ) [(\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + 1) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} - 1] \\ \leq & (\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + 1) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} - 1 . \end{matrix} \end{matrix}

Denote $ν : = ν (A, G, u)$ . Then,

\begin{matrix} \begin{matrix} ν^{2} & \overset{(4)}{=} & \frac{⟨ (G - A) G^{- 1} (G - A) u, u ⟩}{⟨ A u, u ⟩} = \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} + \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} - 2 . \end{matrix} \end{matrix}

Consequently,

\begin{matrix} \begin{matrix} (1 + ξ) ν^{2} & \geq & (\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + 1) ν^{2} \\ \overset{(14)}{=} & (\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + 1) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} + \frac{{⟨ A G^{- 1} A u, u ⟩}^{2}}{{⟨ A u, u ⟩}^{2}} - \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} - 2 \\ \overset{(13)}{\geq} & \frac{⟨ G G_{+}^{- 1} G u, u ⟩}{⟨ A u, u ⟩} + \frac{{⟨ A G^{- 1} A u, u ⟩}^{2}}{⟨ A u, u ⟩} - \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} - 1 . \end{matrix} \end{matrix}

Thus,

\begin{matrix} \begin{matrix} (1 + ξ) ν^{2} - \frac{⟨ (G - A) G_{+}^{- 1} (G - A) u, u ⟩}{⟨ G u, u ⟩} & \overset{(12)}{=} & (1 + ξ) ν^{2} - \frac{⟨ G G_{+}^{- 1} G u, u ⟩}{G u, u ⟩} - \frac{⟨ A u, u ⟩}{⟨ G u, u ⟩} + 2 \\ \overset{(15)}{\geq} & \frac{{⟨ A G^{- 1} A u, u ⟩}^{2}}{{⟨ A u, u ⟩}^{2}} - \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} - \frac{⟨ A u, u ⟩}{⟨ G u, u ⟩} + 1 \\ \geq & \frac{{⟨ A G^{- 1} A u, u ⟩}^{2}}{{⟨ A u, u ⟩}^{2}} - 2 \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + 1 \geq 0, \end{matrix} \end{matrix}

where we have used the Cauchy–Schwartz inequality $\frac{⟨ A u, u ⟩}{⟨ G u, u ⟩} \leq \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩}$ . $□$

Unconstrained Quadratic Minimization

Let us study the convergence properties of the classical quasi-Newton methods from the convex Broyden class, as applied to minimizing the quadratic function

\begin{matrix} f (x) : = \frac{1}{2} ⟨ A x, x ⟩ - ⟨ b, x ⟩, \end{matrix}

where $A : E \to E^{*}$ is a self-adjoint positive definite linear operator, and $b \in E^{*}$ .

Let $B : E \to E^{*}$ be a fixed self-adjoint positive definite linear operator, and let $μ, L > 0$ be such that

\begin{matrix} μ B ⪯ A ⪯ L B . \end{matrix}

Thus, $μ$ is the strong convexity parameter of f, and L is the constant of Lipschitz continuity of the gradient of f, both measured relative to B.

Consider the following standard quasi-Newton process for minimizing (16): For measuring its rate of convergence, we use the norm of the gradient, taken with respect to the Hessian:

\begin{matrix} λ_{k} : = {‖ \nabla f (x_{k}) ‖}_{A}^{*} \overset{(1)}{=} {⟨ \nabla f (x_{k}), A^{- 1} \nabla f (x_{k}) ⟩}^{1 / 2} . \end{matrix}

It is known that the process () has at least a linear convergence rate of the standard gradient method:

Theorem 4.1

(see [27, Theorem 3.1]) In scheme (), for all $k \geq 0$ :

\begin{matrix} A ⪯ G_{k} ⪯ \frac{L}{μ} A, λ_{k} \leq {(1 - \frac{μ}{L})}^{k} λ_{0} . \end{matrix}

Let us establish the superlinear convergence. According to (19), for the quadratic function, we have $A ⪯ G_{k}$ for all $k \geq 0$ . Therefore, in our analysis, we can use both potential functions: the log-det barrier and the augmented log-det barrier. Let us consider both options. We start with the first one.

Theorem 4.2

In scheme (), for all $k \geq 1$ , we have

\begin{matrix} λ_{k} \leq {[\frac{2}{\prod_{i = 0}^{k - 1} {(τ_{i} \frac{μ}{L} + 1 - τ_{i})}^{1 / k}}, (e^{\frac{n}{k} ln \frac{L}{μ}} - 1)]}^{k / 2} \sqrt{\frac{L}{μ}} \cdot λ_{0} . \end{matrix}

Proof

Without loss of generality, we can assume that $u_{i} \neq 0$ for all $0 \leq i \leq k$ . Denote $V_{i} : = V (A, G_{i})$ , $ν_{i} : = ν (A, G_{i}, u_{i})$ , $p_{i} : = τ_{i} \frac{μ}{L} + 1 - τ_{i}$ , $g_{i} : = {‖ \nabla f (x_{i}) ‖}_{G_{i}}^{*}$ for any $0 \leq i \leq k$ . By Lemma 3.2 and (19), for all $0 \leq i \leq k - 1$ , we have $ln (1 + p_{i} ν_{i}^{2}) \leq V_{i} - V_{i + 1}$ . Summing up, we obtain

\begin{matrix} \begin{matrix} \sum_{i = 0}^{k - 1} ln (1 + p_{k} ν_{k}^{2}) \leq V_{0} - V_{k} \overset{(19)}{\leq} V_{0} \overset{(18)}{=} V (A, L B) \\ \overset{(5)}{=} & ln Det (A^{- 1}, L B) \overset{(17)}{\leq} ln Det (\frac{1}{μ} B^{- 1}, L B) = n ln \frac{L}{μ} . \end{matrix} \end{matrix}

Hence, by the convexity of function $t \mapsto ln (1 + e^{t})$ , we get

\begin{matrix} \begin{matrix} \frac{n}{k} ln \frac{L}{μ} & \overset{(21)}{\geq} & \frac{1}{k} \sum_{i = 0}^{k - 1} ln (1 + p_{i} ν_{i}^{2}) = \frac{1}{k} \sum_{i = 0}^{k - 1} ln (1 + e^{ln (p_{i} ν_{i}^{2})}) \\ \geq & ln (1 + e^{\frac{1}{k} \sum_{i = 0}^{k - 1} ln (p_{i} ν_{i}^{2})}) = ln (1 + {[\prod_{i = 0}^{k - 1}, p_{i}, ν_{i}^{2}]}^{1 / k}) . \end{matrix} \end{matrix}

But, for all $0 \leq i \leq k - 1$ , we have $ν_{i}^{2} \geq \frac{1}{2} \frac{⟨ (G_{i} - A) G_{i + 1}^{- 1} (G_{i} - A) u_{i}, u_{i} ⟩}{⟨ G_{i} u_{i}, u_{i} ⟩} = \frac{1}{2} \frac{g_{i + 1}^{2}}{g_{i}^{2}}$ by Lemma 3.5, (19), and since $G_{i} u_{i} = - \nabla f (x_{i})$ , $A u_{i} = \nabla f (x_{i + 1}) - \nabla f (x_{i})$ . Hence, $\prod_{i = 0}^{k - 1} ν_{i}^{2} \geq \frac{1}{2^{k}} \frac{g_{k}^{2}}{g_{0}^{2}}$ , and so $\frac{n}{k} ln \frac{L}{μ} \overset{(22)}{\geq} ln (1 + \frac{1}{2} {[\prod_{i = 0}^{k - 1}, p_{i}]}^{1 / k} {[\frac{g_{k}}{g_{0}}]}^{2 / k})$ . Rearranging, we obtain $g_{k} \leq {[\frac{2}{\prod_{i = 0}^{k - 1} p_{i}^{1 / k}}, (e^{\frac{n}{k} ln \frac{L}{μ}} - 1)]}^{k / 2} g_{0}$ . It remains to note that $λ_{k} \leq \sqrt{\frac{L}{μ}} \cdot g_{k}$ and $g_{0} \leq λ_{0}$ in view of (19). $□$

Remark 4.1

As can be seen from (21), the factor $n ln \frac{L}{μ}$ in (20) can be improved up to $ln Det (A^{- 1}, L B) = \sum_{i = 1}^{n} ln \frac{L}{λ_{i}}$ , where $λ_{1}, \dots, λ_{n}$ are the eigenvalues of A relative to B. This improved factor can be significantly smaller than the original one if the majority of the eigenvalues $λ_{i}$ are much larger than $μ$ .

Let us briefly present another approach, which is based on the augmented log-det barrier. The resulting efficiency estimate will be the same as in Theorem 4.2 up to a slightly worse absolute constant under the exponent. However, this proof can be extended onto general nonlinear functions.

Theorem 4.3

In scheme (), for all $k \geq 1$ , we have

\begin{matrix} λ_{k} \leq {[\frac{2}{\prod_{i = 0}^{k - 1} {(τ_{i} \frac{μ}{L} + 1 - τ_{i})}^{1 / k}}, (e^{\frac{13}{6} \frac{n}{k} ln \frac{L}{μ}} - 1)]}^{k / 2} \sqrt{\frac{L}{μ}} \cdot λ_{0} . \end{matrix}

Proof

Without loss of generality, we can assume that $u_{i} \neq 0$ for all $0 \leq i \leq k$ . Denote $ψ_{i} : = ψ (G_{i}, A)$ , $ν_{i} : = ν (A, G_{i}, u_{i})$ , $p_{i} = τ_{i} \frac{μ}{L} + 1 - τ_{i}$ , $g_{i} : = {‖ \nabla f (x_{i}) ‖}_{G_{i}}^{*}$ for all $0 \leq i \leq k$ . By Lemma 3.4 and (19), for all $0 \leq i \leq k - 1$ , we have $\frac{6}{13} ln (1 + p_{i} ν_{i}^{2}) \leq ψ_{i} - ψ_{i + 1}$ . Hence,

\begin{matrix} \begin{matrix} \frac{6}{13} \sum_{i = 0}^{k - 1} ln (1 + p_{i} ν_{i}^{2}) & \leq & ψ_{0} - ψ_{k} \overset{(9)}{\leq} ψ_{0} \overset{(18)}{=} ψ (L B, A) \\ \overset{(8)}{=} & ln Det (A^{- 1}, L B) - ⟨ \frac{1}{L} B^{- 1}, L B - A ⟩ \overset{(17)}{\leq} n ln \frac{L}{μ}, \end{matrix} \end{matrix}

and we can continue exactly as in the proof of Theorem 4.2. $□$

Minimization of General Functions

In this section, we consider the general unconstrained minimization problem:

\begin{matrix} \begin{matrix} min_{x \in E} f (x), \end{matrix} \end{matrix}

where $f : E \to R$ is a twice continuously differentiable function with positive definite second derivative. Our goal is to study the convergence properties of the following standard quasi-Newton scheme for solving (24): Here, $B : E \to E^{*}$ is a self-adjoint positive definite linear operator, and L is a positive constant, which together define the initial Hessian approximation $G_{0}$ .

We assume that there exist constants $μ > 0$ and $M \geq 0$ , such that

\begin{matrix} μ B ⪯ \nabla^{2} f (x) ⪯ L B, \end{matrix}

\begin{matrix} \nabla^{2} f (y) - \nabla^{2} f (x) ⪯ M {‖ y - x ‖}_{z} \nabla^{2} f (w) \end{matrix}

for all $x, y, z, w \in E$ . The first assumption (26) specifies that, relative to the operator B, the objective function f is $μ$ -strongly convex and its gradient is L-Lipschitz continuous. The second assumption (27) means that f is M-strongly self-concordant. This assumption was recently introduced in [26] as a convenient affine-invariant alternative to the standard assumption of the Lipschitz second derivative and is satisfied at least for any strongly convex function with Lipschitz continuous Hessian (see [26, Example 4.1]). The main facts, which we use about strongly self-concordant functions, are summarized in the following lemma (see [26, Lemma 4.1]):

Lemma 5.1

For any $x, y \in E$ , $J : = \int_{0}^{1} \nabla^{2} f (x + t (y - x)) d t$ , $r : = {‖ y - x ‖}_{x}$ :

\begin{matrix} {(1 + \frac{Mr}{2})}^{- 1} \nabla^{2} f (x) ⪯ J ⪯ (1 + \frac{Mr}{2}) \nabla^{2} f (x), \end{matrix}

\begin{matrix} {(1 + \frac{Mr}{2})}^{- 1} \nabla^{2} f (y) ⪯ J ⪯ (1 + \frac{Mr}{2}) \nabla^{2} f (y) . \end{matrix}

Note that for a quadratic function, we have $M = 0$ .

For measuring the convergence rate of (), we use the local gradient norm:

\begin{matrix} λ_{k} : = {‖ \nabla f (x_{k}) ‖}_{x_{k}}^{*} \overset{(1)}{=} {⟨ \nabla f (x_{k}), \nabla^{2} f {(x_{k})}^{- 1} \nabla f (x_{k}) ⟩}^{1 / 2} . \end{matrix}

The local convergence analysis of the scheme () is, in general, the same as the corresponding analysis in the quadratic case. However, it is much more technical due to the fact that, in the nonlinear case, the Hessian is no longer constant. This causes a few problems.

First, there are now several different ways how one can treat the Hessian approximation $G_{k}$ . One can view it as an approximation to the Hessian $\nabla^{2} f (x_{k})$ at the current iterate $x_{k}$ , to the Hessian $\nabla^{2} f (x^{*})$ at the minimizer $x^{*}$ , to the integral Hessian $J_{k}$ , etc. Of course, locally, due to strong self-concordancy, all these variants are equivalent since the corresponding Hessians are close to each other. Nevertheless, from the viewpoint of technical simplicity of the analysis, some options are slightly more preferable than others. We find it to be the most convenient to always think of $G_{k}$ as an approximation to the integral Hessian $J_{k}$ .

The second issue is as follows. Suppose we already know what is the connection between our current Hessian approximation $G_{k}$ and the actual integral Hessian $J_{k}$ , e.g., in terms of the relative eigenvalues and the value of the augmented log-det barrier potential function (8). Naturally, we want to know how these quantities change after we update $G_{k}$ into $G_{k + 1}$ at Step 4 of the scheme (). For this, we apply Lemma 3.1 and Lemma 3.4, respectively. However, the problem is that both of these lemmas will provide us only with the information on the connection between the update result $G_{k + 1}$ and the current integral Hessian $J_{k}$ (which was used for performing the update), not the next one $J_{k + 1}$ . Therefore, we need to additionally take into account the errors, resulting from approximating $J_{k + 1}$ by $J_{k}$ .

For estimating the errors, which accumulate as a result of approximating one Hessian by another, it is convenient to introduce the following quantities2:

\begin{matrix} r_{k} : = {‖ u_{k} ‖}_{x_{k}}, ξ_{k} : = e^{M \sum_{i = 0}^{k - 1} r_{i}} (\geq 1), k \geq 0 . \end{matrix}

Remark 5.1

The general framework of our analysis is the same as in the previous paper [27]. The main difference is that now another potential function is used for establishing the rate of superlinear convergence (Lemma 5.4). However, in order to properly incorporate the new potential function into the analysis, many parts in the proof had to be appropriately modified, most notably the part, related to estimating the region of local convergence. In any case, the analysis, presented below, is fully self-contained and does not require the reader first go through [27].

We analyze the method () in several steps. The first step is to establish the bounds on the relative eigenvalues of the Hessian approximations with respect to the corresponding Hessians.

Lemma 5.2

For all $k \geq 0$ , we have

\begin{matrix} \frac{1}{ξ_{k}} \nabla^{2} f (x_{k}) ⪯ G_{k} ⪯ ξ_{k} \frac{L}{μ} \nabla^{2} f (x_{k}), \end{matrix}

\begin{matrix} \frac{1}{ξ_{k + 1}} J_{k} ⪯ G_{k} ⪯ ξ_{k + 1} \frac{L}{μ} J_{k} . \end{matrix}

Proof

For $k = 0$ , (32) follows from (26) and the fact that $G_{0} = L B$ and $ξ_{0} = 1$ . Now suppose that $k \geq 0$ , and that (32) has already been proved for all indices up to k. Then, applying Lemma 5.1 to (32), we obtain

\begin{matrix} \frac{1}{ξ_{k} (1 + \frac{M r_{k}}{2})} J_{k} ⪯ G_{k} ⪯ (1 + \frac{M r_{k}}{2}) ξ_{k} \frac{L}{μ} J_{k} . \end{matrix}

Since $(1 + \frac{M r_{k}}{2}) ξ_{k} \leq ξ_{k + 1}$ by (31), this proves (33) for the index k. Applying Lemma 3.1 to (34), we get $\frac{1}{ξ_{k} (1 + \frac{M r_{k}}{2})} J_{k} ⪯ G_{k + 1} ⪯ (1 + \frac{M r_{k}}{2}) ξ_{k} \frac{L}{μ} J_{k}$ , and so

\begin{matrix} \begin{matrix} G_{k + 1} & \overset{(29)}{⪯} & {(1 + \frac{M r_{k}}{2})}^{2} ξ_{k} \frac{L}{μ} \nabla^{2} f (x_{k + 1}) \overset{(31)}{⪯} ξ_{k + 1} \frac{L}{μ} \nabla^{2} f (x_{k + 1}), \\ G_{k + 1} & \overset{(29)}{⪰} & \frac{1}{{(1 + \frac{M r_{k}}{2})}^{2} ξ_{k}} \nabla^{2} f (x_{k + 1}) \overset{(31)}{⪰} \frac{1}{ξ_{k + 1}} \nabla^{2} f (x_{k + 1}) . \end{matrix} \end{matrix}

This proves (32) for the index $k + 1$ , and we can continue by induction. $□$

Corollary 5.1

For all $k \geq 0$ , we have

\begin{matrix} r_{k} \leq ξ_{k} λ_{k} . \end{matrix}

Proof

Indeed,

\begin{matrix} \begin{matrix} r_{k} \overset{(31)}{=} {‖ u_{k} ‖}_{x_{k}} & \overset{(35)}{=} & {⟨ \nabla f (x_{k}), G_{k}^{- 1} \nabla^{2} f (x_{k}) G_{k}^{- 1} \nabla f (x_{k}) ⟩}^{1 / 2} \\ \overset{(32)}{\leq} & ξ_{k} {⟨ \nabla f (x_{k}), \nabla^{2} f {(x_{k})}^{- 1} \nabla f (x_{k}) ⟩}^{1 / 2} \overset{(30)}{=} ξ_{k} λ_{k} . □ \end{matrix} \end{matrix}

The second step in our analysis is to establish a preliminary version of the linear convergence theorem for the scheme ().

Lemma 5.3

For all $k \geq 0$ , we have

\begin{matrix} λ_{k} \leq \sqrt{ξ_{k}} λ_{0} \prod_{i = 0}^{k - 1} q_{i}, \end{matrix}

where

\begin{matrix} q_{i} : = max \{1 - \frac{μ}{ξ_{i + 1} L}, ξ_{i + 1} - 1\} . \end{matrix}

Proof

Let $k, i \geq 0$ be arbitrary. By Taylor’s formula, we have

\begin{matrix} \nabla f (x_{i + 1}) \overset{(25)}{=} \nabla f (x_{i}) + J_{i} u_{i} \overset{(25)}{=} J_{i} (J_{i}^{- 1} - G_{i}^{- 1}) \nabla f (x_{i}) . \end{matrix}

Hence,

\begin{matrix} \begin{matrix} λ_{i + 1} & \overset{(30)}{=} & {⟨ \nabla f (x_{i + 1}), \nabla^{2} f {(x_{i + 1})}^{- 1} \nabla f (x_{i + 1}) ⟩}^{1 / 2} \\ \overset{(29)}{\leq} & \sqrt{1 + \frac{M r_{i}}{2}} {⟨ \nabla f (x_{i + 1}), J_{i}^{- 1} \nabla f (x_{i + 1}) ⟩}^{1 / 2} \\ \overset{(38)}{=} & \sqrt{1 + \frac{M r_{i}}{2}} {⟨ \nabla f (x_{i}), (J_{i}^{- 1} - G_{i}^{- 1}) J_{i} (J_{i}^{- 1} - G_{i}^{- 1}) \nabla f (x_{i}) ⟩}^{1 / 2} . \end{matrix} \end{matrix}

Note that $- (ξ_{i + 1} - 1) J_{i}^{- 1} \overset{(33)}{⪯} J_{i}^{- 1} - G_{i}^{- 1} \overset{(33)}{⪯} (1 - \frac{μ}{ξ_{i + 1} L}) J_{i}^{- 1}$ . Therefore,

\begin{matrix} (J_{i}^{- 1} - G_{i}^{- 1}) J_{i} (J_{i}^{- 1} - G_{i}^{- 1}) \overset{(37)}{⪯} q_{i}^{2} J_{i}^{- 1} \overset{(28)}{⪯} q_{i}^{2} (1 + \frac{M r_{i}}{2}) \nabla^{2} f {(x_{i})}^{- 1} . \end{matrix}

Thus, $λ_{i + 1} \leq (1 + \frac{M r_{i}}{2}) q_{i} λ_{i}$ in view of (39) and (30). Consequently,

\begin{matrix} λ_{k} \leq λ_{0} \prod_{i = 0}^{k - 1} (1 + \frac{M r_{i}}{2}) q_{i} \leq λ_{0} \prod_{i = 0}^{k - 1} e^{\frac{M r_{i}}{2}} q_{i} \overset{(31)}{=} \sqrt{ξ_{k}} λ_{0} \prod_{i = 0}^{k - 1} q_{i} . □ \end{matrix}

Next, we establish a preliminary version of the theorem on superlinear convergence of the scheme (). The proof uses the augmented log-det barrier potential function and is essentially a generalization of the corresponding proof of Theorem 4.3.

Lemma 5.4

For all $k \geq 1$ , we have

\begin{matrix} λ_{k} \leq {[\frac{1 + ξ_{k}}{\prod_{i = 0}^{k - 1} {(τ_{i} \frac{μ}{ξ_{i + 1}^{2} L} + 1 - τ_{i})}^{1 / k}}, (e^{\frac{13}{6} \frac{n}{k} ln (ξ_{k + 1}^{ξ_{k + 1}}, \frac{L}{μ})} - 1)]}^{k / 2} \sqrt{ξ_{k} \frac{L}{μ}} \cdot λ_{0} . \end{matrix}

Proof

Without loss of generality, assume that $u_{i} \neq 0$ for all $0 \leq i \leq k$ . Denote $ψ_{i} : = ψ (G_{i}, J_{i})$ , ${\tilde{ψ}}_{i + 1} : = ψ (G_{i + 1}, J_{i})$ , $ν_{i} : = ν (J_{i}, G_{i}, u_{i})$ , $p_{i} : = τ_{i} \frac{μ}{ξ_{i + 1}^{2} L} + 1 - τ_{i}$ , and $g_{i} : = {‖ \nabla f (x_{i}) ‖}_{G_{i}}^{*}$ for any $0 \leq i \leq k$ .

Let $0 \leq i \leq k - 1$ be arbitrary. By Lemma 3.4 and (33), we have

\begin{matrix} \frac{6}{13} ln (1 + p_{i} ν_{i}^{2}) \leq ψ_{i} - {\tilde{ψ}}_{i + 1} = ψ_{i} - ψ_{i + 1} + Δ_{i}, \end{matrix}

where

\begin{matrix} Δ_{i} : = ψ_{i + 1} - {\tilde{ψ}}_{i + 1} \overset{(8)}{=} ⟨ G_{i + 1}^{- 1}, J_{i + 1} - J_{i} ⟩ + ln Det (J_{i + 1}^{- 1}, J_{i}) . \end{matrix}

Note that $J_{i} ⪰ {(1 + \frac{M r_{i}}{2})}^{- 1} \nabla^{2} f (x_{i + 1}) ⪰ {(1 + \frac{M r_{i}}{2})}^{- 1} {(1 + \frac{M r_{i + 1}}{2})}^{- 1} J_{i + 1}$ in view of (29) and (28). In particular, $J_{i} ⪰ e^{- \frac{M}{2} (r_{i} + r_{i + 1})} J_{i + 1} ⪰ (1 - \frac{M}{2} (r_{i} + r_{i + 1})) J_{i + 1}$ . Therefore, $J_{i + 1} - J_{i} ⪯ \frac{M}{2} (r_{i} + r_{i + 1}) J_{i + 1}$ , and so

\begin{matrix} \begin{matrix} \sum_{i = 0}^{k - 1} ⟨ G_{i + 1}^{- 1}, J_{i + 1} - J_{i} ⟩ & \leq & \frac{M}{2} \sum_{i = 0}^{k - 1} (r_{i} + r_{i + 1}) ⟨ G_{i + 1}^{- 1}, J_{i + 1} ⟩ \\ \overset{(33)}{\leq} & n \frac{M}{2} \sum_{i = 0}^{k - 1} ξ_{i + 2} (r_{i} + r_{i + 1}) \overset{(31)}{\leq} n ξ_{k + 1} \frac{M}{2} \sum_{i = 0}^{k - 1} (r_{i} + r_{i + 1}) \\ \leq & n ξ_{k + 1} M \sum_{i = 0}^{k} r_{i} \overset{(31)}{=} n ξ_{k + 1} ln ξ_{k + 1} . \end{matrix} \end{matrix}

Consequently,

\begin{matrix} \sum_{i = 0}^{k - 1} Δ_{i} \overset{(42)}{\leq} n ξ_{k + 1} ln ξ_{k + 1} + ln Det (J_{k}^{- 1}, J_{0}) . \end{matrix}

Summing up (41), we thus obtain

\begin{matrix} \begin{matrix} \frac{6}{13} \sum_{i = 0}^{k - 1} ln (1 + p_{i} ν_{i}^{2}) \leq ψ_{0} - ψ_{k} + \sum_{i = 0}^{k - 1} Δ_{i} \overset{(9)}{\leq} ψ_{0} + \sum_{i = 0}^{k - 1} Δ_{i} \\ \overset{(8)}{=} & ln Det (J_{0}^{- 1}, L B) - ⟨ \frac{1}{L} B^{- 1}, L B - J_{0} ⟩ + \sum_{i = 0}^{k - 1} Δ_{i} \\ \overset{(43)}{\leq} & ln Det (J_{k}^{- 1}, L B) - ⟨ \frac{1}{L} B^{- 1}, L B - J_{0} ⟩ + n ξ_{k + 1} ln ξ_{k + 1} \\ \overset{(26)}{\leq} & n ln \frac{L}{μ} + n ξ_{k + 1} ln ξ_{k + 1} = n ln (ξ_{k + 1}^{ξ_{k + 1}}, \frac{L}{μ}) . \end{matrix} \end{matrix}

By the convexity of function $t \mapsto ln (1 + e^{t})$ , it follows that

\begin{matrix} \begin{matrix} \frac{13}{6} \frac{n}{k} ln (ξ_{k + 1}^{ξ_{k + 1}}, \frac{L}{μ}) \geq \frac{1}{k} \sum_{i = 0}^{k - 1} ln (1 + p_{i} ν_{i}^{2}) = \frac{1}{k} \sum_{i = 0}^{k - 1} ln (1 + e^{ln (p_{i} ν_{i}^{2})}) \\ \geq & ln (1 + e^{\frac{1}{k} \sum_{i = 0}^{k - 1} ln (p_{i} ν_{i}^{2})}) = ln (1 + {[\prod_{i = 0}^{k - 1}, p_{i}, ν_{i}^{2}]}^{1 / k}) . \end{matrix} \end{matrix}

At the same time, $ν_{i}^{2} \geq \frac{1}{1 + ξ_{i + 1}} \frac{⟨ (G_{i} - J_{i}) G_{i + 1}^{- 1} (G_{i} - J_{i}) u_{i}, u_{i} ⟩}{⟨ G_{i} u_{i}, u_{i} ⟩} = \frac{1}{1 + ξ_{i + 1}} \frac{g_{i + 1}^{2}}{g_{i}^{2}}$ in view of Lemma 3.5, (33) and since $G_{i} u_{i} = - \nabla f (x_{i})$ , $J_{i} u_{i} = \nabla f (x_{i + 1}) - \nabla f (x_{i})$ . Hence, we can write $\prod_{i = 0}^{k - 1} ν_{i}^{2} \geq \frac{g_{k}^{2}}{g_{0}^{2}} \prod_{i = 0}^{k - 1} \frac{1}{1 + ξ_{i + 1}} \overset{(31)}{\geq} \frac{1}{{(1 + ξ_{k})}^{k}} \frac{g_{k}^{2}}{g_{0}^{2}}$ . Consequently, $\frac{13}{6} \frac{n}{k} ln (ξ_{k + 1}^{ξ_{k + 1}} \frac{L}{μ}) \overset{(44)}{\geq} ln (1 + \frac{\prod_{i = 0}^{k - 1} p_{i}^{1 / k}}{1 + ξ_{k}} {[\frac{g_{k}}{g_{0}}]}^{2 / k})$ . Rearranging, we obtain that $g_{k} \leq {[\frac{1 + ξ_{k}}{\prod_{i = 0}^{k - 1} p_{i}^{1 / k}}, (e^{\frac{13}{6} \frac{n}{k} ln (ξ_{k + 1}^{ξ_{k + 1}} \frac{L}{μ})} - 1)]}^{k / 2} g_{0}$ . But $λ_{k} \leq \sqrt{ξ_{k} \frac{L}{μ}} \cdot g_{k}$ by (32), and $g_{0} \leq λ_{0}$ in view of (26) and the fact that $G_{0} = L B$ . $□$

In the quadratic case ( $M = 0$ ), we have $ξ_{k} \equiv 1$ (see (31)), and Lemmas 5.2 and 5.3 reduce to the already known Theorem 4.1, and Lemma 5.4 reduces to the already known Theorem 4.2. In the general case, the quantities $ξ_{k}$ can grow with iterations. However, as we will see in a moment, by requiring the initial point $x_{0}$ in the scheme () to be sufficiently close to the solution, we can still ensure that $ξ_{k}$ stay uniformly bounded by a sufficiently small absolute constant. This allows us to recover all the main results of the quadratic case.

To write down the region of local convergence of (), we need to introduce one more quantity, related to the starting moment of superlinear convergence3:

\begin{matrix} K_{0} : = ⌈\frac{1}{τ \frac{4 μ}{9 L} + 1 - τ} 8 n ln \frac{2 L}{μ}⌉, τ : = sup_{k \geq 0} τ_{k} (\leq 1) . \end{matrix}

For DFP ( $τ_{k} \equiv 1$ ) and BFGS ( $τ_{k} \equiv 0)$ , we have, respectively,

\begin{matrix} K_{0}^{DFP} = ⌈\frac{18 n L}{μ} ln \frac{2 L}{μ}⌉, K_{0}^{BFGS} = ⌈8 n ln \frac{2 L}{μ}⌉ . \end{matrix}

Now we are ready to prove the main result of this section.

Theorem 5.1

Suppose that, in scheme (), we have

\begin{matrix} M λ_{0} \leq \frac{ln \frac{3}{2}}{{(\frac{3}{2})}^{\frac{3}{2}}} max \{\frac{μ}{2 L}, \frac{1}{K_{0} + 9}\} . \end{matrix}

Then, for all $k \geq 0$ ,

\begin{matrix} \frac{2}{3} \nabla^{2} f (x_{k}) ⪯ G_{k} ⪯ \frac{3 L}{2 μ} \nabla^{2} f (x_{k}), \end{matrix}

\begin{matrix} λ_{k} \leq {(1 - \frac{μ}{2 L})}^{k} \sqrt{\frac{3}{2}} \cdot λ_{0}, \end{matrix}

and, for all $k \geq 1$ ,

\begin{matrix} λ_{k} \leq {[\frac{5}{2 \prod_{i = 0}^{k - 1} {(τ_{i} \frac{4 μ}{9 L} + 1 - τ_{i})}^{1 / k}}, (e^{\frac{13}{6} \frac{n}{k} ln \frac{2 L}{μ}} - 1)]}^{k / 2} \sqrt{\frac{3 L}{2 μ}} \cdot λ_{0} . \end{matrix}

Proof

Let us prove by induction that, for all $k \geq 0$ , we have

\begin{matrix} ξ_{k} \leq \frac{3}{2} . \end{matrix}

Clearly, (51) is satisfied for $k = 0$ since $ξ_{0} = 1$ . It is also satisfied for $k = 1$ since $ξ_{1} \overset{(31)}{=} e^{M r_{0}} \overset{(35)}{\leq} e^{ξ_{0} M λ_{0}} \overset{(31)}{=} e^{M λ_{0}} \overset{(47)}{\leq} \frac{3}{2}$ .

Now let $k \geq 0$ , and suppose that (51) has already been proved for all indices up to $k + 1$ . Then, applying Lemma 5.2, we obtain (48) for all indices up to $k + 1$ . Applying now Lemma 5.3 and using for all $0 \leq i \leq k$ the relation $q_{i} \overset{(37)}{=} max {1 - \frac{μ}{ξ_{i + 1} L}, ξ_{i + 1} - 1} \overset{(51)}{\leq} max {1 - \frac{2 μ}{3 L}, \frac{1}{2}} \leq 1 - \frac{μ}{2 L}$ , we obtain (49) for all indices up to $k + 1$ . Finally, if $k \geq 1$ , then, applying Lemma 5.4 and using that $ξ_{i + 1}^{ξ_{i + 1}} \overset{(51)}{\leq} {(\frac{3}{2})}^{\frac{3}{2}} = \frac{3}{2} \sqrt{\frac{3}{2}} \leq \frac{3}{2} (1 + \frac{1}{4}) = \frac{15}{8} \leq 2$ for all $0 \leq i \leq k$ , we obtain (50) for all indices up to k. Thus, at this moment, (48) and (49) are proved for all indices up to $k + 1$ , while (50) is proved only up to k.

To finish the inductive step, it remains to prove that (51) is satisfied for the index $k + 2$ , or, equivalently, in view of (31), that $M \sum_{i = 0}^{k + 1} r_{i} \leq ln \frac{3}{2}$ . Since $M \sum_{i = 0}^{k + 1} r_{i} \leq M \sum_{i = 0}^{k + 1} ξ_{i} λ_{i} \leq \frac{3}{2} M \sum_{i = 0}^{k + 1} λ_{i}$ in view of (35) and (51), respectively, it suffices to show that $\frac{3}{2} M \sum_{i = 0}^{k + 1} λ_{i} \leq ln \frac{3}{2}$ .

Note that

\begin{matrix} \frac{3}{2} M \sum_{i = 0}^{k + 1} λ_{i} \overset{(49)}{\leq} {(\frac{3}{2})}^{\frac{3}{2}} M λ_{0} \sum_{i = 0}^{k + 1} {(1 - \frac{μ}{2 L})}^{i} \leq {(\frac{3}{2})}^{\frac{3}{2}} \frac{2 L}{μ} M λ_{0} . \end{matrix}

Therefore, if we could prove that

\begin{matrix} \frac{3}{2} M \sum_{i = 0}^{k + 1} λ_{i} \leq {(\frac{3}{2})}^{\frac{3}{2}} (K_{0} + 9) M λ_{0}, \end{matrix}

then, combining (52) and (53), we would obtain

\begin{matrix} \frac{3}{2} M \sum_{i = 0}^{k + 1} λ_{i} \leq {(\frac{3}{2})}^{\frac{3}{2}} min \{\frac{2 L}{μ}, K_{0} + 9\} M λ_{0} \overset{(47)}{\leq} ln \frac{3}{2}, \end{matrix}

which is exactly what we need. Let us prove (53). If $k \leq K_{0}$ , in view of (49), we have $\frac{3}{2} M \sum_{i = 0}^{k + 1} λ_{i} \leq {(\frac{3}{2})}^{\frac{3}{2}} (k + 2) M λ_{0} \leq {(\frac{3}{2})}^{\frac{3}{2}} (K_{0} + 2) M λ_{0}$ , and (53) follows. Therefore, from now on, we can assume that $k \geq K_{0}$ . Then4,

\begin{matrix} \begin{matrix} \frac{3}{2} M \sum_{i = 0}^{k + 1} λ_{i} & = & \frac{3}{2} M (\sum_{i = 0}^{K_{0} - 1} λ_{i} + λ_{k + 1}) + \frac{3}{2} M \sum_{i = K_{0}}^{k} λ_{i} \\ \overset{(49)}{\leq} & {(\frac{3}{2})}^{\frac{3}{2}} (K_{0} + 1) M λ_{0} + \frac{3}{2} M \sum_{i = K_{0}}^{k} λ_{i} . \end{matrix} \end{matrix}

It remains to show $\frac{3}{2} M \sum_{i = K_{0}}^{k} λ_{i} \leq {(\frac{3}{2})}^{\frac{3}{2}} 8 M λ_{0}$ . We can do this using (50).

First, let us make some estimations. Clearly, for all $0 < t < 1$ , we have $e^{t} = \sum_{j = 0}^{\infty} \frac{t^{j}}{j!} \leq 1 + t + \frac{t^{2}}{2} \sum_{j = 0}^{\infty} t^{j} = 1 + t (1 + \frac{t}{2 (1 - t)})$ . Hence, for all $0 < t \leq 1$ , we obtain $e^{\frac{13 t}{48}} - 1 \leq \frac{13 t}{48} (1 + \frac{\frac{13}{48}}{2 (1 - \frac{13}{48})}) = \frac{13 t}{48} \cdot \frac{83}{70} \leq \frac{13 t}{48} \cdot \frac{6}{5} = \frac{13 t}{40}$ , and so

\begin{matrix} {[\frac{5}{2 t}, (e^{\frac{13 t}{48}} - 1)]}^{1 / 2} \leq \sqrt{\frac{5}{2 t} \cdot \frac{13 t}{40}} = \sqrt{\frac{13}{16}} \leq \frac{11}{12} . \end{matrix}

At the same time, $\frac{11}{12} = 1 - \frac{1}{12} \leq e^{- \frac{1}{12}}$ . Hence,

\begin{matrix} \begin{matrix} {(\frac{11}{12})}^{K_{0}} \sqrt{\frac{L}{μ}} & \overset{(45)}{\leq} & {(\frac{11}{12})}^{8 ln \frac{2 L}{μ}} \sqrt{\frac{L}{μ}} \leq e^{- \frac{2}{3} ln \frac{2 L}{μ}} \sqrt{\frac{L}{μ}} = {(\frac{2 L}{μ})}^{- \frac{2}{3}} \sqrt{\frac{L}{μ}} \\ = & 2^{- \frac{2}{3}} {(\frac{L}{μ})}^{- \frac{1}{6}} \leq 2^{- \frac{2}{3}} \leq \frac{2}{3} . \end{matrix} \end{matrix}

Thus, for all $K_{0} \leq i \leq k$ , and $p : = τ \frac{4 μ}{9 L} + 1 - τ \overset{(45)}{\leq} \prod_{j = 0}^{i - 1} {(τ_{i} \frac{4 μ}{9 L} + 1 - τ_{i})}^{1 / i}$ :

\begin{matrix} \begin{matrix} λ_{i} & \overset{(50)}{\leq} & {[\frac{5}{2 p}, (e^{\frac{13}{6} \frac{n}{i} ln \frac{2 L}{μ}} - 1)]}^{i / 2} \sqrt{\frac{3 L}{2 μ}} \cdot λ_{0} \\ \overset{(45)}{\leq} & {[\frac{5}{2 p}, (e^{\frac{13 p}{48}} - 1)]}^{i / 2} \sqrt{\frac{3 L}{2 μ}} \cdot λ_{0} \overset{(54)}{\leq} {(\frac{11}{12})}^{i} \sqrt{\frac{3 L}{2 μ}} \cdot λ_{0} \\ = & {(\frac{11}{12})}^{i - K_{0}} {(\frac{11}{12})}^{K_{0}} \sqrt{\frac{3 L}{2 μ}} \cdot λ_{0} \overset{(55)}{\leq} {(\frac{11}{12})}^{i - K_{0}} \frac{2}{3} \cdot \sqrt{\frac{3}{2}} \cdot λ_{0} . \end{matrix} \end{matrix}

Hence, $\frac{3}{2} M \sum_{i = K_{0}}^{k} λ_{i} \leq {(\frac{3}{2})}^{\frac{3}{2}} M λ_{0} \cdot \frac{2}{3} \sum_{i = K_{0}}^{k} {(\frac{11}{12})}^{i - K_{0}} \leq {(\frac{3}{2})}^{\frac{3}{2}} 8 M λ_{0}$ . $□$

Remark 5.2

In accordance with Theorem 5.1, the parameter M of strong self-concordancy affects only the size of the region of local convergence of the process (), and not its rate of convergence. We do not know whether this is an artifact of the analysis or not, but it might be an interesting topic for future research. For a quadratic function, we have $M = 0$ , and so the scheme () is globally convergent.

The region of local convergence, specified by (47), depends on the maximum of two quantities: $\frac{μ}{L}$ and $\frac{1}{K_{0}}$ . For DFP, the $\frac{1}{K_{0}}$ part in this maximum is in fact redundant, and its region of local convergence is simply inversely proportional to the condition number: $O (\frac{μ}{L})$ . However, for BFGS, the $\frac{1}{K_{0}}$ part does not disappear, and we obtain the following region of local convergence:

\begin{matrix} M λ_{0} \leq max \{O (\frac{μ}{L}), O (\frac{1}{n ln \frac{2 L}{μ}})\} . \end{matrix}

Clearly, the latter region can be much bigger than the former when the condition number $\frac{L}{μ}$ is significantly larger than the dimension n.

Remark 5.3

The previous estimate of the size of the region of local convergence, established in [27], was $O (\frac{μ}{L})$ for both DFP and BFGS.

Example 5.1

Consider the functions

\begin{matrix} f (x) : = f_{0} (x) + \frac{μ}{2} {‖ x ‖}^{2}, f_{0} (x) : = ln (\sum_{i = 1}^{m}, e^{⟨ a_{i}, x ⟩ + b_{i}}), x \in E, \end{matrix}

where $a_{i} \in E^{*}$ , $b_{i} \in R$ , $i = 1, \dots, m$ , $μ > 0$ , and $‖ \cdot ‖$ is the Euclidean norm, induced by the operator B. Let $γ > 0$ be such that

\begin{matrix} ‖ a_{i} ‖_{*} \leq γ, i = 1, \dots, m, \end{matrix}

where ${‖ \cdot ‖}_{*}$ is the norm conjugate to $‖ \cdot ‖$ . Define

\begin{matrix} π_{i} (x) : = \frac{e^{⟨ a_{i}, x ⟩ + b_{i}}}{\sum_{j = 1}^{m} e^{⟨ a_{j}, x ⟩ + b_{j}}}, x \in E, i = 1, \dots, m . \end{matrix}

Clearly, $\sum_{i = 1}^{m} π_{i} (x) = 1$ , $π_{i} (x) > 0$ for all $x \in E$ , $i = 1, \dots, m$ . It is not difficult to check that, for all $x, h \in E$ , we have5

\begin{matrix} \begin{matrix} ⟨ \nabla f_{0} (x), h ⟩ & = & \sum_{i = 1}^{m} π_{i} (x) ⟨ a_{i}, h ⟩ \leq γ . \\ ⟨ \nabla^{2} f_{0} (x) h, h ⟩ & = & \sum_{i = 1}^{m} π_{i} (x) {⟨ a_{i} - \nabla f_{0} (x), h ⟩}^{2} \\ = & \sum_{i = 1}^{m} π_{i} (x) {⟨ a_{i}, h ⟩}^{2} - {⟨ \nabla f_{0} (x), h ⟩}^{2} \leq γ^{2} {‖ h ‖}^{2}, \\ D^{3} f_{0} (x) [h, h, h] & = & \sum_{i = 1}^{m} π_{i} (x) {⟨ a_{i} - \nabla f_{0} (x), h ⟩}^{3} \\ \leq & 2 γ ‖ h ‖ ⟨ \nabla^{2} f_{0} (x) h, h ⟩ \leq 2 γ^{3} {‖ h ‖}^{3} . \end{matrix} \end{matrix}

Thus, $f_{0}$ is a convex function with $γ^{2}$ -Lipschitz gradient and $(2 γ^{3})$ -Lipschitz Hessian. Consequently, the function f is $μ$ -strongly convex with L-Lipschitz gradient, $(2 γ^{3})$ -Lipschitz Hessian, and, in view of [26, Example 4.1], M-strongly self-concordant, where

\begin{matrix} L : = γ^{2} + μ, M : = & \frac{2 γ^{3}}{μ^{3 / 2}} . \end{matrix}

Let the regularization parameter $μ$ be sufficiently small, namely $μ \leq γ^{2}$ . Denote $Q : = \frac{γ^{2}}{μ} \geq 1$ . Then, $Q \leq \frac{L}{μ} \leq 2 Q$ , $M = 2 Q^{3 / 2}$ , so, according to (47), the region of local convergence of BFGS can be described as follows:

\begin{matrix} λ_{0} \leq max \{O (\frac{1}{Q^{5 / 2}}), O (\frac{1}{n Q^{3 / 2} ln (4 Q)})\} . □ \end{matrix}

Discussion

Let us compare the new convergence rates, obtained in this paper for the classical DFP and BFGS methods, with the previously known ones from [27]. Since the estimates for the general nonlinear case differ from those for the quadratic one just in absolute constants, we only discuss the latter case.

In what follows, we use our standard notation: n is the dimension of the space, $μ$ is the strong convexity parameter, L is the Lipschitz constant of the gradient, and $λ_{k}$ is the local norm of the gradient at the kth iteration.

For BFGS, the previously known rate (see [27, Theorem 3.2]) is

\begin{matrix} λ_{k} \leq {(\frac{nL}{μ k})}^{k / 2} λ_{0} . \end{matrix}

Although (56) is formally valid for all $k \geq 1$ , it becomes useful6 only after

\begin{matrix} {\hat{K}}_{0}^{BFGS} : = \frac{nL}{μ} \end{matrix}

iterations. Thus, ${\hat{K}}_{0}^{BFGS}$ can be thought of as the starting moment of the superlinear convergence, according to the estimate (56).

In this paper, we have obtained a new estimate (Theorem 4.2):

\begin{matrix} λ_{k} \leq {[2, (e^{\frac{n}{k} ln \frac{L}{μ}} - 1)]}^{k / 2} \sqrt{\frac{L}{μ}} \cdot λ_{0} . \end{matrix}

Its starting moment of superlinear convergence can be described as follows:

\begin{matrix} K_{0}^{BFGS} : = 4 n ln \frac{L}{μ} . \end{matrix}

Indeed, since $e^{t} \leq \frac{1}{1 - t} = 1 + \frac{t}{1 - t}$ for any $t < 1$ , we have, for all $k \geq K_{0}^{BFGS}$ ,

\begin{matrix} e^{\frac{n}{k} ln \frac{L}{μ}} - 1 \leq \frac{\frac{n}{k} ln \frac{L}{μ}}{1 - \frac{n}{k} ln \frac{L}{μ}} \overset{(59)}{\leq} \frac{\frac{n}{k} ln \frac{L}{μ}}{1 - \frac{1}{4}} = \frac{4 n}{3 k} ln \frac{L}{μ} . \end{matrix}

At the same time, for all $k \geq K_{0}^{BFGS}$ :

\begin{matrix} \sqrt{\frac{L}{μ}} = e^{\frac{1}{2} ln \frac{L}{μ}} \overset{(59)}{\leq} e^{\frac{k}{8}} = {(e^{\frac{1}{4}})}^{k / 2} \leq {(\frac{4}{3})}^{k / 2} \leq {(\frac{3}{2})}^{k / 2} . \end{matrix}

Hence, according the new estimate (58), for all $k \geq K_{0}^{BFGS}$ :

\begin{matrix} λ_{k} \overset{(60)}{\leq} {(\frac{8 n}{3 k} ln \frac{L}{μ})}^{k / 2} \sqrt{\frac{L}{μ}} \cdot λ_{0} \overset{(61)}{\leq} {(\frac{4 n}{k} ln \frac{L}{μ})}^{k / 2} λ_{0} (\overset{(59)}{\leq} λ_{0}) . \end{matrix}

Comparing the previously known efficiency estimate (56) and its starting moment of superlinear convergence (57) with the new ones (62), (59), we thus conclude that we manage to put the condition number $\frac{L}{μ}$ under the logarithm.

For DFP, the previously known rate (see [27, Theorem 3.2]) is

\begin{matrix} λ_{k} \leq {(\frac{n L^{2}}{μ^{2} k})}^{k / 2} λ_{0} \end{matrix}

with the following starting moment of the superlinear convergence:

\begin{matrix} {\hat{K}}_{0}^{DFP} : = \frac{n L^{2}}{μ^{2}} . \end{matrix}

The new rate, which we have obtained in this paper (Theorem 4.2), is

\begin{matrix} λ_{k} \leq {[\frac{2 L}{μ}, (e^{\frac{n}{k} ln \frac{L}{μ}} - 1)]}^{k / 2} \sqrt{\frac{L}{μ}} \cdot λ_{0} . \end{matrix}

Repeating the same reasoning as above, we can easily obtain that the new starting moment of the superlinear convergence can be described as follows:

\begin{matrix} K_{0}^{DFP} : = \frac{4 n L}{μ} ln \frac{L}{μ}, \end{matrix}

and, for all $k \geq K_{0}^{DFP}$ , the new estimate (64) takes the following form:

\begin{matrix} λ_{k} \leq {(\frac{4 n L}{μ k} ln \frac{L}{μ})}^{k / 2} λ_{0} (\overset{(65)}{\leq} λ_{0}) . \end{matrix}

Thus, compared to the old result, we have improved the factor $\frac{L^{2}}{μ^{2}}$ up to $\frac{L}{μ} ln \frac{L}{μ}$ . Interestingly enough, the ratio between the old starting moments (63), (57) of the superlinear convergence of DFP and BFGS and the new ones (65), (59) have remained the same, $\frac{L}{μ}$ , although the both estimates have been improved.

It is also interesting whether the results, obtained in this paper, can be applied to limited-memory quasi-Newton methods such as L-BFGS [30]. Unfortunately, it seems like the answer is negative. The main problem is that we cannot say anything interesting about just a few iterations of BFGS. Indeed, according to our main result, after k iterations of BFGS, the initial residual is contracted by the factor of the form ${[exp (\frac{n}{k} ln \frac{L}{μ}) - 1]}^{k}$ . For all values $k \leq n ln \frac{L}{μ}$ , this contraction factor is in fact bigger than 1, so the result becomes useless.

Conclusions

We have presented a new theoretical analysis of local superlinear convergence of classical quasi-Newton methods from the convex Broyden class. Our analysis has been based on the potential function involving the logarithm of determinant of Hessian approximation and the trace of inverse Hessian approximation. Compared to the previous works, we have obtained new convergence rate estimates, which have much better dependency on the condition number of the problem.

Note that all our results are local, i.e. they are valid under the assumption that the starting point is sufficiently close to a minimizer. In particular, there is no contradiction between our results and the fact that the DFP method is not known to be globally convergent with inexact line search (see, e.g., [31]).

Let us mention several open questions. First, looking at the starting moment of superlinear convergence of the BFGS method, in addition to the dimension of the problem, we see the presence of the logarithm of its condition number. Although typically such logarithmic factors are considered small, it is still interesting to understand whether this factor can be completely removed.

Second, all the superlinear convergence rates, which we have obtained for the convex Broyden class in this paper, are expressed in terms of the parameter $τ$ , which controls the weight of the DFP component in the updating formula for the inverse operator. At the same time, in [27], the corresponding estimates were presented in terms of the parameter $ϕ$ , which controls the weight of the DFP component in the updating formula for the primal operator. Of course, for the extreme members of the convex Broyden class, DFP and BFGS, $ϕ$ and $τ$ coincide. However, in general, they could be quite different. We do not know if it is possible to express the results of this paper in terms of $ϕ$ instead of $τ$ .

Finally, in all the methods, which we considered, the initial Hessian approximation $G_{0}$ was LB, where L is the Lipschitz constant of the gradient, measured relative to the operator B. We always assume that this constant is known. Of course, it is interesting to develop some adaptive algorithms, which could start from any initial guess $L_{0}$ for the constant L, and then somehow dynamically adjust the Hessian approximations in iterations, yet retaining all the original efficiency estimates.

Acknowledgements

The presented results were supported by ERC Advanced Grant 788368. The authors are thankful to the anonymous reviewers for their valuable time and comments.

Appendix

Lemma A.1

Let $A, G : E \to E^{*}$ be self-adjoint positive definite linear operators, let $u \in E$ be nonzero, and let $τ \in R$ be such that $G_{+} : = {Broyd}_{τ} (A, G, u)$ is well defined. Then,

\begin{matrix} \begin{matrix} G_{+}^{- 1} & = & τ [G^{- 1} - \frac{G^{- 1} A u u^{*} A G^{- 1}}{⟨ A G^{- 1} A u, u ⟩} + \frac{u u^{*}}{⟨ A u, u ⟩}] \\ + (1 - τ) [G^{- 1} - \frac{G^{- 1} A u u^{*} + u u^{*} A G^{- 1}}{⟨ A u, u ⟩} + (\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + 1) \frac{u u^{*}}{⟨ A u, u ⟩}], \end{matrix} \end{matrix}

and

\begin{matrix} Det (G_{+}^{- 1}, G) = τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} . \end{matrix}

Proof

Denote $ϕ : = ϕ_{τ} (A, G, u)$ . According to Lemma 6.2 in [27], we have

\begin{matrix} Det (G^{- 1}, G_{+}) = ϕ \frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + (1 - ϕ) \frac{⟨ A u, u ⟩}{⟨ G u, u ⟩} \\ \overset{(3)}{=} {[τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩}]}^{- 1} . \end{matrix}

This proves (67) since $Det (G_{+}^{- 1}, G) = \frac{1}{Det (G^{- 1}, G_{+})}$ . Let us prove (66). Denote

\begin{matrix} G_{0} : = G - \frac{G u u^{*} G}{⟨ G u, u ⟩} + \frac{A u u^{*} A}{⟨ A u, u ⟩}, s : = \frac{Au}{⟨ A u, u ⟩} - \frac{Gu}{⟨ G u, u ⟩} . \end{matrix}

Note that

\begin{matrix} G_{+} & \overset{(2)}{=} G_{0} + ϕ [\frac{⟨ G u, u ⟩ A u u^{*} A}{{⟨ A u, u ⟩}^{2}} + \frac{G u u^{*} G}{⟨ G u, u ⟩} - \frac{⟨ A u u^{*} G + G u u^{*} A}{⟨ A u, u ⟩}] \\ = G_{0} + ϕ ⟨ G u, u ⟩ s s^{*} . \end{matrix}

Let $I_{E}$ and $I_{E^{*}}$ be the identity operators in $E$ and $E^{*}$ . Since $G_{0} u = A u$ , we have

\begin{matrix} \begin{matrix} [(I_{E} - \frac{u u^{*} A}{⟨ A u, u ⟩}) G^{- 1} (I_{E^{*}} - \frac{A u u^{*}}{⟨ A u, u ⟩}) + \frac{u u^{*}}{⟨ A u, u ⟩}] G_{0} \\ = & (I_{E} - \frac{u u^{*} A}{⟨ A u, u ⟩}) G^{- 1} (G_{0} - \frac{A u u^{*} A}{⟨ A u, u ⟩}) + \frac{u u^{*} A}{⟨ A u, u ⟩} \\ \overset{(68)}{=} & (I_{E} - \frac{u u^{*} A}{⟨ A u, u ⟩}) G^{- 1} (G - \frac{G u u^{*} G}{⟨ G u, u ⟩}) + \frac{u u^{*} A}{⟨ A u, u ⟩} = I_{E} . \end{matrix} \end{matrix}

Hence, we can conclude that

\begin{matrix} \begin{matrix} G_{0}^{- 1} & = & (I_{E} - \frac{u u^{*} A}{⟨ A u, u ⟩}) G^{- 1} (I_{E^{*}} - \frac{A u u^{*}}{⟨ A u, u ⟩}) + \frac{u u^{*}}{⟨ A u, u ⟩} \\ = & G^{- 1} - \frac{G^{- 1} A u u^{*} + u u^{*} A G^{- 1}}{⟨ A u, u ⟩} + (\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} + 1) \frac{u u^{*}}{⟨ A u, u ⟩} . \end{matrix} \end{matrix}

Thus, we see that the right-hand side of (66) equals

\begin{matrix} H_{+} : = & G_{0}^{- 1} - τ [\frac{⟨ A G^{- 1} A u, u ⟩ u u^{*}}{{⟨ A u, u ⟩}^{2}} + \frac{G^{- 1} A u u^{*} A G^{- 1}}{⟨ A G^{- 1} A u, u ⟩} - \frac{G^{- 1} A u u^{*} + u u^{*} A G^{- 1}}{⟨ A u, u ⟩}] \\ = & G_{0}^{- 1} - τ ⟨ A G^{- 1} A u, u ⟩ w w^{*}, \end{matrix}

where

\begin{matrix} w : = \frac{G^{- 1} A u}{⟨ A G^{- 1} A u, u ⟩} - \frac{u}{⟨ A u, u ⟩} . \end{matrix}

It remains to verify that $H_{+} G_{+} = I_{E}$ . Clearly,

\begin{matrix} \begin{matrix} ⟨ A G^{- 1} A u, u ⟩ G_{0} w & \overset{(71)}{=} & G_{0} G^{- 1} A u - \frac{⟨ A G^{- 1} A u, u ⟩ G_{0} u}{⟨ A u, u ⟩} \\ \overset{(68)}{=} & A u - \frac{⟨ A u, u ⟩ G u}{⟨ G u, u ⟩} \overset{(68)}{=} ⟨ A u, u ⟩ s . \end{matrix} \end{matrix}

Hence,

\begin{matrix} \begin{matrix} ⟨ A G^{- 1} A u, u ⟩ ⟨ G_{0} w, w ⟩ \overset{(72)}{=} ⟨ A u, u ⟩ ⟨ s, w ⟩ \overset{(71)}{=} \frac{⟨ A u, u ⟩ ⟨ s, G^{- 1} A u ⟩}{⟨ A G^{- 1} A u, u ⟩} - ⟨ s, u ⟩ \\ \overset{(68)}{=} \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} (\frac{⟨ A G^{- 1} A u, u ⟩}{⟨ A u, u ⟩} - \frac{⟨ A u, u ⟩}{⟨ G u, u ⟩}) = 1 - \frac{{⟨ A u, u ⟩}^{2}}{⟨ A G^{- 1} A u, u ⟩ ⟨ G u, u ⟩} . \end{matrix} \end{matrix}

Consequently,

\begin{matrix} \begin{matrix} \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} H_{+} G_{0} w w^{*} G_{0} & \overset{(70)}{=} & \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} (G_{0}^{- 1} - τ ⟨ A G^{- 1} A u, u ⟩ w w^{*}) G_{0} w w^{*} G_{0} \\ = & \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} (1 - τ ⟨ A G^{- 1} A u, u ⟩ ⟨ G_{0} w, w ⟩) w w^{*} G_{0} \\ \overset{(73)}{=} & \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} (1 - τ + τ \frac{{⟨ A u, u ⟩}^{2}}{⟨ A G^{- 1} A u, u ⟩ ⟨ G u, u ⟩}) w w^{*} G_{0} \\ = & [τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩}] w w^{*} G_{0} . \end{matrix} \end{matrix}

Thus,

\begin{matrix} \begin{matrix} H_{+} G_{+} & \overset{(69)}{=} & H_{+} (G_{0} + ϕ ⟨ G u, u ⟩ s s^{*}) \overset{(72)}{=} H_{+} (G_{0} + ϕ \frac{{⟨ A G^{- 1} A u, u ⟩}^{2}}{⟨ A u, u ⟩} \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩} G_{0} w w^{*} G_{0}) \\ \overset{(74)}{=} & H_{+} G_{0} + ϕ \frac{{⟨ A G^{- 1} A u, u ⟩}^{2}}{⟨ A u, u ⟩} [τ \frac{⟨ A u, u ⟩}{⟨ A G^{- 1} A u, u ⟩} + (1 - τ) \frac{⟨ G u, u ⟩}{⟨ A u, u ⟩}] \\ \overset{(3)}{=} & H_{+} G_{0} + τ ⟨ A G^{- 1} A u, u ⟩ w w^{*} G_{0} \overset{(70)}{=} I_{E} . \end{matrix} □ \end{matrix}

Footnotes

This is obvious when $G - A$ is nondegenerate. The general case follows by continuity.

We follow the standard convention that the sum over the empty set is defined as 0, so $ξ_{0} = 1$ . Similarly, the product over the empty set is defined as 1.

Hereinafter, $⌈ t ⌉$ for $t > 0$ denotes the smallest positive integer greater or equal to t.

⁴

We will estimate the second sum using (50). However, recall that, at this moment, (50) is proved only up to the index k. This is the reason why we move $λ_{k + 1}$ into the first sum.

⁵

$D^{3} f_{0} (x) [h, h, h] = \frac{d^{3}}{d t^{3}} f_{0} (x + t h) |_{t = 0}$ is the third derivative of f along the direction h.

⁶

Indeed, according to Theorem 4.1, we have at least $λ_{k} \leq {(1 - \frac{μ}{L})}^{k} λ_{0}$ for all $k \geq 0$ .

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Anton Rodomanov, Email: anton.rodomanov@uclouvain.be.

Yurii Nesterov, Email: yurii.nesterov@uclouvain.be.

References

1.Davidon, W.: Variable metric method for minimization. Argonne National Laboratory Research and Development Report 5990, (1959)
2.Fletcher R, Powell M. A rapidly convergent descent method for minimization. Comput. J. 1963;6(2):163–168. doi: 10.1093/comjnl/6.2.163. [DOI] [Google Scholar]
3.Broyden C. The convergence of a class of double-rank minimization algorithms: 1. General considerations. IMA J. Appl. Math. 1970;6(1):76–90. doi: 10.1093/imamat/6.1.76. [DOI] [Google Scholar]
4.Broyden C. The convergence of a class of double-rank minimization algorithms: 2. The new algorithm. IMA J. Appl. Math. 1970;6(3):222–231. doi: 10.1093/imamat/6.3.222. [DOI] [Google Scholar]
5.Fletcher R. A new approach to variable metric algorithms. Comput. J. 1970;13(3):317–322. doi: 10.1093/comjnl/13.3.317. [DOI] [Google Scholar]
6.Goldfarb D. A family of variable-metric methods derived by variational means. Math. Comput. 1970;24(109):23–26. doi: 10.1090/S0025-5718-1970-0258249-6. [DOI] [Google Scholar]
7.Shanno D. Conditioning of quasi-Newton methods for function minimization. Math. Comput. 1970;24(111):647–656. doi: 10.1090/S0025-5718-1970-0274029-X. [DOI] [Google Scholar]
8.Broyden C. Quasi-Newton methods and their application to function minimization. Math. Comput. 1967;21(99):368–381. doi: 10.1090/S0025-5718-1967-0224273-2. [DOI] [Google Scholar]
9.Dennis J, Moré J. Quasi-Newton methods, motivation and theory. SIAM Rev. 1977;19(1):46–89. doi: 10.1137/1019005. [DOI] [Google Scholar]
10.Nocedal J, Wright S. Numerical Optimization. New York: Springer; 2006. [Google Scholar]
11.Lewis A, Overton M. Nonsmooth optimization via quasi-Newton methods. Math. Program. 2013;141(1–2):135–163. doi: 10.1007/s10107-012-0514-2. [DOI] [Google Scholar]
12.Powell M. On the convergence of the variable metric algorithm. IMA J. Appl. Math. 1971;7(1):21–36. doi: 10.1093/imamat/7.1.21. [DOI] [Google Scholar]
13.Dixon L. Quasi-Newton algorithms generate identical points. Math. Program. 1972;2(1):383–387. doi: 10.1007/BF01584554. [DOI] [Google Scholar]
14.Dixon L. Quasi Newton techniques generate identical points II: the proofs of four new theorems. Math. Program. 1972;3(1):345–358. doi: 10.1007/BF01585007. [DOI] [Google Scholar]
15.Broyden C, Dennis J, Moré J. On the local and superlinear convergence of quasi-Newton methods. IMA J. Appl. Math. 1973;12(3):223–245. doi: 10.1093/imamat/12.3.223. [DOI] [Google Scholar]
16.Dennis J, Moré J. A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 1974;28(126):549–560. doi: 10.1090/S0025-5718-1974-0343581-1. [DOI] [Google Scholar]
17.Stachurski A. Superlinear convergence of Broyden’s bounded $θ$ -class of methods. Math. Program. 1981;20(1):196–212. doi: 10.1007/BF01589345. [DOI] [Google Scholar]
18.Griewank A, Toint P. Local convergence analysis for partitioned quasi-Newton updates. Numer. Math. 1982;39(3):429–448. doi: 10.1007/BF01407874. [DOI] [Google Scholar]
19.Engels J, Martínez H. Local and superlinear convergence for partially known quasi-Newton methods. SIAM J. Optim. 1991;1(1):42–56. doi: 10.1137/0801005. [DOI] [Google Scholar]
20.Byrd R, Liu D, Nocedal J. On the behavior of Broyden’s class of quasi-Newton methods. SIAM J. Optim. 1992;2(4):533–557. doi: 10.1137/0802026. [DOI] [Google Scholar]
21.Yabe H, Yamaki N. Local and superlinear convergence of structured quasi-Newton methods for nonlinear optimization. J. Oper. Res. Soc. Jpn. 1996;39(4):541–557. [Google Scholar]
22.Wei Z, Yu G, Yuan G, Lian Z. The superlinear convergence of a modified BFGS-type method for unconstrained optimization. Comput. Optim. Appl. 2004;29(3):315–332. doi: 10.1023/B:COAP.0000044184.25410.39. [DOI] [Google Scholar]
23.Yabe H, Ogasawara H, Yoshino M. Local and superlinear convergence of quasi-Newton methods based on modified secant conditions. J. Comput. Appl. Math. 2007;205(1):617–632. doi: 10.1016/j.cam.2006.05.018. [DOI] [Google Scholar]
24.Mokhtari A, Eisen M, Ribeiro A. IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 2018;28(2):1670–1698. doi: 10.1137/17M1122943. [DOI] [Google Scholar]
25.Gao W, Goldfarb D. Quasi-Newton methods: superlinear convergence without line searches for self-concordant functions. Optim. Methods Softw. 2019;34(1):194–217. doi: 10.1080/10556788.2018.1510927. [DOI] [Google Scholar]
26.Rodomanov, A., Nesterov, Y.: Greedy quasi-Newton methods with explicit superlinear convergence. CORE Discussion Papers. 06 (2020)
27.Rodomanov, A., Nesterov, Y.: Rates of Superlinear Convergence for Classical Quasi-Newton Methods. CORE Discussion Papers. 11 (2020) [DOI] [PMC free article] [PubMed]
28.Jin, Q., Mokhtari, A.: Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods. arXiv preprint arXiv:2003.13607 (2020)
29.Byrd R, Nocedal J. A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 1989;26(3):727–739. doi: 10.1137/0726042. [DOI] [Google Scholar]
30.Liu D, Nocedal J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989;45(1–3):503–528. doi: 10.1007/BF01589116. [DOI] [Google Scholar]
31.Byrd R, Nocedal J, Yuan Y. Global convergence of a class of quasi-Newton methods on convex problems. SIAM J. Numer. Anal. 1987;24(5):1171–1190. doi: 10.1137/0724077. [DOI] [Google Scholar]

[CR1] 1.Davidon, W.: Variable metric method for minimization. Argonne National Laboratory Research and Development Report 5990, (1959)

[CR2] 2.Fletcher R, Powell M. A rapidly convergent descent method for minimization. Comput. J. 1963;6(2):163–168. doi: 10.1093/comjnl/6.2.163. [DOI] [Google Scholar]

[CR3] 3.Broyden C. The convergence of a class of double-rank minimization algorithms: 1. General considerations. IMA J. Appl. Math. 1970;6(1):76–90. doi: 10.1093/imamat/6.1.76. [DOI] [Google Scholar]

[CR4] 4.Broyden C. The convergence of a class of double-rank minimization algorithms: 2. The new algorithm. IMA J. Appl. Math. 1970;6(3):222–231. doi: 10.1093/imamat/6.3.222. [DOI] [Google Scholar]

[CR5] 5.Fletcher R. A new approach to variable metric algorithms. Comput. J. 1970;13(3):317–322. doi: 10.1093/comjnl/13.3.317. [DOI] [Google Scholar]

[CR6] 6.Goldfarb D. A family of variable-metric methods derived by variational means. Math. Comput. 1970;24(109):23–26. doi: 10.1090/S0025-5718-1970-0258249-6. [DOI] [Google Scholar]

[CR7] 7.Shanno D. Conditioning of quasi-Newton methods for function minimization. Math. Comput. 1970;24(111):647–656. doi: 10.1090/S0025-5718-1970-0274029-X. [DOI] [Google Scholar]

[CR8] 8.Broyden C. Quasi-Newton methods and their application to function minimization. Math. Comput. 1967;21(99):368–381. doi: 10.1090/S0025-5718-1967-0224273-2. [DOI] [Google Scholar]

[CR9] 9.Dennis J, Moré J. Quasi-Newton methods, motivation and theory. SIAM Rev. 1977;19(1):46–89. doi: 10.1137/1019005. [DOI] [Google Scholar]

[CR10] 10.Nocedal J, Wright S. Numerical Optimization. New York: Springer; 2006. [Google Scholar]

[CR11] 11.Lewis A, Overton M. Nonsmooth optimization via quasi-Newton methods. Math. Program. 2013;141(1–2):135–163. doi: 10.1007/s10107-012-0514-2. [DOI] [Google Scholar]

[CR12] 12.Powell M. On the convergence of the variable metric algorithm. IMA J. Appl. Math. 1971;7(1):21–36. doi: 10.1093/imamat/7.1.21. [DOI] [Google Scholar]

[CR13] 13.Dixon L. Quasi-Newton algorithms generate identical points. Math. Program. 1972;2(1):383–387. doi: 10.1007/BF01584554. [DOI] [Google Scholar]

[CR14] 14.Dixon L. Quasi Newton techniques generate identical points II: the proofs of four new theorems. Math. Program. 1972;3(1):345–358. doi: 10.1007/BF01585007. [DOI] [Google Scholar]

[CR15] 15.Broyden C, Dennis J, Moré J. On the local and superlinear convergence of quasi-Newton methods. IMA J. Appl. Math. 1973;12(3):223–245. doi: 10.1093/imamat/12.3.223. [DOI] [Google Scholar]

[CR16] 16.Dennis J, Moré J. A characterization of superlinear convergence and its application to quasi-Newton methods. Math. Comput. 1974;28(126):549–560. doi: 10.1090/S0025-5718-1974-0343581-1. [DOI] [Google Scholar]

[CR17] 17.Stachurski A. Superlinear convergence of Broyden’s bounded $θ$ -class of methods. Math. Program. 1981;20(1):196–212. doi: 10.1007/BF01589345. [DOI] [Google Scholar]

[CR18] 18.Griewank A, Toint P. Local convergence analysis for partitioned quasi-Newton updates. Numer. Math. 1982;39(3):429–448. doi: 10.1007/BF01407874. [DOI] [Google Scholar]

[CR19] 19.Engels J, Martínez H. Local and superlinear convergence for partially known quasi-Newton methods. SIAM J. Optim. 1991;1(1):42–56. doi: 10.1137/0801005. [DOI] [Google Scholar]

[CR20] 20.Byrd R, Liu D, Nocedal J. On the behavior of Broyden’s class of quasi-Newton methods. SIAM J. Optim. 1992;2(4):533–557. doi: 10.1137/0802026. [DOI] [Google Scholar]

[CR21] 21.Yabe H, Yamaki N. Local and superlinear convergence of structured quasi-Newton methods for nonlinear optimization. J. Oper. Res. Soc. Jpn. 1996;39(4):541–557. [Google Scholar]

[CR22] 22.Wei Z, Yu G, Yuan G, Lian Z. The superlinear convergence of a modified BFGS-type method for unconstrained optimization. Comput. Optim. Appl. 2004;29(3):315–332. doi: 10.1023/B:COAP.0000044184.25410.39. [DOI] [Google Scholar]

[CR23] 23.Yabe H, Ogasawara H, Yoshino M. Local and superlinear convergence of quasi-Newton methods based on modified secant conditions. J. Comput. Appl. Math. 2007;205(1):617–632. doi: 10.1016/j.cam.2006.05.018. [DOI] [Google Scholar]

[CR24] 24.Mokhtari A, Eisen M, Ribeiro A. IQN: an incremental quasi-Newton method with local superlinear convergence rate. SIAM J. Optim. 2018;28(2):1670–1698. doi: 10.1137/17M1122943. [DOI] [Google Scholar]

[CR25] 25.Gao W, Goldfarb D. Quasi-Newton methods: superlinear convergence without line searches for self-concordant functions. Optim. Methods Softw. 2019;34(1):194–217. doi: 10.1080/10556788.2018.1510927. [DOI] [Google Scholar]

[CR26] 26.Rodomanov, A., Nesterov, Y.: Greedy quasi-Newton methods with explicit superlinear convergence. CORE Discussion Papers. 06 (2020)

[CR27] 27.Rodomanov, A., Nesterov, Y.: Rates of Superlinear Convergence for Classical Quasi-Newton Methods. CORE Discussion Papers. 11 (2020) [DOI] [PMC free article] [PubMed]

[CR28] 28.Jin, Q., Mokhtari, A.: Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods. arXiv preprint arXiv:2003.13607 (2020)

[CR29] 29.Byrd R, Nocedal J. A tool for the analysis of quasi-Newton methods with application to unconstrained minimization. SIAM J. Numer. Anal. 1989;26(3):727–739. doi: 10.1137/0726042. [DOI] [Google Scholar]

[CR30] 30.Liu D, Nocedal J. On the limited memory BFGS method for large scale optimization. Math. Program. 1989;45(1–3):503–528. doi: 10.1007/BF01589116. [DOI] [Google Scholar]

[CR31] 31.Byrd R, Nocedal J, Yuan Y. Global convergence of a class of quasi-Newton methods on convex problems. SIAM J. Numer. Anal. 1987;24(5):1171–1190. doi: 10.1137/0724077. [DOI] [Google Scholar]

PERMALINK

New Results on Superlinear Convergence of Classical Quasi-Newton Methods

Anton Rodomanov

Yurii Nesterov

Abstract

Introduction

Notation

Convex Broyden Class

Remark 3.1

Lemma 3.1

Lemma 3.2

Proof

Remark 3.2

Lemma 3.3

Proof

Lemma 3.4

Proof

Lemma 3.5

Proof

Unconstrained Quadratic Minimization

Theorem 4.1

Theorem 4.2

Proof

Remark 4.1

Theorem 4.3

Proof

Minimization of General Functions

Lemma 5.1

Remark 5.1

Lemma 5.2

Proof

Corollary 5.1

Proof

Lemma 5.3

Proof

Lemma 5.4

Proof

Theorem 5.1

Proof

Remark 5.2

Remark 5.3

Example 5.1

Discussion

Conclusions

Acknowledgements

Appendix

Lemma A.1

Proof

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases