ANOTHER LOOK AT THE FAST ITERATIVE SHRINKAGE/THRESHOLDING ALGORITHM (FISTA)

Donghwan Kim; Jeffrey A Fessler

doi:10.1137/16M108940X

. Author manuscript; available in PMC: 2018 May 23.

Published in final edited form as: SIAM J Optim. 2018 Jan 30;28(1):223–250. doi: 10.1137/16M108940X

ANOTHER LOOK AT THE FAST ITERATIVE SHRINKAGE/THRESHOLDING ALGORITHM (FISTA)^*

Donghwan Kim ^†, Jeffrey A Fessler ^†

PMCID: PMC5966151 NIHMSID: NIHMS923572 PMID: 29805242

Abstract

This paper provides a new way of developing the “Fast Iterative Shrinkage/Thresholding Algorithm (FISTA)” [3] that is widely used for minimizing composite convex functions with a nonsmooth term such as the ℓ₁ regularizer. In particular, this paper shows that FISTA corresponds to an optimized approach to accelerating the proximal gradient method with respect to a worst-case bound of the cost function. This paper then proposes a new algorithm that is derived by instead optimizing the step coefficients of the proximal gradient method with respect to a worst-case bound of the composite gradient mapping. The proof is based on the worst-case analysis called Performance Estimation Problem in [11].

1. Introduction

The “Fast Iterative Shrinkage/Thresholding Algorithm” (FISTA) [3], also known as a fast proximal gradient method (FPGM) in general, is a very widely used fast first-order method. FISTA’s speed arises from Nesterov’s accelerating technique in [23, 24] that improves the O(1/N) cost function worst-case bound of a proximal gradient method (PGM) to the optimal O(1/N²) rate where N denotes the number of iterations [3].

This paper first provides a new way to develop Nesterov’s acceleration approach, i.e., FISTA (FPGM). In particular, we show that FPGM corresponds to an optimized approach to accelerating PGM with respect to a worst-case bound of the cost function. We then propose a new fast algorithm that is derived from PGM by instead optimizing a worst-case bound of the composite gradient mapping. We call this new method FPGM-OCG (OCG for optimized over composite gradient mapping). This new method provides the best known analytical worst-case bound for decreasing the composite gradient mapping with rate $O (1 / N^{\frac{3}{2}})$ among fixed-step first-order methods. The proof is based on the worst-case bound analysis called Performance Estimation Problem (PEP) in [11], which we briefly review next.

Drori and Teboulle’s PEP [11] casts a worst-case analysis for a given optimization method and a given class of optimization problems into a meta-optimization problem. The original PEP has been intractable to solve exactly, so [11] introduced a series of tractable relaxations, focusing on first-order methods and smooth convex minimization problems; this PEP and its relaxations were studied for various algorithms and minimization problem classes in [12, 16, 17, 18, 19, 29, 30]. Drori and Teboulle [11] further proposed to optimize the step coefficients of a given class of optimization methods using a PEP. This approach was studied for first-order methods on unconstrained smooth convex minimization problems in [11], and the authors [17] derived a new first-order method, called an optimized gradient method (OGM) that has an analytic worst-case bound on the cost function that is twice smaller than the previously best known bounds of [23, 24]. Recently, Drori [10] showed that the OGM exactly achieves the optimal cost function worst-case bound among first-order methods for smooth convex minimization (in high-dimensional problems).

Building upon [11] and its successors, Taylor et al. [29] expanded the use of PEP to first-order (proximal gradient) methods for minimizing nonsmooth composite convex functions. They used a tight relaxation¹ for PEP and studied the tight (exact) numerical worst-case bounds of FPGM, a proximal gradient version of OGM, and some variants versus number of iterations N. Their numerical results suggest that there exists an OGM-type acceleration of PGM that has a worst-case cost function bound that is about twice smaller than that of FPGM, showing room for improvement in accelerating PGM. However, it is difficult to derive an analytical worst-case bound for the tightly relaxed PEP in [29], so optimizing the step coefficients of PGM remains an open problem, unlike [11, 17] for smooth convex minimization.

Different from the tightly relaxed PEP in [29], this paper suggests a new (looser) relaxation of a cost function form of PEP for nonsmooth composite convex minimization that simplifies analysis and optimization of step coefficients of PGM, although yields loose worst-case bounds. Interestingly, the resulting optimized PGM numerically appears to be the FPGM. Then, we further provide a new generalized version of FPGM using our relaxed PEP that extends our understanding of the FPGM variants.

This paper next extends the PEP analysis of the gradient norm in [29, 30]. For unconstrained smooth convex minimization, the authors [16] used such PEP to optimize the step coefficients with respect to the gradient norm. The corresponding optimized algorithm can be useful particularly when dealing with dual problems where the gradient norm decrease is important in addition to the cost function minimization (see e.g., [9, 22, 26]). By extending [16], this paper optimizes the step coefficients of the PGM for the composite gradient mapping form of PEP for nonsmooth composite convex minimization. The resulting optimized algorithm differs somewhat from Nesterov’s acceleration and turns out to belong to the proposed generalized FPGM class.

Sec. 2 describes a nonsmooth composite convex minimization problem and first-order (proximal gradient) methods. Sec. 3 proposes a new relaxation of PEP for nonsmooth composite convex minimization problems and the proximal gradient methods, and suggests that the FPGM (FISTA) [3] is the optimized method of the cost function form of this relaxed PEP. Sec. 3 further proposes a generalized version of FPGM using the relaxed PEP. Sec. 4 studies the composite gradient mapping form of the relaxed PEP and describes a new optimized method for decreasing the norm of composite gradient mapping. Sec. 5 compares the various algorithms considered, and Sec. 6 concludes.

2. Problem, methods, and contribution

We consider first-order algorithms for solving the nonsmooth composite convex minimization problem:

min_{x \in ℝ^{d}} {F (x) ≔ f (x) + ϕ (x)},

(M)

under the following assumptions:

f : ℝ^d → ℝ is is a convex function of the type $𝒞_{L}^{1, 1} (ℝ^{d})$ , i.e., continuously differentiable with Lipschitz continuous gradient:
$‖ \nabla f (x) - \nabla f (y) ‖ \leq L ‖ x - y ‖, \forall x, y \in ℝ^{d},$ (1)
where L > 0 is the Lipschitz constant.
ϕ : ℝ^d → ℝ is proper, closed, convex and “proximal-friendly” [6].
The optimal set X_*(F) = arg min_x∈ℝ^d F(x) is nonempty, i.e., the problem (M) is solvable.

We use ℱ_L(ℝ^d) to denote the class of functions F that satisfy the above conditions. We additionally assume that the distance between the initial point x₀ and an optimal solution x_* ∈ X(F) is bounded by R > 0, i.e., ‖x₀ − x_*‖ ≤ R.

PGM is a standard first-order method for solving the problem (M) [3, 6], particularly when the following proximal gradient update (that consists of a gradient descent step and a proximal operation [6]) is relatively simple:

p_{L} (y) ≔ \underset{x}{arg min} {f (y) + 〈 x - y, \nabla f (y) 〉 + \frac{L}{2} {‖ x - y ‖}^{2} + ϕ (x)} = \underset{x}{arg min} {\frac{L}{2} {‖ x - (y - \frac{1}{L} \nabla f (y)) ‖}^{2} + ϕ (x)} .

(2)

For ϕ(x) = ‖x‖₁, the update (2) becomes a simple shrinkage/thresholding update, and PGM reduces to an iterative shrinkage/thresholding algorithm (ISTA) [8]. (See [6, Table 10.2] for more functions ϕ(x) that lead to simple proximal operations.) PGM has the following bound on the cost function [3, Thm. 3.1] for any N ≥ 1:

F (x_{N}) - F (x_{*}) \leq \frac{{LR}^{2}}{2 N} .

(3)

Algorithm PGM

Input: f ∈ ℱ_L(ℝ^d), x₀ ∈ ℝ^d.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (x_{i})

Open in a new tab

For simplicity in later derivations, we use the following definition of the composite gradient mapping [27]:

{\tilde{\nabla}}_{L} F (x) ≔ - L (p_{L} (x) - x) .

(4)

The composite gradient mapping reduces to the usual function gradient ∇f(x) when ϕ(x) = 0. We can then rewrite the PGM update in the following form reminiscent of a gradient method:

x_{i + 1} = p_{L} (x_{i}) = x_{i} - \frac{1}{L} {\tilde{\nabla}}_{L} F (x_{i}),

(5)

where each update guarantees the following monotonic cost function descent [27, Thm. 1]:

F (x_{i}) - F (x_{i + 1}) \geq \frac{1}{2 L} {‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖}^{2} .

(6)

For any x ∈ ℝ^d, there exists a subgradient ϕ′(p_L(x)) ∈ ∂ϕ(p_L(x)) that satisfies the following equality [3, Lemma 2.2]:

{\tilde{\nabla}}_{L} F (x) = \nabla f (x) + ϕ' (p_{L} (x)) .

(7)

This equality implies that any point x̄ with a zero composite gradient mapping (∇̃_LF(x̄) = 0, i.e., x̄ = p_L(x̄)) satisfies 0 ∈ ∂F(x̄) and is a minimizer of (M). As discussed, minimizing the composite gradient mapping is noteworthy in addition to decreasing the cost function. This property becomes particularly important when dealing with dual problems. In particular, it is known that the norm of the dual (sub)gradient is related to the primal feasibility (see e.g., [9, 22, 26]). Furthermore, the norm of the subgradient is upper bounded by the norm of the composite gradient mapping, i.e., for any given subgradients ϕ′(p_L(x)) in (7) and F′(p_L(x)) ≔ ∇f(p_L(x)) + ϕ′(p_L(x)) ∈ ∂F(p_L(x)), we have

‖ F' (p_{L} (x)) ‖ \leq ‖ \nabla f (x) - \nabla f (p_{L} (x)) ‖ + ‖ \nabla f (x) + ϕ' (p_{L} (x)) ‖ \leq 2 L ‖ x - p_{L} (x) ‖ = 2 ‖ {\tilde{\nabla}}_{L} F (p_{L} (x)) ‖,

(8)

where the first inequality uses the triangle inequality and the second inequality uses (1) and (7). This inequality provides a close relationship between the primal feasibility and the dual composite gradient mapping. Therefore, we next analyze the worst-case bound of the composite gradient mapping of PGM; Sec. 4 below discusses a first-order algorithm that is optimized with respect to the composite gradient mapping.²

The following lemma shows that PGM monotonically decreases the norm of the composite gradient mapping.

Lemma 1

The PGM monotonically decreases the norm of composite gradient mapping, i.e., for all x:

‖ {\tilde{\nabla}}_{L} F (p_{L} (x)) ‖ \leq ‖ {\tilde{\nabla}}_{L} F (x) ‖ .

(9)

Proof

The proof in [22, Lemma 2.4] can be easily extended to prove (9) using the nonexpansiveness of the proximal mapping (proximity operator) [6].

The following theorem provides a O(1/N) bound on the norm of composite gradient mapping for the PGM, using the idea in [26] and Lemma 1.

Theorem 2

Let F : ℝ^d → ℝ be in ℱ_L(ℝ^d) and let x₀, ⋯, x_N ∈ ℝ^d be generated by PGM. Then for N ≥ 2,

min_{i \in {0, \dots, N}} ‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖ = ‖ {\tilde{\nabla}}_{L} F (x_{N}) ‖ \leq \frac{2 LR}{\sqrt{(N - 1) (N + 2)}} .

(10)

Proof

Let $m = ⌊ \frac{N}{2} ⌋$ , and we have

\frac{{LR}^{2}}{2 m} \overset{(3)}{\geq} F (x_{m}) - F (x_{*}) \overset{(6)}{\geq} F (x_{N + 1}) - F (x_{*}) + \frac{1}{2 L} \sum_{i = m}^{N} {‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖}^{2} \overset{(9)}{\geq} \frac{N - m + 1}{2 L} {‖ {\tilde{\nabla}}_{L} F (x_{N}) ‖}^{2},

which is equivalent to (10) using $m \geq \frac{N - 1}{2}$ and $N - m \geq \frac{N}{2}$ .

Despite its inexpensive per-iteration computational cost, PGM suffers from the slow rate O(1/N) for decreasing both the cost function and the norm of composite gradient mapping.³ Therefore for acceleration, this paper considers the following class of fixed-step first-order methods (FSFOM), where the (i + 1)th iteration consists of one proximal gradient evaluation, just like PGM, and a weighted summation of previous and current proximal gradient updates ${x_{k + 1} - y_{k}}_{k = 0}^{i}$ with step coefficients ${h_{i + 1, k}}_{k = 0}^{i}$ .

Algorithm Class FSFOM

Input: f ∈ ℱ_L(ℝ^d), x₀ ∈ ℝ^d, y₀ = x₀.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (y_{i}) = y_{i} - \frac{1}{L} {\tilde{\nabla}}_{L} F (y_{i})

y_{i + 1} = y_{i} + \sum_{k = 0}^{i} h_{i + 1, k} (x_{k + 1} - y_{k}) = y_{i} - \frac{1}{L} \sum_{k = 0}^{i} h_{i + 1, k} {\tilde{\nabla}}_{L} F (y_{k}) .

Open in a new tab

Although the weighted summation in FSFOM seems at first to be inefficient both computationally and memory-wise, the optimized FSFOM presented in this paper have equivalent recursive forms that have memory and computation requirements that are similar to PGM. Note that this class FSFOM includes PGM but excludes accelerated algorithms in [13, 25, 27] that combine the proximal operations and the gradient steps in other ways.

Among FSFOM⁴, FISTA [3], also known as FPGM, is widely used since it has computation and memory requirements that are similar to PGM yet it achieves the optimal O(1/N²) worst-case rate for decreasing the cost function, using Nesterov’s acceleration technique [23, 24].

Algorithm FPGM (FISTA)

Input: f ∈ ℱ_L(ℝ^d), x₀ ∈ ℝ^d, y₀ = x₀, t₀ = 1.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (y_{i})

t_{i + 1} = \frac{1 + \sqrt{1 + 4 t_{i}^{2}}}{2}

(11)

y_{i + 1} = x_{i + 1} + \frac{t_{i} - 1}{t_{i + 1}} (x_{i + 1} - x_{i})

Open in a new tab

FPGM has the following bound for the cost function [3, Thm. 4.4] for any N ≥ 1:

F (x_{N}) - F (x_{*}) \leq \frac{{LR}^{2}}{2 t_{N - 1}^{2}} \leq \frac{2 {LR}^{2}}{{(N + 1)}^{2}},

(12)

where the parameters t_i (11) satisfy

t_{i}^{2} = \sum_{l = 0}^{i} t_{l} and t_{i} \geq \frac{i + 2}{2} .

(13)

Sec. 3 provides a new proof of the cost function bound (12) of FPGM using a new relaxation of PEP, and illustrates that this particular acceleration of PGM results from optimizing a relaxed version of the cost function form of PEP. In addition, it is shown in [3, 5] that FPGM and its bound (12) generalize to any t_i such that t₀ = 1 and $t_{i}^{2} \leq t_{i - 1}^{2} + t_{i}$ for all i ≥ 1 with corresponding bound for any N ≥ 1:

F (x_{N}) - F (x_{*}) \leq \frac{{LR}^{2}}{2 t_{N - 1}^{2}},

(14)

which includes the choice $t_{i} = \frac{i + a}{a}$ for any a ≥ 2. Using our relaxed PEP, Sec. 3 further describes similar but different generalizations of FPGM that complement our understanding of FPGM.

We are often interested in the worst-case analysis of the norm of the (composite) gradient (mapping) in addition to that of the cost function, particularly when dealing with dual problems. To improve the rate O(1/N) of the gradient norm bound of a gradient method, Nesterov [26] suggested performing his fast gradient method (FGM) [23, 24], a non-proximal version of FPGM, for the first m iterations and a gradient method for remaining N − m iterations for smooth convex problems (when ϕ(x) = 0). Here we extend this idea to the nonsmooth composite convex problem (M) and use FPGM-m to denote the resulting algorithm. The following theorem provides a $O (1 / N^{\frac{3}{2}})$ worst-case bound for the norm of composite gradient mapping of FPGM-m, using the idea in [26] and Lemma 1.

Algorithm FPGM-m

Input: f ∈ ℱ_L(ℝ^d), x₀ ∈ ℝ^d, y₀ = x₀, t₀ = 1.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (y_{i})

t_{i + 1} = \frac{1 + \sqrt{1 + 4 t_{i}^{2}}}{2}, i \leq m - 1

y_{i + 1} = {\begin{matrix} x_{i + 1} + \frac{t_{i} - 1}{t_{i} + 1} (x_{i + 1} - x_{i}), & i \leq m - 1, \\ x_{i + 1}, & otherwise . \end{matrix}

Open in a new tab

Theorem 3

Let F : ℝ^d → ℝ be in ℱ_L(ℝ^d) and let x₀, ⋯, x_N ∈ ℝ^d be generated by FPGM-m for 1 ≤ m ≤ N. Then for N ≥ 1,

min_{i \in {0, \dots, N}} ‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖ \leq ‖ {\tilde{\nabla}}_{L} F (x_{N}) ‖ \leq \frac{2 LR}{(m + 1) \sqrt{N - m + 1}} .

(15)

Proof

We have

\frac{2 {LR}^{2}}{{(m + 1)}^{2}} \overset{(12)}{\geq} F (x_{m}) - F (x_{*}) \overset{(6)}{\geq} F (x_{N + 1}) - F (x_{*}) + \frac{1}{2 L} \sum_{i = m}^{N} {‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖}^{2} \overset{(9)}{\geq} \frac{N - m + 1}{2 L} {‖ {\tilde{\nabla}}_{L} F (x_{N}) ‖}^{2},

which is equivalent to (15).

As noticed by a reviewer, when $m = ⌊ \frac{2 N}{3} ⌋$ , the worst-case bound (15) of the composite gradient mapping roughly has its smallest constant $3 \sqrt{3}$ for the rate $O (1 / N^{\frac{3}{2}})$ , which is better than the choice $m = ⌊ \frac{N}{2} ⌋$ in [26].

Monteiro and Svaiter [21] considered a variant of FPGM that replaces p_L(·) of FPGM by p_L/σ²(·) for 0 < σ < 1; that variant, which we denote FPGM-σ, satisfies the $O (1 / N^{\frac{3}{2}})$ rate for the composite gradient mapping. This FPGM-σ algorithm satisfies the following cost function and composite gradient mapping worst-case bounds⁵ [21, Prop. 5.2] for N ≥ 1:

F (x_{N}) - F (x_{*}) \leq \frac{2 {LR}^{2}}{σ^{2} N^{2}},

(16)

min_{i \in {0, \dots, N}} ‖ {\tilde{\nabla}}_{L / σ^{2}} F (y_{i}) ‖ \leq \frac{2 \sqrt{3}}{σ} \sqrt{\frac{1 + σ}{1 - σ}} \frac{LR}{N^{\frac{3}{2}}} .

(17)

The worst-case bound (17) of the composite gradient mapping has its smallest constant $\frac{2 \sqrt{3}}{σ^{2}} \sqrt{\frac{1 + σ}{1 - σ}} \approx 16.2$ when $σ = \frac{\sqrt{17} - 1}{4} \approx 0.78$ , which makes the bound (17) about $\frac{16}{3 \sqrt{3}}$ ≈ 3-times larger than the bound (15) of $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ at best. However, since FPGM-σ does not require one to select the number of total iterations N in advance unlike FPGM-m, the FPGM-σ algorithm could be useful in practice, as discussed further in Sec. 4.4. Ghadimi and Lan [13] also showed the $O (1 / N^{\frac{3}{2}})$ rate for a composite gradient mapping worst-case bound of another variant of FPGM, but the corresponding algorithm in [13] requires two proximal gradient updates per iteration, combining the proximal operations and the gradient steps in a way that differs from the class FSFOM and could be less attractive in terms of the per-iteration computational complexity.

FPGM has been used in dual problems [1, 2, 4, 14]; using FPGM-m and the algorithms in [13, 21] that guarantee $O (1 / N^{\frac{3}{2}})$ rate for minimizing the norm of the composite gradient mapping could be potentially useful for solving dual problems. (Using F(P)GM-m for (dual) smooth convex problems was discussed in [9, 22, 26].) However, FPGM-m and the algorithms in [13, 21] are not necessarily the best possible methods with respect to the worst-case bound of the norm of the composite gradient mapping. Therefore, Sec. 4 seeks to optimize the step coefficients of FSFOM for minimizing the norm of the composite gradient mapping using a relaxed PEP.

The next section first provides a new proof of FPGM using our new relaxation on PEP, and proposes the new generalized FPGM.

3. Relaxation and optimization of the cost function form of PEP

3.1. Relaxation for the cost function form of PEP

For FSFOM with given step-size coefficients h ≔ {h_i+1,k}, in principle the worst-case bound on the cost function after N iterations corresponds to the solution of the following PEP problem [11]:

\begin{matrix} ℬ_{P} (h, N, d, L, R) ≔ & max_{\begin{matrix} F \in ℱ_{L} (ℝ^{d}), \\ x_{0}, \dots, x_{N} \in ℝ^{d}, x_{*} \in X_{*} (F) \\ y_{0}, \dots, y_{N - 1} \in ℝ^{d} \end{matrix}} F (x_{N}) - F (x_{*}) \\ s . t . x_{i + 1} = p_{L} (y_{i}), i = 0, \dots, N - 1, ‖ x_{0} - x_{*} ‖ \leq R, \\ y_{i + 1} = y_{i} + \sum_{k = 0}^{i} h_{i + 1, k} (x_{k + 1} - y_{k}), i = 0, \dots, N - 2 . \end{matrix}

(P)

Since (non-relaxed) PEP problems like (P) are difficult to solve due to the (infinite-dimensional) functional constraint on F, Drori and Teboulle [11] suggested (for smooth convex problems) replacing the functional constraint by a property of F related to the update such as p_L(·) in (P). Taylor et al. [29, 30] discussed properties of F that can replace the functional constraint of PEP without strictly relaxing (P), and provided tight numerical worst-case analysis for any given N. However, analytical solutions remain unknown for (P) and most PEP problems.

Instead, this paper proposes an alternate relaxation that is looser than that in [29, 30] but provides tractable and useful analytical results. We consider the following property of F involving the proximal gradient update p_L(·) [3, Lemma 2.3]:

F (x) - F (p_{L} (y)) \leq \frac{L}{2} {‖ p_{L} (y) - y ‖}^{2} + L 〈 y - x, p_{L} (y) - y 〉 \forall_{x, y} \in ℝ^{d}

(18)

to replace the functional constraint on F. In particular, we use the following property:

\frac{L}{2} {‖ p_{L} (y) - y ‖}^{2} - L 〈 p_{L} (x) - x, p_{L} (y) - y 〉 \leq F (p_{L} (x)) - F (p_{L} (y)) + L 〈 p_{L} (y) - y, x - y 〉, \forall_{x, y} \in ℝ^{d}

(19)

that results from replacing x in (18) by p_L(x). When ϕ(x) = 0, the property (19) reduces to

\frac{1}{2 L} {‖ \nabla f (y) ‖}^{2} - \frac{1}{L} 〈 \nabla f (x), \nabla f (y) 〉 \leq f (x - \frac{1}{L} \nabla f (x)) - f (y - \frac{1}{L} \nabla f (y)) - 〈 \nabla f (y), x - y 〉, \forall_{x, y} \in ℝ^{d} .

(20)

Note that the relaxation of PEP in [11, 16, 17, 18, 30] for unconstrained smooth convex minimization (ϕ(x) = 0) uses a well-known property of f in [24, Thm. 2.1.5] that differs from (20) and does not strictly relax the PEP as discussed in [30], whereas our relaxation using (19) and (20) does not guarantee a tight relaxation of (P). Finding a tight relaxation that leads to useful (or even optimal) algorithms remains an open problem for nonsmooth composite convex problems.

Similar to [11, Problem (Q′)], we (strictly) relax problem (P) as follows using a set of constraint inequalities (19) at the points (x, y) = (y_i−1, y_i) for i = 1, …, N − 1 and (x, y) = (x_*, y_i) for i = 0, …, N − 1:

\begin{matrix} ℬ_{P 1} (h, N, d, L, R) ≔ & max_{\begin{matrix} G \in ℝ^{N \times d}, \\ δ \in ℝ^{N} \end{matrix}} {LR}^{2} δ_{N - 1} \\ s . t . Tr {G^{⊤} {\overset{ˇ}{A}}_{i - 1, i} (h) G} \leq δ_{i - 1} - δ_{i}, i = 1, \dots, N - 1, \\ Tr {G^{⊤} {\overset{ˇ}{D}}_{i} (h) G + ν u_{i}^{⊤} G} \leq - δ_{i}, i = 0, \dots, N - 1, \end{matrix}

(P1)

for any given unit vector ν ∈ ℝ^d, by defining the (i + 1)th standard basis vector u_i = e_i+1 ∈ ℝ^N, the matrix G = [g₀, ⋯, g_{N − 1}]^⊤ ∈ ℝ^{N × d} and the vector δ = [δ₀, ⋯, δ_{N − 1}]^⊤ ∈ ℝ^N, where

{\begin{matrix} g_{i} ≔ - \frac{1}{‖ y_{0} - x_{*} ‖} (p_{L} (y_{i}) - y_{i}) = \frac{1}{L ‖ y_{0} - x_{*} ‖} {\tilde{\nabla}}_{L} F (y_{i}), \\ δ_{i} ≔ \frac{1}{L {‖ y_{0} - x_{*} ‖}^{2}} (F (p_{L} (y_{i})) - F (x_{*})), \end{matrix}

(21)

for i = 0, …, N − 1, *. Note that g_* = [0, ⋯, 0]^⊤, δ_* = 0 and $Tr {G^{⊤} u_{i} u_{j}^{⊤} G} = 〈 g_{i}, g_{j} 〉$ by definition. The matrices Ă_i−1,i(h) and Ď_i(h) are defined as

{\begin{matrix} {\overset{ˇ}{A}}_{i - 1, i} (h) ≔ \frac{1}{2} u_{i} u_{i}^{⊤} - \frac{1}{2} u_{i - 1} u_{i}^{⊤} - \frac{1}{2} u_{i} u_{i - 1}^{⊤} + \frac{1}{2} \sum_{k = 0}^{i - 1} h_{i, k} (u_{i} u_{k}^{⊤} + u_{k} u_{i}^{⊤}), \\ {\overset{ˇ}{D}}_{i} (h) ≔ \frac{1}{2} u_{i} u_{i}^{⊤} + \frac{1}{2} \sum_{j = 1}^{i} \sum_{k = 0}^{j - 1} h_{j, k} (u_{i} u_{k}^{⊤} + u_{k} u_{i}^{⊤}), \end{matrix}

(22)

which results from the inequalities (19) at the points (x, y) = (y_i−1, y_i) and (x, y) = (x_*, y_i) respectively.

As in [11, Problem (DQ′)], problem (P1) has a dual formulation that one can solve numerically for any given N using a semidefinite program (SDP) to determine an upper bound on the cost function worst-case bound for any FSFOM:⁶

F (x_{N}) - F (x_{*}) \leq ℬ_{P} (h, N, d, L, R) \leq ℬ_{D} (h, N, L, R) ≔ max_{\begin{matrix} (λ, τ) \in Λ, \\ γ \in ℝ \end{matrix}} {\frac{1}{2} {LR}^{2} γ : (\begin{matrix} S (h, λ, τ) & \frac{1}{2} τ \\ \frac{1}{2} τ^{⊤} & \frac{1}{2} γ \end{matrix}) ⪰ 0},

(D)

where $λ = {[λ_{1}, \dots, λ_{N - 1}]}^{⊤} \in ℝ_{+}^{N - 1}, τ = {[τ_{0}, \dots, τ_{N - 1}]}^{⊤} \in ℝ_{+}^{N}$ , and

Λ = {(λ, τ) \in ℝ_{+}^{2 N - 1} : \begin{matrix} τ_{0} = λ_{1}, λ_{N - 1} + τ_{N - 1} = 1, \\ λ_{i} - λ_{i + 1} + τ_{i} = 0, i = 1, \dots, N - 2, \end{matrix}},

(23)

S (h, λ, τ) = \sum_{i = 1}^{N - 1} λ_{i} {\overset{ˇ}{A}}_{i - 1, i} (h) + \sum_{i = 0}^{N - 1} τ_{i} {\overset{ˇ}{D}}_{i} (h) .

(24)

This means that one can compute a valid upper bound (D) of (P) for given step coefficients h using a SDP. The next two sections provide an analytical solution to (D) for FPGM and similarly for our new generalized FPGM, superseding the use of numerical SDP solvers.

3.2. Generalized FPGM

We specify a feasible point of (D) that leads to our new generalized form of FPGM.

Lemma 4

For the following step coefficients:

h_{i + 1, k} = {\begin{matrix} \frac{t_{i + 1}}{T_{i + 1}} (t_{k} - \sum_{j = k + 1}^{i} h_{j, k}), & k = 0, \dots, i - 1, \\ 1 + \frac{(t_{i} - 1) t_{i + 1}}{T_{i + 1}}, & k = i, \end{matrix}

(25)

the choice of variables:

λ_{i} = \frac{T_{i - 1}}{T_{N - 1}}, i = 1, \dots, N - 1, τ_{i} = \frac{t_{i}}{T_{N - 1}}, i = 0, \dots, N - 1, γ = \frac{1}{T_{N - 1}},

(26)

is a feasible point of (D) for any choice of t_i such that

t_{0} = 1, t_{i} > 0, and t_{i}^{2} \leq T_{i} ≔ \sum_{l = 0}^{i} t_{l} .

(27)

Proof

It is obvious that (λ, τ) in (26) with (27) is in Λ (23). Using (22), the (i, k)th entry of the symmetric matrix S(h, λ, τ) in (24) can be written as

S_{i, k} (h, λ, τ) = {\begin{matrix} \frac{1}{2} ((λ_{i} + τ_{i}) h_{i, k} + τ_{i} \sum_{j = k + 1}^{i - 1} h_{j, k}), & i = 2, \dots, N - 1, k = 0, \dots, i - 2, \\ \frac{1}{2} ((λ_{i} + τ_{i}) h_{i, i - 1} - λ_{i}), & i = 1, \dots, N - 1, k = i - 1, \\ \frac{1}{2} λ_{i + 1}, & i = 0, \dots, N - 2, k = i, \\ \frac{1}{2}, & i = N - 1, k = i, \end{matrix}

where each element S_i,k(h, λ, τ) corresponds to the coefficient of the term $u_{i} u_{k}^{⊤}$ of S(h, λ, τ) in (24). Then, inserting (25) and (26) to the above yields

S_{i, k} (h, λ, τ) = {\begin{matrix} \frac{1}{2} (\frac{T_{i}}{T_{N - 1}} \frac{t_{i}}{T_{i}} (t_{k} - \sum_{j = k + 1}^{i - 1} h_{j, k}) + \frac{t_{i}}{T_{N - 1}} \sum_{j = k + 1}^{i - 1} h_{j, k}), & i = 2, \dots, N - 1, k = 0, \dots, i - 2, \\ \frac{1}{2} (\frac{T_{i}}{T_{N - 1}} (1 + \frac{(t_{i - 1} - 1) t_{i}}{T_{i}}) - \frac{T_{i - 1}}{T_{N - 1}}), & i = 1, \dots, N - 1, k = i - 1, \\ \frac{T_{i}}{2 T_{N - 1}}, & i = 0, \dots, N - 1, k = i . \end{matrix} = {\begin{matrix} \frac{t_{i} t_{k}}{2 T_{N - 1}}, & i = 1, \dots, N - 1, k = 0, \dots, i - 1, \\ \frac{T_{i}}{2 T_{N - 1}}, & i = 0, \dots, N - 1, k = i . \end{matrix}

Then, using (26) and (27), we finally show the feasibility condition of (D):

(\begin{matrix} S (h, λ, τ) & \frac{1}{2} τ \\ \frac{1}{2} τ^{⊤} & \frac{1}{2} γ \end{matrix}) = \frac{1}{2 T_{N - 1}} (diag {T - t^{2}} + {tt}^{⊤}) ⪰ 0,

where t = (t₀, ⋯, t_N−1, 1)^⊤ and T = (T₀, ⋯, T_N−1, 1)^⊤.

FSFOM with the step coefficients (25) would be both computationally and memory-wise inefficient, so we next present an equivalent recursive form of FSFOM with (25), named Generalized FPGM (GFPGM).

Algorithm GFPGM

Input: f ∈ ℱ_L(ℝ^d), x₀ ∈ ℝ^d, y₀ = x₀, t₀ = T₀ = 1.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (y_{i})

Choose t_{i + 1} s . t . t_{i + 1} > 0 and t_{i + 1}^{2} \leq T_{i + 1} ≔ \sum_{l = 0}^{i + 1} t_{l}

y_{i + 1} = x_{i + 1} + \frac{(T_{i} - t_{i}) t_{i + 1}}{t_{i} T_{i + 1}} (x_{i + 1} - x_{i}) + \frac{(t_{i}^{2} - T_{i}) t_{i + 1}}{t_{i} T_{i + 1}} (x_{i + 1} - y_{i})

Open in a new tab

Proposition 5

The sequence {x₀, ⋯, x_N} generated by FSFOM with step sizes (25) is identical to the corresponding sequence generated by GFPGM.

Proof

See Appendix B.

Using Lemma 4, the following theorem bounds the cost function for the GFPGM iterates.

Theorem 6

Let F : ℝ^d → ℝ be in ℱ_L(ℝ^d) and let x₀, ⋯, x_N ∈ ℝ^d be generated by GFPGM. Then for N ≥ 1,

F (x_{N}) - F (x_{*}) \leq \frac{{LR}^{2}}{2 T_{N - 1}} .

(28)

Proof

Using (D), Lemma 4 and Prop. 5, we have

F (x_{N}) - F (x_{*}) \leq ℬ_{D} (h, N, L, R) = \frac{1}{2} {LR}^{2} γ = \frac{{LR}^{2}}{2 T_{N - 1}} .

(29)

The GFPGM and Thm. 6 reduce to FPGM and (12) when $t_{i}^{2} = T_{i}$ for all i, and Sec. 3.4 describes that FPGM results from optimizing the step coefficients of FSFOM with respect to the cost function form of the relaxed PEP (D). This GFPGM also includes the choice $t_{i} = \frac{i + a}{a}$ for any a ≥ 2 as used in [5], which we denote as FPGM-a that differs from the algorithm in [5]. The following corollary provides a cost function worst-case bound for FPGM-a.

Corollary 7

Let F : ℝ^d → ℝ be in ℱ_L(ℝ^d) and let x₀, ⋯, x_N ∈ ℝ^d be generated by GFPGM with $t_{i} = \frac{i + a}{a}$ (FPGM-a) for any a ≥ 2. Then for N ≥ 1,

F (x_{N}) - F (x_{*}) \leq \frac{{aLR}^{2}}{N (N + 2 a - 1)} .

(30)

Proof

Thm. 6 implies (30), since $t_{i} = \frac{i + a}{a}$ satisfies (27), i.e.,

T_{i} - t_{i}^{2} = \frac{(i + 1) (i + 2 a)}{2 a} - \frac{{(i + a)}^{2}}{a^{2}} = \frac{(a - 2) i^{2} + a (2 a - 3) i}{2 a^{2}} \geq 0

(31)

for any a ≥ 2 and all i ≥ 0.

3.3. Related work of GFPGM

This section shows that the GFPGM has a close connection to the accelerated algorithm in [25] that was developed specifically for a constrained smooth convex problem with a closed convex set Q, i.e.,

ϕ (x) = I_{Q} (x) ≔ {\begin{matrix} 0, & x \in Q, \\ \infty, & otherwise . \end{matrix}

(32)

The projection operator P_Q(x) ≔ arg min_y∈Q ‖x − y‖ is used for the proximal gradient update (2).

We show that the GFPGM can be written in the following equivalent form, named GFPGM′, which is similar to that of the accelerated algorithm in [25] shown below. Note that the accelerated algorithm in [25] satisfies the bound (28) of the GFPGM in [25, Thm. 2] when ϕ(x) = I_Q(x).

Algorithm GFPGM′

Input: f ∈ ℱ_L(ℝ^d), x₀ ∈ ℝ^d, y₀ = x₀, t₀ = T₀ = 1.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (y_{i}) = y_{i} - \frac{1}{L} {\tilde{\nabla}}_{L} F (y_{i})

z_{i + 1} = y_{0} - \frac{1}{L} \sum_{k = 0}^{i} t_{k} {\tilde{\nabla}}_{L} F (y_{k})

Choose t_{i + 1} s . t . t_{i + 1} > 0 and t_{i + 1}^{2} \leq T_{i + 1} = \sum_{l = 0}^{i + 1} t_{l}

y_{i + 1} = (1 - \frac{t_{i + 1}}{T_{i + 1}}) x_{i + 1} + \frac{t_{i + 1}}{T_{i + 1}} z_{i + 1}

Open in a new tab

Algorithm [25] for ϕ(x) = I_Q(x)

Input: f ∈ ℱ_L(ℝ^d), x₀ ∈ ℝ^d, y₀ = x₀, t₀ = T₀ = 1.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (y_{i}) = P_{Q} (y_{i} - \frac{1}{L} \nabla f (y_{i}))

z_{i + 1} = P_{Q} (y_{0} - \frac{1}{L} \sum_{k = 0}^{i} t_{k} \nabla f (y_{k}))

Choose t_{i + 1} s . t . t_{i + 1} > 0 and t_{i + 1}^{2} \leq T_{i + 1} = \sum_{l = 0}^{i + 1} t_{l}

y_{i + 1} = (1 - \frac{t_{i + 1}}{T_{i + 1}}) x_{i + 1} + \frac{t_{i + 1}}{T_{i + 1}} z_{i + 1}

Open in a new tab

Proposition 8

The sequence {x₀, ⋯, x_N} generated by GFPGM is identical to the corresponding sequence generated by GFPGM′.

Proof

See Appendix C.

Clearly GFPGM′ and the accelerated algorithm in [25] are equivalent for the unconstrained smooth convex problem (Q = ℝ^d). However, when the operation P_Q(x) is relatively expensive, our GFPGM and GFPGM′ that use one projection per iteration could be preferred over the accelerated algorithm in [25] that uses two projections per iteration.

3.4. Optimizing step coefficients of FSFOM using the cost function form of PEP

To find the step coefficients in the class FSFOM that are optimal in terms of the cost function form of PEP, we would like to solve the following problem:

{\hat{h}}_{P} ≔ \underset{h \in ℝ^{N (N + 1) / 2}}{arg min} ℬ_{P} (h, N, d, L, R) .

(HP)

Because (HP) seems intractable, we instead optimize the step coefficients using the relaxed bound in (D):

{\hat{h}}_{D} ≔ \underset{h \in ℝ^{N (N + 1) / 2}}{arg min} ℬ_{D} (h, N, L, R) .

(HD)

The problem (HD) is bilinear, and a convex relaxation technique in [11, Thm. 3] makes it solvable using numerical methods. We optimized (HD) numerically for many choices of N using a SDP solver [7, 15] and based on our numerical results (not shown) we conjecture that the feasible point in Lemma 4 with $t_{i}^{2} = T_{i}$ that corresponds to FPGM (FISTA) is a global minimizer of (HD). It is straightforward to show that the step coefficients in Lemma 4 with $t_{i}^{2} = T_{i}$ give the smallest bound of (D) and (28) among all feasible points in Lemma 4, but showing optimality among all possible feasible points of (HD) may require further derivations as in [17, Lemma 3] using KKT conditions, which we leave as future work.

This section has provided a new worst-case bound proof of FPGM using the relaxed PEP, and suggested that FPGM corresponds to FSFOM with optimized step coefficients using the cost function form of the relaxed PEP. The next section provides a different optimization of the step coefficients of FSFOM that targets the norm of the composite gradient mapping, because minimizing the norm of the composite gradient mapping is important in dual problems (see [9, 22, 26] and (8)).

4. Relaxation and optimization of the composite gradient mapping form of PEP

4.1. Relaxation for the composite gradient mapping form of PEP

To form a worst-case bound on the norm of the composite gradient mapping for a given h of FSFOM, we use the following PEP that replaces F(x_N)−F(x_*) in (P) by the norm squared of the composite gradient mapping. Here, we consider the smallest composite gradient mapping norm squared among all iterates⁷ (min_{x∈Ω_N} ‖L(p_L(x) − x)‖² = min_{x∈Ω_N} ‖∇̃_LF(x)‖² where Ω_N ≔ {y₀, ⋯, y_N−1, x_N}) as follows:

\begin{matrix} ℬ_{P'} (h, N, d, L, R) ≔ & max_{\begin{matrix} F \in ℱ_{L} (ℝ^{d}), \\ x_{0}, \dots, x_{N} \in ℝ^{d}, x_{*} \in X_{*} (F), \\ y_{0}, \dots, y_{N - 1} \in ℝ^{d} \end{matrix}} min_{x \in Ω_{N}} {‖ L (p_{L} (x) - x) ‖}^{2} \\ s . t . x_{i + 1} = p_{L} (y_{i}), i = 0, \dots, N - 1, ‖ x_{0} - x_{*} ‖ \leq R, \\ y_{i + 1} = y_{i} + \sum_{k = 0}^{i} h_{i + 1, k} (x_{k + 1} - y_{k}), i = 0, \dots, N - 2 . \end{matrix}

(P′)

Because this infinite-dimensional max-min problem appears intractable, similar to the relaxation from (P) to (P1), we relax (P′) to a finite-dimensional problem with an additional constraint resulting from (6) that is equivalent to

\frac{L}{2} {‖ p_{L} (x_{N}) - x_{N} ‖}^{2} \leq F (x_{N}) - F (x_{*})

(33)

and conditions that are equivalent to α ≤ ‖L(p_L(x) − x)‖² for all x ∈ Ω_N after replacing min_{x∈Ω_N} ‖L(p_L(x) − x)‖² by α as in [30].⁸ This relaxation leads to

\begin{matrix} ℬ_{P 1'} (h, N, d, L, R) ≔ & max_{\begin{matrix} \bar{G} \in ℝ^{(N + 1) \times d}, \\ δ \in ℝ^{N}, α \in ℝ \end{matrix}} L^{2} R^{2} α \\ s . t . Tr {{\bar{G}}^{⊤} {\bar{A}}_{i - 1, i} (h) \bar{G}} \leq δ_{i - 1} - δ_{i}, i = 1, \dots, N - 1, \\ Tr {{\bar{G}}^{⊤} {\bar{D}}_{i} (h) \bar{G} + ν {\bar{u}}_{i}^{⊤} \bar{G}} \leq - δ_{i}, i = 0, \dots, N - 1, \\ Tr {\frac{1}{2} {\bar{G}}^{⊤} {\bar{u}}_{N} {\bar{u}}_{N}^{⊤} \bar{G}} \leq δ_{N - 1}, \\ Tr {- {\bar{G}}^{⊤} {\bar{u}}_{i} {\bar{u}}_{i}^{⊤} \bar{G}} \leq - α, i = 0, \dots, N, \end{matrix}

(P1′)

for any given unit vector ν ∈ ℝ^d, by defining the (i + 1)th standard basis vector ū_i = e_i+1 ∈ ℝ^N+1, the matrices

{\bar{A}}_{i - 1, i} (h) = (\begin{matrix} {\overset{ˇ}{A}}_{i - 1, i} (h) & 0 \\ 0^{⊤} & 0 \end{matrix}), {\bar{D}}_{i} (h) = (\begin{matrix} {\overset{ˇ}{D}}_{i} (h) & 0 \\ 0^{⊤} & 0 \end{matrix})

(34)

where 0 = [0, …, 0]^⊤ ∈ ℝ^N, and the matrix Ḡ = [G^⊤, ḡ_N]^⊤ ∈ ℝ^(N+1)×d where

{\bar{g}}_{N} ≔ - \frac{1}{‖ y_{0} - x_{*} ‖} (p_{L} (x_{N}) - x_{N}) = \frac{1}{L ‖ y_{0} - x_{*} ‖} {\tilde{\nabla}}_{L} F (x_{N}) .

(35)

Similar to (D) and [16, Problem (D″)], we have the following dual formulation of (P1′) that could be solved using SDP:

ℬ_{D'} (h, N, L, R) ≔ min_{\begin{matrix} (λ, τ, η, β) \in Λ', \\ γ \in ℝ \end{matrix}} {\frac{1}{2} L^{2} R^{2} γ : (\begin{matrix} S' (h, λ, τ, η, β) & \frac{1}{2} {[τ^{⊤}, 0]}^{⊤} \\ \frac{1}{2} [τ^{⊤}, 0] & \frac{1}{2} γ \end{matrix}) ⪰ 0}

(D′)

where η ∈ ℝ₊, $β = {[β_{0}, \dots, β_{N}]}^{⊤} \in ℝ_{+}^{N + 1}$ , and

Λ' = {\begin{matrix} (λ, τ, η, β) \in ℝ_{+}^{3 N + 1} : & \begin{matrix} τ_{0} = λ_{1}, λ_{N - 1} + τ_{N - 1} = η, \sum_{i = 0}^{N} β_{i} = 1, \\ λ_{i} - λ_{i + 1} + τ_{i} = 0, i = 1, \dots, N - 2 \end{matrix} \end{matrix}},

(36)

S' (h, λ, τ, η, β) = \sum_{i = 1}^{N - 1} λ_{i} {\bar{A}}_{i - 1, i} (h) + \sum_{i = 0}^{N - 1} τ_{i} {\bar{D}}_{i} (h) + \frac{1}{2} η {\bar{u}}_{N} {\bar{u}}_{N}^{⊤} - \sum_{i = 0}^{N} β_{i} {\bar{u}}_{i} {\bar{u}}_{i}^{⊤} .

(37)

The next section specifies a feasible point of interest that is in the class of GFPGM and analyzes the worst-case bound of the norm of the composite gradient mapping. Then we optimize the step coefficients of FSFOM with respect to the composite gradient mapping form of PEP leading to a new algorithm that differs from Nesterov’s acceleration for decreasing the cost function.

4.2. Worst-case analysis of the composite gradient mapping of GF-PGM

The following lemma provides feasible point of (D′) for the step coefficients (25) of GFPGM.

Lemma 9

For the step coefficients {h_i+1,k} in (25), the choice of variables

λ_{i} = T_{i - 1} τ_{0}, i = 1, \dots, N - 1, τ_{i} = {\begin{matrix} {(\frac{1}{2} (\sum_{k = 0}^{N - 1} (T_{k} - t_{k}^{2}) + T_{N - 1}))}^{- 1}, & i = 0, \\ t_{i} τ_{0}, & i = 1, \dots, N - 1, \end{matrix}

(38)

η = T_{N - 1} τ_{0}, β_{i} = {\begin{matrix} \frac{1}{2} (T_{i} - t_{i}^{2}) τ_{0}, & i = 0, \dots, N - 1, \\ \frac{1}{2} T_{N - 1} τ_{0}, & i = N, \end{matrix} γ = τ_{0} .

(39)

is a feasible point of (D′) for any choice of t_i and T_i satisfying (27).

Proof

It is obvious that (λ, τ, η, β) in (38) and (39) with (27) is in Λ′ (36). Using (22) and (34), the (i, k)th entry of the symmetric matrix S′(h, λ, τ, η, β) in (37) can be written as

S_{i, k}^{'} (h, λ, τ, η, β) = {\begin{matrix} \frac{1}{2} ((λ_{i} + τ_{i}) h_{i, k} + τ_{i} \sum_{j = k + 1}^{i - 1} h_{j, k}), & i = 2, \dots, N - 1, k = 0, \dots, i - 2, \\ \frac{1}{2} ((λ_{i} + τ_{i}) h_{i, i - 1} - λ_{i}), & i = 1, \dots, N - 1, k = i - 1, \\ \frac{1}{2} λ_{i + 1} - β_{i}, & i = 0, \dots, N - 2, k = i, \\ \frac{1}{2} η - β_{i}, & i = N - 1, N, k = i, \\ 0, & i = N, k = 0, \dots, i - 1, \end{matrix}

and inserting (25), (38), and (39) yields

S_{i, k}^{'} (h, λ, τ, η, β) = {\begin{matrix} \frac{1}{2} (T_{i} τ_{0} \frac{t_{i}}{T_{i}} (t_{k} - \sum_{j = k + 1}^{i - 1} h_{j, k}) + t_{i} τ_{0} \sum_{j = k + 1}^{i - 1} h_{j, k}), & i = 2, \dots, N - 1, k = 0, \dots, i - 2, \\ \frac{1}{2} (T_{i} τ_{0} (1 + \frac{(t_{i - 1} - 1) t_{i}}{T_{i}}) - T_{i - 1} τ_{0}), & i = 1, \dots, N - 1, k = i - 1, \\ \frac{1}{2} T_{i} τ_{0} - \frac{1}{2} (T_{i} - t_{i}^{2}) τ_{0}, & i = 0, \dots, N - 1, k = i, \\ 0, & i = N, k = 0, \dots, i, \end{matrix} = {\begin{matrix} \frac{1}{2} t_{i} t_{k} τ_{0}, & i = 0, \dots, N - 1, k = 0, \dots, i, \\ 0, & i = N, k = 0, \dots, i . \end{matrix}

Finally, by defining t̄ = (t₀, ⋯, t_N−1, 0, 1)^⊤ we have the feasibility condition of (D′):

(\begin{matrix} S' (h, λ, τ, η, β) & \frac{1}{2} {[τ^{⊤}, 0]}^{⊤} \\ \frac{1}{2} [τ^{⊤}, 0] & \frac{1}{2} γ \end{matrix}) = \frac{1}{2} \bar{t} \bar{t} τ_{0} ⪰ 0 .

Using Lemma 9, the following theorem bounds the (smallest) norm of the composite gradient mapping for the GFPGM iterates.

Theorem 10

Let f : ℝ^d → ℝ be in ℱ_L(ℝ^d) and let x₀, ⋯, x_N, y₀, ⋯, y_N−1 ∈ ℝ^d be generated by GFPGM. Then for N ≥ 1,

min_{i \in {0, \dots, N}} ‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖ \leq min_{x \in Ω_{N}} ‖ {\tilde{\nabla}}_{L} F (x) ‖ \leq \frac{LR}{\sqrt{\sum_{k = 0}^{N - 1} (T_{k} - t_{k}^{2}) + T_{N - 1}}} .

(40)

Proof

Lemma 1 implies the first inequality of (40). Using (D′), Lemma 9 and Prop. 5, we have

min_{x \in Ω_{N}} {‖ {\tilde{\nabla}}_{L} F (x) ‖}^{2} \leq ℬ_{D'} (h, N, L, R) = \frac{1}{2} L^{2} R^{2} γ = \frac{L^{2} R^{2}}{\sum_{k = 0}^{N - 1} (T_{k} - t_{k}^{2}) + T_{N - 1}},

which is equivalent to (40).

Although the bound (40) is not tight due to the relaxation on PEP, next two sections show that there exists choices of t_i that provide a rate $O (1 / N^{\frac{3}{2}})$ for decreasing the composite gradient mapping, including the choice that optimizes the composite gradient mapping form of PEP.

FGM for smooth convex minimization was shown to achieve the rate $O (1 / N^{\frac{3}{2}})$ for the decrease of the usual gradient in [16]. In contrast, Thm. 10 provides only a O(1/N) bound for FPGM (or GFPGM with t_i (11)) on the decrease of the composite gradient mapping since $T_{i} = t_{i}^{2}$ for all i and the value of T_N−1 is O(N²) for t_i (11). Sec. 5 below numerically studies a tight bound on the composite gradient mapping of FPGM and illustrates that it has a rate that is faster than the rate O(1/N) of Thm. 10, indicating there is a room for improvement in the composite gradient mapping form of the relaxed PEP

4.3. Optimizing step coefficients of FSFOM using the composite gradient mapping form of PEP

To optimize the step coefficients in the class FSFOM in terms of the composite gradient mapping form of the relaxed PEP (D′), we would like to solve the following problem:

{\hat{h}}_{D'} ≔ \underset{h \in ℝ^{N (N + 1) / 2}}{arg min} ℬ_{D'} (h, N, L, R) .

(HD′)

Similar to (HD), we use a convex relaxation [11, Thm. 3] to make the bilinear problem (HD′) solvable using numerical methods. We then numerically optimized (HD′) for many choices of N using a SDP solver [7, 15] and found that the following choice of t_i:

t_{i} = {\begin{matrix} 1, & i = 0, \\ \frac{1 + \sqrt{1 + 4 t_{i - 1}^{2}}}{2}, & i = 1, \dots, ⌊ \frac{N}{2} ⌋ - 1, \\ \frac{N - i + 1}{2}, & i = ⌊ \frac{N}{2} ⌋, \dots, N - 1, \end{matrix}

(41)

makes the feasible point in Lemma 9 optimal empirically with respect to the relaxed bound (HD′). Interestingly, whereas the usual t_i factors (such as (11) and $t_{i} = \frac{i + a}{a}$ for any a ≥ 2) increase with i indefinitely, here, the factors begin decreasing after $i = ⌊ \frac{N}{2} ⌋ - 1$ .

We also noticed numerically that finding the t_i that minimizes the bound (40), i.e., solving the following constrained quadratic problem:

max_{{t_{i}}} {\sum_{k = 0}^{N - 1} (\sum_{l = 0}^{k} t_{l} - t_{k}^{2}) + \sum_{l = 0}^{N - 1} t_{l}} s . t . t_{i} satisfies (27) for all i,

(42)

is equivalent to optimizing (HD′). This means that the solution of (42) numerically appears equivalent to (41), the (conjectured) solution of (HD′). Interestingly, the unconstrained maximizer of (42) without the constraint (27) is $t_{i} = \frac{N - i + 1}{2}$ , and this partially appears in the constrained maximizer (41) of the problem (42).

Based on this numerical evidence, we conjecture that the solution ĥ_D′ of problem (HD′) corresponds to (25) with (41). Using Prop. 5, the following GFPGM form with (41) is equivalent to FSFOM with the step coefficients (25) for (41) that are optimized step coefficients of FSFOM with respect to the norm of the composite gradient mapping, which we name FPGM-OCG (OCG for optimized over composite gradient mapping).

Algorithm FPGM-OCG (GFPGM with t_i in (41))

Input:

f \in C_{L}^{1, 1} (ℝ^{d})

convex, x₀ ∈ ℝ^d, y₀ = x₀, t₀ = T₀ = 1.

For i = 0, …, N − 1

x_{i + 1} = p_{L} (y_{i})

t_{i + 1} = {\begin{matrix} \frac{1 + \sqrt{1 + 4 t_{i}^{2}}}{2}, & i = 1, \dots, ⌊ \frac{N}{2} ⌋ - 2, \\ \frac{N - i}{2}, & i = ⌊ \frac{N}{2} ⌋ - 1, \dots, N - 2, \end{matrix}

y_{i + 1} = x_{i + 1} + \frac{(T_{i} - t_{i}) t_{i + 1}}{t_{i} T_{i + 1}} (x_{i + 1} - x_{i}) + \frac{(t_{i}^{2} - T_{i}) t_{i + 1}}{t_{i} T_{i + 1}} (x_{i + 1} - y_{i}), i < N = 1

Open in a new tab

The following theorem bounds the cost function and the (smallest) norm of the composite gradient mapping for the FPGM-OCG iterates.

Theorem 11

Let F : ℝ^d → ℝ be in ℱ_L(ℝ^d) and let x₀, ⋯, x_N, y₀, ⋯, y_N−1 ∈ ℝ^d be generated by FPGM-OCG. Then for N ≥ 1,

F (x_{N}) - F (x_{*}) \leq \frac{4 L {‖ x_{0} - x_{*} ‖}^{2}}{N (N + 4)},

(43)

and for N ≥ 3,

min_{i \in {0, \dots, N}} ‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖ \leq min_{x \in Ω_{N}} ‖ {\tilde{\nabla}}_{L} F (x) ‖ \leq \frac{2 \sqrt{6} LR}{N \sqrt{N - 2}} .

(44)

Proof

FPGM-OCG is an instance of the GFPGM, and thus Thm. 6 implies (43) using

T_{N - 1} = T_{m - 1} + \sum_{k = m}^{N - 1} t_{k} = t_{m - 1}^{2} + \sum_{k = m}^{N - 1} \frac{N - k + 1}{2} = t_{m - 1}^{2} + \sum_{k' = 2}^{N - m + 1} \frac{k'}{2} \geq \frac{{(m + 1)}^{2}}{4} + \frac{(N - m + 1) (N - m + 2)}{4} - \frac{1}{2} \geq \frac{2 N^{2} + 8 N + 1}{16},

where $m = ⌊ \frac{N}{2} ⌋ \geq \frac{N - 1}{2}, N - m \geq \frac{N}{2}$ , and $T_{m - 1} = t_{m - 1}^{2} \geq \frac{{(m + 1)}^{2}}{4}$ (13).

In addition, Thm. 10 implies (44), using

\sum_{k = 0}^{N - 1} (T_{k} - t_{k}^{2}) + T_{N - 1} \geq \frac{1}{24} (N - 2) N^{2},

(45)

which we prove in the Appendix E.

The composite gradient mapping bound (44) of FPGM-OCG is asymptotically $\frac{2 \sqrt{2}}{3}$ -times smaller than the bound (15) of $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ . In addition, the cost function bound (43) of FPGM-OCG satisfies the optimal rate O(1/N²), although the bound (43) is two-times larger than the analogous bound (12) of FPGM.

4.4. Decreasing the composite gradient mapping with a rate $O (1 / N^{\frac{3}{2}})$ without selecting N in advance

FPGM-OCG and FPGM-m satisfy a fast rate $O (1 / N^{\frac{3}{2}})$ for decreasing the norm of the composite gradient mapping but require one to select the total number of iterations N in advance, which could be undesirable in practice. One could use FPGM-σ in [21] that does not require selecting N in advance, but instead we suggest a new choice of t_i in GFPGM that satisfies a composite gradient mapping bound that is lower than the bound (17) of FPGM-σ.

Based on Thm. 10, the following corollary shows that GFPGM with $t_{i} = \frac{i + a}{a}$ (FPGM-a) for any a > 2 satisfies the rate $O (1 / N^{\frac{3}{2}})$ of the norm of the composite gradient mapping without selecting N in advance. (Cor. 7 showed that FPGM-a for any a ≥ 2 satisfies the optimal rate O(1/N²) of the cost function.)

Corollary 12

Let f : ℝ^d → ℝ be in ℱ^L(ℝ^d) and let x₀, ⋯, x_N, y₀, ⋯, y_N−1 ∈ ℝ^d be generated by GFPGM with $t_{i} = \frac{i + a}{a}$ (FPGM-a) for any a ≥ 2. Then for N ≥ 1, we have the following bound on the (smallest) composite gradient mapping:

min_{i \in {0, \dots, N}} ‖ {\tilde{\nabla}}_{L} F (x_{i}) ‖ \leq min_{x \in Ω_{N}} ‖ {\tilde{\nabla}}_{L} F (x) ‖ \leq \frac{a \sqrt{6} LR}{\sqrt{N ((a - 2) N^{2} + 3 (a^{2} - a + 1) N + (3 a^{2} + 2 a - 1))}} .

(46)

Proof

With $T_{i} = \frac{(i + 1) (i + 2 a)}{2 a}$ and (31), Thm. 10 implies (46) using

\sum_{k = 0}^{N - 1} (T_{k} - t_{k}^{2}) + T_{N - 1} = \sum_{k = 0}^{N - 1} (\frac{(k + 1) (k + 2 a)}{2 a} - \frac{{(k + a)}^{2}}{a^{2}}) + \frac{N (N + 2 a - 1)}{2 a} = \sum_{k = 0}^{N - 1} (\frac{(a - 2) k^{2} + a (2 a - 3) k}{2 a^{2}}) + \frac{N (N + 2 a - 1)}{2 a} = \frac{N}{2 a^{2}} (\frac{(a - 2) (N - 1) (2 N - 1)}{6} + \frac{a (2 a - 3) (N - 1)}{2} + a (N + 2 a - 1)) = \frac{N ((a - 2) N^{2} + 3 (a^{2} - a + 1) N + (3 a^{2} + 2 a - 1))}{6 a^{2}} .

FPGM-a for any a > 2 has a composite gradient mapping bound (46) that is asymptotically $\frac{a}{2 \sqrt{a - 2}}$ -times larger than the bound (44) of FPGM-OCG. This gap reduces to $\sqrt{2}$ at best when a = 4, which is clearly better than that of FPGM-σ. Therefore, this FPGM-a algorithm will be useful for minimizing the composite gradient mapping with a rate $O (1 / N^{\frac{3}{2}})$ without selecting N in advance.

5. Discussion

5.1. Summary of analytical worst-case bounds on the cost function and the composite gradient mapping

Table 1 summarizes the asymptotic worst-case bounds of all algorithms discussed in this paper. (Note that the bounds are not guaranteed to be tight.) In Table 1, FPGM and FPGM-OCG provide the best known analytical worst-case bounds for decreasing the cost function and the composite gradient mapping respectively. When one does not want to select N in advance for decreasing the composite gradient mapping, FPGM-a will be a useful alternative to FPGM-OCG.

Table 1.

Asymptotic worst-case bounds on the cost function F(x_N) − F(x_*) and the norm of the composite gradient mapping min_{x∈Ω_N} ‖∇̃_LF(x)‖ of PGM, FPGM, FPGM-σ, FPGM-m, FPGM-OCG, and FPGM-a. (The cost function bound for FPGM-m in the table corresponds to the bound for FPGM after m iterations because a tight bound for the final N th iteration is unknown. The bound on min_{i∈{0, …, N}} ‖∇̃_L/σ²F(y_i)‖ is used for FPGM-σ.)

Algorithm

Asymptotic worst-case bound

Require selecting
N in advance

Cost function (×LR²)

Proximal gradient (×LR)

PGM

\frac{1}{2} N^{- 1}

2N⁻¹

FPGM

2N⁻²

2N⁻¹

FPGM-σ (0 < σ < 1)

\frac{2}{σ^{2}} N^{- 2}

\frac{2 \sqrt{3}}{σ^{2}} \sqrt{\frac{1 + σ}{1 - σ}} N^{- \frac{3}{2}}

FPGM-(σ = 0.78)

3.3N⁻²

16.2 N^{- \frac{3}{2}}

FPGM - (m = ⌊ \frac{2 N}{3} ⌋)

4.5N⁻²

5.2 N^{- \frac{3}{2}}

Yes

FPGM-OCG

4N−²

4.9 N^{- \frac{3}{2}}

Yes

FPGM-a (a > 2)

aN ⁻²

\frac{a \sqrt{6}}{\sqrt{a - 2}} N^{- \frac{3}{2}}

FPGM-(a = 4)

4N⁻²

6.9 N^{- \frac{3}{2}}

Open in a new tab

5.2. Tight worst-case bounds on the cost function and the smallest composite gradient mapping norm

Since none of the bounds presented in Table 1 are guaranteed to be tight, we modified the code⁹ (using SDP solvers [20, 28]) in Taylor et al. [29] to compare tight (numerical) bounds for the cost function and the composite gradient mapping in Tables 2 and 3 respectively for N = 1, 2, 4, 10, 20, 30, 40, 47, 50. This numerical bound is guaranteed to be tight when the large-scale condition is satisfied [29]. Taylor et al. [29, Fig. 1] already studied a tight worst-case bound on the cost function decrease of FPGM numerically, and found that the analytical bound (12) is asymptotically tight. Table 2 additionally provides numerical tight bounds on the cost function of all algorithms presented in this paper, also suggesting that our relaxation of the cost function form of the PEP from (P) to (D) is asymptotically tight (for some algorithms). In addition, the trend of the tight bounds of the composite gradient mapping in Table 3 follows that of the bounds in Table 1. However, there is gap between them that is not asymptotically tight, unlike the gap between the bounds of the cost function in Tables 1 and 2. In particular, the numerical tight bound for the composite gradient mapping of FPGM in Table 3 has a rate faster than the known rate O(1/N) in Thm. 10. We leave reducing this gap for the bounds on the norm of the composite gradient mapping as future work, possibly with a tighter relaxation of PEP. In addition, $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ has a numerical tight bound in Table 3 that is even slightly better than that of FPGM-OCG, unlike our expectation from the analytical bounds in Table 1 and Sec. 4.3. This shows room for improvement in optimizing the step coefficients of FSFOM with respect to the composite gradient mapping, again possibly with a tighter relaxation of PEP.

Table 2.

Tight worst-case bounds on the cost function LR²/(F(x_N) − F(x_*)) of PGM, FPGM, FPGM-(σ = 0.78), $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ , FPGM-OCG, and FPGM-(a = 4). We computed empirical rates by assuming that the bounds follow the form bN^c with constants b and c, and then by estimating c from points N = 47, 50. Note that the corresponding empirical rates are underestimated due to the simplified exponential model.

PGM

FPGM

FPGM
-(σ = 0.78)

FPGM - (m = ⌊ \frac{2 N}{3} ⌋)

FPGM
-OCG

FPGM
-(a = 4)

4.00

2.43

4.00

8.00

4.87

8.00

16.00

19.35

11.77

17.13

17.60

17.23

40.00

79.07

48.11

56.47

59.25

55.88

80.00

261.66

159.19

163.75

170.10

159.17

120.00

546.51

332.49

321.56

331.97

312.03

160.00

932.89

567.57

502.37

544.55

514.73

188.00

1263.58

768.76

675.68

723.06

686.33

200.00

1420.45

864.20

752.90

807.66

767.37

Empi. O(·)

N ^−1.00

N ^−1.89

N ^−1.75

N ^−1.79

N ^−1.80

Known O(·)

N ⁻¹

N ⁻²

Open in a new tab

Table 3.

Tight worst-case bounds on the norm of the composite gradient mapping LR/(min_{x∈Ω_N} ‖∇̃_LF(x)‖) of PGM, FPGM, FPGM-(σ = 0.78), $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ , FPGM-OCG, and FPGM-(a = 4). Empirical rates were computed as described in Table 2. (The bound for FPGM-σ uses min_{x∈Ω_N} ‖∇̃_L/σ²F(x)‖.)

PGM

FPGM

FPGM
-(σ = 0.78)

FPGM - (m = ⌊ \frac{2 N}{3} ⌋)

FPGM
-OCG

FPGM
-(a=4)

1.84

1.18

1.84

2.83

1.78

2.83

4.81

5.65

3.50

5.09

5.21

5.12

10.80

13.24

8.74

14.91

15.60

14.76

20.78

27.19

18.83

39.70

39.61

29.21

30.78

43.49

30.82

64.45

64.40

47.14

40.78

61.76

44.39

92.82

91.99

67.82

47.77

75.60

54.73

113.92

113.41

83.67

50.77

81.78

59.35

123.54

123.17

90.78

Empi. O(·)

N ^−0.98

N ^−1.27

N ^−1.31

N ^−1.33

N ^−1.32

Known O(·)

N ⁻¹

N^{- \frac{3}{2}}

N^{- \frac{3}{2}}

N^{- \frac{3}{2}}

N^{- \frac{3}{2}}

Open in a new tab

5.3. Tight worst-case bounds on the final compo site gradient mapping

This paper focused on analyzing the worst-case bound of the smallest composite gradient mapping among all iterates (min_{x∈Ω_N} ‖∇̃_LF(x)‖) in addition to the cost function, whereas the composite gradient mapping at the final iterate (‖∇̃_LF(x_N)‖) could be also considered (see Appendix D). For example, the composite gradient mapping bounds (10) and (15) for PGM and FPGM-m also apply to the final composite gradient mapping, and using (6) we can easily derive a (loose) worst-case bound on the final composite gradient mapping for other algorithms, e.g., such a final composite gradient mapping bound for GFPGM is as follows:

‖ {\tilde{\nabla}}_{L} F (x_{N}) ‖ \overset{(6)}{\leq} \sqrt{2 L (F (x_{N}) - F (p_{L} (x_{N})))} \leq \sqrt{2 L (F (x_{N}) - F (x_{*}))} \overset{(28)}{\leq} \frac{LR}{\sqrt{T_{N - 1}}} .

(47)

Since the optimal rate for decreasing the cost function is O(1/N²), the composite gradient mapping worst-case bound (47) can provide only a rate O(1/N) at best. For completeness of the discussion, Table 4 reports tight numerical bounds for the final composite gradient mapping. Here, FPGM, FPGM-(σ = 0.78), and FPGM-(a = 4) have empirical rates of the worst-case bounds in Table 4 that are slower than those in Table 3, unlike the other three including FPGM-OCG.

Table 4.

Tight worst-case bounds on the norm of the final composite gradient mapping LR/‖∇̃_LF(x_N)‖ of PGM, FPGM, FPGM-(σ = 0.78), $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ , FPGM-OCG, and FPGM-(a = 4). Empirical rates were computed as described in Table 2. (The bound for FPGM-σ uses ‖∇̃_L/σ²F(x_N)‖.)

PGM

FPGM

FPGM
-(σ = 0.78)

FPGM - (m = ⌊ \frac{2 N}{3} ⌋)

FPGM
-OCG

FPGM
-(a = 4)

1.84

1.18

1.84

2.83

1.78

2.83

4.81

5.65

3.50

5.09

5.21

5.12

10.80

12.68

8.41

14.91

15.60

14.76

20.78

22.02

14.26

39.65

39.10

25.96

30.78

31.26

20.12

64.40

63.40

34.21

40.78

40.46

25.97

92.78

90.16

42.39

47.77

46.89

30.06

113.92

110.12

48.13

50.77

49.65

31.81

123.53

118.99

50.59

Empi. O(·)

N ^−0.98

N ^−0.92

N ^−1.31

N ^−1.25

N ^−0.81

Known O(·)

N ⁻¹

N^{- \frac{3}{2}}

N ⁻¹

Open in a new tab

To best of our knowledge, FPGM-m (or algorithms that similarly perform accelerated algorithms in the beginning and run a proximal gradient method for the remaining iterations) is known only to have a rate $O (1 / N^{\frac{3}{2}})$ in (15) for decreasing the final composite gradient mapping, while FPGM-OCG was also found to inherit such fast rate in Table 4. Therefore, searching for first-order methods that have a worst-case bound on the final composite gradient mapping that is lower than that of FPGM-m (and FPGM-OCG), and that possibly do not require knowing N in advance is an interesting open problem. Note that a regularization technique in [26] that provides a faster rate O(1/N²) (up to a logarithmic factor) for decreasing the final gradient norm for smooth convex minimization can be easily extended for rapidly minimizing the final composite gradient mapping with such rate for the composite problem (M); however, that approach requires knowing R in advance.

5.4. Tight worst-case bounds on the final subgradient

This paper has mainly focused on the norm of the composite gradient mapping based on (8), instead of the subgradient norm that is of primary interest in the dual problem (see e.g., [9, 22, 26]). Therefore to have a better sense of subgradient norm bounds, we computed tight numerical bounds on the final¹⁰ subgradient norm ‖F′(x_N)‖ in Table 5 and compared them with Table 4.

Table 5.

Tight worst-case bounds on the subgradient norm LR/‖F′(x_N)‖ of PGM, FPGM, FPGM-(σ = 0.78), $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ , FPGM-OCG, and FPGM-(a = 4), where F′(x) ∈ ∂F(x) is a subgradient. Empirical rates were computed as described in Table 2.

PGM

FPGM

FPGM
-(σ = 0.78)

FPGM - (m = ⌊ \frac{2 N}{3} ⌋)

FPGM
-OCG

FPGM
-(a = 4)

1.00

0.61

1.00

2.00

1.22

2.00

4.00

4.83

2.94

4.28

4.40

4.31

10.00

7.60

4.67

14.12

14.81

12.10

20.00

12.58

7.67

38.29

36.65

16.85

30.00

17.63

10.74

62.71

60.40

21.61

40.00

22.67

13.80

91.00

86.62

26.47

47.00

26.20

15.94

112.01

106.21

29.91

50.00

27.71

16.86

121.53

114.93

31.39

Empi. O(·)

N ^−1.00

N ^−0.91

N ^−1.32

N ^−1.27

N ^−0.78

Known O(·)

N ⁻¹

N^{- \frac{3}{2}}

N ⁻¹

Open in a new tab

For all six algorithms, empirical rates in Table 5 are similar to those for the final composite gradient mapping in Table 4. In particular, the subgradient norm bounds for the three algorithms PGM, $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ , and FPGM-OCG are almost identical to those in Table 4 except for the first few iterations, eliminating the concern of using (8) for such cases. On the other hand, the other three algorithms FPGM, FPGM-(σ = 0.78), and FPGM-(a = 4) almost tightly satisfy the inequality (8) for most N, and thus have bounds on the final subgradient that are about twice larger than those on the final composite gradient mapping. Therefore, regardless of (8), Table 5 further supports the use of $FPGM - (m = ⌊ \frac{2 N}{3} ⌋)$ and FPGM-OCG over FPGM and other algorithms in dual problems.

6. Conclusion

This paper analyzed and developed fixed-step first-order methods (FSFOM) for nonsmooth composite convex cost functions. We showed an alternate proof of FPGM (FISTA) using PEP, and suggested that FPGM (FISTA) results from optimizing the step coefficients of FSFOM with respect to the cost function form of the (relaxed) PEP. We then described a new generalized version of FPGM and analyzed its worst-case bound using the (relaxed) PEP over both the cost function and the norm of the composite gradient mapping. Furthermore, we optimized the step coefficients of FSFOM with respect to the composite gradient mapping form of the (relaxed) PEP, yielding FPGM-OCG, which could be useful particularly when tackling dual problems.

Our relaxed PEP provided tractable analysis of the optimized step coefficients of FSFOM with respect to the cost function and the norm of the composite gradient mapping, but the relaxation is not guaranteed to be tight and the corresponding accelerations of PGM (FPGM and FPGM-OCG) are thus unlikely to be optimal. Therefore, finding optimal step coefficients of FSFOM over the cost function and the norm of the composite gradient mapping remain as future work. Nevertheless, the proposed FPGM-OCG that optimizes the composite gradient mapping form of the relaxed PEP and the FPGM-a (for any a > 2) may be useful in dual problems.

Software

Matlab codes for the SDP approaches in Sec. 3.4, Sec. 4.3 and Sec. 5 are available at https://gitlab.eecs.umich.edu/michigan-fast-optimization.

Acknowledgments

The authors would like to thank the anonymous referees for very useful comments that have improved the quality of this paper.

Funding: This research was supported in part by NIH grant U01 EB018753.

Appendix A

Derivation of the dual formulation (D) of (P1)

The derivation below is similar to [11, Lemma 2].

We replace max_G,δ LR²δ_N−1 of (P1) by min_G,δ{−δ_N−1} for convenience in this section. The corresponding dual function of such (P1) is then defined as

H (λ, τ; h) = min_{\begin{matrix} G \in ℝ^{N \times d}, \\ δ \in ℝ^{N} \end{matrix}} {ℒ (G, δ, λ, τ; h) ≔ ℒ_{1} (δ, λ, τ) + ℒ_{2} (G, λ, τ; h)}

for dual variables $λ = {[λ_{1}, \dots, λ_{N - 1}]}^{⊤} \in ℝ_{+}^{N - 1}$ and $τ = {[τ_{0}, \dots, τ_{N - 1}]}^{⊤} \in ℝ_{+}^{N}$ , where ℒ(G, δ, λ, τ; h) is a Lagrangian function, and

ℒ_{1} (δ, λ, τ) ≔ - δ_{N - 1} + \sum_{i = 1}^{N - 1} λ_{i} (δ_{i} - δ_{i - 1}) + \sum_{i = 0}^{N 1} τ_{i} δ_{i},

ℒ_{2} (G, λ, τ; h) ≔ \sum_{i = 1}^{N - 1} λ_{i} Tr {G^{⊤} {\overset{ˇ}{A}}_{i - 1, i} (h) G} + \sum_{i = 0}^{N - 1} τ_{i} Tr {G^{⊤} {\overset{ˇ}{D}}_{i} (h) G + ν u_{i}^{⊤} G} .

Here, min_δ ℒ₁(δ, λ, τ) = 0 for any (λ, τ) ∈ Λ where Λ is defined in (23), and min_δ ℒ₁(δ, λ, τ) = −∞ otherwise.

For any given unit vector ν, [11, Lemma 1] implies

min_{G \in ℝ^{N \times d}} ℒ_{2} (G, λ, τ) = min_{w \in ℝ^{N}} ℒ_{2} (w ν^{⊤}, λ, τ),

and thus for any (λ, τ) ∈ Λ, we can rewrite the dual function as

H (λ, τ; h) = min_{w \in ℝ^{N}} {w^{⊤} S (h, λ, τ) w + τ^{⊤} w} = max_{γ \in ℝ} {- \frac{1}{2} γ : w^{⊤} S (h, λ, τ) w + τ^{⊤} w \geq - \frac{1}{2} γ, \forall w \in ℝ^{N}} = max_{γ \in ℝ} {- \frac{1}{2} γ : (\begin{matrix} S (h, λ, τ) & \frac{1}{2} τ \\ \frac{1}{2} τ^{⊤} & \frac{1}{2} γ \end{matrix}) ⪰ 0},

where S(h, λ, τ) is defined in (24). Therefore the dual problem of (P1) becomes (D), recalling that we previously replaced max_G,δ LR²δ_N−1 of (P1) by min_G,δ{−δ_N−1}.

Appendix B

Proof of Prop. 5

The proof is similar to [17, Prop. 2, 3 and 4].

We first show that {h_i+1,k} in (25) is equivalent to

h_{i + 1, k} = {\begin{matrix} \frac{(T_{i} - t_{i}) t_{i + 1}}{t_{i} T_{i + 1}} h_{i, k} & i = 0, \dots, N - 1, k = 0, \dots, i - 2, \\ \frac{(T_{i} - t_{i}) t_{i + 1}}{t_{i} T_{i + 1}} (h_{i, i - 1} - 1), & i = 0, \dots, N - 1, k = i - 1, \\ 1 + \frac{(t_{i} - 1) t_{i + 1}}{T_{i + 1}}, & i = 0, \dots, N - 1, k = i, \end{matrix}

(48)

We use the notation $h_{i, k}^{'}$ for the coefficients (25) to distinguish from (48). It is obvious that $h_{i + 1, i}^{'} = h_{i + 1, i}$ , i = 0, …, N − 1, and we clearly have

h_{i + 1, i - 1}^{'} = \frac{t_{i + 1}}{T_{i + 1}} (t_{i - 1} - h_{i, i - 1}^{'}) = \frac{t_{i + 1}}{T_{i + 1}} (t_{i - 1} - (1 + \frac{(t_{i - 1} - 1) t_{i}}{T_{i}})) = \frac{(t_{i - 1} - 1) (T_{i} - t_{i}) t_{i + 1}}{T_{i} T_{i + 1}} = \frac{(T_{i} - t_{i}) t_{i + 1}}{t_{i} T_{i + 1}} (h_{i, i - 1} - 1) = h_{i + 1, i - 1}

We next use induction by assuming $h_{i + 1, k}^{'} = h_{i + 1, k}$ for i = 0, …, n − 1, k = 0, …, i. We then have

h_{n + 1, k}^{'} = \frac{t_{n + 1}}{T_{n + 1}} (t_{k} - \sum_{j = k + 1}^{n} h_{j, k}^{'}) = \frac{t_{n + 1}}{T_{n + 1}} (t_{k} - \sum_{j = k + 1}^{n - 1} h_{j, k}^{'} - h_{n, k}^{'}) = \frac{t_{n + 1}}{T_{n + 1}} (\frac{T_{n}}{t_{n}} h_{n, k}^{'} - h_{n, k}^{'}) = \frac{(T_{n} - t_{n}) t_{n + 1}}{t_{n} T_{n + 1}} h_{n, k} = h_{n + 1, k}

Next, using (48), we show that FSFOM with (25) is equivalent to the GFPGM. We use induction, and for clarity, we use the notation $y_{0}^{'}, \dots, y_{N}^{'}$ for FSFOM with (48). It is obvious that $y_{0}^{'} = y_{0}$ , and we have

y_{1}^{'} = y_{0}^{'} - \frac{1}{L} h_{1, 0} {\tilde{\nabla}}_{L} F (y_{0}^{'}) = y_{0} - \frac{1}{L} (1 + \frac{(t_{0} - 1) t_{1}}{T_{1}}) {\tilde{\nabla}}_{L} F (y_{0}) = x_{1} + \frac{(T_{0} - t_{0}) t_{1}}{t_{0} T_{1}} (x_{1} - x_{0}) + \frac{(t_{0}^{2} - T_{0}) t_{1}}{t_{0} T_{1}} (x_{1} - y_{0}) = y_{1},

since T₀ = t₀. Assuming $y_{i}^{'} = y_{i}$ for i = 0, …, n, we then have

y_{n + 1}^{'} = y_{n}^{'} - \frac{1}{L} h_{n + 1, n} {\tilde{\nabla}}_{L} F (y_{n}^{'}) - \frac{1}{L} h_{n + 1, n - 1} {\tilde{\nabla}}_{L} F (y_{n - 1}^{'}) - \frac{1}{L} \sum_{k = 0}^{n - 2} h_{n + 1, k} {\tilde{\nabla}}_{L} F (y_{k}^{'}) = y_{n} - \frac{1}{L} (1 + \frac{(t_{n} - 1) t_{n + 1}}{T_{n + 1}}) {\tilde{\nabla}}_{L} F (y_{n}) - \frac{1}{L} \frac{(T_{n} - t_{n}) t_{n + 1}}{t_{n} T_{n + 1}} (h_{n, n - 1} - 1) {\tilde{\nabla}}_{L} F (y_{n - 1}) - \frac{1}{L} \sum_{k = 0}^{n - 2} \frac{(T_{n} - t_{n}) t_{n + 1}}{t_{n} T_{n + 1}} h_{n, k} {\tilde{\nabla}}_{L} F (y_{k}) = x_{n + 1} - \frac{1}{L} \frac{(t_{n}^{2} - T_{n}) t_{n + 1}}{t_{n} T_{n + 1}} {\tilde{\nabla}}_{L} F (y_{n}) - \frac{1}{L} \frac{(T_{n} - t_{n}) t_{n + 1}}{t_{n} T_{n + 1}} ({\tilde{\nabla}}_{L} F (y_{n}) - {\tilde{\nabla}}_{L} F (y_{n - 1}) + \sum_{k = 0}^{n - 1} h_{n, k} {\tilde{\nabla}}_{L} F (y_{k})) = x_{n + 1} + \frac{(t_{n}^{2} - T_{n}) t_{n + 1}}{t_{n} T_{n + 1}} (x_{n + 1} - y_{n}) + \frac{(T_{n} - t_{n}) t_{n + 1}}{t_{n} T_{n + 1}} (- \frac{1}{L} {\tilde{\nabla}}_{L} F (y_{n}) + \frac{1}{L} {\tilde{\nabla}}_{L} F (y_{n - 1}) + y_{n} - y_{n - 1}) = x_{n + 1} + \frac{(T_{n} - t_{n}) t_{n + 1}}{t_{n} T_{n + 1}} (x_{n + 1} - x_{n}) + \frac{(t_{n}^{2} - T_{n}) t_{n + 1}}{t_{n} T_{n + 1}} (x_{n + 1} - y_{n}) = y_{n + 1} .

Appendix C

Proof of Prop. 8

The proof is similar to [17, Prop. 1 and 5].

We use induction, and for clarity, we use the notation $y_{0}^{'}, \dots, y_{N}^{'}$ for FSFOM with (25) that is equivalent to GFPGM by Prop. 5. It is obvious that $y_{0}^{'} = y_{0}$ , and we have

y_{1}^{'} = y_{0}^{'} - \frac{1}{L} h_{1, 0} {\tilde{\nabla}}_{L} F (y_{0}^{'}) = y_{0} - \frac{1}{L} (1 + \frac{(t_{0} - 1) t_{1}}{T_{1}}) {\tilde{\nabla}}_{L} F (y_{0}) = (1 - \frac{t_{1}}{T_{1}}) (y_{0} - \frac{1}{L} {\tilde{\nabla}}_{L} F (y_{0})) + \frac{t_{1}}{T_{1}} (y_{0} - \frac{1}{L} t_{0} {\tilde{\nabla}}_{L} F (y_{0})) = (1 - \frac{t_{1}}{T_{1}}) x_{1} + \frac{t_{1}}{T_{1}} z_{1} = y_{1} .

Assuming $y_{0}^{'} = y_{i}$ for i = 0, …, n, we then have

y_{n + 1}^{'} = y_{n}^{'} - \frac{1}{L} h_{n + 1, n} {\tilde{\nabla}}_{L} F (y_{n}^{'}) - \frac{1}{L} \sum_{k = 0}^{n - 1} h_{n + 1, k} {\tilde{\nabla}}_{L} F (y_{k}^{'}) = y_{n} - \frac{1}{L} (1 + \frac{(t_{n} - 1) t_{n + 1}}{T_{n + 1}}) {\tilde{\nabla}}_{L} F (y_{n}) - \frac{1}{L} \sum_{k = 0}^{n - 1} \frac{t_{n + 1}}{T_{n + 1}} (t_{k} - \sum_{j = k + 1}^{n} h_{j, k}) {\tilde{\nabla}}_{L} F (y_{k}) = (1 - \frac{t_{n + 1}}{T_{n + 1}}) (y_{n} - \frac{1}{L} {\tilde{\nabla}}_{L} F (y_{n})) + \frac{t_{n + 1}}{T_{n + 1}} (y_{n} - \frac{1}{L} \sum_{k = 0}^{n} t_{k} {\tilde{\nabla}}_{L} F (y_{k}) + \frac{1}{L} \sum_{k = 0}^{n - 1} \sum_{j = k + 1}^{n} h_{j, k} {\tilde{\nabla}}_{L} F (y_{k})) = (1 - \frac{t_{n + 1}}{T_{n + 1}}) (y_{n} - \frac{1}{L} {\tilde{\nabla}}_{L} F (y_{n})) + \frac{t_{n + 1}}{T_{n + 1}} (y_{0} - \frac{1}{L} \sum_{k = 0}^{n} t_{k} {\tilde{\nabla}}_{L} F (y_{k})) = (1 - \frac{t_{n + 1}}{T_{n + 1}}) x_{n + 1} + \frac{t_{n + 1}}{T_{n + 1}} z_{n + 1} .

Appendix D

Discussion on the choice of Ω_N in Sec. 4.1

Our formulation (P′) examines the set Ω_N = {y₀, ⋯, y_N−1, x_N} and eventually leads to the best known analytical bound on the norm of the composite gradient mapping in Thm. 11 among fixed-step first-order methods.

An alternative formulation would be to use the set {y₀, ⋯, y_N−1} (i.e., excluding the point x_N). For this alternative, we could simply replace the inequality (33) with the condition 0 ≤ F(y_N−1) − F(x_*) to derive a slightly different relaxation. (One could use other conditions at the point y_N−1 as in [29] for a tight relaxation, but this is beyond the scope of this paper.) We found that the corresponding (loose) relaxation (P1′) using {y₀, ⋯, y_N−1} leads to a larger upper bound than (40) in Thm. 10 for the set Ω_N.

Another alternative would be to use the set {x₀, ⋯, x_N}, which we leave as future work. Nevertheless, the inequality in Lemma 1 provides a bound for that set {x₀, ⋯, x_N} as seen in Thm. 10 and 11.

We could also consider the final point x_N (or y_N) in (P′) instead of the minimum over a set of points. However, the corresponding (loose) relaxation (P1′) yielded only an O(1/N) bound at best (even for the corresponding optimized step coefficients of (HD′)) on the final composite gradient mapping norm. So we leave finding its tighter relaxation as future work. Note that Table 4 reports tight numerical bounds on the composite gradient mapping norm at the final point x_N of algorithms considered.

Appendix E

Proof of Equation (45) in Thm. 11

\sum_{k = 0}^{N - 1} (T_{k} - t_{k}^{2}) + T_{N - 1} = \sum_{k = m}^{N - 1} (t_{m - 1}^{2} + \sum_{l = m}^{k} t_{l} - t_{k}^{2}) + t_{m - 1}^{2} + \sum_{l = m}^{N - 1} t_{l} = (N - m + 1) t_{m - 1}^{2} + \sum_{k = m}^{N - 1} (\sum_{l = m}^{k} \frac{N - l + 1}{2} - {(\frac{N - k + 1}{2})}^{2}) + \sum_{l = m}^{N - 1} \frac{N - l + 1}{2} = (N - m + 1) t_{m - 1}^{2} + \sum_{k' = 0}^{N - m - 1} (\sum_{l' = 0}^{k'} \frac{N - l' - m + 1}{2} - {(\frac{N - k' - m + 1}{2})}^{2}) + \sum_{l' = 0}^{N - m - 1} \frac{N - l' - m + 1}{2} = (N - m + 1) t_{m - 1}^{2} + \sum_{k = 0}^{N - m - 1} (\frac{(N - m + 1) (k + 1)}{2} - \frac{k (k + 1)}{4} - \frac{{(N - m + 1)}^{2} - 2 (N - m + 1) k + k^{2}}{4}) + \frac{(N - m + 1) (N - m)}{2} - \frac{(N - m - 1) (N - m)}{4} = (N - m + 1) t_{m - 1}^{2} + \sum_{k = 0}^{N - m - 1} (- \frac{k^{2}}{2} + (N - m + 3 / 4) k - \frac{(N - m - 1) (N - m + 1)}{4}) + \frac{(N - m) (N - m + 3)}{4} = (N - m + 1) t_{m - 1}^{2} - \frac{(N - m - 1) (N - m - 1 / 2) (N - m)}{6} + \frac{(N - m - 1) (N - m) (N - m + 3 / 4)}{2} - \frac{(N - m - 1) (N - m) (N - m + 1)}{4} + \frac{(N - m) (N - m + 3)}{4} \geq \frac{(N - m + 1) {(m + 1)}^{2}}{4} + \frac{(N - m - 1) {(N - m)}^{2}}{3} - \frac{{(N - m)}^{2} (N - m + 1)}{4} \geq \frac{(N - m - 1) {(N - m)}^{2}}{3} \geq \frac{1}{24} (N - 2) N^{2},

where $m = ⌊ \frac{N}{2} ⌋ \geq \frac{N - 1}{2}, N - m \geq \frac{N}{2}$ , and $t_{m - 1} \geq \frac{m + 1}{2}$ (13).

Footnotes

Submitted to the editors July 31, 2017.

Tight relaxation here denotes transforming (relaxing) an optimization problem into a solvable problem while their solutions remain the same. [29] tightly relaxes the PEP into a solvable equivalent problem under a large-dimensional condition.

One could develop a first-order algorithm that is optimized with respect to the norm of the subgradient (rather than its upper bound in Sec. 4), which we leave as future work.

[11, Thm. 2] and [16, Thm. 2] imply that the O(1/N) rates of both the cost function bound (3) and the composite gradient mapping norm bound (10) of PGM are tight up to a constant respectively.

⁴

The step coefficients of FSFOM for FPGM are [11, 17]

h_{i + 1, k} = {\begin{matrix} \frac{1}{t_{i + 1}} (t_{k} - \sum_{j = k + 1}^{i} h_{j, k}), & k = 0, \dots, i - 1, \\ 1 + \frac{t_{i} - 1}{t_{i + 1}}, & k = i . \end{matrix}

⁵

The bound for min_{i∈{0, …, N}} ‖∇̃_L/σ²F(y_i)‖ of FPGM-σ is described in a big-O sense in [21, Prop. 5.2(c)], and we further computed the constant in (17) by following the derivation of [21, Prop. 5.2(c)].

⁶

See Appendix A for the derivation of the dual formulation (D) of (P1).

⁷

See Appendix D for the discussion on the choice of Ω_N.

⁸

Here, we simply relaxed (P′) into (P1′) in a way that is similar to the relaxation from (P) to (P1). This relaxation resulted in a constructive analytical worst-case analysis on the composite gradient mapping in this section that is somewhat similar to that on the cost function in Section 3. However, this relaxation on (P1′) turned out to be relatively loose compared to the relaxation on (P1) (see Sec. 5), suggesting there is room for improvement in the future with a tighter relaxation.

⁹

The code in Taylor et al. [29] currently does not provide a tight bound of the norm of the composite gradient mapping (and the subgradient), so we simply added a few lines to compute a tight bound.

¹⁰

Using modifications of the code in [29] to compute tight bounds on the final subgradient norm was easier than for the smallest subgradient norm among all iterates. Even without the smallest subgradient norm bounds, the bounds on the final subgradient norm in Table 5 (compared to Table 4) provide some insights (beyond (8)) on the relationship between the bounds on the subgradient norm and the composite gradient mapping norm as discussed in Sec. 5.4. We leave further modifying the code in [29] for computing tight bounds on the smallest subgradient norm or other criteria as future work.

Contributor Information

Donghwan Kim, Email: kimdongh@umich.edu.

Jeffrey A. Fessler, Email: fessler@umich.edu.

References

1.Beck A, Nedic A, Ozdaglar A, Teboulle M. An O(1/k) gradient method for network resource allocation problems. IEEE Trans. Control of Network Systems. 2014;1:64–73. doi: 10.1109/TCNS.2014.2309751. [DOI] [Google Scholar]
2.Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Im. Proc. 2009;18:2419–34. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]
3.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009;2:183–202. doi: 10.1137/080716542. [DOI] [Google Scholar]
4.Beck A, Teboulle M. A fast dual proximal gradient algorithm for convex minimization and applications. Operations Research Letters. 2014;42:1–6. doi: 10.1016/j.orl.2013.10.007. [DOI] [Google Scholar]
5.Chambolle A, Dossal C. On the convergence of the iterates of the “Fast iterative shrinkage/Thresholding algorithm”. J. Optim. Theory Appl. 2015;166:968–82. doi: 10.1007/s10957-015-0746-4. [DOI] [Google Scholar]
6.Combettes PL, Pesquet J-C. Proximal splitting methods in signal processing. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, Optimization and Its Applications; 2011. pp. 185–212. [DOI] [Google Scholar]
7.CVX Research Inc. CVX: Matlab software for disciplined convex programming, version 2.0. Aug., 2012. http://cvxr.com/cvx .
8.Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. 2004;57:1413–57. doi: 10.1002/cpa.20042. [DOI] [Google Scholar]
9.Devolder O, Glineur F, Nesterov Y. Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J. Optim. 2012;22:702–27. doi: 10.1137/110826102. [DOI] [Google Scholar]
10.Drori Y. The exact information-based complexity of smooth convex minimization. Journal of Complexity. 2017;39:1–16. doi: 10.1016/j.jco.2016.11.001. [DOI] [Google Scholar]
11.Drori Y, Teboulle M. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming. 2014;145:451–82. doi: 10.1007/s10107-013-0653-0. [DOI] [Google Scholar]
12.Drori Y, Teboulle M. An optimal variant of Kelley’s cutting-plane method. Mathematical Programming. 2016;160:321–51. doi: 10.1007/s10107-016-0985-7. [DOI] [Google Scholar]
13.Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming. 2016;156:59–99. doi: 10.1007/s10107-015-0871-8. [DOI] [Google Scholar]
14.Goldstein T, O’Donoghue B, Setzer S, Baraniuk R. Fast alternating direction optimization methods. SIAM J. Imaging Sci. 2014;7:1588–623. doi: 10.1137/120896219. [DOI] [Google Scholar]
15.Grant M, Boyd S. Graph implementations for nonsmooth convex programs. In: Blondel V, Boyd S, Kimura H, editors. Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, Springer-Verlag Limited; 2008. pp. 95–110. http://stanford.edu/~boyd/graph_dcp.html . [Google Scholar]
16.Kim D, Fessler JA. Generalizing the optimized gradient method for smooth convex minimization. 2016. http://arxiv.org/abs/1607.06764. arxiv 1607.06764.
17.Kim D, Fessler JA. Optimized first-order methods for smooth convex minimization. Mathematical Programming. 2016;159:81–107. doi: 10.1007/s10107-015-0949-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kim D, Fessler JA. On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 2017;172:187–205. doi: 10.1007/s10957-016-1018-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Lessard L, Recht B, Packard A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 2016;26:57–95. doi: 10.1137/15M1009597. [DOI] [Google Scholar]
20.Löfberg J. Yalmip : A toolbox for modeling and optimization in matlab. In Proceedings of the CACSD Conference; Taipei, Taiwan. 2004. [Google Scholar]
21.Monteiro RDC, Svaiter BF. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 2013;23:1092–1125. doi: 10.1137/110833786. [DOI] [Google Scholar]
22.Necoara I, Patrascu A. Iteration complexity analysis of dual first order methods for conic convex programming. Optimization Methods and Software. 2016;31:645–78. doi: 10.1080/10556788.2016.1161763. [DOI] [Google Scholar]
23.Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2) Dokl. Akad. Nauk. USSR. 1983;269:543–7. [Google Scholar]
24.Nesterov Y. Introductory lectures on convex optimization: A basic course. Kluwer; 2004. http://books.google.com/books?id=VyYLem-l3CgC . [Google Scholar]
25.Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–52. doi: 10.1007/s10107-004-0552-5. [DOI] [Google Scholar]
26.Nesterov Y. How to make the gradients small. Optima. 2012;88:10–11. [Google Scholar]
27.Nesterov Y. Gradient methods for minimizing composite functions. Mathematical Programming. 2013;140:125–61. doi: 10.1007/s10107-012-0629-5. [DOI] [Google Scholar]
28.Sturm J. Using SeDuMi 1.02, A MATLAB toolbox for optimization over symmetric cones. Optim. Meth. Software. 1999;11:625–53. doi: 10.1080/10556789908805766. [DOI] [Google Scholar]
29.Taylor AB, Hendrickx JM, Glineur F. Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 2017;27:1283–1313. doi: 10.1137/16M108104X. [DOI] [Google Scholar]
30.Taylor AB, Hendrickx JM, Glineur F. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming. 2017;161:307–45. doi: 10.1007/s10107-016-1009-3. [DOI] [Google Scholar]

[R1] 1.Beck A, Nedic A, Ozdaglar A, Teboulle M. An O(1/k) gradient method for network resource allocation problems. IEEE Trans. Control of Network Systems. 2014;1:64–73. doi: 10.1109/TCNS.2014.2309751. [DOI] [Google Scholar]

[R2] 2.Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Im. Proc. 2009;18:2419–34. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]

[R3] 3.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009;2:183–202. doi: 10.1137/080716542. [DOI] [Google Scholar]

[R4] 4.Beck A, Teboulle M. A fast dual proximal gradient algorithm for convex minimization and applications. Operations Research Letters. 2014;42:1–6. doi: 10.1016/j.orl.2013.10.007. [DOI] [Google Scholar]

[R5] 5.Chambolle A, Dossal C. On the convergence of the iterates of the “Fast iterative shrinkage/Thresholding algorithm”. J. Optim. Theory Appl. 2015;166:968–82. doi: 10.1007/s10957-015-0746-4. [DOI] [Google Scholar]

[R6] 6.Combettes PL, Pesquet J-C. Proximal splitting methods in signal processing. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, Optimization and Its Applications; 2011. pp. 185–212. [DOI] [Google Scholar]

[R7] 7.CVX Research Inc. CVX: Matlab software for disciplined convex programming, version 2.0. Aug., 2012. http://cvxr.com/cvx .

[R8] 8.Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. 2004;57:1413–57. doi: 10.1002/cpa.20042. [DOI] [Google Scholar]

[R9] 9.Devolder O, Glineur F, Nesterov Y. Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J. Optim. 2012;22:702–27. doi: 10.1137/110826102. [DOI] [Google Scholar]

[R10] 10.Drori Y. The exact information-based complexity of smooth convex minimization. Journal of Complexity. 2017;39:1–16. doi: 10.1016/j.jco.2016.11.001. [DOI] [Google Scholar]

[R11] 11.Drori Y, Teboulle M. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming. 2014;145:451–82. doi: 10.1007/s10107-013-0653-0. [DOI] [Google Scholar]

[R12] 12.Drori Y, Teboulle M. An optimal variant of Kelley’s cutting-plane method. Mathematical Programming. 2016;160:321–51. doi: 10.1007/s10107-016-0985-7. [DOI] [Google Scholar]

[R13] 13.Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming. 2016;156:59–99. doi: 10.1007/s10107-015-0871-8. [DOI] [Google Scholar]

[R14] 14.Goldstein T, O’Donoghue B, Setzer S, Baraniuk R. Fast alternating direction optimization methods. SIAM J. Imaging Sci. 2014;7:1588–623. doi: 10.1137/120896219. [DOI] [Google Scholar]

[R15] 15.Grant M, Boyd S. Graph implementations for nonsmooth convex programs. In: Blondel V, Boyd S, Kimura H, editors. Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, Springer-Verlag Limited; 2008. pp. 95–110. http://stanford.edu/~boyd/graph_dcp.html . [Google Scholar]

[R16] 16.Kim D, Fessler JA. Generalizing the optimized gradient method for smooth convex minimization. 2016. http://arxiv.org/abs/1607.06764. arxiv 1607.06764.

[R17] 17.Kim D, Fessler JA. Optimized first-order methods for smooth convex minimization. Mathematical Programming. 2016;159:81–107. doi: 10.1007/s10107-015-0949-3. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Kim D, Fessler JA. On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 2017;172:187–205. doi: 10.1007/s10957-016-1018-7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] 19.Lessard L, Recht B, Packard A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 2016;26:57–95. doi: 10.1137/15M1009597. [DOI] [Google Scholar]

[R20] 20.Löfberg J. Yalmip : A toolbox for modeling and optimization in matlab. In Proceedings of the CACSD Conference; Taipei, Taiwan. 2004. [Google Scholar]

[R21] 21.Monteiro RDC, Svaiter BF. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 2013;23:1092–1125. doi: 10.1137/110833786. [DOI] [Google Scholar]

[R22] 22.Necoara I, Patrascu A. Iteration complexity analysis of dual first order methods for conic convex programming. Optimization Methods and Software. 2016;31:645–78. doi: 10.1080/10556788.2016.1161763. [DOI] [Google Scholar]

[R23] 23.Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2) Dokl. Akad. Nauk. USSR. 1983;269:543–7. [Google Scholar]

[R24] 24.Nesterov Y. Introductory lectures on convex optimization: A basic course. Kluwer; 2004. http://books.google.com/books?id=VyYLem-l3CgC . [Google Scholar]

[R25] 25.Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–52. doi: 10.1007/s10107-004-0552-5. [DOI] [Google Scholar]

[R26] 26.Nesterov Y. How to make the gradients small. Optima. 2012;88:10–11. [Google Scholar]

[R27] 27.Nesterov Y. Gradient methods for minimizing composite functions. Mathematical Programming. 2013;140:125–61. doi: 10.1007/s10107-012-0629-5. [DOI] [Google Scholar]

[R28] 28.Sturm J. Using SeDuMi 1.02, A MATLAB toolbox for optimization over symmetric cones. Optim. Meth. Software. 1999;11:625–53. doi: 10.1080/10556789908805766. [DOI] [Google Scholar]

[R29] 29.Taylor AB, Hendrickx JM, Glineur F. Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 2017;27:1283–1313. doi: 10.1137/16M108104X. [DOI] [Google Scholar]

[R30] 30.Taylor AB, Hendrickx JM, Glineur F. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming. 2017;161:307–45. doi: 10.1007/s10107-016-1009-3. [DOI] [Google Scholar]

PERMALINK

ANOTHER LOOK AT THE FAST ITERATIVE SHRINKAGE/THRESHOLDING ALGORITHM (FISTA)*

Donghwan Kim

Jeffrey A Fessler

Abstract

1. Introduction

2. Problem, methods, and contribution

Lemma 1

Proof

Theorem 2

Proof

Theorem 3

Proof

3. Relaxation and optimization of the cost function form of PEP

3.1. Relaxation for the cost function form of PEP

3.2. Generalized FPGM

Lemma 4

Proof

Proposition 5

Proof

Theorem 6

Proof

Corollary 7

Proof

3.3. Related work of GFPGM

Proposition 8

Proof

3.4. Optimizing step coefficients of FSFOM using the cost function form of PEP

4. Relaxation and optimization of the composite gradient mapping form of PEP

4.1. Relaxation for the composite gradient mapping form of PEP

4.2. Worst-case analysis of the composite gradient mapping of GF-PGM

Lemma 9

Proof

Theorem 10

Proof

4.3. Optimizing step coefficients of FSFOM using the composite gradient mapping form of PEP

Theorem 11

Proof

4.4. Decreasing the composite gradient mapping with a rate O(1/N32) without selecting N in advance

Corollary 12

Proof

5. Discussion

5.1. Summary of analytical worst-case bounds on the cost function and the composite gradient mapping

Table 1.

5.2. Tight worst-case bounds on the cost function and the smallest composite gradient mapping norm

Table 2.

Table 3.

5.3. Tight worst-case bounds on the final compo site gradient mapping

Table 4.

5.4. Tight worst-case bounds on the final subgradient

Table 5.

6. Conclusion

Software

Acknowledgments

Appendix A

Derivation of the dual formulation (D) of (P1)

Appendix B

Proof of Prop. 5

Appendix C

Proof of Prop. 8

Appendix D

Discussion on the choice of ΩN in Sec. 4.1

Appendix E

Proof of Equation (45) in Thm. 11

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

ANOTHER LOOK AT THE FAST ITERATIVE SHRINKAGE/THRESHOLDING ALGORITHM (FISTA)^*

4.4. Decreasing the composite gradient mapping with a rate $O (1 / N^{\frac{3}{2}})$ without selecting N in advance

Discussion on the choice of Ω_N in Sec. 4.1