Abstract
This paper provides a new way of developing the “Fast Iterative Shrinkage/Thresholding Algorithm (FISTA)” [3] that is widely used for minimizing composite convex functions with a nonsmooth term such as the ℓ1 regularizer. In particular, this paper shows that FISTA corresponds to an optimized approach to accelerating the proximal gradient method with respect to a worst-case bound of the cost function. This paper then proposes a new algorithm that is derived by instead optimizing the step coefficients of the proximal gradient method with respect to a worst-case bound of the composite gradient mapping. The proof is based on the worst-case analysis called Performance Estimation Problem in [11].
1. Introduction
The “Fast Iterative Shrinkage/Thresholding Algorithm” (FISTA) [3], also known as a fast proximal gradient method (FPGM) in general, is a very widely used fast first-order method. FISTA’s speed arises from Nesterov’s accelerating technique in [23, 24] that improves the O(1/N) cost function worst-case bound of a proximal gradient method (PGM) to the optimal O(1/N2) rate where N denotes the number of iterations [3].
This paper first provides a new way to develop Nesterov’s acceleration approach, i.e., FISTA (FPGM). In particular, we show that FPGM corresponds to an optimized approach to accelerating PGM with respect to a worst-case bound of the cost function. We then propose a new fast algorithm that is derived from PGM by instead optimizing a worst-case bound of the composite gradient mapping. We call this new method FPGM-OCG (OCG for optimized over composite gradient mapping). This new method provides the best known analytical worst-case bound for decreasing the composite gradient mapping with rate among fixed-step first-order methods. The proof is based on the worst-case bound analysis called Performance Estimation Problem (PEP) in [11], which we briefly review next.
Drori and Teboulle’s PEP [11] casts a worst-case analysis for a given optimization method and a given class of optimization problems into a meta-optimization problem. The original PEP has been intractable to solve exactly, so [11] introduced a series of tractable relaxations, focusing on first-order methods and smooth convex minimization problems; this PEP and its relaxations were studied for various algorithms and minimization problem classes in [12, 16, 17, 18, 19, 29, 30]. Drori and Teboulle [11] further proposed to optimize the step coefficients of a given class of optimization methods using a PEP. This approach was studied for first-order methods on unconstrained smooth convex minimization problems in [11], and the authors [17] derived a new first-order method, called an optimized gradient method (OGM) that has an analytic worst-case bound on the cost function that is twice smaller than the previously best known bounds of [23, 24]. Recently, Drori [10] showed that the OGM exactly achieves the optimal cost function worst-case bound among first-order methods for smooth convex minimization (in high-dimensional problems).
Building upon [11] and its successors, Taylor et al. [29] expanded the use of PEP to first-order (proximal gradient) methods for minimizing nonsmooth composite convex functions. They used a tight relaxation1 for PEP and studied the tight (exact) numerical worst-case bounds of FPGM, a proximal gradient version of OGM, and some variants versus number of iterations N. Their numerical results suggest that there exists an OGM-type acceleration of PGM that has a worst-case cost function bound that is about twice smaller than that of FPGM, showing room for improvement in accelerating PGM. However, it is difficult to derive an analytical worst-case bound for the tightly relaxed PEP in [29], so optimizing the step coefficients of PGM remains an open problem, unlike [11, 17] for smooth convex minimization.
Different from the tightly relaxed PEP in [29], this paper suggests a new (looser) relaxation of a cost function form of PEP for nonsmooth composite convex minimization that simplifies analysis and optimization of step coefficients of PGM, although yields loose worst-case bounds. Interestingly, the resulting optimized PGM numerically appears to be the FPGM. Then, we further provide a new generalized version of FPGM using our relaxed PEP that extends our understanding of the FPGM variants.
This paper next extends the PEP analysis of the gradient norm in [29, 30]. For unconstrained smooth convex minimization, the authors [16] used such PEP to optimize the step coefficients with respect to the gradient norm. The corresponding optimized algorithm can be useful particularly when dealing with dual problems where the gradient norm decrease is important in addition to the cost function minimization (see e.g., [9, 22, 26]). By extending [16], this paper optimizes the step coefficients of the PGM for the composite gradient mapping form of PEP for nonsmooth composite convex minimization. The resulting optimized algorithm differs somewhat from Nesterov’s acceleration and turns out to belong to the proposed generalized FPGM class.
Sec. 2 describes a nonsmooth composite convex minimization problem and first-order (proximal gradient) methods. Sec. 3 proposes a new relaxation of PEP for nonsmooth composite convex minimization problems and the proximal gradient methods, and suggests that the FPGM (FISTA) [3] is the optimized method of the cost function form of this relaxed PEP. Sec. 3 further proposes a generalized version of FPGM using the relaxed PEP. Sec. 4 studies the composite gradient mapping form of the relaxed PEP and describes a new optimized method for decreasing the norm of composite gradient mapping. Sec. 5 compares the various algorithms considered, and Sec. 6 concludes.
2. Problem, methods, and contribution
We consider first-order algorithms for solving the nonsmooth composite convex minimization problem:
(M) |
under the following assumptions:
- f : ℝd → ℝ is is a convex function of the type , i.e., continuously differentiable with Lipschitz continuous gradient:
where L > 0 is the Lipschitz constant.(1) ϕ : ℝd → ℝ is proper, closed, convex and “proximal-friendly” [6].
The optimal set X*(F) = arg minx∈ℝd F(x) is nonempty, i.e., the problem (M) is solvable.
We use ℱL(ℝd) to denote the class of functions F that satisfy the above conditions. We additionally assume that the distance between the initial point x0 and an optimal solution x* ∈ X(F) is bounded by R > 0, i.e., ‖x0 − x*‖ ≤ R.
PGM is a standard first-order method for solving the problem (M) [3, 6], particularly when the following proximal gradient update (that consists of a gradient descent step and a proximal operation [6]) is relatively simple:
(2) |
For ϕ(x) = ‖x‖1, the update (2) becomes a simple shrinkage/thresholding update, and PGM reduces to an iterative shrinkage/thresholding algorithm (ISTA) [8]. (See [6, Table 10.2] for more functions ϕ(x) that lead to simple proximal operations.) PGM has the following bound on the cost function [3, Thm. 3.1] for any N ≥ 1:
(3) |
Algorithm PGM | |
Input: f ∈ ℱL(ℝd), x0 ∈ ℝd. | |
For i = 0, …, N − 1 | |
|
For simplicity in later derivations, we use the following definition of the composite gradient mapping [27]:
(4) |
The composite gradient mapping reduces to the usual function gradient ∇f(x) when ϕ(x) = 0. We can then rewrite the PGM update in the following form reminiscent of a gradient method:
(5) |
where each update guarantees the following monotonic cost function descent [27, Thm. 1]:
(6) |
For any x ∈ ℝd, there exists a subgradient ϕ′(pL(x)) ∈ ∂ϕ(pL(x)) that satisfies the following equality [3, Lemma 2.2]:
(7) |
This equality implies that any point x̄ with a zero composite gradient mapping (∇̃LF(x̄) = 0, i.e., x̄ = pL(x̄)) satisfies 0 ∈ ∂F(x̄) and is a minimizer of (M). As discussed, minimizing the composite gradient mapping is noteworthy in addition to decreasing the cost function. This property becomes particularly important when dealing with dual problems. In particular, it is known that the norm of the dual (sub)gradient is related to the primal feasibility (see e.g., [9, 22, 26]). Furthermore, the norm of the subgradient is upper bounded by the norm of the composite gradient mapping, i.e., for any given subgradients ϕ′(pL(x)) in (7) and F′(pL(x)) ≔ ∇f(pL(x)) + ϕ′(pL(x)) ∈ ∂F(pL(x)), we have
(8) |
where the first inequality uses the triangle inequality and the second inequality uses (1) and (7). This inequality provides a close relationship between the primal feasibility and the dual composite gradient mapping. Therefore, we next analyze the worst-case bound of the composite gradient mapping of PGM; Sec. 4 below discusses a first-order algorithm that is optimized with respect to the composite gradient mapping.2
The following lemma shows that PGM monotonically decreases the norm of the composite gradient mapping.
Lemma 1
The PGM monotonically decreases the norm of composite gradient mapping, i.e., for all x:
(9) |
Proof
The proof in [22, Lemma 2.4] can be easily extended to prove (9) using the nonexpansiveness of the proximal mapping (proximity operator) [6].
The following theorem provides a O(1/N) bound on the norm of composite gradient mapping for the PGM, using the idea in [26] and Lemma 1.
Theorem 2
Let F : ℝd → ℝ be in ℱL(ℝd) and let x0, ⋯, xN ∈ ℝd be generated by PGM. Then for N ≥ 2,
(10) |
Proof
Let , and we have
which is equivalent to (10) using and .
Despite its inexpensive per-iteration computational cost, PGM suffers from the slow rate O(1/N) for decreasing both the cost function and the norm of composite gradient mapping.3 Therefore for acceleration, this paper considers the following class of fixed-step first-order methods (FSFOM), where the (i + 1)th iteration consists of one proximal gradient evaluation, just like PGM, and a weighted summation of previous and current proximal gradient updates with step coefficients .
Algorithm Class FSFOM | |
Input: f ∈ ℱL(ℝd), x0 ∈ ℝd, y0 = x0. | |
For i = 0, …, N − 1 | |
| |
|
Although the weighted summation in FSFOM seems at first to be inefficient both computationally and memory-wise, the optimized FSFOM presented in this paper have equivalent recursive forms that have memory and computation requirements that are similar to PGM. Note that this class FSFOM includes PGM but excludes accelerated algorithms in [13, 25, 27] that combine the proximal operations and the gradient steps in other ways.
Among FSFOM4, FISTA [3], also known as FPGM, is widely used since it has computation and memory requirements that are similar to PGM yet it achieves the optimal O(1/N2) worst-case rate for decreasing the cost function, using Nesterov’s acceleration technique [23, 24].
Algorithm FPGM (FISTA) | ||
Input: f ∈ ℱL(ℝd), x0 ∈ ℝd, y0 = x0, t0 = 1. | ||
For i = 0, …, N − 1 | ||
| ||
| ||
|
FPGM has the following bound for the cost function [3, Thm. 4.4] for any N ≥ 1:
(12) |
where the parameters ti (11) satisfy
(13) |
Sec. 3 provides a new proof of the cost function bound (12) of FPGM using a new relaxation of PEP, and illustrates that this particular acceleration of PGM results from optimizing a relaxed version of the cost function form of PEP. In addition, it is shown in [3, 5] that FPGM and its bound (12) generalize to any ti such that t0 = 1 and for all i ≥ 1 with corresponding bound for any N ≥ 1:
(14) |
which includes the choice for any a ≥ 2. Using our relaxed PEP, Sec. 3 further describes similar but different generalizations of FPGM that complement our understanding of FPGM.
We are often interested in the worst-case analysis of the norm of the (composite) gradient (mapping) in addition to that of the cost function, particularly when dealing with dual problems. To improve the rate O(1/N) of the gradient norm bound of a gradient method, Nesterov [26] suggested performing his fast gradient method (FGM) [23, 24], a non-proximal version of FPGM, for the first m iterations and a gradient method for remaining N − m iterations for smooth convex problems (when ϕ(x) = 0). Here we extend this idea to the nonsmooth composite convex problem (M) and use FPGM-m to denote the resulting algorithm. The following theorem provides a worst-case bound for the norm of composite gradient mapping of FPGM-m, using the idea in [26] and Lemma 1.
Algorithm FPGM-m | |
Input: f ∈ ℱL(ℝd), x0 ∈ ℝd, y0 = x0, t0 = 1. | |
For i = 0, …, N − 1 | |
| |
| |
|
Theorem 3
Let F : ℝd → ℝ be in ℱL(ℝd) and let x0, ⋯, xN ∈ ℝd be generated by FPGM-m for 1 ≤ m ≤ N. Then for N ≥ 1,
(15) |
Proof
We have
which is equivalent to (15).
As noticed by a reviewer, when , the worst-case bound (15) of the composite gradient mapping roughly has its smallest constant for the rate , which is better than the choice in [26].
Monteiro and Svaiter [21] considered a variant of FPGM that replaces pL(·) of FPGM by pL/σ2(·) for 0 < σ < 1; that variant, which we denote FPGM-σ, satisfies the rate for the composite gradient mapping. This FPGM-σ algorithm satisfies the following cost function and composite gradient mapping worst-case bounds5 [21, Prop. 5.2] for N ≥ 1:
(16) |
(17) |
The worst-case bound (17) of the composite gradient mapping has its smallest constant when , which makes the bound (17) about ≈ 3-times larger than the bound (15) of at best. However, since FPGM-σ does not require one to select the number of total iterations N in advance unlike FPGM-m, the FPGM-σ algorithm could be useful in practice, as discussed further in Sec. 4.4. Ghadimi and Lan [13] also showed the rate for a composite gradient mapping worst-case bound of another variant of FPGM, but the corresponding algorithm in [13] requires two proximal gradient updates per iteration, combining the proximal operations and the gradient steps in a way that differs from the class FSFOM and could be less attractive in terms of the per-iteration computational complexity.
FPGM has been used in dual problems [1, 2, 4, 14]; using FPGM-m and the algorithms in [13, 21] that guarantee rate for minimizing the norm of the composite gradient mapping could be potentially useful for solving dual problems. (Using F(P)GM-m for (dual) smooth convex problems was discussed in [9, 22, 26].) However, FPGM-m and the algorithms in [13, 21] are not necessarily the best possible methods with respect to the worst-case bound of the norm of the composite gradient mapping. Therefore, Sec. 4 seeks to optimize the step coefficients of FSFOM for minimizing the norm of the composite gradient mapping using a relaxed PEP.
The next section first provides a new proof of FPGM using our new relaxation on PEP, and proposes the new generalized FPGM.
3. Relaxation and optimization of the cost function form of PEP
3.1. Relaxation for the cost function form of PEP
For FSFOM with given step-size coefficients h ≔ {hi+1,k}, in principle the worst-case bound on the cost function after N iterations corresponds to the solution of the following PEP problem [11]:
(P) |
Since (non-relaxed) PEP problems like (P) are difficult to solve due to the (infinite-dimensional) functional constraint on F, Drori and Teboulle [11] suggested (for smooth convex problems) replacing the functional constraint by a property of F related to the update such as pL(·) in (P). Taylor et al. [29, 30] discussed properties of F that can replace the functional constraint of PEP without strictly relaxing (P), and provided tight numerical worst-case analysis for any given N. However, analytical solutions remain unknown for (P) and most PEP problems.
Instead, this paper proposes an alternate relaxation that is looser than that in [29, 30] but provides tractable and useful analytical results. We consider the following property of F involving the proximal gradient update pL(·) [3, Lemma 2.3]:
(18) |
to replace the functional constraint on F. In particular, we use the following property:
(19) |
that results from replacing x in (18) by pL(x). When ϕ(x) = 0, the property (19) reduces to
(20) |
Note that the relaxation of PEP in [11, 16, 17, 18, 30] for unconstrained smooth convex minimization (ϕ(x) = 0) uses a well-known property of f in [24, Thm. 2.1.5] that differs from (20) and does not strictly relax the PEP as discussed in [30], whereas our relaxation using (19) and (20) does not guarantee a tight relaxation of (P). Finding a tight relaxation that leads to useful (or even optimal) algorithms remains an open problem for nonsmooth composite convex problems.
Similar to [11, Problem (Q′)], we (strictly) relax problem (P) as follows using a set of constraint inequalities (19) at the points (x, y) = (yi−1, yi) for i = 1, …, N − 1 and (x, y) = (x*, yi) for i = 0, …, N − 1:
(P1) |
for any given unit vector ν ∈ ℝd, by defining the (i + 1)th standard basis vector ui = ei+1 ∈ ℝN, the matrix G = [g0, ⋯, gN − 1]⊤ ∈ ℝN × d and the vector δ = [δ0, ⋯, δN − 1]⊤ ∈ ℝN, where
(21) |
for i = 0, …, N − 1, *. Note that g* = [0, ⋯, 0]⊤, δ* = 0 and by definition. The matrices Ăi−1,i(h) and Ďi(h) are defined as
(22) |
which results from the inequalities (19) at the points (x, y) = (yi−1, yi) and (x, y) = (x*, yi) respectively.
As in [11, Problem (DQ′)], problem (P1) has a dual formulation that one can solve numerically for any given N using a semidefinite program (SDP) to determine an upper bound on the cost function worst-case bound for any FSFOM:6
(D) |
where , and
(23) |
(24) |
This means that one can compute a valid upper bound (D) of (P) for given step coefficients h using a SDP. The next two sections provide an analytical solution to (D) for FPGM and similarly for our new generalized FPGM, superseding the use of numerical SDP solvers.
3.2. Generalized FPGM
We specify a feasible point of (D) that leads to our new generalized form of FPGM.
Lemma 4
For the following step coefficients:
(25) |
the choice of variables:
(26) |
is a feasible point of (D) for any choice of ti such that
(27) |
Proof
It is obvious that (λ, τ) in (26) with (27) is in Λ (23). Using (22), the (i, k)th entry of the symmetric matrix S(h, λ, τ) in (24) can be written as
where each element Si,k(h, λ, τ) corresponds to the coefficient of the term of S(h, λ, τ) in (24). Then, inserting (25) and (26) to the above yields
Then, using (26) and (27), we finally show the feasibility condition of (D):
where t = (t0, ⋯, tN−1, 1)⊤ and T = (T0, ⋯, TN−1, 1)⊤.
FSFOM with the step coefficients (25) would be both computationally and memory-wise inefficient, so we next present an equivalent recursive form of FSFOM with (25), named Generalized FPGM (GFPGM).
Algorithm GFPGM | |
Input: f ∈ ℱL(ℝd), x0 ∈ ℝd, y0 = x0, t0 = T0 = 1. | |
For i = 0, …, N − 1 | |
| |
| |
|
Proposition 5
The sequence {x0, ⋯, xN} generated by FSFOM with step sizes (25) is identical to the corresponding sequence generated by GFPGM.
Proof
See Appendix B.
Using Lemma 4, the following theorem bounds the cost function for the GFPGM iterates.
Theorem 6
Let F : ℝd → ℝ be in ℱL(ℝd) and let x0, ⋯, xN ∈ ℝd be generated by GFPGM. Then for N ≥ 1,
(28) |
Proof
Using (D), Lemma 4 and Prop. 5, we have
(29) |
The GFPGM and Thm. 6 reduce to FPGM and (12) when for all i, and Sec. 3.4 describes that FPGM results from optimizing the step coefficients of FSFOM with respect to the cost function form of the relaxed PEP (D). This GFPGM also includes the choice for any a ≥ 2 as used in [5], which we denote as FPGM-a that differs from the algorithm in [5]. The following corollary provides a cost function worst-case bound for FPGM-a.
Corollary 7
Let F : ℝd → ℝ be in ℱL(ℝd) and let x0, ⋯, xN ∈ ℝd be generated by GFPGM with (FPGM-a) for any a ≥ 2. Then for N ≥ 1,
(30) |
Proof
Thm. 6 implies (30), since satisfies (27), i.e.,
(31) |
for any a ≥ 2 and all i ≥ 0.
3.3. Related work of GFPGM
This section shows that the GFPGM has a close connection to the accelerated algorithm in [25] that was developed specifically for a constrained smooth convex problem with a closed convex set Q, i.e.,
(32) |
The projection operator PQ(x) ≔ arg miny∈Q ‖x − y‖ is used for the proximal gradient update (2).
We show that the GFPGM can be written in the following equivalent form, named GFPGM′, which is similar to that of the accelerated algorithm in [25] shown below. Note that the accelerated algorithm in [25] satisfies the bound (28) of the GFPGM in [25, Thm. 2] when ϕ(x) = IQ(x).
Algorithm GFPGM′ | |
Input: f ∈ ℱL(ℝd), x0 ∈ ℝd, y0 = x0, t0 = T0 = 1. | |
For i = 0, …, N − 1 | |
| |
| |
| |
|
Algorithm [25] for ϕ(x) = IQ(x) | |
Input: f ∈ ℱL(ℝd), x0 ∈ ℝd, y0 = x0, t0 = T0 = 1. | |
For i = 0, …, N − 1 | |
| |
| |
| |
|
Proposition 8
The sequence {x0, ⋯, xN} generated by GFPGM is identical to the corresponding sequence generated by GFPGM′.
Proof
See Appendix C.
Clearly GFPGM′ and the accelerated algorithm in [25] are equivalent for the unconstrained smooth convex problem (Q = ℝd). However, when the operation PQ(x) is relatively expensive, our GFPGM and GFPGM′ that use one projection per iteration could be preferred over the accelerated algorithm in [25] that uses two projections per iteration.
3.4. Optimizing step coefficients of FSFOM using the cost function form of PEP
To find the step coefficients in the class FSFOM that are optimal in terms of the cost function form of PEP, we would like to solve the following problem:
(HP) |
Because (HP) seems intractable, we instead optimize the step coefficients using the relaxed bound in (D):
(HD) |
The problem (HD) is bilinear, and a convex relaxation technique in [11, Thm. 3] makes it solvable using numerical methods. We optimized (HD) numerically for many choices of N using a SDP solver [7, 15] and based on our numerical results (not shown) we conjecture that the feasible point in Lemma 4 with that corresponds to FPGM (FISTA) is a global minimizer of (HD). It is straightforward to show that the step coefficients in Lemma 4 with give the smallest bound of (D) and (28) among all feasible points in Lemma 4, but showing optimality among all possible feasible points of (HD) may require further derivations as in [17, Lemma 3] using KKT conditions, which we leave as future work.
This section has provided a new worst-case bound proof of FPGM using the relaxed PEP, and suggested that FPGM corresponds to FSFOM with optimized step coefficients using the cost function form of the relaxed PEP. The next section provides a different optimization of the step coefficients of FSFOM that targets the norm of the composite gradient mapping, because minimizing the norm of the composite gradient mapping is important in dual problems (see [9, 22, 26] and (8)).
4. Relaxation and optimization of the composite gradient mapping form of PEP
4.1. Relaxation for the composite gradient mapping form of PEP
To form a worst-case bound on the norm of the composite gradient mapping for a given h of FSFOM, we use the following PEP that replaces F(xN)−F(x*) in (P) by the norm squared of the composite gradient mapping. Here, we consider the smallest composite gradient mapping norm squared among all iterates7 (minx∈ΩN ‖L(pL(x) − x)‖2 = minx∈ΩN ‖∇̃LF(x)‖2 where ΩN ≔ {y0, ⋯, yN−1, xN}) as follows:
(P′) |
Because this infinite-dimensional max-min problem appears intractable, similar to the relaxation from (P) to (P1), we relax (P′) to a finite-dimensional problem with an additional constraint resulting from (6) that is equivalent to
(33) |
and conditions that are equivalent to α ≤ ‖L(pL(x) − x)‖2 for all x ∈ ΩN after replacing minx∈ΩN ‖L(pL(x) − x)‖2 by α as in [30].8 This relaxation leads to
(P1′) |
for any given unit vector ν ∈ ℝd, by defining the (i + 1)th standard basis vector ūi = ei+1 ∈ ℝN+1, the matrices
(34) |
where 0 = [0, …, 0]⊤ ∈ ℝN, and the matrix Ḡ = [G⊤, ḡN]⊤ ∈ ℝ(N+1)×d where
(35) |
Similar to (D) and [16, Problem (D″)], we have the following dual formulation of (P1′) that could be solved using SDP:
(D′) |
where η ∈ ℝ+, , and
(36) |
(37) |
The next section specifies a feasible point of interest that is in the class of GFPGM and analyzes the worst-case bound of the norm of the composite gradient mapping. Then we optimize the step coefficients of FSFOM with respect to the composite gradient mapping form of PEP leading to a new algorithm that differs from Nesterov’s acceleration for decreasing the cost function.
4.2. Worst-case analysis of the composite gradient mapping of GF-PGM
The following lemma provides feasible point of (D′) for the step coefficients (25) of GFPGM.
Lemma 9
For the step coefficients {hi+1,k} in (25), the choice of variables
(38) |
(39) |
is a feasible point of (D′) for any choice of ti and Ti satisfying (27).
Proof
It is obvious that (λ, τ, η, β) in (38) and (39) with (27) is in Λ′ (36). Using (22) and (34), the (i, k)th entry of the symmetric matrix S′(h, λ, τ, η, β) in (37) can be written as
and inserting (25), (38), and (39) yields
Finally, by defining t̄ = (t0, ⋯, tN−1, 0, 1)⊤ we have the feasibility condition of (D′):
Using Lemma 9, the following theorem bounds the (smallest) norm of the composite gradient mapping for the GFPGM iterates.
Theorem 10
Let f : ℝd → ℝ be in ℱL(ℝd) and let x0, ⋯, xN, y0, ⋯, yN−1 ∈ ℝd be generated by GFPGM. Then for N ≥ 1,
(40) |
Proof
Lemma 1 implies the first inequality of (40). Using (D′), Lemma 9 and Prop. 5, we have
which is equivalent to (40).
Although the bound (40) is not tight due to the relaxation on PEP, next two sections show that there exists choices of ti that provide a rate for decreasing the composite gradient mapping, including the choice that optimizes the composite gradient mapping form of PEP.
FGM for smooth convex minimization was shown to achieve the rate for the decrease of the usual gradient in [16]. In contrast, Thm. 10 provides only a O(1/N) bound for FPGM (or GFPGM with ti (11)) on the decrease of the composite gradient mapping since for all i and the value of TN−1 is O(N2) for ti (11). Sec. 5 below numerically studies a tight bound on the composite gradient mapping of FPGM and illustrates that it has a rate that is faster than the rate O(1/N) of Thm. 10, indicating there is a room for improvement in the composite gradient mapping form of the relaxed PEP
4.3. Optimizing step coefficients of FSFOM using the composite gradient mapping form of PEP
To optimize the step coefficients in the class FSFOM in terms of the composite gradient mapping form of the relaxed PEP (D′), we would like to solve the following problem:
(HD′) |
Similar to (HD), we use a convex relaxation [11, Thm. 3] to make the bilinear problem (HD′) solvable using numerical methods. We then numerically optimized (HD′) for many choices of N using a SDP solver [7, 15] and found that the following choice of ti:
(41) |
makes the feasible point in Lemma 9 optimal empirically with respect to the relaxed bound (HD′). Interestingly, whereas the usual ti factors (such as (11) and for any a ≥ 2) increase with i indefinitely, here, the factors begin decreasing after .
We also noticed numerically that finding the ti that minimizes the bound (40), i.e., solving the following constrained quadratic problem:
(42) |
is equivalent to optimizing (HD′). This means that the solution of (42) numerically appears equivalent to (41), the (conjectured) solution of (HD′). Interestingly, the unconstrained maximizer of (42) without the constraint (27) is , and this partially appears in the constrained maximizer (41) of the problem (42).
Based on this numerical evidence, we conjecture that the solution ĥD′ of problem (HD′) corresponds to (25) with (41). Using Prop. 5, the following GFPGM form with (41) is equivalent to FSFOM with the step coefficients (25) for (41) that are optimized step coefficients of FSFOM with respect to the norm of the composite gradient mapping, which we name FPGM-OCG (OCG for optimized over composite gradient mapping).
Algorithm FPGM-OCG (GFPGM with ti in (41)) | |
Input: convex, x0 ∈ ℝd, y0 = x0, t0 = T0 = 1. | |
For i = 0, …, N − 1 | |
| |
| |
|
The following theorem bounds the cost function and the (smallest) norm of the composite gradient mapping for the FPGM-OCG iterates.
Theorem 11
Let F : ℝd → ℝ be in ℱL(ℝd) and let x0, ⋯, xN, y0, ⋯, yN−1 ∈ ℝd be generated by FPGM-OCG. Then for N ≥ 1,
(43) |
and for N ≥ 3,
(44) |
Proof
FPGM-OCG is an instance of the GFPGM, and thus Thm. 6 implies (43) using
where , and (13).
In addition, Thm. 10 implies (44), using
(45) |
which we prove in the Appendix E.
The composite gradient mapping bound (44) of FPGM-OCG is asymptotically -times smaller than the bound (15) of . In addition, the cost function bound (43) of FPGM-OCG satisfies the optimal rate O(1/N2), although the bound (43) is two-times larger than the analogous bound (12) of FPGM.
4.4. Decreasing the composite gradient mapping with a rate without selecting N in advance
FPGM-OCG and FPGM-m satisfy a fast rate for decreasing the norm of the composite gradient mapping but require one to select the total number of iterations N in advance, which could be undesirable in practice. One could use FPGM-σ in [21] that does not require selecting N in advance, but instead we suggest a new choice of ti in GFPGM that satisfies a composite gradient mapping bound that is lower than the bound (17) of FPGM-σ.
Based on Thm. 10, the following corollary shows that GFPGM with (FPGM-a) for any a > 2 satisfies the rate of the norm of the composite gradient mapping without selecting N in advance. (Cor. 7 showed that FPGM-a for any a ≥ 2 satisfies the optimal rate O(1/N2) of the cost function.)
Corollary 12
Let f : ℝd → ℝ be in ℱL(ℝd) and let x0, ⋯, xN, y0, ⋯, yN−1 ∈ ℝd be generated by GFPGM with (FPGM-a) for any a ≥ 2. Then for N ≥ 1, we have the following bound on the (smallest) composite gradient mapping:
(46) |
Proof
With and (31), Thm. 10 implies (46) using
FPGM-a for any a > 2 has a composite gradient mapping bound (46) that is asymptotically -times larger than the bound (44) of FPGM-OCG. This gap reduces to at best when a = 4, which is clearly better than that of FPGM-σ. Therefore, this FPGM-a algorithm will be useful for minimizing the composite gradient mapping with a rate without selecting N in advance.
5. Discussion
5.1. Summary of analytical worst-case bounds on the cost function and the composite gradient mapping
Table 1 summarizes the asymptotic worst-case bounds of all algorithms discussed in this paper. (Note that the bounds are not guaranteed to be tight.) In Table 1, FPGM and FPGM-OCG provide the best known analytical worst-case bounds for decreasing the cost function and the composite gradient mapping respectively. When one does not want to select N in advance for decreasing the composite gradient mapping, FPGM-a will be a useful alternative to FPGM-OCG.
Table 1.
Algorithm | Asymptotic worst-case bound | Require selecting N in advance |
|||
---|---|---|---|---|---|
| |||||
Cost function (×LR2) | Proximal gradient (×LR) | ||||
| |||||
PGM |
|
2N−1 | No | ||
| |||||
FPGM | 2N−2 | 2N−1 | No | ||
| |||||
FPGM-σ (0 < σ < 1) |
|
|
No | ||
FPGM-(σ = 0.78) | 3.3N−2 |
|
|||
| |||||
|
4.5N−2 |
|
Yes | ||
| |||||
FPGM-OCG | 4N−2 |
|
Yes | ||
| |||||
FPGM-a (a > 2) | aN −2 |
|
No | ||
FPGM-(a = 4) | 4N−2 |
|
5.2. Tight worst-case bounds on the cost function and the smallest composite gradient mapping norm
Since none of the bounds presented in Table 1 are guaranteed to be tight, we modified the code9 (using SDP solvers [20, 28]) in Taylor et al. [29] to compare tight (numerical) bounds for the cost function and the composite gradient mapping in Tables 2 and 3 respectively for N = 1, 2, 4, 10, 20, 30, 40, 47, 50. This numerical bound is guaranteed to be tight when the large-scale condition is satisfied [29]. Taylor et al. [29, Fig. 1] already studied a tight worst-case bound on the cost function decrease of FPGM numerically, and found that the analytical bound (12) is asymptotically tight. Table 2 additionally provides numerical tight bounds on the cost function of all algorithms presented in this paper, also suggesting that our relaxation of the cost function form of the PEP from (P) to (D) is asymptotically tight (for some algorithms). In addition, the trend of the tight bounds of the composite gradient mapping in Table 3 follows that of the bounds in Table 1. However, there is gap between them that is not asymptotically tight, unlike the gap between the bounds of the cost function in Tables 1 and 2. In particular, the numerical tight bound for the composite gradient mapping of FPGM in Table 3 has a rate faster than the known rate O(1/N) in Thm. 10. We leave reducing this gap for the bounds on the norm of the composite gradient mapping as future work, possibly with a tighter relaxation of PEP. In addition, has a numerical tight bound in Table 3 that is even slightly better than that of FPGM-OCG, unlike our expectation from the analytical bounds in Table 1 and Sec. 4.3. This shows room for improvement in optimizing the step coefficients of FSFOM with respect to the composite gradient mapping, again possibly with a tighter relaxation of PEP.
Table 2.
N | PGM | FPGM | FPGM -(σ = 0.78) |
|
FPGM -OCG |
FPGM -(a = 4) |
|
---|---|---|---|---|---|---|---|
| |||||||
1 | 4.00 | 4.00 | 2.43 | 4.00 | 4.00 | 4.00 | |
2 | 8.00 | 8.00 | 4.87 | 8.00 | 8.00 | 8.00 | |
4 | 16.00 | 19.35 | 11.77 | 17.13 | 17.60 | 17.23 | |
10 | 40.00 | 79.07 | 48.11 | 56.47 | 59.25 | 55.88 | |
20 | 80.00 | 261.66 | 159.19 | 163.75 | 170.10 | 159.17 | |
30 | 120.00 | 546.51 | 332.49 | 321.56 | 331.97 | 312.03 | |
40 | 160.00 | 932.89 | 567.57 | 502.37 | 544.55 | 514.73 | |
47 | 188.00 | 1263.58 | 768.76 | 675.68 | 723.06 | 686.33 | |
50 | 200.00 | 1420.45 | 864.20 | 752.90 | 807.66 | 767.37 | |
| |||||||
Empi. O(·) | N −1.00 | N −1.89 | N −1.89 | N −1.75 | N −1.79 | N −1.80 | |
| |||||||
Known O(·) | N −1 | N −2 | N −2 | N −2 | N −2 | N −2 |
Table 3.
N | PGM | FPGM | FPGM -(σ = 0.78) |
|
FPGM -OCG |
FPGM -(a=4) |
||||
---|---|---|---|---|---|---|---|---|---|---|
| ||||||||||
1 | 1.84 | 1.84 | 1.18 | 1.84 | 1.84 | 1.84 | ||||
2 | 2.83 | 2.83 | 1.78 | 2.83 | 2.83 | 2.83 | ||||
4 | 4.81 | 5.65 | 3.50 | 5.09 | 5.21 | 5.12 | ||||
10 | 10.80 | 13.24 | 8.74 | 14.91 | 15.60 | 14.76 | ||||
20 | 20.78 | 27.19 | 18.83 | 39.70 | 39.61 | 29.21 | ||||
30 | 30.78 | 43.49 | 30.82 | 64.45 | 64.40 | 47.14 | ||||
40 | 40.78 | 61.76 | 44.39 | 92.82 | 91.99 | 67.82 | ||||
47 | 47.77 | 75.60 | 54.73 | 113.92 | 113.41 | 83.67 | ||||
50 | 50.77 | 81.78 | 59.35 | 123.54 | 123.17 | 90.78 | ||||
| ||||||||||
Empi. O(·) | N −0.98 | N −1.27 | N −1.31 | N −1.31 | N −1.33 | N −1.32 | ||||
| ||||||||||
Known O(·) | N −1 | N −1 |
|
|
|
|
5.3. Tight worst-case bounds on the final compo site gradient mapping
This paper focused on analyzing the worst-case bound of the smallest composite gradient mapping among all iterates (minx∈ΩN ‖∇̃LF(x)‖) in addition to the cost function, whereas the composite gradient mapping at the final iterate (‖∇̃LF(xN)‖) could be also considered (see Appendix D). For example, the composite gradient mapping bounds (10) and (15) for PGM and FPGM-m also apply to the final composite gradient mapping, and using (6) we can easily derive a (loose) worst-case bound on the final composite gradient mapping for other algorithms, e.g., such a final composite gradient mapping bound for GFPGM is as follows:
(47) |
Since the optimal rate for decreasing the cost function is O(1/N2), the composite gradient mapping worst-case bound (47) can provide only a rate O(1/N) at best. For completeness of the discussion, Table 4 reports tight numerical bounds for the final composite gradient mapping. Here, FPGM, FPGM-(σ = 0.78), and FPGM-(a = 4) have empirical rates of the worst-case bounds in Table 4 that are slower than those in Table 3, unlike the other three including FPGM-OCG.
Table 4.
N | PGM | FPGM | FPGM -(σ = 0.78) |
|
FPGM -OCG |
FPGM -(a = 4) |
|
---|---|---|---|---|---|---|---|
| |||||||
1 | 1.84 | 1.84 | 1.18 | 1.84 | 1.84 | 1.84 | |
2 | 2.83 | 2.83 | 1.78 | 2.83 | 2.83 | 2.83 | |
4 | 4.81 | 5.65 | 3.50 | 5.09 | 5.21 | 5.12 | |
10 | 10.80 | 12.68 | 8.41 | 14.91 | 15.60 | 14.76 | |
20 | 20.78 | 22.02 | 14.26 | 39.65 | 39.10 | 25.96 | |
30 | 30.78 | 31.26 | 20.12 | 64.40 | 63.40 | 34.21 | |
40 | 40.78 | 40.46 | 25.97 | 92.78 | 90.16 | 42.39 | |
47 | 47.77 | 46.89 | 30.06 | 113.92 | 110.12 | 48.13 | |
50 | 50.77 | 49.65 | 31.81 | 123.53 | 118.99 | 50.59 | |
| |||||||
Empi. O(·) | N −0.98 | N −0.92 | N −0.92 | N −1.31 | N −1.25 | N −0.81 | |
| |||||||
Known O(·) | N −1 | N −1 | N −1 |
|
N −1 | N −1 |
To best of our knowledge, FPGM-m (or algorithms that similarly perform accelerated algorithms in the beginning and run a proximal gradient method for the remaining iterations) is known only to have a rate in (15) for decreasing the final composite gradient mapping, while FPGM-OCG was also found to inherit such fast rate in Table 4. Therefore, searching for first-order methods that have a worst-case bound on the final composite gradient mapping that is lower than that of FPGM-m (and FPGM-OCG), and that possibly do not require knowing N in advance is an interesting open problem. Note that a regularization technique in [26] that provides a faster rate O(1/N2) (up to a logarithmic factor) for decreasing the final gradient norm for smooth convex minimization can be easily extended for rapidly minimizing the final composite gradient mapping with such rate for the composite problem (M); however, that approach requires knowing R in advance.
5.4. Tight worst-case bounds on the final subgradient
This paper has mainly focused on the norm of the composite gradient mapping based on (8), instead of the subgradient norm that is of primary interest in the dual problem (see e.g., [9, 22, 26]). Therefore to have a better sense of subgradient norm bounds, we computed tight numerical bounds on the final10 subgradient norm ‖F′(xN)‖ in Table 5 and compared them with Table 4.
Table 5.
N | PGM | FPGM | FPGM -(σ = 0.78) |
|
FPGM -OCG |
FPGM -(a = 4) |
|
---|---|---|---|---|---|---|---|
| |||||||
1 | 1.00 | 1.00 | 0.61 | 1.00 | 1.00 | 1.00 | |
2 | 2.00 | 2.00 | 1.22 | 2.00 | 2.00 | 2.00 | |
4 | 4.00 | 4.83 | 2.94 | 4.28 | 4.40 | 4.31 | |
10 | 10.00 | 7.60 | 4.67 | 14.12 | 14.81 | 12.10 | |
20 | 20.00 | 12.58 | 7.67 | 38.29 | 36.65 | 16.85 | |
30 | 30.00 | 17.63 | 10.74 | 62.71 | 60.40 | 21.61 | |
40 | 40.00 | 22.67 | 13.80 | 91.00 | 86.62 | 26.47 | |
47 | 47.00 | 26.20 | 15.94 | 112.01 | 106.21 | 29.91 | |
50 | 50.00 | 27.71 | 16.86 | 121.53 | 114.93 | 31.39 | |
| |||||||
Empi. O(·) | N −1.00 | N −0.91 | N −0.91 | N −1.32 | N −1.27 | N −0.78 | |
| |||||||
Known O(·) | N −1 | N −1 | N −1 |
|
N −1 | N −1 |
For all six algorithms, empirical rates in Table 5 are similar to those for the final composite gradient mapping in Table 4. In particular, the subgradient norm bounds for the three algorithms PGM, , and FPGM-OCG are almost identical to those in Table 4 except for the first few iterations, eliminating the concern of using (8) for such cases. On the other hand, the other three algorithms FPGM, FPGM-(σ = 0.78), and FPGM-(a = 4) almost tightly satisfy the inequality (8) for most N, and thus have bounds on the final subgradient that are about twice larger than those on the final composite gradient mapping. Therefore, regardless of (8), Table 5 further supports the use of and FPGM-OCG over FPGM and other algorithms in dual problems.
6. Conclusion
This paper analyzed and developed fixed-step first-order methods (FSFOM) for nonsmooth composite convex cost functions. We showed an alternate proof of FPGM (FISTA) using PEP, and suggested that FPGM (FISTA) results from optimizing the step coefficients of FSFOM with respect to the cost function form of the (relaxed) PEP. We then described a new generalized version of FPGM and analyzed its worst-case bound using the (relaxed) PEP over both the cost function and the norm of the composite gradient mapping. Furthermore, we optimized the step coefficients of FSFOM with respect to the composite gradient mapping form of the (relaxed) PEP, yielding FPGM-OCG, which could be useful particularly when tackling dual problems.
Our relaxed PEP provided tractable analysis of the optimized step coefficients of FSFOM with respect to the cost function and the norm of the composite gradient mapping, but the relaxation is not guaranteed to be tight and the corresponding accelerations of PGM (FPGM and FPGM-OCG) are thus unlikely to be optimal. Therefore, finding optimal step coefficients of FSFOM over the cost function and the norm of the composite gradient mapping remain as future work. Nevertheless, the proposed FPGM-OCG that optimizes the composite gradient mapping form of the relaxed PEP and the FPGM-a (for any a > 2) may be useful in dual problems.
Software
Matlab codes for the SDP approaches in Sec. 3.4, Sec. 4.3 and Sec. 5 are available at https://gitlab.eecs.umich.edu/michigan-fast-optimization.
Acknowledgments
The authors would like to thank the anonymous referees for very useful comments that have improved the quality of this paper.
Funding: This research was supported in part by NIH grant U01 EB018753.
Appendix A
Derivation of the dual formulation (D) of (P1)
The derivation below is similar to [11, Lemma 2].
We replace maxG,δ LR2δN−1 of (P1) by minG,δ{−δN−1} for convenience in this section. The corresponding dual function of such (P1) is then defined as
for dual variables and , where ℒ(G, δ, λ, τ; h) is a Lagrangian function, and
Here, minδ ℒ1(δ, λ, τ) = 0 for any (λ, τ) ∈ Λ where Λ is defined in (23), and minδ ℒ1(δ, λ, τ) = −∞ otherwise.
For any given unit vector ν, [11, Lemma 1] implies
and thus for any (λ, τ) ∈ Λ, we can rewrite the dual function as
where S(h, λ, τ) is defined in (24). Therefore the dual problem of (P1) becomes (D), recalling that we previously replaced maxG,δ LR2δN−1 of (P1) by minG,δ{−δN−1}.
Appendix B
Proof of Prop. 5
The proof is similar to [17, Prop. 2, 3 and 4].
We first show that {hi+1,k} in (25) is equivalent to
(48) |
We use the notation for the coefficients (25) to distinguish from (48). It is obvious that , i = 0, …, N − 1, and we clearly have
We next use induction by assuming for i = 0, …, n − 1, k = 0, …, i. We then have
Next, using (48), we show that FSFOM with (25) is equivalent to the GFPGM. We use induction, and for clarity, we use the notation for FSFOM with (48). It is obvious that , and we have
since T0 = t0. Assuming for i = 0, …, n, we then have
Appendix C
Proof of Prop. 8
The proof is similar to [17, Prop. 1 and 5].
We use induction, and for clarity, we use the notation for FSFOM with (25) that is equivalent to GFPGM by Prop. 5. It is obvious that , and we have
Assuming for i = 0, …, n, we then have
Appendix D
Discussion on the choice of ΩN in Sec. 4.1
Our formulation (P′) examines the set ΩN = {y0, ⋯, yN−1, xN} and eventually leads to the best known analytical bound on the norm of the composite gradient mapping in Thm. 11 among fixed-step first-order methods.
An alternative formulation would be to use the set {y0, ⋯, yN−1} (i.e., excluding the point xN). For this alternative, we could simply replace the inequality (33) with the condition 0 ≤ F(yN−1) − F(x*) to derive a slightly different relaxation. (One could use other conditions at the point yN−1 as in [29] for a tight relaxation, but this is beyond the scope of this paper.) We found that the corresponding (loose) relaxation (P1′) using {y0, ⋯, yN−1} leads to a larger upper bound than (40) in Thm. 10 for the set ΩN.
Another alternative would be to use the set {x0, ⋯, xN}, which we leave as future work. Nevertheless, the inequality in Lemma 1 provides a bound for that set {x0, ⋯, xN} as seen in Thm. 10 and 11.
We could also consider the final point xN (or yN) in (P′) instead of the minimum over a set of points. However, the corresponding (loose) relaxation (P1′) yielded only an O(1/N) bound at best (even for the corresponding optimized step coefficients of (HD′)) on the final composite gradient mapping norm. So we leave finding its tighter relaxation as future work. Note that Table 4 reports tight numerical bounds on the composite gradient mapping norm at the final point xN of algorithms considered.
Appendix E
Proof of Equation (45) in Thm. 11
where , and (13).
Footnotes
Submitted to the editors July 31, 2017.
Tight relaxation here denotes transforming (relaxing) an optimization problem into a solvable problem while their solutions remain the same. [29] tightly relaxes the PEP into a solvable equivalent problem under a large-dimensional condition.
One could develop a first-order algorithm that is optimized with respect to the norm of the subgradient (rather than its upper bound in Sec. 4), which we leave as future work.
[11, Thm. 2] and [16, Thm. 2] imply that the O(1/N) rates of both the cost function bound (3) and the composite gradient mapping norm bound (10) of PGM are tight up to a constant respectively.
The bound for mini∈{0, …, N} ‖∇̃L/σ2F(yi)‖ of FPGM-σ is described in a big-O sense in [21, Prop. 5.2(c)], and we further computed the constant in (17) by following the derivation of [21, Prop. 5.2(c)].
See Appendix A for the derivation of the dual formulation (D) of (P1).
See Appendix D for the discussion on the choice of ΩN.
Here, we simply relaxed (P′) into (P1′) in a way that is similar to the relaxation from (P) to (P1). This relaxation resulted in a constructive analytical worst-case analysis on the composite gradient mapping in this section that is somewhat similar to that on the cost function in Section 3. However, this relaxation on (P1′) turned out to be relatively loose compared to the relaxation on (P1) (see Sec. 5), suggesting there is room for improvement in the future with a tighter relaxation.
The code in Taylor et al. [29] currently does not provide a tight bound of the norm of the composite gradient mapping (and the subgradient), so we simply added a few lines to compute a tight bound.
Using modifications of the code in [29] to compute tight bounds on the final subgradient norm was easier than for the smallest subgradient norm among all iterates. Even without the smallest subgradient norm bounds, the bounds on the final subgradient norm in Table 5 (compared to Table 4) provide some insights (beyond (8)) on the relationship between the bounds on the subgradient norm and the composite gradient mapping norm as discussed in Sec. 5.4. We leave further modifying the code in [29] for computing tight bounds on the smallest subgradient norm or other criteria as future work.
Contributor Information
Donghwan Kim, Email: kimdongh@umich.edu.
Jeffrey A. Fessler, Email: fessler@umich.edu.
References
- 1.Beck A, Nedic A, Ozdaglar A, Teboulle M. An O(1/k) gradient method for network resource allocation problems. IEEE Trans. Control of Network Systems. 2014;1:64–73. doi: 10.1109/TCNS.2014.2309751. [DOI] [Google Scholar]
- 2.Beck A, Teboulle M. Fast gradient-based algorithms for constrained total variation image denoising and deblurring problems. IEEE Trans. Im. Proc. 2009;18:2419–34. doi: 10.1109/TIP.2009.2028250. [DOI] [PubMed] [Google Scholar]
- 3.Beck A, Teboulle M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009;2:183–202. doi: 10.1137/080716542. [DOI] [Google Scholar]
- 4.Beck A, Teboulle M. A fast dual proximal gradient algorithm for convex minimization and applications. Operations Research Letters. 2014;42:1–6. doi: 10.1016/j.orl.2013.10.007. [DOI] [Google Scholar]
- 5.Chambolle A, Dossal C. On the convergence of the iterates of the “Fast iterative shrinkage/Thresholding algorithm”. J. Optim. Theory Appl. 2015;166:968–82. doi: 10.1007/s10957-015-0746-4. [DOI] [Google Scholar]
- 6.Combettes PL, Pesquet J-C. Proximal splitting methods in signal processing. Fixed-Point Algorithms for Inverse Problems in Science and Engineering, Springer, Optimization and Its Applications; 2011. pp. 185–212. [DOI] [Google Scholar]
- 7.CVX Research Inc. CVX: Matlab software for disciplined convex programming, version 2.0. Aug., 2012. http://cvxr.com/cvx .
- 8.Daubechies I, Defrise M, De Mol C. An iterative thresholding algorithm for linear inverse problems with a sparsity constraint. Comm. Pure Appl. Math. 2004;57:1413–57. doi: 10.1002/cpa.20042. [DOI] [Google Scholar]
- 9.Devolder O, Glineur F, Nesterov Y. Double smoothing technique for large-scale linearly constrained convex optimization. SIAM J. Optim. 2012;22:702–27. doi: 10.1137/110826102. [DOI] [Google Scholar]
- 10.Drori Y. The exact information-based complexity of smooth convex minimization. Journal of Complexity. 2017;39:1–16. doi: 10.1016/j.jco.2016.11.001. [DOI] [Google Scholar]
- 11.Drori Y, Teboulle M. Performance of first-order methods for smooth convex minimization: A novel approach. Mathematical Programming. 2014;145:451–82. doi: 10.1007/s10107-013-0653-0. [DOI] [Google Scholar]
- 12.Drori Y, Teboulle M. An optimal variant of Kelley’s cutting-plane method. Mathematical Programming. 2016;160:321–51. doi: 10.1007/s10107-016-0985-7. [DOI] [Google Scholar]
- 13.Ghadimi S, Lan G. Accelerated gradient methods for nonconvex nonlinear and stochastic programming. Mathematical Programming. 2016;156:59–99. doi: 10.1007/s10107-015-0871-8. [DOI] [Google Scholar]
- 14.Goldstein T, O’Donoghue B, Setzer S, Baraniuk R. Fast alternating direction optimization methods. SIAM J. Imaging Sci. 2014;7:1588–623. doi: 10.1137/120896219. [DOI] [Google Scholar]
- 15.Grant M, Boyd S. Graph implementations for nonsmooth convex programs. In: Blondel V, Boyd S, Kimura H, editors. Recent Advances in Learning and Control. Lecture Notes in Control and Information Sciences, Springer-Verlag Limited; 2008. pp. 95–110. http://stanford.edu/~boyd/graph_dcp.html . [Google Scholar]
- 16.Kim D, Fessler JA. Generalizing the optimized gradient method for smooth convex minimization. 2016. http://arxiv.org/abs/1607.06764. arxiv 1607.06764.
- 17.Kim D, Fessler JA. Optimized first-order methods for smooth convex minimization. Mathematical Programming. 2016;159:81–107. doi: 10.1007/s10107-015-0949-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Kim D, Fessler JA. On the convergence analysis of the optimized gradient method. J. Optim. Theory Appl. 2017;172:187–205. doi: 10.1007/s10957-016-1018-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Lessard L, Recht B, Packard A. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM J. Optim. 2016;26:57–95. doi: 10.1137/15M1009597. [DOI] [Google Scholar]
- 20.Löfberg J. Yalmip : A toolbox for modeling and optimization in matlab. In Proceedings of the CACSD Conference; Taipei, Taiwan. 2004. [Google Scholar]
- 21.Monteiro RDC, Svaiter BF. An accelerated hybrid proximal extragradient method for convex optimization and its implications to second-order methods. SIAM J. Optim. 2013;23:1092–1125. doi: 10.1137/110833786. [DOI] [Google Scholar]
- 22.Necoara I, Patrascu A. Iteration complexity analysis of dual first order methods for conic convex programming. Optimization Methods and Software. 2016;31:645–78. doi: 10.1080/10556788.2016.1161763. [DOI] [Google Scholar]
- 23.Nesterov Y. A method for unconstrained convex minimization problem with the rate of convergence O(1/k2) Dokl. Akad. Nauk. USSR. 1983;269:543–7. [Google Scholar]
- 24.Nesterov Y. Introductory lectures on convex optimization: A basic course. Kluwer; 2004. http://books.google.com/books?id=VyYLem-l3CgC . [Google Scholar]
- 25.Nesterov Y. Smooth minimization of non-smooth functions. Mathematical Programming. 2005;103:127–52. doi: 10.1007/s10107-004-0552-5. [DOI] [Google Scholar]
- 26.Nesterov Y. How to make the gradients small. Optima. 2012;88:10–11. [Google Scholar]
- 27.Nesterov Y. Gradient methods for minimizing composite functions. Mathematical Programming. 2013;140:125–61. doi: 10.1007/s10107-012-0629-5. [DOI] [Google Scholar]
- 28.Sturm J. Using SeDuMi 1.02, A MATLAB toolbox for optimization over symmetric cones. Optim. Meth. Software. 1999;11:625–53. doi: 10.1080/10556789908805766. [DOI] [Google Scholar]
- 29.Taylor AB, Hendrickx JM, Glineur F. Exact worst-case performance of first-order methods for composite convex optimization. SIAM J. Optim. 2017;27:1283–1313. doi: 10.1137/16M108104X. [DOI] [Google Scholar]
- 30.Taylor AB, Hendrickx JM, Glineur F. Smooth strongly convex interpolation and exact worst-case performance of first-order methods. Mathematical Programming. 2017;161:307–45. doi: 10.1007/s10107-016-1009-3. [DOI] [Google Scholar]