Skip to main content
Proceedings of the National Academy of Sciences of the United States of America logoLink to Proceedings of the National Academy of Sciences of the United States of America
. 2019 Oct 30;116(46):22924–22930. doi: 10.1073/pnas.1908018116

The importance of better models in stochastic optimization

Hilal Asi a,1, John C Duchi a,b
PMCID: PMC6859306  PMID: 31666325

Significance

Sensitivity of optimization algorithms to problem and algorithmic parameters leads to tremendous waste in time and energy, especially in applications with millions of parameters, such as deep learning. We address this by developing stochastic optimization methods demonstrably—both by theory and by experimental evidence—more robust, enjoying optimal convergence guarantees for a variety of stochastic optimization problems. Additionally, we highlight the importance of method sensitivity to problem difficulty and algorithmic parameters.

Keywords: stochastic optimization, large-scale optimization

Abstract

Standard stochastic optimization methods are brittle, sensitive to stepsize choice and other algorithmic parameters, and they exhibit instability outside of well-behaved families of objectives. To address these challenges, we investigate models for stochastic optimization and learning problems that exhibit better robustness to problem families and algorithmic parameters. With appropriately accurate models—which we call the aprox family—stochastic methods can be made stable, provably convergent, and asymptotically optimal; even modeling that the objective is nonnegative is sufficient for this stability. We extend these results beyond convexity to weakly convex objectives, which include compositions of convex losses with smooth functions common in modern machine learning. We highlight the importance of robustness and accurate modeling with experimental evaluation of convergence time and algorithm sensitivity.


A major challenge in stochastic optimization—the algorithmic workhorse for much of modern statistical and machine-learning applications—is in setting algorithm parameters (or hyperparameter tuning). This sensitivity causes multiple issues. It results in thousands to millions of wasted engineer and computational hours. It also leads to a lack of clarity in research and development of algorithms—in claiming that one algorithm is better than another, it is unclear whether this is due to judicious choice of dataset or judicious parameter settings or whether indeed the algorithm does exhibit new desirable behavior. Consequently, in this paper we pursue 2 main thrusts: First, by using models more accurate than the first-order models common in stochastic gradient methods, we develop families of algorithms that are provably more robust to input parameter choices, with several corresponding optimality properties. Second, we argue for a different type of experimental evidence in evaluating stochastic optimization methods, where one jointly evaluates convergence speed and sensitivity of the methods.

The wasted computational and engineering energy is especially pronounced in deep learning, where engineers use models with millions of parameters, requiring days to weeks to train a single model. To get a sense of this energy use, we consider a few recent papers we view as exemplars of this broader trend: In searching for optimal neural network architectures and hyperparameters, the papers (13) used approximately 3,150 graphics processing unit (GPU) days, 22,000 GPU days, and 750,000 central processing unit (CPU) days of computation, respectively. To put this in perspective, assuming standard CPU energy use of between 60 and 100 W, the energy (ignoring network interconnect, monitors, etc.) for the paper (3) is roughly between 4 and 61012 J. At 109 J per tank of gas, this is sufficient to drive 4,000 Toyota Camrys the 380 miles between San Francisco and Los Angeles.

To address these challenges, we develop stochastic optimization procedures that exhibit similar convergence to classical approaches—when the classical approaches have good tuning parameters—while enjoying better robustness, achieving this performance over a range of parameters. We argue too for evaluation of optimization algorithms based not only on convergence time but also on robustness to input choices. Briefly, a fast algorithm that converges for a small range of stepsizes is too brittle; we argue instead for (potentially slightly slower) algorithms that converge for broad ranges of stepsizes and other parameters. Our theory and experiments demonstrate the effectiveness of our methods for applications including phase retrieval, matrix completion, and deep learning.

Problem Setting and Approach

We begin by making our setting concrete. We study the stochastic optimization problem

minimizeF(x)EP[f(x;S)]=Sf(x;s)dP(s)xsubjectto  xX. [1]

In problem 1, the set S is a sample space, XRn is closed convex, and f(x;s) is the loss x suffers on sample s. In this paper, we move beyond convex optimization by considering ρ(s)-weakly convex functions f, meaning (cf. refs. 4 and 5) that f(x;s)+ρ(s)2x22 is convex. We recover convexity when ρ(s)0. Examples include linear regression, f(x;(a,b))=(a,xb)2, and phase retrieval, f(x;(a,b))=|a,x2b|, which is 2a22-weakly convex.

Most optimization methods iterate by making an approximation—a model—of the objective at the current iterate, minimizing this model and reapproximating. Stochastic (sub)gradient methods (6, 7) instantiate this approach using a linear approximation; following initial work of our own and others (5, 8, 9), we study the modeling approach in more depth for stochastic optimization. Thus, the aprox algorithms we develop iterate as follows: For k=1,2,, we draw a random SkP and then update the iterate xk by minimizing a regularized approximation to f(;Sk), setting

xk+1argminxXfxk(x;Sk)+12αkxxk22. [2]

We call fx(;s) the model of f at x, where fx satisfies 3 conditions (cf. refs. 5, 8, and 9):

  • C.i)

    (Model convexity): The function yfx(y;s) is convex and subdifferentiable.

  • C.ii)

    (Weak lower bound): The model fx satisfies

fx(y;s)f(y;s)+ρ(s)2yx22  for allyX.
  • C.iii)

    (Local accuracy): We have fx(x;s)=f(x;s).

The containment yfx(y;s)y=xxf(x;s) is immediate from condition C.iii. We provide examples presently.

We show that models slightly more accurate than the first-order model used by the stochastic gradient method—sometimes as simple as recognizing that if f is nonnegative, we should truncate the approximation at zero—achieve substantially better theoretical guarantees and practical performance. While the iterates of gradient methods can (superexponentially) diverge for misspecified stepsizes, our methods guarantee the iterates never diverge. Even more, this stability guarantees convergence and, in convex cases, optimal asymptotic normality of the averaged iterates. Finally, we evaluate the performance of our methods, validating our theoretical findings on convergence and robustness for a range of problems, including matrix completion, phase retrieval, and classification with neural networks. We defer proofs to SI Appendix.

In optimization broadly, proximal point methods and their related robust convergence are classical (1012), and their role in smoothing and Moreau–Yosida regularization is also central in convex and variational analysis (1315). In signal processing, least-mean squares for adaptive filtering is an important instance of the stochastic proximal point method (16, 17). More recent work in large-scale optimization and machine learning revisits Moreau smoothing and regularization, extending acceleration and stability properties of proximal-point-type methods to finite sum and stochastic problems (1820).

Notation and Basic Assumptions

For a weakly convex function f, we let f(x) denote its Fréchet subdifferential at the point x, and f(x)f(x) denotes an arbitrary element of the subdifferential. Throughout, we let x denote a minimizer of problem 1 and X=argminxXF(x) denote the optimal set for problem 1. We let Fkσ(S1,,Sk) denote the σ field generated by the first k random variables Si. Note that xkFk1 for all k. Unless stated otherwise, we assume that the function f(x;s) is ρ(s) -weakly convex for each sS. Finally, the following assumption implicitly holds throughout.

Assumption A1.

The set XargminxX{F(x)} is nonempty, and there exists σ2< such that for each xX and selection f(x;s)f(x;s), we have E[f(x;S)22]σ2.

Methods

To make our approach more concrete, we identify several models that fit into our framework. These have appeared in refs. 5, 8, and 9, but we believe a self-contained presentation is beneficial. Each one satisfies our conditions C.i to C.iii. The most common model in stochastic optimization is the first-order model.

Stochastic Subgradient Methods.

The stochastic subgradient method uses the model

fx(y;s)f(x;s)+f(x;s),yx. [3]

Proximal Point Methods.

In the convex setting (8, 20, 21), the stochastic proximal point method uses the model fx(y;s)f(y;s); in the weakly convex setting, we regularize and use

fx(y;s)f(y;s)+ρ(s)2yx22. [4]

Other models require less knowledge than proximal model 4 but preserve structural properties in the original function.

Prox-Linear Model.

Let the function f have the composite structure f(x;s)=h(c(x;s);s), where h(;s) is convex and c(;s) is smooth. The stochastic prox-linear method applies h to a first-order approximation of c, using

fx(y;s)h(c(x;s)+c(x;s)T(yx);s). [5]

In the nonstochastic setting, these models are classical (22), while recent work establishes convergence and convergence rates in restrictive stochastic settings (5, 9). When h is Lh Lipschitz and c has an Lc-Lipschitz gradient, then f is ρ=LhLc-weakly convex.

Example 1 (phase retrieval):

In phase retrieval (23), we wish to recover an object xCn from a diffraction pattern Ax, where ACm×n, but physical sensor limitations mean we observe only amplitudes b=|Ax|2. A natural objective is

minimizexCn1mi=1mf(x;(ai,bi)),   f(x;(ai,bi))=|ai,x|2bi.

This is the composition of h(z)=|z| and c(x;(ai,bi))=|<ai,x|2bi, so f(;(ai,bi)) is 2ai22-weakly convex (24).

Example 2 (matrix completion):

In the matrix completion problem (25), which arises (for example) in the design of recommendation systems, we have a matrix MRm×n with decomposition M=XYT for XRm×r and YRn×r. Based on the incomplete set of known entries Ω[m]×[n], our goal is to recover the matrix M, giving rise to the objective

minimizeXRm×r,YRn×r1|Ω|(i,j)Ωf(xi,yj;Mi,j),

where f(x,y;z)|x,yz| and xi, yj are the (i,j) rows of X and Y. This is the composition of h(z)=|z| and c(x,y,z)=x,yz, so that f=hc is 1-weakly convex.

Truncated Models.

The prox-linear model 5 may be challenging to implement for complex compositions (e.g., deep learning). If instead we know a lower bound on f, we may incorporate this into the model

fx(y;s)maxf(x;s)+f(x;s),yx,infzXf(z;s). [6]

In our examples—linear and logistic regression, phase retrieval, and matrix completion (more generally, typical loss functions in machine learning)—we have infzf(z;s)=0. The assumption that we have a lower bound is thus rarely restrictive. This model satisfies the conditions C.i to C.iii, also satisfying the following condition.

  • C.iv)

    (Lower optimality): For all sS and x,yX,

fx(y;s)infzXf(z;s).

As we show, condition C.iv is sufficient to derive several optimality and stability properties.

Stability and Its Consequences

In our initial study of stability in optimization (8), we defined an algorithm as stable if its iterates remain bounded and then showed several consequences of this in convex optimization (which we review presently). Here, we develop 2 important extensions. First, we show that any model satisfying condition C.iv has stable iterates under mild assumptions, in strong contrast to models (e.g., linear) that fail the condition. Second, we develop an analogous stability theory for weakly convex functions, proving that accurate enough models are stable. In parallel to the convex case, stability suffices for more: It implies convergence (with an asymptotic rate) to stationary points for any model-based method on weakly convex functions. Let us formalize stability (8). A pair (F,P) is a collection of problems if P consists of probability measures on a sample space S and F of functions f:X×SR.

Definition 1.

An algorithm generating iterates xk according to the model-based update 2 is stable in probability for the class of problems (F,P) if for all fF, PP defining F(x)=EP[f(x;S)], and X=argminxXF(x),

supkdist(xk,X)<  withprobability1. [7]

Typically, stability 7 requires the standard assumptions

αk>0,  k1αk=,  and  k1αk2<. [8]

Even under these, models such as the linear model 3 and consequent subgradient method are unstable (ref. 8, section 3). They may even cause superexponential divergence.

Example 3 (divergence):

Let F(x)=ex+ex, p<, and α0>0, and let αk satisfy αkα0kp. Let xk+1=xkαkF(xk)=xkαk(exkexk) be generated by the gradient method. For large x1, logxk+1xk2k for all k.

The Importance of Stability in Stochastic Convex Optimization.

To set the stage for what follows, we begin by motivating the importance of stable procedures. Briefly, any stable aprox model converges for any convex function under weak assumptions, which we now elucidate. First, we make an assumption.

Assumption A2.

There exists Gbig:R+0, such that for all xX and each measurable selection f(x;s)f(x;s),

Ef(x;S)22Gbig(x2).

Assumption A2 is equivalent to assuming E[f(x;S)22] is bounded on compact sets; it allows arbitrary growth as long as the subgradients have second moments.

Proposition 1 [Asi and Duchi (8), proposition 1].

Assume that f(;s) is convex for each sS and let Assumption A2 hold. Let the iterates xk be generated by any method satisfying conditions C.i to C.iii and [8]. On the event supkxk, kαk(F(xk)F(x))< and dist(xk,X)a.s.0.

Proposition 1 establishes convergence of stable procedures and also (via Jensen’s inequality) provides asymptotic rates of convergence for weighted averages kαkxk/kαk.

Stability is additionally important when the functions f are smooth: Any stable aprox method achieves asymptotically optimal convergence. In particular, let us assume F is C2 near x=argminXF(x) with 2F(x)0, and the f(;s) have an L(s) -Lipschitz gradient near x with E[L(S)2]<.

Proposition 2 [Asi and Duchi (8), theorem 2].

In addition to the conditions of Proposition 1, let the conditions of the previous paragraph hold. Then x¯k=1ki=1kxi satisfies

k(x¯kx)dN0,2F(x)1Cov(f(x;S))2F(x)1.

This convergence is optimal for any method (26).

Stability of Lower-Bounded Models for Convex Functions.

With these consequences of stability in hand—convergence and asymptotic optimality—it behooves us to provide conditions sufficient to guarantee stability. To that end, we show that lower-bounded models satisfying condition C.iv are stable in probability (Definition 1) for functions whose (sub)gradients grow at most polynomially. We begin with an assumption.

Assumption A3.

There exist C<, 2p< such that

Ef(x;S)22C(1+dist(x,X)p),  allxX,

and E[(f(x;S)infzXf(z;S))p/2]C for all xX.

The analogous condition (27) for stochastic gradient methods holds for p=2, or quadratic growth, without which the method may diverge. In contrast, Assumption A3 allows polynomial growth; for example, the function f(x)=x4 is permissible, while the gradient method may exponentially diverge even for stepsizes αk=1/k. The key consequence of Assumption A3 is that if it holds, truncated models are stable:

Theorem 1.

Assume the function f(;s) is convex for each sS. Let Assumption A3 hold and αk=α0kβ with p+2p+4<β<1. Let xk be generated by the iteration 2 with a model satisfying conditions C.i to C.iv. Then

supkNdist(xk,X)<withprobability1.

Theorem 1 shows that truncated methods enjoy the benefits of stability we outline in Propositions 1 and 2 above. Thus, these models, whose updates are typically as cheap to compute as a stochastic gradient step (especially in the common case that infzf(z;s)=0) provide substantial advantage over methods using only (sub)gradient approximations.

Stability and Its Consequences for Weakly Convex Functions.

We continue our argument that—if possible—it is beneficial to use more accurate models, even in situations beyond convexity, investigating the stability of proximal models (Eq. 4) for weakly convex functions. Establishing stability in the weakly convex case requires a different approach to the convex case, as the iterates may not make progress toward a fixed optimal set. In this case, to show stability, we require an assumption bounding the size of f(x;S) relative to the population subgradient F.

Assumption A4.

There exist C1,C2< such that for all measurable selections f(x;s)f(x;s) and F(x)F(x),

Ef(x;S)F(x)22C1F(x)22+C2.

By providing a relative noise condition on f, Assumption A4 allows for more than the typical class of functions with global Lipschitz properties (cf. ref. 5), such as the phase retrieval and matrix completion objectives (Examples 1 and 2). It can allow exponential growth, addressing the challenges in Example 3. For example, let f(x;1)=ex and f(x;2)=ex, where S is uniform in {1,2} so that F(x)=12(ex+ex); then E[f(x;S)2]=2F(x)2+1.

To describe convergence and stability in nonconvex (even nonsmooth) settings, we require appropriate definitions. Finding global minima of nonconvex functions is computationally infeasible (28), so we follow established practice and consider convergence to stationary points via the Moreau envelope (5, 29). To formalize, for xRn and λ0, the Moreau envelope and associated proximal map are

Fλ(x)infyXF(y)+λ2yx22  andproxF/λ(x)argminyXF(y)+λ2yx22.

For large λ, the minimizer xλproxF/λ(x) is unique whenever F is weakly convex. Adopting the techniques Davis and Drusvyatskiy (5) pioneer for weakly convex problems, we rely on the Moreau envelope’s connections to (near) stationarity:

Fλ(x)=λ(xxλ),   F(xλ)F(x),dist(0,F(xλ))Fλ(x)2. [9]

The 3 properties in [9] imply that any nearly stationary point x of Fλ—when Fλ(x)2 is small—is close to a nearly stationary point xλ of F. To prove convergence for weakly convex F, then, it suffices to show Fλ(xk)0.

Using full proximal models guarantees convergence.

Theorem 2.

Let Assumption A4 hold, let λ< satisfy E[ρ(S)]<λ, and assume infxXF(x)> and E[ρ(S)2]<. Let xk follow the iteration 2 with proximal model 4 and stepsizes 8. Then there exists a random variable Gλ< satisfying

Fλ(xk)Gλ  and  kαkFλ(xk)22<  w.p.1.

Theorem 2 shows that Fλ(xk) is bounded almost surely. Thus, if F is coercive, meaning F(x) as x, the Moreau envelope Fλ is coercive, yielding the following.

Corollary 1.

Let the conditions of Theorem 2 hold and let F be coercive. Then

supkNdist(xk,X)<   withprobability1.

In parallel with our development of the convex case, stability is sufficient to develop convergence results for any aprox method, highlighting its importance. Indeed, we can show that stable methods guarantee convergence, although for probability 1 convergence of the iterates, we require a slightly elaborate assumption (cf. refs. 9 and 30), which rules out pathological limits.

Assumption A5 (Weak Sard).

Let Xstat={x0F(x)} be the collection of stationary points of F over X. The Lebesgue measure of the image F(Xstat) is zero.

Under this assumption, aprox methods converge to stationary points whenever the iterates are stable.

Proposition 3.

Let Assumption A2 hold and the iterates xk be generated by any method satisfying conditions C.i to C.iii and [8]. Assume that λ is large enough that E[ρ(S)]<λ. There exists a finite random variable Gλ such that on the event that supkxk2<, with probability 1 we have

kαkFλ(xk)22<   and   Fλ(xk)Gλ. [10]

Under Assumption A5, then dist(xk,Xstat)a.s.0 and Fλ(xk)2a.s.0.

The condition 10 is enough to develop a conditional 2-convergence guarantee similar to what stochastic (sub)gradient methods achieve to stationary points for Lipschitz F (5, 31). Indeed, assume αk=α0kβ for some β(12,1) and that the iterates xk are stable; choose Ik{1,,k} with probability P(Ik=i)=αi/j=1kαj. Then inequality 10 shows

lim supkk1βEFλ(xIk)22Fk<  with probability1.

Fast Convergence for Easy Problems

In many engineering and learning applications, solutions interpolate the data. Consider, for example, signal recovery problems with b=Ax or modern machine-learning applications, where frequently training error is zero (32, 33). We consider such problems here, showing how models that satisfy the lower-bound condition C.iv enjoy linear convergence, extending our earlier results (8) beyond convex optimization.

Definition 2.

Let F(x)EP[f(x;S)]. Then F is easy to optimize if for each xX and P almost all sS,

infxXf(x;s)=f(x;s).

For such problems, we can guarantee progress toward minimizers for appropriate f, as the following lemma shows.

Lemma 1.

Let F be easy to optimize (Definition 2). Let xk be generated by the updates 2 using a model satisfying conditions C.i to C.iv. Then for any xX,

xk+1x22(1+αkρ(Sk))xkx22[f(xk;Sk)f(x;Sk)]minαk,f(xk;Sk)f(x;Sk)f(xk;Sk)22.

Lemma 1 allows us to prove fast convergence as long as f grows quickly enough away from x; a sufficient condition for us is a sharp growth condition away from the optimal set X. To meld with Lemma 1, we consider the following:

Assumption A6 (Expected Sharp Growth).

There exist constants λ0,λ1>0 such that for αR+, xX, and xX,

Eminα,f(x;S)f(x;S)f(x;S)22(f(x;S)f(x;S))   dist(x,X)minλ0α,λ1dist(x,X).

Assumption A6 is tailored to Lemma 1, so we discuss a few situations where it holds. One sufficient condition is the small-ball condition that there exists C such that P(f(x;S)f(x;S)ϵdist(x,X))1Cϵ for ϵ>0 and E[f(x;S)22]C(1+dist(x,X)2). We can be more explicit:

Example 4 (Example 1 continued):

Consider the (real-valued) phase retrieval problem with objective f(x;(a,b))=|a,x2b|. Assume the vectors aiRn are drawn from a distribution satisfying the small-ball condition P(|ai,u|ϵu2)1ϵ for ϵ>0 and any uRn and additionally that E[ai22]M2n and E[ai,x2]M2x22 for some M<. For a sample of size m, Assumption A3 holds with high probability for the objective F(x)=1mi=1mf(x;(ai,bi)) with λ0=cx2, and λ1=c16M4n, for a numerical constant c>0. The full calculation is in SI Appendix.

The following proposition is our main result in this section, showing lower-bounded models may enjoy linear convergence.

Proposition 4.

Let Assumption A6 hold and xk be generated by the stochastic iteration 2 using any model satisfying conditions C.i to C.iv, where the stepsizes αk satisfy αk=α0kβ for some β(0,1). If f(;Sk) is ρ(Sk)-weakly convex with E[ρ(Sk)]=ρ¯, then for any mNand ϵ>0, there exists a finite random variable V,m< such that

dist(xk,X)2(1λ1)k1maxmik1dist(xi,X)λ0(1+ϵ)ρ¯a.s.V,m.

When the functions f are convex, we have ρ¯=0, so that Proposition 4 guarantees linear convergence for easy problems. In the case that ρ¯>0, the result is conditional: If an aprox method converges to one of the sharp minimizers of f, then this convergence is linear (i.e., geometrically fast). In the case of phase retrieval, we can guarantee convergence:

Example 5 (Example 4 continued):

Let ARm×n be a matrix with rows ai that satisfy the conditions of Example 4. For F(x)=1m|Ax|2|Ax|21 where mn, the truncated model 6 requires overall computation time O(mnlog1ϵ) to achieve an ϵ-accurate solution to phase retrieval, which is the best-known time complexity. See proof in SI Appendix.

Experiments

An important question in the development of any optimization method is its sensitivity to algorithm parameters. Consequently, we conclude by experimentally examining convergence time and robustness of each of our optimization methods. We consider each of the models in this paper: the stochastic gradient method (i.e., the linear model 3), the proximal model 4, the prox-linear model 5, and the (lower) truncated model 6.

We test both convergence time and, dovetailing with our focus in this paper, robustness to stepsize for several problems: phase retrieval, matrix completion, and 2 classification problems using deep learning. We consider stepsize sequences of the form αk=α0kβ and perform K iterations over a wide range of different initial stepsizes α0. (For brevity, we present results only for the power β=0.6; experiments with varied β(12,1) were similar.) For a fixed accuracy ϵ>0, we record the number of steps k to achieve F(xk)F(x)ϵ, reporting these times (where we terminate each run at iteration K). We perform T experiments for each initial stepsize choice, reporting median time to ϵ accuracy and 90% confidence intervals.

Phase Retrieval.

We start our experiments with the phase retrieval problem in Examples 1 and 4, focusing on the real case for simplicity, where we are given ARm×n with rows aiRn and b=(Ax)2R+m for some xRn. Our objective is the nonconvex and nonsmooth function

F(x)=1mi=1mai,x2bi.

We sample the entries the vectors ai and x i.i.d. N(0,In).

We present the results in Fig. 1, comparing the stochastic gradient method 3, the proximal method 4, and the truncated method 6 (whose updates are identical to the prox-linear model 5 in this case). The plots demonstrate the expected result that the stochastic gradient method has good performance in a narrow range of stepsizes, α110 in this case, while better approximations for aprox yield convergence over a large range of stepsizes. The truncated model 6 exhibits oscillation for large stepsizes, in contrast to the exact model 4.

Fig. 1.

Fig. 1.

The number of iterations to achieve ϵ accuracy versus initial stepsize α0 for phase retrieval with n=50, m=1,000. SGM, stochastic gradient method.

Matrix Completion.

For our second experiment, we investigate aprox procedures for the matrix completion problem of Example 2. In this setting, we are given M=XYT, for XRm×r and YRn×r, and a set of indexes Ω[m]×[n]. We aim to recover M observing only {Mij}i,jΩ, so our goal is to

minimizeF(X,Y)1Ωi,jΩXiTYjMi,j.

We optimize over matrices XRm×r^ and YRn×r^, where the estimated rank r^r. We generate X and Y by drawing their entries i.i.d. N(0,1), choosing Ω uniformly at random of size |Ω|=5(nr+mr). We present the timing results in Fig. 2, which tells a similar story to Fig. 1: Better approximations, such as the truncated models (which again yield identical updates to the prox-linear models 5), are significantly more robust to stepsize specification. The proximal update requires solving a nontrivial quartic, so we omit it.

Fig. 2.

Fig. 2.

Number of iterations to achieve ϵ accuracy versus initial stepsize α0 for matrix completion with m=2,000, n=2,400, r=5. Shown are estimated ranks (A) r^=5 and (B) r^=10.

Neural Networks.

As one of our main motivations is to address the extraordinary effort—in computational and engineering hours—spent carefully tuning optimization methods, we would be remiss to avoid experiments on deep neural networks. Therefore, in our last set of experiments, we test the performance of our models for training neural networks for classification tasks over the CIFAR10 dataset (34) and the fine-grained 128-class Stanford dog multiclass recognition task (35). For our CIFAR10 experiment, we use the Resnet18 architecture (36); we replace the rectified linear unit (RELU) activations internal to the architecture with exponentiated linear units (ELUs) (37) so that the loss is of composite form f=hc for h convex and c smooth. For Stanford dogs we use the VGG16 architecture (38) pretrained on Imagenet (39), again substituting ELUs for RELU activations. For this experiment, we also test a modified version of the truncated method, truncadagrad, which uses the truncated model in iteration 2 and a diagonally scaled Euclidean distance (40), updating at iteration k by setting xk to minimize

f(xk;Sk)+gk,xxk++12α0(xxk)THk(xxk),

where Hk=diag(i=1kgigiT)1/2 for gi=f(xi;Si). This update requires no more of standard deep-learning software than computing a gradient (backpropagation) and loss. We also compare to adam, the default optimizer in TensorFlow (41).

Figs. 3 and 4 show our results for the CIFAR10 and Stanford dogs datasets, respectively. Fig. 3A and 4A give the number of iterations required to achieve ϵ test-classification error (on the highest or “top-1” predicted class), while Figs. 3B and 4B show the maximal accuracy each procedure achieves for a given initial stepsize α0. The plots demonstrate the sensitivity of the standard stochastic gradient method to stepsize choice, which converges only for a small range of stepsizes, in both experiments. ADAM exhibits better robustness for CIFAR10, while it is extremely sensitive in the second experiment (Fig. 4), converging only for a small range of stepsizes—this difference in sensitivities highlights the importance of robustness. In contrast, our procedures using the truncated model are apparently robust for all large enough stepsizes. Figs. 3B and 4B show additionally that the maximal accuracy the 2 truncated methods achieve changes only slightly for α0101, again in strong contrast to the other methods, which achieve their best accuracy only for a small range of stepsizes.

Fig. 3.

Fig. 3.

(A) The number of iterations to achieve ϵ test error versus initial stepsize α0 for CIFAR10. (B) The best achieved accuracy after T=50 epochs.

Fig. 4.

Fig. 4.

(A) The number of iterations to achieve ϵ test error versus initial stepsize α0 for the Stanford dogs dataset. (B) The best achieved accuracy after T=30 epochs.

These results reaffirm the insights from our theoretical results and experiments: It is important and possible to develop methods that enjoy good convergence guarantees and are robust to algorithm parameters.

Data Availability.

All data discussed in this paper are available at GitHub (https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms) (42).

Supplementary Material

Supplementary File
pnas.1908018116.sapp.pdf (325.1KB, pdf)

Acknowledgments

H.A. and J.C.D. were supported by National Science Foundation (NSF)-CAREER Award CCF-1553086, Office of Naval Research Young Investigator Program Award N00014-19-2288, and the Stanford DAWN Consortium.

Footnotes

The authors declare no competing interest.

This article is a PNAS Direct Submission.

Data deposition: Data and code for this work have been deposited in GitHub (https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms).

This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1908018116/-/DCSupplemental.

References

  • 1.Real E., Aggarwal A., Huang Y., Le Q. V., “Regularized evolution for image classifier architecture search” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Stone P., Ed. (AAAI Press, Palo Alto, CA, 2019), vol. 33, pp. 4780–4789. [Google Scholar]
  • 2.Zoph B., Le Q. V., “Neural architecture search with reinforcement learning” in Proceedings of the Fifth International Conference on Learning Representations, Bengio Y., LeCun Y., Eds. (ICLR, 2017). [Google Scholar]
  • 3.Collins J., Sohl-Dickstein J., Sussillo D., Capacity and trainability in recurrent neural networks. arXiv:1611.09913 [stat.ML] (29 November 2016).
  • 4.Rockafellar R. T., Wets R. J. B., Variational Analysis (Springer, New York, NY, 1998). [Google Scholar]
  • 5.Davis D., Drusvyatskiy D., Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29, 207–239 (2019). [Google Scholar]
  • 6.Robbins H., Monro S., A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). [Google Scholar]
  • 7.Nemirovski A., Juditsky A., Lan G., Shapiro A., Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009). [Google Scholar]
  • 8.Asi H., Duchi J. C., Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM J. Optim. 29, 2257–2290 (2019). [Google Scholar]
  • 9.Duchi J. C., Ruan F., Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28, 3229–3259 (2018). [Google Scholar]
  • 10.Martinet B., Regularisation d’inéquations variationelles par approximations succesives. Revue Francaise d’Informatique et de Recherche Operationelle 4, 154–158 (1970). [Google Scholar]
  • 11.Rockafellar R. T., Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976). [Google Scholar]
  • 12.Güler O., On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim. 29, 403–419 (1991). [Google Scholar]
  • 13.Bauschke H. H., Combettes P. L., Convex Analysis and Monotone Operator Theory in Hilbert Spaces (Springer, 2011), vol. 408. [Google Scholar]
  • 14.Hiriart-Urruty J., Lemaréchal C., Convex Analysis and Minimization Algorithms I & II (Springer, New York, NY, 1993). [Google Scholar]
  • 15.Bonnans J. F., Shapiro A., Perturbation Analysis of Optimization Problems (Springer, 2000). [Google Scholar]
  • 16.Widrow B., Hoff M. E., “Adaptive switching circuits” in 1960 IRE WESCON Convention Record (IRE [Institute of Radio Engineers], 1960), pp. 96–104M, Reprinted in Neurocomputing, 1988. [Google Scholar]
  • 17.Sayed A. H., Fundamentals of Adaptive Filtering (John Wiley & Sons, 2003). [Google Scholar]
  • 18.Shalev-Shwartz S., Zhang T., “Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization” in Proceedings of the 31st International Conference on Machine Learning, Xing E. P., Jebara T., Eds. (PMLR, 2014), vol. 32, pp. 64–72. [Google Scholar]
  • 19.Lin H., Mairal J., Harchaoui Z., Catalyst acceleration for first-order convex optimization: From theory to practice. J. Mach. Learn. Res. 18, 1–54 (2018). [Google Scholar]
  • 20.Patrascu A., Necoara I., Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. J. Mach. Learn. Res. 18, 1–42 (2018). [Google Scholar]
  • 21.Bertsekas D. P., Incremental proximal methods for large scale convex optimization. Math. Program. Ser. B 129, 163–195 (2011). [Google Scholar]
  • 22.Fletcher R., A model algorithm for composite nondifferentiable optimization problems. Math. Program. Study 17, 67–76 (1982). [Google Scholar]
  • 23.Schechtman Y., et al. , Phase retrieval with application to optical imaging. IEEE Signal Process. Mag. 32, 87–109 (2015). [Google Scholar]
  • 24.Duchi J., Ruan F., Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval Inform. Infer. J. IMA 8, 471–529 (2018). [Google Scholar]
  • 25.Candes E. J., Recht B., Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2008). [Google Scholar]
  • 26.Duchi J. C., Ruan F., Asymptotic optimality in stochastic optimization. arXiv:1612.05612 (16 December 2016).
  • 27.Polyak B. T., Juditsky A. B., Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838–855 (1992). [Google Scholar]
  • 28.Nemirovski A., Yudin D., Problem Complexity and Method Efficiency in Optimization (Wiley, 1983). [Google Scholar]
  • 29.Drusvyatskiy D., Lewis A., Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43, 919–948 (2018). [Google Scholar]
  • 30.Davis D., Drusvyatskiy D., Kakade S., Lee J. D., Stochastic subgradient method converges on tame functions (Springer, New York, NY, 2019).
  • 31.Ghadimi S., Lan G., Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23, 2341–2368 (2013). [Google Scholar]
  • 32.LeCun Y., Bengio Y., Hinton G., Deep learning. Nature 521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
  • 33.Belkin M., Hsu D., Mitra P., “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate” in Advances in Neural Information Processing Systems, Bengio S., Ed. (Curran Associates, Inc., 2018), vol. 31, pp. 2300–2311. [Google Scholar]
  • 34.Krizhevsky A., “Learning multiple layers of features from tiny images” (Tech Rep., University of Toronto, Toronto, ON, Canada, 2009).
  • 35.Khosla A., Jayadevaprakash N., Yao B., Li F. F., “Novel dataset for fine-grained image categorization” in First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Pinto N. (IEEE, Piscataway, NJ, 2011). [Google Scholar]
  • 36.He K., Zhang X., Ren S., Sun J., “Deep residual learning for image recognition” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Mortensen E., Saenko K., Eds. (IEEE, Piscataway, NJ, 2016), pp. 770–778. [Google Scholar]
  • 37.Clevert D. A., Unterthiner T., Hochreiter S., “Fast and accurate deep network learning by exponential linear units (ELUs)” in Proceedings of the Fourth International Conference on Learning Representations, Bengio Y., LeCun Y., Eds. (ICLR, 2016). [Google Scholar]
  • 38.Simonyan K., Zisserman A., “Very deep convolutional networks for large-scale image recognition” in Proceedings of the Third International Conference on Learning Representations, Bengio Y., LeCun Y., Eds. (ICLR, 2015). [Google Scholar]
  • 39.Deng J., et al. , “ImageNet: A large-scale hierarchical image database” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Flynn P., Mortensen E., Eds. (IEEE, Piscataway, NJ, 2009), pp. 248–255. [Google Scholar]
  • 40.Duchi J. C., Hazan E., Singer Y., Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12:2121–2159 (2011). [Google Scholar]
  • 41.Abadi M., et al. , “TensorFlow: A system for large-scale machine learning” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Keeton K., Roscoe T., Eds. (USENIX Association, 2016), pp. 265–283. [Google Scholar]
  • 42.Asi H., Duchi J., APROX: Robust Stochastic Optimization Algorithms. GitHub. https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms. Deposited 18 October 2019.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File
pnas.1908018116.sapp.pdf (325.1KB, pdf)

Data Availability Statement

All data discussed in this paper are available at GitHub (https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms) (42).


Articles from Proceedings of the National Academy of Sciences of the United States of America are provided here courtesy of National Academy of Sciences

RESOURCES