Significance
Sensitivity of optimization algorithms to problem and algorithmic parameters leads to tremendous waste in time and energy, especially in applications with millions of parameters, such as deep learning. We address this by developing stochastic optimization methods demonstrably—both by theory and by experimental evidence—more robust, enjoying optimal convergence guarantees for a variety of stochastic optimization problems. Additionally, we highlight the importance of method sensitivity to problem difficulty and algorithmic parameters.
Keywords: stochastic optimization, large-scale optimization
Abstract
Standard stochastic optimization methods are brittle, sensitive to stepsize choice and other algorithmic parameters, and they exhibit instability outside of well-behaved families of objectives. To address these challenges, we investigate models for stochastic optimization and learning problems that exhibit better robustness to problem families and algorithmic parameters. With appropriately accurate models—which we call the aprox family—stochastic methods can be made stable, provably convergent, and asymptotically optimal; even modeling that the objective is nonnegative is sufficient for this stability. We extend these results beyond convexity to weakly convex objectives, which include compositions of convex losses with smooth functions common in modern machine learning. We highlight the importance of robustness and accurate modeling with experimental evaluation of convergence time and algorithm sensitivity.
A major challenge in stochastic optimization—the algorithmic workhorse for much of modern statistical and machine-learning applications—is in setting algorithm parameters (or hyperparameter tuning). This sensitivity causes multiple issues. It results in thousands to millions of wasted engineer and computational hours. It also leads to a lack of clarity in research and development of algorithms—in claiming that one algorithm is better than another, it is unclear whether this is due to judicious choice of dataset or judicious parameter settings or whether indeed the algorithm does exhibit new desirable behavior. Consequently, in this paper we pursue 2 main thrusts: First, by using models more accurate than the first-order models common in stochastic gradient methods, we develop families of algorithms that are provably more robust to input parameter choices, with several corresponding optimality properties. Second, we argue for a different type of experimental evidence in evaluating stochastic optimization methods, where one jointly evaluates convergence speed and sensitivity of the methods.
The wasted computational and engineering energy is especially pronounced in deep learning, where engineers use models with millions of parameters, requiring days to weeks to train a single model. To get a sense of this energy use, we consider a few recent papers we view as exemplars of this broader trend: In searching for optimal neural network architectures and hyperparameters, the papers (1–3) used approximately 3,150 graphics processing unit (GPU) days, 22,000 GPU days, and 750,000 central processing unit (CPU) days of computation, respectively. To put this in perspective, assuming standard CPU energy use of between 60 and 100 W, the energy (ignoring network interconnect, monitors, etc.) for the paper (3) is roughly between 4 and J. At J per tank of gas, this is sufficient to drive 4,000 Toyota Camrys the 380 miles between San Francisco and Los Angeles.
To address these challenges, we develop stochastic optimization procedures that exhibit similar convergence to classical approaches—when the classical approaches have good tuning parameters—while enjoying better robustness, achieving this performance over a range of parameters. We argue too for evaluation of optimization algorithms based not only on convergence time but also on robustness to input choices. Briefly, a fast algorithm that converges for a small range of stepsizes is too brittle; we argue instead for (potentially slightly slower) algorithms that converge for broad ranges of stepsizes and other parameters. Our theory and experiments demonstrate the effectiveness of our methods for applications including phase retrieval, matrix completion, and deep learning.
Problem Setting and Approach
We begin by making our setting concrete. We study the stochastic optimization problem
| [1] |
In problem 1, the set is a sample space, is closed convex, and is the loss suffers on sample . In this paper, we move beyond convex optimization by considering -weakly convex functions , meaning (cf. refs. 4 and 5) that is convex. We recover convexity when . Examples include linear regression, , and phase retrieval, , which is -weakly convex.
Most optimization methods iterate by making an approximation—a model—of the objective at the current iterate, minimizing this model and reapproximating. Stochastic (sub)gradient methods (6, 7) instantiate this approach using a linear approximation; following initial work of our own and others (5, 8, 9), we study the modeling approach in more depth for stochastic optimization. Thus, the aprox algorithms we develop iterate as follows: For , we draw a random and then update the iterate by minimizing a regularized approximation to , setting
| [2] |
We call the model of at , where satisfies 3 conditions (cf. refs. 5, 8, and 9):
-
C.i)
(Model convexity): The function is convex and subdifferentiable.
-
C.ii)
(Weak lower bound): The model satisfies
-
C.iii)
(Local accuracy): We have .
The containment is immediate from condition C.iii. We provide examples presently.
We show that models slightly more accurate than the first-order model used by the stochastic gradient method—sometimes as simple as recognizing that if is nonnegative, we should truncate the approximation at zero—achieve substantially better theoretical guarantees and practical performance. While the iterates of gradient methods can (superexponentially) diverge for misspecified stepsizes, our methods guarantee the iterates never diverge. Even more, this stability guarantees convergence and, in convex cases, optimal asymptotic normality of the averaged iterates. Finally, we evaluate the performance of our methods, validating our theoretical findings on convergence and robustness for a range of problems, including matrix completion, phase retrieval, and classification with neural networks. We defer proofs to SI Appendix.
In optimization broadly, proximal point methods and their related robust convergence are classical (10–12), and their role in smoothing and Moreau–Yosida regularization is also central in convex and variational analysis (13–15). In signal processing, least-mean squares for adaptive filtering is an important instance of the stochastic proximal point method (16, 17). More recent work in large-scale optimization and machine learning revisits Moreau smoothing and regularization, extending acceleration and stability properties of proximal-point-type methods to finite sum and stochastic problems (18–20).
Notation and Basic Assumptions
For a weakly convex function , we let denote its Fréchet subdifferential at the point , and denotes an arbitrary element of the subdifferential. Throughout, we let denote a minimizer of problem 1 and denote the optimal set for problem 1. We let denote the field generated by the first random variables . Note that for all . Unless stated otherwise, we assume that the function is -weakly convex for each . Finally, the following assumption implicitly holds throughout.
Assumption A1.
The set is nonempty, and there exists such that for each and selection , we have .
Methods
To make our approach more concrete, we identify several models that fit into our framework. These have appeared in refs. 5, 8, and 9, but we believe a self-contained presentation is beneficial. Each one satisfies our conditions C.i to C.iii. The most common model in stochastic optimization is the first-order model.
Stochastic Subgradient Methods.
The stochastic subgradient method uses the model
| [3] |
Proximal Point Methods.
In the convex setting (8, 20, 21), the stochastic proximal point method uses the model ; in the weakly convex setting, we regularize and use
| [4] |
Other models require less knowledge than proximal model 4 but preserve structural properties in the original function.
Prox-Linear Model.
Let the function have the composite structure , where is convex and is smooth. The stochastic prox-linear method applies to a first-order approximation of , using
| [5] |
In the nonstochastic setting, these models are classical (22), while recent work establishes convergence and convergence rates in restrictive stochastic settings (5, 9). When is Lipschitz and has an -Lipschitz gradient, then is -weakly convex.
Example 1 (phase retrieval):
In phase retrieval (23), we wish to recover an object from a diffraction pattern , where , but physical sensor limitations mean we observe only amplitudes . A natural objective is
This is the composition of and , so is -weakly convex (24).
Example 2 (matrix completion):
In the matrix completion problem (25), which arises (for example) in the design of recommendation systems, we have a matrix with decomposition for and . Based on the incomplete set of known entries , our goal is to recover the matrix , giving rise to the objective
where and , are the rows of and . This is the composition of and , so that is 1-weakly convex.
Truncated Models.
The prox-linear model 5 may be challenging to implement for complex compositions (e.g., deep learning). If instead we know a lower bound on , we may incorporate this into the model
| [6] |
In our examples—linear and logistic regression, phase retrieval, and matrix completion (more generally, typical loss functions in machine learning)—we have . The assumption that we have a lower bound is thus rarely restrictive. This model satisfies the conditions C.i to C.iii, also satisfying the following condition.
-
C.iv)
(Lower optimality): For all and ,
As we show, condition C.iv is sufficient to derive several optimality and stability properties.
Stability and Its Consequences
In our initial study of stability in optimization (8), we defined an algorithm as stable if its iterates remain bounded and then showed several consequences of this in convex optimization (which we review presently). Here, we develop 2 important extensions. First, we show that any model satisfying condition C.iv has stable iterates under mild assumptions, in strong contrast to models (e.g., linear) that fail the condition. Second, we develop an analogous stability theory for weakly convex functions, proving that accurate enough models are stable. In parallel to the convex case, stability suffices for more: It implies convergence (with an asymptotic rate) to stationary points for any model-based method on weakly convex functions. Let us formalize stability (8). A pair is a collection of problems if consists of probability measures on a sample space and of functions .
Definition 1.
An algorithm generating iterates according to the model-based update 2 is stable in probability for the class of problems if for all , defining , and ,
| [7] |
Typically, stability 7 requires the standard assumptions
| [8] |
Even under these, models such as the linear model 3 and consequent subgradient method are unstable (ref. 8, section 3). They may even cause superexponential divergence.
Example 3 (divergence):
Let , , and , and let satisfy . Let be generated by the gradient method. For large , for all .
The Importance of Stability in Stochastic Convex Optimization.
To set the stage for what follows, we begin by motivating the importance of stable procedures. Briefly, any stable aprox model converges for any convex function under weak assumptions, which we now elucidate. First, we make an assumption.
Assumption A2.
There exists such that for all and each measurable selection ,
Assumption A2 is equivalent to assuming is bounded on compact sets; it allows arbitrary growth as long as the subgradients have second moments.
Proposition 1 [Asi and Duchi (8), proposition 1].
Assume that is convex for each and let Assumption A2 hold. Let the iterates be generated by any method satisfying conditions C.i to C.iii and [8]. On the event , and .
Proposition 1 establishes convergence of stable procedures and also (via Jensen’s inequality) provides asymptotic rates of convergence for weighted averages .
Stability is additionally important when the functions are smooth: Any stable aprox method achieves asymptotically optimal convergence. In particular, let us assume is near with , and the have an -Lipschitz gradient near with .
Proposition 2 [Asi and Duchi (8), theorem 2].
In addition to the conditions of Proposition 1, let the conditions of the previous paragraph hold. Then satisfies
This convergence is optimal for any method (26).
Stability of Lower-Bounded Models for Convex Functions.
With these consequences of stability in hand—convergence and asymptotic optimality—it behooves us to provide conditions sufficient to guarantee stability. To that end, we show that lower-bounded models satisfying condition C.iv are stable in probability (Definition 1) for functions whose (sub)gradients grow at most polynomially. We begin with an assumption.
Assumption A3.
There exist , such that
and for all .
The analogous condition (27) for stochastic gradient methods holds for , or quadratic growth, without which the method may diverge. In contrast, Assumption A3 allows polynomial growth; for example, the function is permissible, while the gradient method may exponentially diverge even for stepsizes . The key consequence of Assumption A3 is that if it holds, truncated models are stable:
Theorem 1.
Assume the function is convex for each . Let Assumption A3 hold and with . Let be generated by the iteration 2 with a model satisfying conditions C.i to C.iv. Then
Theorem 1 shows that truncated methods enjoy the benefits of stability we outline in Propositions 1 and 2 above. Thus, these models, whose updates are typically as cheap to compute as a stochastic gradient step (especially in the common case that ) provide substantial advantage over methods using only (sub)gradient approximations.
Stability and Its Consequences for Weakly Convex Functions.
We continue our argument that—if possible—it is beneficial to use more accurate models, even in situations beyond convexity, investigating the stability of proximal models (Eq. 4) for weakly convex functions. Establishing stability in the weakly convex case requires a different approach to the convex case, as the iterates may not make progress toward a fixed optimal set. In this case, to show stability, we require an assumption bounding the size of relative to the population subgradient .
Assumption A4.
There exist such that for all measurable selections and ,
By providing a relative noise condition on , Assumption A4 allows for more than the typical class of functions with global Lipschitz properties (cf. ref. 5), such as the phase retrieval and matrix completion objectives (Examples 1 and 2). It can allow exponential growth, addressing the challenges in Example 3. For example, let and , where is uniform in so that ; then .
To describe convergence and stability in nonconvex (even nonsmooth) settings, we require appropriate definitions. Finding global minima of nonconvex functions is computationally infeasible (28), so we follow established practice and consider convergence to stationary points via the Moreau envelope (5, 29). To formalize, for and , the Moreau envelope and associated proximal map are
For large , the minimizer is unique whenever is weakly convex. Adopting the techniques Davis and Drusvyatskiy (5) pioneer for weakly convex problems, we rely on the Moreau envelope’s connections to (near) stationarity:
| [9] |
The 3 properties in [9] imply that any nearly stationary point of —when is small—is close to a nearly stationary point of . To prove convergence for weakly convex , then, it suffices to show .
Using full proximal models guarantees convergence.
Theorem 2.
Let Assumption A4 hold, let satisfy , and assume and . Let follow the iteration 2 with proximal model 4 and stepsizes 8. Then there exists a random variable satisfying
Theorem 2 shows that is bounded almost surely. Thus, if is coercive, meaning as , the Moreau envelope is coercive, yielding the following.
Corollary 1.
Let the conditions of Theorem 2 hold and let be coercive. Then
In parallel with our development of the convex case, stability is sufficient to develop convergence results for any aprox method, highlighting its importance. Indeed, we can show that stable methods guarantee convergence, although for probability 1 convergence of the iterates, we require a slightly elaborate assumption (cf. refs. 9 and 30), which rules out pathological limits.
Assumption A5 (Weak Sard).
Let be the collection of stationary points of over . The Lebesgue measure of the image is zero.
Under this assumption, aprox methods converge to stationary points whenever the iterates are stable.
Proposition 3.
Let Assumption A2 hold and the iterates be generated by any method satisfying conditions C.i to C.iii and [8]. Assume that is large enough that . There exists a finite random variable such that on the event that , with probability 1 we have
| [10] |
Under Assumption A5, then and .
The condition 10 is enough to develop a conditional -convergence guarantee similar to what stochastic (sub)gradient methods achieve to stationary points for Lipschitz (5, 31). Indeed, assume for some and that the iterates are stable; choose with probability . Then inequality 10 shows
Fast Convergence for Easy Problems
In many engineering and learning applications, solutions interpolate the data. Consider, for example, signal recovery problems with or modern machine-learning applications, where frequently training error is zero (32, 33). We consider such problems here, showing how models that satisfy the lower-bound condition C.iv enjoy linear convergence, extending our earlier results (8) beyond convex optimization.
Definition 2.
Let . Then is easy to optimize if for each and almost all ,
For such problems, we can guarantee progress toward minimizers for appropriate , as the following lemma shows.
Lemma 1.
Let be easy to optimize (Definition 2). Let be generated by the updates 2 using a model satisfying conditions C.i to C.iv. Then for any ,
Lemma 1 allows us to prove fast convergence as long as grows quickly enough away from ; a sufficient condition for us is a sharp growth condition away from the optimal set . To meld with Lemma 1, we consider the following:
Assumption A6 (Expected Sharp Growth).
There exist constants such that for , , and ,
Assumption A6 is tailored to Lemma 1, so we discuss a few situations where it holds. One sufficient condition is the small-ball condition that there exists such that for and . We can be more explicit:
Example 4 (Example 1 continued):
Consider the (real-valued) phase retrieval problem with objective . Assume the vectors are drawn from a distribution satisfying the small-ball condition for and any and additionally that and for some . For a sample of size , Assumption A3 holds with high probability for the objective with , and , for a numerical constant . The full calculation is in SI Appendix.
The following proposition is our main result in this section, showing lower-bounded models may enjoy linear convergence.
Proposition 4.
Let Assumption A6 hold and be generated by the stochastic iteration 2 using any model satisfying conditions C.i to C.iv, where the stepsizes satisfy for some . If is -weakly convex with , then for any and , there exists a finite random variable such that
When the functions are convex, we have , so that Proposition 4 guarantees linear convergence for easy problems. In the case that , the result is conditional: If an aprox method converges to one of the sharp minimizers of , then this convergence is linear (i.e., geometrically fast). In the case of phase retrieval, we can guarantee convergence:
Example 5 (Example 4 continued):
Let be a matrix with rows that satisfy the conditions of Example 4. For where , the truncated model 6 requires overall computation time to achieve an -accurate solution to phase retrieval, which is the best-known time complexity. See proof in SI Appendix.
Experiments
An important question in the development of any optimization method is its sensitivity to algorithm parameters. Consequently, we conclude by experimentally examining convergence time and robustness of each of our optimization methods. We consider each of the models in this paper: the stochastic gradient method (i.e., the linear model 3), the proximal model 4, the prox-linear model 5, and the (lower) truncated model 6.
We test both convergence time and, dovetailing with our focus in this paper, robustness to stepsize for several problems: phase retrieval, matrix completion, and 2 classification problems using deep learning. We consider stepsize sequences of the form and perform iterations over a wide range of different initial stepsizes . (For brevity, we present results only for the power ; experiments with varied were similar.) For a fixed accuracy , we record the number of steps to achieve , reporting these times (where we terminate each run at iteration ). We perform experiments for each initial stepsize choice, reporting median time to accuracy and 90% confidence intervals.
Phase Retrieval.
We start our experiments with the phase retrieval problem in Examples 1 and 4, focusing on the real case for simplicity, where we are given with rows and for some . Our objective is the nonconvex and nonsmooth function
We sample the entries the vectors and i.i.d. .
We present the results in Fig. 1, comparing the stochastic gradient method 3, the proximal method 4, and the truncated method 6 (whose updates are identical to the prox-linear model 5 in this case). The plots demonstrate the expected result that the stochastic gradient method has good performance in a narrow range of stepsizes, in this case, while better approximations for aprox yield convergence over a large range of stepsizes. The truncated model 6 exhibits oscillation for large stepsizes, in contrast to the exact model 4.
Fig. 1.
The number of iterations to achieve accuracy versus initial stepsize for phase retrieval with , . SGM, stochastic gradient method.
Matrix Completion.
For our second experiment, we investigate aprox procedures for the matrix completion problem of Example 2. In this setting, we are given , for and , and a set of indexes . We aim to recover observing only , so our goal is to
We optimize over matrices and , where the estimated rank . We generate and by drawing their entries i.i.d. , choosing uniformly at random of size . We present the timing results in Fig. 2, which tells a similar story to Fig. 1: Better approximations, such as the truncated models (which again yield identical updates to the prox-linear models 5), are significantly more robust to stepsize specification. The proximal update requires solving a nontrivial quartic, so we omit it.
Fig. 2.
Number of iterations to achieve accuracy versus initial stepsize for matrix completion with , , . Shown are estimated ranks (A) and (B) .
Neural Networks.
As one of our main motivations is to address the extraordinary effort—in computational and engineering hours—spent carefully tuning optimization methods, we would be remiss to avoid experiments on deep neural networks. Therefore, in our last set of experiments, we test the performance of our models for training neural networks for classification tasks over the CIFAR10 dataset (34) and the fine-grained 128-class Stanford dog multiclass recognition task (35). For our CIFAR10 experiment, we use the Resnet18 architecture (36); we replace the rectified linear unit (RELU) activations internal to the architecture with exponentiated linear units (ELUs) (37) so that the loss is of composite form for convex and smooth. For Stanford dogs we use the VGG16 architecture (38) pretrained on Imagenet (39), again substituting ELUs for RELU activations. For this experiment, we also test a modified version of the truncated method, truncadagrad, which uses the truncated model in iteration 2 and a diagonally scaled Euclidean distance (40), updating at iteration by setting to minimize
where for . This update requires no more of standard deep-learning software than computing a gradient (backpropagation) and loss. We also compare to adam, the default optimizer in TensorFlow (41).
Figs. 3 and 4 show our results for the CIFAR10 and Stanford dogs datasets, respectively. Fig. 3A and 4A give the number of iterations required to achieve test-classification error (on the highest or “top-1” predicted class), while Figs. 3B and 4B show the maximal accuracy each procedure achieves for a given initial stepsize . The plots demonstrate the sensitivity of the standard stochastic gradient method to stepsize choice, which converges only for a small range of stepsizes, in both experiments. ADAM exhibits better robustness for CIFAR10, while it is extremely sensitive in the second experiment (Fig. 4), converging only for a small range of stepsizes—this difference in sensitivities highlights the importance of robustness. In contrast, our procedures using the truncated model are apparently robust for all large enough stepsizes. Figs. 3B and 4B show additionally that the maximal accuracy the 2 truncated methods achieve changes only slightly for , again in strong contrast to the other methods, which achieve their best accuracy only for a small range of stepsizes.
Fig. 3.
(A) The number of iterations to achieve test error versus initial stepsize for CIFAR10. (B) The best achieved accuracy after epochs.
Fig. 4.
(A) The number of iterations to achieve test error versus initial stepsize for the Stanford dogs dataset. (B) The best achieved accuracy after epochs.
These results reaffirm the insights from our theoretical results and experiments: It is important and possible to develop methods that enjoy good convergence guarantees and are robust to algorithm parameters.
Data Availability.
All data discussed in this paper are available at GitHub (https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms) (42).
Supplementary Material
Acknowledgments
H.A. and J.C.D. were supported by National Science Foundation (NSF)-CAREER Award CCF-1553086, Office of Naval Research Young Investigator Program Award N00014-19-2288, and the Stanford DAWN Consortium.
Footnotes
The authors declare no competing interest.
This article is a PNAS Direct Submission.
Data deposition: Data and code for this work have been deposited in GitHub (https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms).
This article contains supporting information online at www.pnas.org/lookup/suppl/doi:10.1073/pnas.1908018116/-/DCSupplemental.
References
- 1.Real E., Aggarwal A., Huang Y., Le Q. V., “Regularized evolution for image classifier architecture search” in Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, Stone P., Ed. (AAAI Press, Palo Alto, CA, 2019), vol. 33, pp. 4780–4789. [Google Scholar]
- 2.Zoph B., Le Q. V., “Neural architecture search with reinforcement learning” in Proceedings of the Fifth International Conference on Learning Representations, Bengio Y., LeCun Y., Eds. (ICLR, 2017). [Google Scholar]
- 3.Collins J., Sohl-Dickstein J., Sussillo D., Capacity and trainability in recurrent neural networks. arXiv:1611.09913 [stat.ML] (29 November 2016).
- 4.Rockafellar R. T., Wets R. J. B., Variational Analysis (Springer, New York, NY, 1998). [Google Scholar]
- 5.Davis D., Drusvyatskiy D., Stochastic model-based minimization of weakly convex functions. SIAM J. Optim. 29, 207–239 (2019). [Google Scholar]
- 6.Robbins H., Monro S., A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951). [Google Scholar]
- 7.Nemirovski A., Juditsky A., Lan G., Shapiro A., Robust stochastic approximation approach to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009). [Google Scholar]
- 8.Asi H., Duchi J. C., Stochastic (approximate) proximal point methods: Convergence, optimality, and adaptivity. SIAM J. Optim. 29, 2257–2290 (2019). [Google Scholar]
- 9.Duchi J. C., Ruan F., Stochastic methods for composite and weakly convex optimization problems. SIAM J. Optim. 28, 3229–3259 (2018). [Google Scholar]
- 10.Martinet B., Regularisation d’inéquations variationelles par approximations succesives. Revue Francaise d’Informatique et de Recherche Operationelle 4, 154–158 (1970). [Google Scholar]
- 11.Rockafellar R. T., Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976). [Google Scholar]
- 12.Güler O., On the convergence of the proximal point algorithm for convex minimization. SIAM J. Control Optim. 29, 403–419 (1991). [Google Scholar]
- 13.Bauschke H. H., Combettes P. L., Convex Analysis and Monotone Operator Theory in Hilbert Spaces (Springer, 2011), vol. 408. [Google Scholar]
- 14.Hiriart-Urruty J., Lemaréchal C., Convex Analysis and Minimization Algorithms I & II (Springer, New York, NY, 1993). [Google Scholar]
- 15.Bonnans J. F., Shapiro A., Perturbation Analysis of Optimization Problems (Springer, 2000). [Google Scholar]
- 16.Widrow B., Hoff M. E., “Adaptive switching circuits” in 1960 IRE WESCON Convention Record (IRE [Institute of Radio Engineers], 1960), pp. 96–104M, Reprinted in Neurocomputing, 1988. [Google Scholar]
- 17.Sayed A. H., Fundamentals of Adaptive Filtering (John Wiley & Sons, 2003). [Google Scholar]
- 18.Shalev-Shwartz S., Zhang T., “Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization” in Proceedings of the 31st International Conference on Machine Learning, Xing E. P., Jebara T., Eds. (PMLR, 2014), vol. 32, pp. 64–72. [Google Scholar]
- 19.Lin H., Mairal J., Harchaoui Z., Catalyst acceleration for first-order convex optimization: From theory to practice. J. Mach. Learn. Res. 18, 1–54 (2018). [Google Scholar]
- 20.Patrascu A., Necoara I., Nonasymptotic convergence of stochastic proximal point algorithms for constrained convex optimization. J. Mach. Learn. Res. 18, 1–42 (2018). [Google Scholar]
- 21.Bertsekas D. P., Incremental proximal methods for large scale convex optimization. Math. Program. Ser. B 129, 163–195 (2011). [Google Scholar]
- 22.Fletcher R., A model algorithm for composite nondifferentiable optimization problems. Math. Program. Study 17, 67–76 (1982). [Google Scholar]
- 23.Schechtman Y., et al. , Phase retrieval with application to optical imaging. IEEE Signal Process. Mag. 32, 87–109 (2015). [Google Scholar]
- 24.Duchi J., Ruan F., Solving (most) of a set of quadratic equalities: Composite optimization for robust phase retrieval Inform. Infer. J. IMA 8, 471–529 (2018). [Google Scholar]
- 25.Candes E. J., Recht B., Exact matrix completion via convex optimization. Found. Comput. Math. 9, 717–772 (2008). [Google Scholar]
- 26.Duchi J. C., Ruan F., Asymptotic optimality in stochastic optimization. arXiv:1612.05612 (16 December 2016).
- 27.Polyak B. T., Juditsky A. B., Acceleration of stochastic approximation by averaging. SIAM J. Control Optim. 30, 838–855 (1992). [Google Scholar]
- 28.Nemirovski A., Yudin D., Problem Complexity and Method Efficiency in Optimization (Wiley, 1983). [Google Scholar]
- 29.Drusvyatskiy D., Lewis A., Error bounds, quadratic growth, and linear convergence of proximal methods. Math. Oper. Res. 43, 919–948 (2018). [Google Scholar]
- 30.Davis D., Drusvyatskiy D., Kakade S., Lee J. D., Stochastic subgradient method converges on tame functions (Springer, New York, NY, 2019).
- 31.Ghadimi S., Lan G., Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM J. Optim. 23, 2341–2368 (2013). [Google Scholar]
- 32.LeCun Y., Bengio Y., Hinton G., Deep learning. Nature 521, 436–444 (2015). [DOI] [PubMed] [Google Scholar]
- 33.Belkin M., Hsu D., Mitra P., “Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate” in Advances in Neural Information Processing Systems, Bengio S., Ed. (Curran Associates, Inc., 2018), vol. 31, pp. 2300–2311. [Google Scholar]
- 34.Krizhevsky A., “Learning multiple layers of features from tiny images” (Tech Rep., University of Toronto, Toronto, ON, Canada, 2009).
- 35.Khosla A., Jayadevaprakash N., Yao B., Li F. F., “Novel dataset for fine-grained image categorization” in First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Pinto N. (IEEE, Piscataway, NJ, 2011). [Google Scholar]
- 36.He K., Zhang X., Ren S., Sun J., “Deep residual learning for image recognition” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Mortensen E., Saenko K., Eds. (IEEE, Piscataway, NJ, 2016), pp. 770–778. [Google Scholar]
- 37.Clevert D. A., Unterthiner T., Hochreiter S., “Fast and accurate deep network learning by exponential linear units (ELUs)” in Proceedings of the Fourth International Conference on Learning Representations, Bengio Y., LeCun Y., Eds. (ICLR, 2016). [Google Scholar]
- 38.Simonyan K., Zisserman A., “Very deep convolutional networks for large-scale image recognition” in Proceedings of the Third International Conference on Learning Representations, Bengio Y., LeCun Y., Eds. (ICLR, 2015). [Google Scholar]
- 39.Deng J., et al. , “ImageNet: A large-scale hierarchical image database” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Flynn P., Mortensen E., Eds. (IEEE, Piscataway, NJ, 2009), pp. 248–255. [Google Scholar]
- 40.Duchi J. C., Hazan E., Singer Y., Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12:2121–2159 (2011). [Google Scholar]
- 41.Abadi M., et al. , “TensorFlow: A system for large-scale machine learning” in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), Keeton K., Roscoe T., Eds. (USENIX Association, 2016), pp. 265–283. [Google Scholar]
- 42.Asi H., Duchi J., APROX: Robust Stochastic Optimization Algorithms. GitHub. https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms. Deposited 18 October 2019.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All data discussed in this paper are available at GitHub (https://github.com/HilalAsi/APROX-Robust-Stochastic-Optimization-Algorithms) (42).




