Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Apr 7.
Published in final edited form as: Neurocomputing (Amst). 2022 Jan 21;481:333–356. doi: 10.1016/j.neucom.2022.01.014

Calibrating the Adaptive Learning Rate to Improve Convergence of ADAM

Qianqian Tong 1, Guannan Liang 1, Jinbo Bi 1,*
PMCID: PMC8951388  NIHMSID: NIHMS1778481  PMID: 35342226

Abstract

Adaptive gradient methods (AGMs) have become popular in optimizing the nonconvex problems in deep learning area. We revisit AGMs and identify that the adaptive learning rate (A-LR) used by AGMs varies significantly across the dimensions of the problem over epochs (i.e., anisotropic scale), which may lead to issues in convergence and generalization. All existing modified AGMs actually represent efforts in revising the A-LR. Theoretically, we provide a new way to analyze the convergence of AGMs and prove that the convergence rate of Adam also depends on its hyper-parameter є, which has been overlooked previously. Based on these two facts, we propose a new AGM by calibrating the A-LR with an activation (softplus) function, resulting in the Sadam and SAMSGrad methods. We further prove that these algorithms enjoy better convergence speed under nonconvex, non-strongly convex, and Polyak-Łojasiewicz conditions compared with Adam. Empirical studies support our observation of the anisotropic A-LR and show that the proposed methods outperform existing AGMs and generalize even better than S-Momentum in multiple deep learning tasks.

Keywords: ADAM, Deep learning, Adaptive methods, Stochastic methods

1. Introduction

Many machine learning problems can be formulated as the minimization of an objective function f of the form: minxdfx=1ni=1nfix, where both f and fi maybe nonconvex in deep learning. Stochastic gradient descent (SGD), its variants such as SGD with momentum (S-Momentum) [1, 2, 3, 4], and adaptive gradient methods (AGMs) [5, 6, 7] play important roles in deep learning area due to simplicity and wide applicability. In particular, AGMs often exhibit fast initial progress in training and are easy to implement in solving large scale optimization problems. The updating rule of AGMs can be generally written as:

xt+1=xtηtvtmt, (1)

where ʘ calculates element-wise product of the first-order momentum mt and the learning rate LRηtvt. There is fairly an agreement on how to compute mt, which is a convex combination of previous mt−1 and current stochastic gradient gt, i.e., mt = β1mt−1 + (1 − β1)gt, β1 ∈ [0,1]. The LR consists of two parts: the base learning rate (B-LR) ηt is a scalar which can be constant or decay over iterations. In our convergence analysis, we consider the B-LR as constant η. The adaptive learning rate 1vt, varies adaptively across dimensions of the problem, where vtd is the second-order momentum calculated as a combination of previous and current squared stochastic gradients. Unlike the first-order momentum, the formula to estimate the second-order momentum varies in different AGMs. As the core technique in AGMs, A-LR opens a new regime of controlling LR, and allows the algorithm to move with different step sizes along the search direction at different coordinates.

The first known AGM is Adagrad [5] where the second-order momentum is estimated as vt=i=1tgi2. It works well in sparse settings, but the A-LR often decays rapidly for dense gradients. To tackle this issue, Adadelta [7], Rmsprop [8], Adam [6] have been proposed to use exponential moving averages of past squared gradients, i.e., vt=β2vt1+1β2gt2, β2 ∈ [0,1] and calculate the A-LR by 1vt+ϵ where є > 0 is used in case that vt vanishes to zero. In particular, Adam has become the most popular optimizer in the deep learning area due to its effectiveness in early training stage. Nevertheless, it has been empirically shown that Adam generalizes worse than S-Momentum to unseen data and leaves a clear generalization gap [9, 10, 11], and even fails to converge in some cases [12, 13]. AGMs decrease the objective value rapidly in early iterations, and then stay at a plateau whereas SGD and S-Momentum continue to show dips in the training error curves, and thus continue to improve test accuracy over iterations. It is essential to understand what happens to Adam in the later learning process, so we can revise AGMs to enhance their generalization performance.

Recently, a few modified AGMs have been developed, such as, AMSGrad [12], Yogi [14], and AdaBound [13]. AMSGrad is the first method to theoretically address the non-convergence issue of Adam by taking the largest second-order momentum estimated in the past iterations, i.e., vt=maxvt1,v˜t where v˜t=β2v˜t1+1β2gt2, and proves its convergence in the convex case. The analysis is later extended to other AGMs (such as RMSProp and AMSGrad) in nonconvex settings [15, 16, 17, 18]. Yogi claims that the past gt2‘s are forgotten in a fairly fast manner in Adam and proposes vt=vt11β2signvt1gt2gt2 to adjust the decay rate of the A-LR. However, the parameter in the A-LR is adjusted to 10−3, instead of 10−8 in the default setting of Adam, so ϵ dominates the A-LR in later iterations when vt becomes small and can be responsible for performance improvement. The hyper-parameter has rarely been discussed previously and our analysis shows that the convergence rate is closely related to є, which is further verified in our experiments. PAdam1 [19, 15] claims that the A-LR in Adam and AMSGrad are “overadapted”, and proposes to replace the A-LR updating formula by 1/((vt)p + є) where p ϵ (0,1/2] AdaBound confines the LR to a predefined range by applying), Clipηvt,ηl,ηr, where LR values outside the interval [ηl,ηr] are clipped to the interval edges. However, a more effective way is to softly and smoothly calibrate the A-LR rather than hard-thresholding the A-LR at all coordinates. Our main contributions are summarized as follows:

  1. We study AGMs from a new perspective: the range of the A-LR. Through experimental studies, we find that the A-LR is always anisotropic. This anisotropy may lead the algorithm to focus on a few dimensions (those with large A-LR), which may exacerbate generalization performance. We analyze the existing modified AGMs to help explain how they close the generalization gap.

  2. Theoretically, we are the first to include hyper-parameter є into the convergence analysis and clearly show that the convergence rate is upper bounded by a 1/є2 term, verifying prior observations that є affects performance of Adam empirically. We provide a new approach to convergence analysis of AGMs under the nonconvex, non-strongly convex, or Polyak-Łojasiewicz (P-L) condition.

  3. Based on the above two results, we propose to calibrate the A-LR using an activation function, particularly we implement the softplus function with a hyper-parameter β, which can be combined with any AGM. In this work, we combine it with Adam and AMSGrad to form the SAdam and SAMSGrad methods.

  4. We also provide theoretical guarantees of our methods, which enjoy better convergence speed than Adam and recover the same convergence rate as SGD in terms of the maximum iteration T as O1/T rather than the known result: OlogT/T in [16]. Empirical evaluations show that our methods obviously increase test accuracy, and outperform many AGMs and even S-Momentum in multiple deep learning models.

2. Preliminaries

Notations.

For any vectors a,bd, we use a ʘ b for element-wise product, a2 for element-wise square, a for element-wise square root, a/b for element-wise division; we use ak to denote element-wise power of k, and ǁaǁ to denote its l2-norm. We use 〈a,b〉 to denote their inner product, max{a,b} to compute element-wise maximum. e is the Euler number, log(·) denotes logarithm function with base e, and O(·) to hide constants which do not rely on the problem parameters.

Optimization Terminology.

In convex setting, the optimality gap, f(xt) − f, is examined where xt is the iterate at iteration t, and f is the optimal value attained at x assuming that f does have a minimum. When f(xt) − fδ, it is said that the method reaches an optimal solution with δ-accuracy. However, in the study of AGMs, the average regret 1Tt=1Tfxtf (where the maximum iteration number T is pre-specified) is used to approximate the optimality gap to define δ-accuracy. Our analysis moves one step further to examine if f1Tt=1Txtfδ by applying Jensen’s inequality to the regret.

In nonconvex setting, finding the global minimum or even local minimum is NP-hard, so optimality gap is not examined. Rather, it is common to evaluate if a first-order stationary point has been achieved [20, 12, 14]. More precisely, we evaluate if Efxt2δ (e.g., in the analysis of SGD [1]). The convergence rate of SGD is O1/T in both non-strongly convex and nonconvex settings. Requiring O1/Tδ yields the maximum number of iterations T = O(12). Thus, SGD can obtain a δ-accurate solution in O(12) steps in non-strongly convex and nonconvex settings. Our results recover the rate of SGD and S-Momentum in terms of T.

Assumption 1. The loss fi and the objective f satisfy:

  1. L-smoothness. x,yd, i1,,n, fixfiyLxy.

  2. Gradient bounded. xd, i1,,n, fixG,G0.

  3. Variance bounded. xd, t1,Egt=fxt, Egtfxt2σ2.

Definition 1. Suppose f has the global minimum, denoted as f = f(x). Then for any x,yd,

  1. Non-strongly convex. fyfx+fxTyx.

  2. Polyak-Łojasiewicz (P-L) condition. λ>0 such that fx22λfxf.

  3. Strongly convex. μ>0 such that fyfx+fxTyx+μ2yx2.

3. Our New Analysis of Adam

First, we empirically observe that Adam has anisotropic A-LR caused by є, which may lead to poor generalization performance. Second, we theoretically show Adam method is sensitive to є, supporting observations in previous work.

3.1. Anisotropic A-LR.

We investigate how the A-LR in Adam varies over time and across problem dimensions, and plot four examples in Figure 1 (more figures in Appendix) where we run Adam to optimize a convolutional neural network (CNN) on the MNIST dataset, and ResNets or DenseNets on the CIFAR-10 dataset. The curves in Figure 1 exhibit very irregular shapes, and the median value is hardly placed in the middle of the range, the range of A-LR across the problem dimensions is anisotropic for AGMs. As a general trend, the A-LR becomes larger when vt approaches 0 over iterations. The elements in the A-LR vary significantly across dimensions and there are always some coordinates in the A-LR of AGMs that reach the maximum 108 determined by є (because we use є = 10−8 in Adam).

Figure 1:

Figure 1:

Range of the A-LR in Adam over iterations in four settings: (a) CNN on MNIST, (b) ResNet20 on CIFAR-10, (d) ResNet56 on CIFAR-10, (d) DenseNets on CIFAR-10. We plot the min, max, median, and the 25 and 75 percentiles of the A-LR across dimensions (the elements in 1vt+ϵ)

This anisotropic scale of A-LR across dimensions makes it difficult to determine the B-LR, η. On the one hand, η should be set small enough so that the LRηvt+ϵ is appropriate, or otherwise some coordinates will have very large updates because the corresponding A-LR’s are big, likely resulting in performance oscillation [21]. This may be due to that exponential moving average of past gradients is different, hence the speed of mt diminishing to zero is different from the speed of vt diminishing to zero. Besides, noise generated in stochastic algorithms has nonnegligible influence to the learning process. On the other hand, very small η may harm the later stage of the learning process since the small magnitude of mt multiplying with a small step size (at some coordinates) will be too small to escape sharp local minimal, which has been shown to lead to poor generalization [22, 23, 24]. Further, in many deep learning tasks, stage-wise policies are often taken to decay the LR after several epochs, thus making the LR even smaller. To address the dilemma, it is essential to control the A-LR, especially when stochastic gradients get close to 0.

By analyzing previous modified AGMs that aim to close the generalization gap, we find that all these works can be summarized into one technique: constraining the A-LR, 1/vt+ϵ, to a reasonable range. Based on the observation of anisotropic A-LR, we propose a more effective way to calibrate the A-LR according to an activation function rather than hard-thresholding the A-LR at all coordinates, empirically improve generalization performance with theoretical guarantees of optimization.

3.2. Sensitive to є.

As a hyper-parameter in AGMs, ϵ is originally introduced to avoid the zero denominator issue when vt goes to 0, and has never been studied in the convergence analysis of AGMs. However, it has been empirically observed that AGMs can be sensitive to the choice of є in [17, 14]. As shown in Figure 1, a smaller є = 10−8 leads to a wide span of the A-LR across the different dimensions, whereas a bigger є = 10−3 as used in Yogi, reduces the span. To better learn the effect caused by sensitive є, we conduct experiments in multiple datasets and results are shown in Table 1 and 2. The setting of є is the main force causing anisotropy, unsatisfied, there has no theoretical result explains the effect of є on AGMs. Inspired by our observation, we believe that the current convergence analysis for Adam is not complete if omitting є.

Table 1:

Test Accuracy(%) of Adam for different є.

є ResNets 20 ResNets 56 DenseNets ResNet 18 VGG
10−1 92.51 ± 0.13 94.29 ± 0.10 94.78 ± 0.19 77.21 ± 0.26 76.05 ± 0.27
10−2 92.88 ± 0.21 94.15 ± 0.17 94.35 ± 0.10 76.64 ± 0.24 75.69 ± 0.16
10−4 92.03 ± 0.21 93.62 ± 0.18 94.15 ± 0.12 76.19 ± 0.20 74.45 ± 0.19
10−6 92.99 ± 0.22 93.56 ± 0.15 94.24 ± 0.24 76.09 ± 0.20 74.20 ± 0.33
10−8 91.68 ± 0.12 92.82 ± 0.09 93.32 ± 0.06 76.14 ± 0.24 74.18 ± 0.15

Table 2:

Test Accuracy(%) of AMSGrad for different є.

є ResNets 20 ResNets 56 DenseNets ResNet 18 VGG
10−1 92.80 ± 0.22 94.12 ± 0.07 94.92 ± 0.10 77.26 ± 0.30 75.84 ± 0.16
10−2 92.89 ± 0.07 94.20 ± 0.18 94.43 ± 0.22 76.23 ± 0.26 75.37 ± 0.18
10−4 91.85 ± 0.10 93.50 ± 0.14 94.02 ± 0.18 76.30 ± 0.31 74.44 ± 0.16
10−6 91.98 ± 0.23 93.54 ± 0.16 94.17 ± 0.10 76.14 ± 0.16 74.17 ± 0.28
10−8 91.70 ± 0.12 93.10 ± 0.11 93.71 ± 0.05 76.32 ± 0.11 74.26 ± 0.18

Most of the existing convergence analysis follows the line in [12] to first project the sequence of the iterates into a minimization problem as xt+1=xtηvtmt=minxvt1/4xxtηvtmt, and then examine if vt1/4xt+1x decreases over iterations. Hence, є is not discussed in this line of proof because it is not included in the step size. In our later convergence analysis section, we introduce an important lemma, bounded A-LR, and by using the bounds of the A-LR (specifically, the lower bound µ1 and upper bound µ2 both containing є for Adam), we give a new general framework of prove (details in Appendix) to show the convergence rate for reaching an x that satisfies Efxt2δ in the nonconvex setting. Then, we also derive the optimality gap from the stationary point in the convex and P-L settings (strongly convex).

Theorem 3.1. [Nonconvex] Suppose f(x) is a nonconvex function that satisfies Assumption 1. Let ηt=η=O1T, Adam has

mint=1,,TEfxt2O1ϵ2T+dϵT+dϵ2TT.

Theorem 3.2. [Non-strongly Convex] Suppose f(x) is a convex function that satisfies Assumption 1. Assume that t, ExtxD, for any m ≠ n, ExmxnD, let ηt=η=O1T, Adam has convergence rate fx¯tfOdϵ2T, where x¯t=1Tt=1Txt.

Theorem 3.3. [P-L Condition] Suppose f(x) has P-L condition (with parameter λ) holds under convex case, satisfying Assumption 1. Let ηt=η=O1T2, Adam has the convergence rate: EfxT+1f12λμ1T2TEfx1f+O1T,

The P-L condition is weaker than strongly convex, and for the strongly convex case, we also have:

Corollary 3.3.1. [Strongly Convex] Suppose f(x) is µ-strongly convex function that satisfies Assumption 1. Let ηt=η=O1T2, Adam has the convergence rate: EfxT+1f12μμ1T2TEfx1f+O1T

This is the first time to theoretically include є into analysis. As expected, the convergence rate of Adam is highly related with є. A bigger є will enjoy a better convergence rate since є will dominate the A-LR and behaves like SMomentum; A smaller є will preserve stronger “adaptivity”, we need to find a better way to control є.

4. The Proposed Algorithms

We propose to use activation functions to calibrate AGMs, and specifically focus on using softplus funciton on top of Adam and AMSGrad methods.

4.1. Activation Functions Help

Activation functions (such as sigmoid, ELU, tanh) transfer inputs to outputs are widely used in deep learning area. As a well-studied activation function, softplusx=1βlog1+eβx is known to keep large values unchanged (behaved like function y = x) while smoothing out small values (see Figure 2 (a)). The target magnitude to be smoothed out can be adjusted by a hyper-parameter β. In our new algorithms, we introduce softplusvt=1βlog1+eβvt to smoothly calibrate the A-LR. This calibration brings the following benefits: (1) constraining extreme large-valued A-LR in some coordinates (corresponding to the small-values in vt) while keeping others untouched with appropriate β. For the undesirable large values in the A-LR, the softplus function condenses them smoothly instead of hard thresholding. For other coordinates, the A-LR largely remains unchanged; (2) removing the sensitive parameter є because the softplus function can be lower-bounded by a nonzero number when used on non-negative variables, softplus1βlog2.

Figure 2:

Figure 2:

Behavior of the softplus function, and the test performance of our Sadam algorithm.

After calibrating vt with a softplus function, the anisotropic A-LR becomes much more regulated (see Figure 3 and Appendix), and we clearly observe improved test accuracy (Figure 2 (b) and more figures in Appendix). We name this method “Sadam” to represent the calibrated Adam with softplus function, here we recommend using softplus function but it is not limited to that, and the later theoretical analysis can be easily extended to other activation functions. More empirical evaluations have shown that the proposed methods significantly improve the generalization performance of Adam and AMSGrad.

Figure 3:

Figure 3:

Behavior of the A-LR in the Sadam method with different choices of β (CNN on the MNIST data).

4.2. Calibrated AGMs

With activation function, we develop two new variants of AGMs: Sadam and SAMSGrad (Algorithms 1 and 2), which are developed based on Adam and AMSGrad respectively.

Algorithm 1 SADAM Input:x1    d,  learning  rateηtt=1T,  parameters  0β1,β2<1,β. Initialize m0=0,v0=0 for t=1 to T do    Compute stochastic gradient gt    mt=β1mt1+1β1gt    vt=β2vt1+1β2gt2    xt+1=xtηtsoftplusvtmt end for¯¯¯
Algorithm 2 SAMSGRAD Input:x1    d,  learning  rateηtt=1T,  parameters  0β1,β2<1,β. Initialize m0=0,v˜0=0 for t=1 to T do    Compute stochastic gradient gt    mt=β1mt1+1β1gt    v˜t=β2v˜t1+1β2gt2    vt=maxvt1,v˜t    xt+1=xtηtsoftplusvtmt end for¯¯¯

The key step lies in the way to design the adaptive functions, instead of using the generalized square root function only, we apply softplus(·) on top of the square root of the second-order momentum, which serves to regulate A-LR’s anisotropic behavior and replace the tolerance parameter є by the hyper-parameter β used in the softplus function.

In our algorithms, the hyper-parameters are recommended as β1 = 0.9, β2 = 0.999. For clarity, we omit the bias correction step proposed in the original Adam. However, our arguments and theoretical analysis are applicable to the bias correction version as well [6, 25, 14]. Using the softplus function, we introduce a new hyper-parameter β, which performs as a controller to smooth out anisotropic A-LR, and connect the Adam and S-Momentum methods automatically. When β is set to be small, Sadam and SAMSGrad perform similarly to S-Momentum; when β is set to be big, softplusvt=1βlog1+eβvt1βlogeβvt=vt, and the updating formula becomes xt+1=xtηtvtmt, which is degenerated into the original AGMs. The hyper-parameter β can be well tuned to achieve the best performance for different datasets and tasks. Based on our empirical observations, we recommend to use β = 50.

As a calibration method, the softplus function has better adaptive behavior than simply setting. More precisely, when є is large or β is small, Adam and AMSGrad amount to S-Momentum, but when є is small as commonly suggested 10−8 or β is taken large, the two methods are different because comparing Figure 1 and 3 yields that Sadam has more regulated A-LR distribution. The proposed calibration scheme regulates the massive range of A-LR back down to a moderate scale. The median of A-LR in different dimensions is now well positioned to the middle of the 25–75 percentile zone. Our approach opens up a new direction to examine other activation functions (not limited to the softplus function) to calibrate the A-LR.

The proposed Sadam and SAMSGrad can be treated as members of a class of AGMs that use the softplus (or another suitable activation) function to better adapt the step size. It can be readily combined with any other AGM, e.g., Rmsrop, Yogi, and PAdam. These methods may easily go back to the original ones by choosing a big β.

5. Convergence Analysis

We first demonstrate an important lemma to highlight that every coordinate in the A-LR is both upper and lower bounded at all iterations, which is consistent with empirical observations (Figure 1), and forms the foundation of our proof.

Lemma 5.1. [Bounded A-LR] With Assumption 1, for any t ≥ 1, j ∈ [1,d], β2 ∈ [0,1], and in Adam, β in Sadam, anisotropic A-LR is bounded in AGMs, Adam has (µ1,µ2)-bounded A-LR:

μ11vt,j+ϵμ2,

Sadam has (µ3,µ4)-bounded A-LR:

μ31softplusvt,jμ4,

where 0 < µ1µ2, and 0 < µ3µ4

Remark 5.2. Besides the square root function and softplus function, the A-LR calibrated by any positive monotonically increasing function can be bounded. All of the bounds can be shown to be related to є or β (see Appendix). Bounded A-LR is an essential foundation in our analysis, we provide a different way of proof from previous works, and the proof procedure can be easily extended to other gradient methods as long as bounded LR is satisfied.

Remark 5.3. These bounds can be applied to all AGMs, including Adagrad. In fact, the lower bounds actually are not the same in Adam and Adagrad, because Adam will have smaller vt,j due to moment decay parameter β2. To achieve a unified result, we use the same relaxation to derive the fixed lower bound µ1.

We now describe our main results of Sadam (and SAMSGrad) in the nonconvex case, we clearly show that similar to Theorem 3.1, the convergence rate of Sadam is related to the bounds of the A-LR. Our methods have improved the convergence rate of Adam when comparing self-contained parameters є and β.

Theorem 5.4. [Nonconvex] Suppose f(x) is a nonconvex function that satisfies Assumption 1. Let ηt=η=O1T, Sadam method has

mint=1,,TEfxt2Oβ2T+dβT+dβ2TT.

Remark 5.5. Compared with the rate in Theorem 3.1, the convergence rate of Sadam relies on β, which can be a much smaller number (β = 50 as recommended) than 1ϵ (commonly є = 10−8 in AGMs), showing that our methods have a better convergence rate than Adam. When β is huge, Sadam’s rate is comparable to the classic Adam. When β is small, the convergence rate will be O1T which recovers that of SGD [1].

Corollary 5.5.1. Treat є or β as a constant, then the Adam, Sadam (and SAMSGrad) methods with fixed L,σ,G,β1, and η=O1T, have complexity of O1T, and thus call for O1δ2 iterations to achieve δ-accurate solutions.

Theorem 5.6. [Non-strongly Convex] Suppose f(x) is a convex function that satisfies Assumption 1. Assume that ExtxD,t, and ExmxnD, mn, let ηt=η=O1T, Sadam has fx¯tfO1T, where x¯t=1Tt=1Txt.

The accurate convergence rate will be Odϵ2T for Adam and Odβ2T for Sadam with fixed L, σ, G, β1, D, D. Some works may specify additional sparsity assumptions on stochastic gradients, and in other words, require that t=1Tj=1dgt,jdT [5, 12, 15, 19] to reduce the order from d to d. Some works may use the element-wise bounds σj or Gj, and apply j=1dσj=σ, and j=1dGj=G to hide d. In our work, we do not assume sparsity, so we use σ and G throughout the proof. Otherwise, those techniques can also be used to hide d from our convergence rate.

Corollary 5.6.1. If є or β is treated as constants, then Adam, Sadam (and SAMSGrad) methods with fixed L,σ,G,β1, and η=O1T in the convex case will call for O1δ2 iterations to achieve δ-accurate solutions.

Theorem 5.7. [P-L Condition] Suppose f(x) satisfies the P-L condition (with parameter λ) and Assumption 1 in the convex case. Let ηt=η=O1T2, Sadam has:

EfxT+1f12λμ3T2TEfx1f+O1T.

Corollary 5.7.1. [Strongly Convex] Suppose f(x) is µ-strongly convex function that satisfies Assumption 1. Let ηt=η=O1T2, Sadam has the convergence rate:

EfxT+1f12μμ3T2TEfx1f+O1T.

In summary, our methods share the same convergence rate as Adam, and enjoy even better convergence speed if comparing the common values chosen for the parameters є and β. Our convergence rate recovers that of SGD and S-Momentum in terms of T for a small β.

6. Experiments

We compare Sadam and SAMSGrad against several state-of-the-art optimizers including S-Momentum, Adam, AMSGrad, Yogi, PAdam, PAMSGrad, Adabound, and Amsbound. More results and architecture details are in Appendix.

Experimental Setup.

We use three datasets for image classifications: MNIST, CIFAR-10 and CIFAR-100 and two datasets for LSTM language models: Penn Treebank dataset (PTB) and the WikiText-2 (WT2) dataset. The MNIST dataset is tested on a CNN with 5 hidden layers. The CIFAR-10 dataset is tested on Residual Neural Network with 20 layers (ResNets 20) and 56 layers (ResNets 56) [9], and DenseNets with 40 layers [11]. The CIFAR-100 dataset is tested on VGGNet [26] and Residual Neural Network with 18 layers (ResNets 18) [9]. The Penn Treebank dataset (PTB) and the WikiText-2 (WT2) dataset are tested on 3-layer LSTM models [27].

We train CNN on the MNIST data for 100 epochs, ResNets/DenseNets on CIFAR-10 for 300 epochs, with a weight decay factor of 5 × 10−4 and a batch size of 128, VGGNet/ResNets on CIFAR-100 for 300 epochs, with a weight decay factor of 0.025 and a batch size of 128 and LSTM language models on 200 epochs. For the CIFAR tasks, we use a fixed multi-stage LR decaying scheme: the B-LR decays by 0.1 at the 150-th epoch and 225-th epoch, which is a popular decaying scheme used in many works [28, 18]. For the language tasks, we use a fixed multi-stage LR decaying scheme: the B-LR decays by 0.1 at the 100-th epoch and 150-th epoch. All algorithms perform grid search for hyperparameters to choose from {10,1,0.1,0.01,0.001,0.0001} for B-LR, {0.9,0.99} for β1 and {0.99,0.999} for β2. For algorithm-specific hyper-parameters, they are tuned around the recommended values, such as p18,116 in PAdam and PAMSGrad. For our algorithms, β is selected from {10,50,100} in Sadam and SAMSGrad, though we do observe fine-tuning β can achieve better test accuracy most of time. All experiments on CIFAR tasks are repeated for 6 times to obtain the mean and standard deviation for each algorithm.

Image Classification Tasks.

As a sanity check, experiment on MNIST has been done and its results are in Figure 4, which shows the learning curve for all baseline algorithms and our algorithms on both training and test datasets. As expected, all methods can reach the zero loss quickly, while for test accuracy, our SAMSGrad shows increase in test accuracy and outperforms competitors within 50 epochs.

Figure 4:

Figure 4:

Training loss and test accuracy on MNIST.

Using the PyTorch framework, we first run the ResNets 20 model on CIFAR10 and results are shown in Table 3. The original Adam and AMSGrad have lower test accuracy in comparison with S-Momentum, leaving a clear generalization gap exactly same as what is previously reported. For our methods, Sadam and SAMSGrad clearly close the gap, and Sadam achieves the best test accuracy among competitors. We further test all methods with CIFAR10 on ResNets 56 with greater network depth, and the overall performance of each algorithm has been improved. For the experiments with DenseNets, we use a DenseNet with 40 layers and a growth rate k = 12 without bottleneck, channel reduction, or dropout. The results are reported in the last column of Table 3, SAMSGrad still achieves the best test performance, and the proposed two methods largely improve the performance of Adam and AMSGrad and close the gap with S-Momentum.

Table 3:

Test Accuracy(%) of CIFAR-10 for ResNets 20, ResNets 56 and DenseNets.

Method B-LR є β ResNets 20 ResNets 56 DenseNets
S-Momentum [9, 11] - - - 91.25 93.03 94.76
Adam [14] 10−3 10−3 - 92.56 ± 0.14 93.42 ± 0.16 93.35 ± 0.21
Yogi [14] 10−2 10−3 - 92.62 ± 0.17 93.90 ± 0.21 94.38 ± 0.26

S-Momentum 10−1 - - 92.73 ± 0.05 94.11 ± 0.15 95.03 ± 0.15
Adam 10−3 10−8 - 91.68 ± 0.12 92.82 ± 0.09 93.32 ± 0.06
AMSGrad 10−3 10−8 - 91.7 ± 0.12 93.10 ± 0.11 93.71 ± 0.05
PAdam 10−1 10−8 - 92.7 ± 0.10 94.12 ± 0.12 95.06 ± 0.06
PAMSGrad 10−1 10−8 - 92.74 ± 0.12 94.18 ± 0.06 95.21 ± 0.10
AdaBound 10−2 10−8 - 91.59 ± 0.24 93.09 ± 0.14 94.16 ± 0.10
AmsBound 10−2 10−8 - 91.76 ± 0.16 93.08 ± 0.09 94.03 ± 0.11

Adam+ 10−1 0.013 - 92.89 ± 0.13 92.24 ± 0.10 94.54 ± 0.13
AMSGrad+ 10−1 0.013 - 92.95 ± 0.17 94.32 ± 0.10 94.58 ± 0.18
Sadam 10−2 - 50 93.01 ± 0.16 94.26 ± 0.10 95.19 ± 0.18
SAMSGrad 10−2 - 50 92.88 ±0.10 94.32 ± 0.18 95.31 ± 0.15

Furthermore, two popular CNN architectures: VGGNet [26] and ResNets18 [9] are tested on CIFAR-100 dataset to compare different algorithms. Results can be found in Figure 5 and repeated results are in Appendix. Our proposed methods again perform slightly better than S-Momentum in terms of test accuracy.

Figure 5:

Figure 5:

Training loss and test accuracy of two CNN architectures on CIFAR-100.

LSTM Language Models.

Observing the significant improvements in deep neural networks for image classification tasks, we further conduct experiments on the language models with LSTM. For comparing the efficiency of our proposed methods, two LSTM models over the Penn Treebank dataset (PTB) [29] and the WikiText-2 (WT2) dataset [30] are tested. We present the single-model perplexity results for both our proposed methods and other competitive methods in Figure 6 and our methods achieve both fast convergence and best generalization performance.

Figure 6:

Figure 6:

Perplexity curves on the test set on 3-layer LSTM models over PTB and WT2 datasets

In summary, our proposed methods show great efficacy on several standard benchmarks in both training and testing results, and outperform most optimizers in terms of generalization performance.

7. Conclusion

In this paper, we study adaptive gradient methods from a new perspective that is driven by the observation that the adaptive learning rates are anisotropic at each iteration. Inspired by this observation, we propose to calibrate the adaptive learning rates using an activation function, and in this work, we examine softplus function. We combine this calibration scheme with Adam and AMSGrad methods and empirical evaluations show obvious improvement on their generalization performance in multiple deep learning tasks. Using this calibration scheme, we replace the hyper-parameter є in the original methods by a new parameter β in the softplus function. A new mathematical model has been proposed to analyze the convergence of adaptive gradient methods. Our analysis shows that the convergence rate is related to є or β, which has not been previously revealed, and the dependence on є or β helps us justify the advantage of the proposed methods. In the future, the calibration scheme can be designed based on other suitable activation functions, and used in conjunction with any other adaptive gradient method to improve generalization performance.

Acknowledgments

This work was funded by NSF grants CCF-1514357, DBI-1356655, and IIS1718738 to Jinbo Bi, who was also supported by NIH grants K02-DA043063 and R01-DA037349.

Appendix

A.1. Architecture Used in Our Experiments

Here we mainly introduce the MNIST architecture with Pytorch used in our empirical study, ResNets and DenseNets are well-known architectures used in many works and we do not include details here.

layer layer setting
F.relu(self.conv1(x)) self.conv1 = nn.Conv2d(1, 6, 5)
F.max pool2d(x, 2, 2)
F.relu(self.conv2(x)) self.conv2 = nn.Conv2d(6, 16, 5)
x.view(−1, 16*4)
F.relu(self.fc1(x)) self.fc1 = nn.Linear(16*4*4, 120)
x= F.relu(self.fc2(x)) self.fc2 = nn.Linear(120, 84)
x = self.fc3(x) self.fc3 = nn.Linear(84, 10)
F.log softmax(x, dim=1)

B.2. More Empirical Results

In this section, we perform multiply experiments to study the property of anisotropic A-LR exsinting in AGMs and the performance of softplus function working on A-LR. We first show the A-LR range of popular Adam-type methods, then present how the parameter β in Sadam and SAMSGrad reduce the range of A-LR and improve both training and testing performance.

B.2.1. A-LR Range of AGMs

Besides the A-LR range of Adam method, which has shown in main paper, we further want to study more other Adam-type methods, and do experiments focus on AMSGrad, PAdam, and PAMSGrad on different tasks (Figure B.2.1, B.2.2, and B.2.3). AMSGrad also has extreme large-valued coordinates, and will encounter the “small learning rate dilemma” as well as Adam. With partial parameter p, the value range of A-LR can be largely narrow down, and the maximum range will be reduced around 102 with PAdam, and less than 102 with PAMSGrad. This reduced range, avoiding the “small learning rate dilemma”, may help us understand what “trick” works on Adam’s A-LR can indeed improve the generalization performance. Besides, the range of A-LR in Yogi, Adabound and Amsbound will be reduced or controlled by specific є or clip function, we don’t show more information here.

B.2.2. Parameter β Reduces the Range of A-LR

The main paper has discussed about softplus function, and mentions that it does help to constrain large-valued coordinates in A-LR while keep others untouched, here we give more empirical support. No matter how does β set, the modified A-LR will have a reduced range. By setting various β’s, we can find an appropriate β that performs the best for specific tasks on datasets. Besides the results of A-LR range of Sadam on MNIST with different choices of β, we also study Sadam and SAMSGrad on ResNets 20 and DenseNets.

Here we do grid search to choose appropriate β from {10,50,100,200,500,1000}. In summary, with softplus fuction, Sadam and SAMSGrad will narrow down the range of A-LR, make the A-LR vector more regular, avoiding ”small learning rate dilemma” and finally achieve better performance.

B.2.3. Parameter β Matters in Both Training and Testing

After studying existing Adam-type methods, and effect of different β in adjusting A-LR, we focus on the training and testing accuracy of our softplus framework, especially Sadam and SAMSGrad, with different choices of β.

Figure B.2.1:

Figure B.2.1:

A-LR range of AMSGrad (a), PAdam (b), and PAMSGrad (c) on MNIST.

Figure B.2.2:

Figure B.2.2:

A-LR range of AMSGrad (a), PAdam (b), and PAMSGrad (c) on ResNets 20.

Figure B.2.3:

Figure B.2.3:

A-LR range of AMSGrad (a), PAdam (b), and PAMSGrad (c) on DenseNets.

Figure B.2.4:

Figure B.2.4:

The range of A-LR: 1/softplusvt over iterations for different choices of β. The maximum ranges in all figures are compressed to a reasonable smaller value compared with 108.

Figure B.2.5:

Figure B.2.5:

The range of A-LR: 1/softplusvt, vt=maxvt1,v˜t over iterations for SAMSGrad on MNIST with different choice of β. The maximum ranges in all figures are compressed to a reasonable smaller value compared with those of AMSGrad on MNIST

Figure B.2.6:

Figure B.2.6:

The range of A-LR: 1/softplusvt over iterations for Sadam on ResNets 20 with different choices of β.

Figure B.2.7:

Figure B.2.7:

The range of A-LR: 1/softplusvt, vt=maxvt1,v˜t over iterations for SAMSGrad on ResNets 20 with different choices of β.

Figure B.2.8:

Figure B.2.8:

The range of A-LR: 1/softplusvt over iterations for Sadam on DenseNets with different choice of β.

Figure B.2.9:

Figure B.2.9:

The range of A-LR: 1/softplusvt, vt=maxvt1,v˜t over iterations for SAMSGrad on DenseNets with different choices of β.

Figure B.2.10:

Figure B.2.10:

Performance of Sadam on CIFAR-10 with different choice of β.

Figure B.2.11:

Figure B.2.11:

Performance of SAMSGrad on CIFAR-10 with different choice of β.

C.3. CIFAR100

Two popular CNN architectures are tested on CIFAR-100 dataset to compare different algorithms: VGGNet [26] and ResNets18 [9]. Besides the figures in main text, we have repeated experiments and show results as follows. Our proposed methods again perform slightly better than S-Momentum in terms of D.4. Theoretical Analysis Details

Table C.3.1:

Test Accuracy(%) of CIFAR100 for VGGNet.

Method 50th epoch 150th epoch 250th epoch best perfomance
S-Momentum 59.09 ± 2.09 61.25 ± 1.51 76.14 ± 0.12 76.43 ± 0.15
Adam 60.21 ± 0.81 62.98 ± 0.10 73.81 ± 0.17 74.18 ± 0.15
AMSGrad 61.00 ± 1.17 63.27 ± 1.18 74.04 ± 0.16 74.26 ± 0.18
PAdam 53.62 ± 1.70 56.02 ± 0.86 75.85 ± 0.20 76.36 ± 0.16
PAMSGrad 52.49 ± 3.07 57.39 ± 1.40 75.82 ± 0.31 76.26 ± 0.30
AdaBound 60.27 ± 0.99 60.36 ± 1.71 75.86 ± 0.23 76.10 ± 0.22
AmsBound 59.88 ± 0.56 60.11 ± 1.92 75.74 ± 0.23 75.99 ± 0.20

Adam+ 43.59 ± 2.71 44.46 ± 4.39 74.91 ± 0.36 75.58 ± 0.33
AMSGrad+ 44.45 ± 2.83 45.61 ± 3.67 74.85 ± 0.08 75.56 ± 0.24
Sadam 58.59 ± 1.60 61.27 ± 1.67 76.35 ± 0.18 76.64 ± 0.18
SAMSgrad 59.16 ± 1.20 60.86 ± 0.39 76.27 ± 0.23 76.47 ± 0.26

Table C.3.2:

Test Accuracy(%) of CIFAR100 for ResNets18.

Method 50th epoch 150th epoch 250th epoch best perfomance
S-Momentum 59.98 ± 1.31 63.32 ± 1.61 77.19 ± 0.36 77.50 ± 0.25
Adam 63.40 ± 1.42 66.18 ± 1.02 75.68 ± 0.49 76.14 ± 0.24
AMSGrad 63.16 ± 0.47 66.59 ± 1.42 75.92 ± 0.26 76.32 ± 0.11
PAdam 56.28 ± 0.87 58.71 ± 1.66 77.18 ± 0.21 77.51 ± 0.19
PAMSGrad 54.34 ± 2.21 58.81 ± 1.95 77.41 ± 0.17 77.67 ± 0.14
AdaBound 61.13 ± 0.84 64.30 ± 1.84 77.18 ± 0.38 77.50 ± 0.29
AmsBound 61.05 ± 1.59 62.04 ± 2.10 77.08 ± 0.19 77.34 ± 0.13

Adam+ 46.5 ± 2.12 48.68 ± 4.06 76.86 ± 0.36 77.19 ± 0.28
AMSGrad+ 49.06 ± 3.23 50.75 ± 2.45 76.58 ± 0.21 76.91 ± 0.12
Sadam 59.00 ± 1.09 62.75 ± 1.03 77.26 ± 0.30 77.61 ± 0.19
SAMSgrad 59.63 ± 1.27 63.44 ± 1.84 77.31 ± 0.40 77.70 ± 0.31

D.4. Theoretical Analysis Details

We analyze the convergence rate of Adam and Sadam under different cases, and derive competitive results of our methods. The following table gives an overview of stochastic gradient methods convergence rate under various conditions, in our work we provide a different way of proof compared with previous works and also associate the analysis with hyperparameters of Adam methods.

D.4.1. Prepared Lemmas

We have a series of prepared lemmas to help with optimization convergence rate analysis, and some of them maybe also used in generalization error bound analysis.

Lemma D.4.1. For any vectors a,b,cd, <a,bc>=<ab, c>=<ab, cb>, here ʘ is element-wise product, b is element-wise square root.

Proof.

<a,bc>=<a1ad,b1c1bdcd>=a1b1c1++adbdcd
<ab,c>=<a1b1adbd,c1cd>=a1b1c1++adbdcd
<ab,cb>=<a1b1adbd,b1c1bdcd>=a1b1c1++adbdcd

Lemma D.4.2. For any vector a, we have.

a2a2. (2)

Lemma D.4.3. For unbiased stochastic gradient, we have

Egt2σ2+G2. (3)

Proof. From gradient bounded assumption and variance bounded assumption,

Egt2=Egtfxt+fxt2                     =Egtfxt2+fxt2                     σ2+G2.

Lemma D.4.4. All momentum-based optimizers using first momentum mt = β1mt−1 + (1 − β1)gt will satisfy

Emt2σ2+G2. (4)

Proof. From the updating rule of first momentum estimator, we can derive

mt=i=1t1β1β1tigi. (5)

Let Γt=i=1tβ1ti=1β1t1β1, by Jensen inequality and Lemma D.4.3,

Emt2=Ei=1t1β1β1tigi2=Γt2Ei=1t1β1β1tiΓtgi2                     Γt2i=1t1β12β1tiΓtEgi2Γt1β12i=1tβ1tiσ2+G2                     σ2+G2.

Lemma D.4.5. Each coordinate of vector vt=β2vt1+1β2gt2 will satisfy

Evt,jσ2+G2,

where j ∈ [1,d] is the coordinate index.

Proof. From the updating rule of second momentum estimator, we can derive

vt,j=i=1t1β2β2tigi,j20. (6)

Since the decay parameter β20,1, i=1t1β2β2ti=1β2t1. From Lemma D.4.3,

Evt,j=Ei=1t1β2β2tigi,j2i=1t1β2β2tiσ2+G2σ2+G2.

And we can derive the following important lemma:

Lemma D.4.6. [Bounded A-LR] For any t ≥ 1, j ∈ [1,d], β2 ∈ [0,1], and fixed є in Adam and β defined in softplus function in Sadam, the following bounds always hold:

Adam has (µ1,µ2)− bounded A-LR:

μ11vt,j+ϵμ2; (7)

Sadam has (µ3,µ4)− bounded A-LR:

μ31softplusvt,jμ4; (8)

where 0 < µ1µ2, 0 < µ3µ4. For brevity, we use µl,µu denoting the lower bound and upper bound respectively, and both Adam and Sadam will be analysis with the help of (µl,µu).

Proof. For Adam, let μ1=1σ2+G2+ϵ, μ2=1ϵ, then we can get the result in (7).

For Sadam, notice that softplus (·) is a monotone increasing function, and vt,j is both upper-bounded and lower-bounded, then we have (8), where μ3=11βlog1+eβσ2+G2, μ4=11βlog1+eβ0=βlog2.◻

Lemma D.4.7. Define zt=xt+β1β1xtxt1, t1β10,1. Let ηt = η, then the following updating formulas hold: Gradient-based optimizer

zt=xt,zt+1=ztηgt; (9)

Adam optimizer

zt+1=zt+ηβ11β11vt1+ϵ1vt+ϵmt1ηvt+ϵgt; (10)

Sadam optimizer

zt+1=zt+ηβ11β11softplusvt11softplusvtmt1ηsoftplusvtgt. (11)

Proof. We consider the Adam optimizer and let β1 = 0, we can easily derive the gradient-based case.

zt+1=xt+1+β11β1xt+1xt
zt+1=zt+11β1xt+1xtβ11β1xtxt1    =zt11β1ηvt+ϵmt+β11β1ηvt1+ϵmt1    =zt+ηβ11β11vt1+ϵ1vt+ϵmt1ηvt+ϵgt.

Similarly, consider the Sadam optimizer:

zt+1=zt+11β1xt+1xtβ11β1xtxt1    =zt11β1ηsoftplusvtmt+β11β1ηsoftplusvt1mt1    =zt+ηβ11β11softplusvt11softplusvtmt1ηsoftplusvtgt.

Lemma D.4.8. As defined in Lemma D.4.7, with the condition that vtvt−1, i.e., AMSGrad and SAMSGrad, we can derive the bound of distance of zt+1zt2 as follows:

Adam optimizer

Ezt+1zt22η2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2                               +2η2μ22σ2+G2 (12)

Sadam optimizer

Ezt+1zt22η2β12σ2+G21β12Ej=1d1softplusvt1,j21softplusvt,j2                               +2η2μ42σ2+G2 (13)

Proof. Adam case:

Ezt+1zt2=Eηβ11β11vt1+ϵ1vt+ϵmt1ηvt+ϵgt2                               2Eηβ11β11vt1+ϵ1vt+ϵmt12++2Eηvt+ϵgt2                               2η2β12σ2+G21β12Ej=1d1vt1,j+ϵ1vt,j+ϵ2+2η2μ22σ2+G2                               2η2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+2η2μ22σ2+G2

The first inequality holds because ǁabǁ2 ≤ 2ǁaǁ2 + 2ǁbǁ2, the second inequality holds because Lemma D.4.3 and D.4.4 and Lemma D.4.6, the third inequality holds because (ab)2a2b2 when ab, and in our assumption, we have vtvt−1 holds.

Sadam case:

Ezt+1zt2=Eηβ11β11softplusvt11softplusvtmt1ηsoftplusvtgt2                               2Eηβ11β11softplusvt11softplusvtmt12                               +2Eηsoftplusvtgt2                               2η2β12σ2+G21β12Ej=1d1softplusvt1,j1softplusvt,j2                               +2η2μ42σ2+G2                               2η2β12σ2+G21β12Ej=1d1softplusvt1,j21softplusvt,j2                               +2η2μ42σ2+G2

Because the softplus function is monotone increasing function, therefore, the third inequality holds as well.◻

Lemma D.4.9. As defined in Lemma D.4.7, with the condition that vtvt−1, we can derive the bound of the inner product as follows:

Adam optimizer

Efztfxt,ηvt+ϵgt12L2η2μ22β11β12σ2+G2+12η2μ22σ2+G2; (14)

Sadam optimizer

Efztfxt,ηsoftplusvtgt12L2η2μ42β11β12σ2+G2+12η2μ42σ2+G2. (15)

Proof. Since the stochastic gradient is unbiased, then we have E[gt] = ∇f(xt).

Adam case:

Efztfxt,ηvt+ϵgt                            12Efztfxt2+12Eηvt+ϵgt2                            L22Eztxt2+12Eηvt+ϵgt2                            =L22β11β12Extxt12+12Eηvt+ϵgt2                            =L22β11β12Eηvt1+ϵmt12+12Eηvt+ϵgt2                            12L2η2μ22β11β12σ2+G2+12η2μ22σ2+G2

The first inequality holds because 12a2+12b2<a,b>, the second inequality holds for L-smoothness, the last inequalities hold due to Lemma D.4.4 and D.4.6.

Similarly, for Sadam, we also have the following result:

Efztfxt,ηsoftplusvtgt                            12Efztfxt2+12Eηsoftplusvtgt2                            L22Eztxt2+12Eηsoftplusvtgt2                            =L22β11β12Extxt12+12Eηsoftplusvtgt2                            =L22β11β12Eηsoftplusvtmt12+12Eηsoftplusvtgt2                            12L2η2μ42β11β12σ2+G2+12η2μ42σ2+G2.

D.4.2. Adam Convergence in Nonconvex Setting

Proof. All the analyses hold true under the condition: vtvt−1. From L-smoothness and Lemma D.4.7, we have

fzt+1fzt+fzt,zt+1zt+L2zt+1zt2                =fzt+ηβ11β1fzt,1vt1+ϵ1vt+ϵmt1                fzt,ηvt+ϵgt+L2zt+1zt2

Take expectation on both sides,

Efzt+1fztηβ11β1Efzt,1vt1+ϵ1vt+ϵmt1                                         Efzt,ηvt+ϵgt+L2Ezt+1zt2                                         =ηβ11β1Efzt,1vt1+ϵ1vt+ϵmt1                                         Efztfxt,ηvt+ϵgtEfxt,ηvt+ϵgt                                         +L2Ezt+1zt2

Plug in the results from prepared lemmas, then we have,

Efzt+1fztηβ11β1Efzt,1vt1+ϵ1vt+ϵmt1            +12L2η2μ22β11β12σ2+G2+12η2μ22σ2+G2Efxt,ηvt+ϵgt            +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

Applying the bound of mt and ∇f(zt),

Efzt+1fztηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ            +12L2η2μ22β11β12σ2+G2+12η2μ22σ2+G2Efxt,ηvt+ϵgt            +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

By rearranging,

Ext,1vt+ϵgtEfztfzt+1+ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ                +12L2η2μ22β11β12σ2+G2+12η2μ22σ2+G2                +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

For the LHS above:

Efxt,1vt+ϵgtEj|fxtjgt,j0μ1fxtjgt,j+j|fxtjgt,j<0μ2fxtjgt,j            j|fxtjgt,j0μ1fxtj2+j|fxtjgt,j<0μ2fxtj2            μ1fxt2

Then we obtain:

ημ1fxt2Efztfzt+1+ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ        +12L2η2μ22β11β12σ2+G2+12η2μ22σ2+G2        +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

Divide ηµ1 on both sides:

fxt21ημ1Efztfzt+1+β11β1μ1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ         +12μ1L2ημ22β11β12σ2+G2+12μ1ημ22σ2+G2         +Lηβ12σ2+G21β12μ1Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lημ22μ1σ2+G2

Summing from t = 1 to T, where T is the maximum number of iteration,

t=1Tfxt21ημ1Efz1f+β11β1μ1Gσ2+G2Ej=1d1v0,j+ϵ1vT,j+ϵ                                 +T2μ1L2ημ22β11β12σ2+G2+T2μ1ημ22σ2+G2                                 +Lηβ12σ2+G21β12μ1Ej=1d1v0,j+ϵ21vT,j+ϵ2+TLημ22μ1σ2+G2

Since v0 = 0, μ2=1ϵ, we have

t=1Tfxt21ημ1Efz1f+β1d1β1μ1Gσ2+G2μ2μ1                                 +T2μ1L2ημ22β11β12σ2+G2+T2μ1ημ22σ2+G2                                 +Lηβ12σ2+G21β12μ1μ22μ12+TLημ22μ1σ2+G2

Divided by 1T,

1Tt=1Tfxt21ημ1TEfz1f+β1d1β1μ1TGσ2+G2μ2μ1                                      +T2μ1L2ημ22β11β12σ2+G2+12μ1ημ22σ2+G2                                      +Lηβ12dσ2+G21β12μ1Tμ22μ12+Lημ22μ1σ2+G2                                      1ημ1TEfz1f+β1d1β1μ1Tμ2μ1                                      +12μ1L2ημ22β11β12+ημ222μ1+Lηβ12dμ22μ121β12μ1T+Lημ22μ1σ2+G2

The second inequality holds because Gσ2+G2σ2+G2.

Setting η=1T, let x0 = x1, then z1 = x1, f(z1) = f(x1) we derive the final result:

mint=1,,TEfxt21μ1TEfx1f+β1d1β1μ1Tμ2μ1        +L2μ222μ1Tβ11β12+μ222μ1T+Lβ12dμ22μ121β12μ1TT+Lμ22μ1Tσ2+G2        =C1T+C2T+C3TT

where

C1=1μ1fx1f+L2μ222μ1β11β12+μ222μ1+Lμ22μ1σ2+G2
C2=β1μ2μ1d1β1μ1,
C3=Lβ12dμ22μ121β1μ1.

With fixed L,σ,G,β1, we have C1=O1ϵ2, C2=Odϵ, C3=Odϵ2.

Therefore,

mint=1,,TEfxt2O1ϵ2T+dϵT+dϵ2TT

Thus, we get the sublinear convergence rate of Adam in nonconvex setting, which recovers the well-known result of SGD ([1]) in nonconvex optimization in terms of T.

Remark D.4.10. The leading item from the above convergence is C1/T, є plays an essential role in the complexity, and we derive a more accurate order O1ϵ2T. At present, є is always underestimated and considered to be not associated with accuracy of the solution ([14]). However, it is closely related with complexity, and with bigger є, the computational complexity should be better. This also supports the analysis of A-LR:1vt+ϵ of Adam in our main paper.

In some other works, people use σi or Gi to show all the element-wise bound, and then by applying j=1dσi:=σ, j=1dGi:=G to hide d in the complexity. Here in our work, we didn’t specify write out σi or Gi, instead we use σ,G through all the procedure.

D.4.3. Sadam Convergence in Nonconvex Setting

As Sadam also has constrained bound pair (µ3,µ4), we can learn from the proof of Adam method, which provides us a general framework of such kind of adaptive methods.

Similar to the Adam proof, from L-smoothness and Lemma D.4.7, we have

Proof. All the analyses hold true under the condition: vtvt−1. From L-smoothness and Lemma D.4.7, we have

fzt+1fzt+fzt,zt+1zt+L2zt+1zt2                 =fzt+ηβ11β1fzt,1softplusvt11softplusvtmt1                 fzt,ηsoftplusvtgt+L2zt+1zt2

Taking expectation on both sides, and plug in the results from prepared lemmas, then we have,

Efzt+1fzt  ηβ11β1Efzt,1softplusvt11softplusvtmt1  Efzt,ηsoftplusvtgt+L2Ezt+1zt2  ηβ11β1Efzt,1softplusvt11softplusvtmt1  Efzt,ηsoftplusvtgt  +Lη2β12σ2+G21β12Ej=1d1softplusvt1,j21softplusvt,j2+Lη2μ42σ2+G2  =ηβ11β1Gσ2+G2Ej=1d1softplusvt1,j1softplusvt,j  Efztfxt,ηsoftplusvtgtEfxt,ηsoftplusvtgt  +Lη2β12σ2+G21β12Ej=1d1softplusvt1,j21softplusvt,j2+Lη2μ42σ2+G2  ηβ11β1Gσ2+G2Ej=1d1softplusvt1,j1softplusvt,j  +L2η2μ422β11β12σ2+G2+η2μ422σ2+G2Efxt,ηsoftplusvtgt  +Lη2β12σ2+G21β12Ej=1d1softplusvt1,j21softplusvt,j2+Lη2μ42σ2+G2

By rearranging,

Efxt,ηsoftplusvtgtEfztfzt+1+ηβ11β1Gσ2+G2Ej=1d1softplusvt1,j1softplusvt,j+L2η2μ422β11β12σ2+G2+η2μ422σ2+G2+Lη2β12σ2+G21β12Ej=1d1softplusvt1,j21softplusvt,j2+Lη2μ42σ2+G2

For the LHS above:

Efxt,1softplusvtgtEj|fxt,jgt,j0μ3fxt,jgt,j+j|fxt,jgt,j<0μ4fxt,jgt,j                            j|fxt,jgt,j0μ3fxt,j2+j|fxt,jgt,j<0μ4fxt,j2                           μ3fxt2

Then we obtain:

ημ3fxt2Efztfzt+1+ηβ11β1Gσ2+G2Ej=1d1softplusvt1,j1softplusvt,j                              +L2η2μ422β11β12σ2+G2+η2μ422σ2+G2                              +Lη2β12σ2+G21β12Ej=1d1softplusvt1,j21softplusvt,j2+Lη2μ42σ2+G2

Divide ηµ3 on both sides and then sum from t = 1 to T, where T is the maximum number of iteration,

t=1Tfxt21ημ3Efz1f+β11β1μ3Gσ2+G2Ej=1d1softplusv0,j1softplusvT,j                                 +L2ηTμ422μ31β11β12σ2+G2+ημ42T2μ3σ2+G2                                 +Lηβ12σ2+G21β12μ3Ej=1d1softplusv0,j21softplusvT,j2+Lημ42Tσ2+G2μ3

Since, v0 =0, 1softplus0=μ4, we have

t=1Tfxt21ημ3Efz1f+β1d1β1μ3Gσ2+G2μ4μ3                                 +L2ηTμ422μ3β11β12σ2+G2+ημ42T2μ3σ2+G2                                 +Lηβ12dσ2+G21β12μ3μ42μ32+Lημ42Tσ2+G2μ3

Divided by 1T,

1Tt=1Tfxt21ημ3TEfz1f+β1d1β1μ3TGσ2+G2μ4μ3                                      +L2ημ422μ3β11β12σ2+G2+ημ422μ3σ2+G2                                      +Lηβ12dσ2+G21β12μ3Tμ42μ32+Lημ42σ2+G2μ3                                      1ημ3TEfz1f+β1d1β1μ3Tμ4μ3                                      +L2ημ422μ3β11β12+ημ422μ3+Lηβ12d1β12μ3Tμ42μ32+Lημ42μ3σ2+G2

Setting η=1T, let x0 = x1, then z1 = x1, f(z1) = f(x1) we derive the final result for Sadam method:

mint=1,,TEfxt21μ3TEfx1f+β1d1β1μ3Tμ4μ3        +L2μ422μ3Tβ11β12+μ422μ3T+Lβ12dμ42μ321β12μ3TT+Lμ42μ3Tσ2+G2        =C1T+C2T+C3TT

where

C1=1μ3fx1f+L2μ422μ3β11β12+μ422μ3+Lμ42μ3σ2+G2
C2=β1μ4μ3d1β1μ3,
C3=Lβ12dμ42μ321β12μ3.

With fixed L,σ,G,β1, we have C1 = O(β2), C2 = O(), C3 = O(2).

Therefore,

mint=1,,TEfxt2Oβ2T+dβT+dβ2TT

Thus, we get the sublinear convergence rate of Sadam in nonconvex setting, which is the same order of Adam and recovers the well-known result of SGD [1] in nonconvex optimization in terms of T.

Remark D.4.11. The leading item from the above convergence is C1/T, β plays an essential role in the complexity, and a more accurate convergence should be Oβlog1+eβT. When β is chosen big, this will become Oβ2T, somehow behave like Adam’s case as O1ϵ2T, which also guides us to have a range of β; when β is chosen small, this will become O1T, the computational complexity will get close to SGD case, and β is a much smaller number compared with 1/є, proving that Sadam converges faster. This also supports the analysis of range of A-LR: 1/softplusvt in our main paper.

D.4.4. Non-strongly Convex

In previous works, convex case has been well-studied in adaptive gradient methods. AMSGrad and later methods PAMSGrad both use a projection on minimizing objective function, here we want to show a different way of proof in non-strongly convex case. For consistency, we still follow the construction of sequence {zt}.

Starting from convexity:

fyfx+fxTyx.

Then, for any xd, ∀t ∈ [1,T],

fx,xtxfxtf, (16)

where f = f(x), x is the optimal solution.

Proof. Adam case:

In the updating rule of Adam optimizer, xt+1=xtηtvt+ϵmt, setting stepsize to be fixed, ηt = η, and assume vtvt−1 holds. Using previous results,

Ezt+1x2=Ezt+ηβ11β11vt1+ϵ1vt+ϵmt1ηvt+ϵgtx2=Eztx2+Eηβ11β11vt1+ϵ1vt+ϵmt1ηvt+ϵgt2+2Eηβ11β11vt1+ϵ1vt+ϵmt1,ztx2Eηvt+ϵgt,ztxEztx2+2η2β121β12E1vt1+ϵ1vt+ϵmt12+2η2E1vt+ϵgt2+2ηβ11β1E1vt1+ϵ1vt+ϵmt1,ztx2ηE1vt+ϵgt,ztxEztx2+2η2β12σ2+G21β12Ej=1d1vt1+ϵ21vt+ϵ2+2η2μ22σ2+G2+2ηβ11β1E1vt1+ϵ1vt+ϵmt1,ztx2ηE1vt+ϵgt,ztx

The first inequality holds due to ǁabǁ2 ≤ 2ǁaǁ2 +2ǁbǁ2, the second inequality holds due to Lemma D.4.3, D.4.4, D.4.6.

Since, <a,b>12ηa2+η2b2,

2E1vt1+ϵ1vt+ϵmt1,ztx1ηE1vt1+ϵ1vt+ϵmt12+ηEztx21ησ2+G2Ej=1d1vt1,j+ϵ21vt,j+ϵ2+ηEztx2

From the definition of zt and convexity,

fxt,xtxfxtf0
2ηE1vt+ϵgt,ztx=2ηE1vt+ϵgt,xtx+β11β1xtxt1=2ηE1vt+ϵgt,xtx2ηβ11β1E1vt+ϵgt,xtxt1=2ηE1vt+ϵgt,xtx2η2β11β1E1vt+ϵgt,1vt1+ϵmt12ημ1fxt,xtx+2η2β1μ221β1σ2+G22ημ1fxtf+2η2β1μ221β1σ2+G2

Plugging in previous two inequalities:

Ezt+1x2Eztx2+2η2β12σ2+G21β12Ej=1d1vt1+ϵ21vt+ϵ22η2μ22σ2+G2+β1σ2+G21β1Ej=1d1vt1,j+ϵ21vt,j+ϵ2+η2β11β1Eztx22ημ1fxtf+2η2β1μ221β1σ2+G2

By rearranging:

2ημ1fxtfEztx2Ezt+1x2+2η2β12σ2+G21β12Ej=1d1vt1+ϵ21vt+ϵ2+2η2μ22σ2+G2+β1σ2+G21β1Ej=1d1vt1,j+ϵ21vt,j+ϵ2+η2β11β1Eztx2+2η2β1μ221β1σ2+G2

Divide 2ηµ1 on both sides,

fxtf12ημ1Eztx2Ezt+1x2+ηβ12σ2+G21β12μ1Ej=1d1vt1+ϵ21vt+ϵ2      +ημ22μ1σ2+G2+β1σ2+G22ημ11β1Ej=1d1vt1,j+ϵ21vt,j+ϵ2      +ηβ12μ11β1Eztx2+ηβ1μ221β1μ1σ2+G2

Assume that t,ExtxD, for any m ≠ n, ExmxnD hold, then Eztxǁ2] can be bounded.

Ez1x2=Ex1x2D2 (17)
Eztx2=Extx+β11β1xtxt12                              2Extx2+2β121β12Extxt12                              2D2+2β121β12D2. (18)

Thus:

fxtf12ημ1Eztx2Ezt+1x2+ηβ12σ2+G21β12μ1Ej=1d1vt1+ϵ21vt+ϵ2      +ημ22μ1σ2+G2+β1σ2+G22ημ11β1Ej=1d1vt1,j+ϵ21vt,j+ϵ2      +ηβ1D2μ11β1+ηβ13D2μ11β13+ηβ1μ221β1μ1σ2+G2

Summing from t = 1 to T,

t=1Tfxtf12ημ1Ez1x2EzTx2+ηβ12σ2+G21β12μ1Ej=1d1v0+ϵ21vT+ϵ2                        +ημ22Tμ1σ2+G2+β1σ2+G22ημ11β1Ej=1d1v0,j+ϵ21vT,j+ϵ2                        +ηβ1D2Tμ11β1+ηβ13D2Tμ11β13+ηβ1μ22T1β1μ1σ2+G2                        12ημ1D2+ηβ12dσ2+G21β12μ1μ22μ12+ημ22Tμ1σ2+G2+β1dσ2+G22ημ11β1μ22μ12                        +ηβ1D2Tμ11β1+ηβ13D2Tμ11β13+ηβ1μ22T1β1μ1σ2+G2

The second inequality is based on the fact that, when iteration t reaches the maximum number T, xt is the optimal solution, zT = x.

By Jensen’s inequality,

1Tt=1Tfxtffx¯tf,

where x¯t=1Tt=1Txt.

Then,

fx¯tfD22ημ1T+ηβ12dσ2+G21β12μ1Tμ22μ12+ημ22μ1σ2+G2+β1dσ2+G22ημ11β1Tμ22μ12                        +ηβ1D2μ11β1+ηβ13D2μ11β13+ηβ1μ221β1μ1σ2+G2

By plugging the stepsize η=O1T, we complete the proof of Adam in non-strongly convex case.

fx¯tfD22μ1T+β12dσ2+G21β12μ1TTμ22μ12+μ22μ1Tσ2+G2+β1dσ2+G22μ11β1Tμ22μ12                +β1D2μ11β1T+β13D2μ11β13T+β1μ221β1μ1Tσ2+G2                =O1T+O1TT=O1T.

Remark D.4.12. The leading item of convergence order of Adam should be OC˜T, where C˜=D22μ1+μ22μ1σ2+G2+β1dσ2+G22μ11β1μ22μ12+β1D2μ11β1+β13D2μ11β13+β1μ221β1μ1σ2+G2. With fixed L,σ,G,β1,D,D,C˜=Odϵ2, which also contains as well as dimension d, here with bigger є, the order should be better, this also supports the discussion in our main paper.

The analysis of Sadam is similar to Adam, by replacing the bounded pairs (µ1,µ2) with (µ3,µ4), we briefly give convergence result below.

Proof. Sadam case:

fx¯tfD22ημ3T+ηβ12dσ2+G21β12μ3Tμ42μ32+ημ42μ3σ2+G2+β1dσ2+G22ημ31β1Tμ42μ32                        +ηβ1D2μ31β1+ηβ13D2μ31β13+ηβ1μ421β1μ3σ2+G2

By plugging the stepsize η=O1T, we get the convergence rate of Sadam in non-strongly convex case.

fx¯tfD22μ3T+β12dσ2+G21β12μ3TTμ42μ32+μ42μ3Tσ2+G2+β1dσ2+G22μ31β1Tμ42μ32                        +β1D2μ31β1T+β13D2μ31β13T+β1μ421β1μ3Tσ2+G2                        =O1T+O1TT=O1T.

For brevity,

fx¯tf=O1T.

Remark D.4.13. The leading item of convergence order of Sadam should be OC˜T, where C˜=D22μ3+μ42dμ3σ2+G2+β1dσ2+G22μ31β1μ42μ32+β1D2μ31β1+β13D2μ31β13+β1μ421β1μ3σ2+G2. With fixed L,σ,G,β1,D,D, C˜=Odβlog1+eβ=Odβ2, with small β, the Sadam will be similar to SGD convergence rate, and β is a much smaller number compared with 1/є, proving that Sadam method perfoms better than Adam in terms of convergence rate.

D.4.5. P-L Condition

Suppose that strongly convex assumption holds, we can easily deduce the P-L condition (see Lemma D.4.14), which shows that P-L condition is much weaker than strongly convex condition. And we further prove the convergence of Adam-type optimizer (Adam and Sadam) under the P-L condition in non-strongly convex case, which can be extended to the strongly convex case as well.

Lemma D.4.14. Suppose that f is continuously diffentiable and strongly convex with parameter γ. Then f has the unique minimizer, denoted as f = f(x). Then for any xd, we have

fx22γfxf.

Proof. From strongly convex assumption,

ffx+fxTxx+γ2xx2        fx+minξfxTξ+γ2ξ2        =fx12γfx2

Letting ξ = xx, when ξ=fxγ, the quadratic function can achieve its minimum.◻

We restate our theorems under PL condition.

Theorem D.4.15. Suppose f(x) satisfies Assumption 1 and PL condition (with parameter λ) in non-strongly convex case and vtvt−1. Let ηt=η=O1T,

Adam and Sadam have convergence rate

EfxtfO1T.

Proof. Adam case:

Starting from L-smoothness, and borrowing the previous results we already have

Efzt+1fztηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ                                           +L2η2μ222β11β12σ2+G2+η2μ222σ2+G2Efxt,ηvt+ϵgt                                           +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2
Efxt,1vt+ϵgtμ1fxt2

Therefore, we get:

Efzt+1fztηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ                                           +L2η2μ222β11β12σ2+G2+η2μ222σ2+G2ημ1fxt2                                           +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

From P-L condition assumption,

Efzt+1Efzt+ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ                                           +L2η2μ222β11β12σ2+G2+η2μ222σ2+G22λημ1Efxtf                                           +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

From convexity,

fzt+1fxt+1+β11β1<fxt+1,xt+1xt>                =fxt+1+β11β1<fxt+1,ηvt+ϵmt>

From L-smoothness,

fztfxt+β11β1<fxt,xtxt1>+L2β11β12xtxt12.

Then we can obtain

Efxt+1+β11β1E<fxt+1,ηvt+ϵmt>Efxt+β11β1E<fxt,xtxt1>+L2β11β12Extxt12+ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ+L2η2μ222β11β12σ2+G2+η2μ222σ2+G22λημ1Efxtf+Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2=Efxt+β11β1E<fxt,ηvt1+ϵmt1>+Lη22β11β12E1vt1+ϵmt12+ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ+L2η2μ222β11β12σ2+G2+η2μ222σ2+G22λημ1Efxtf+Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

By rearranging,

Efxt+1Efxt+β1η1β1E<fxtm1vt1+ϵmt1>E<fxt+1,1vt+ϵmt>                 +Lη22β11β12E1vt1+ϵmt12+ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ                 +L2η2μ222β11β12σ2+G2+η2μ222σ2+G22λημ1Efxtf                 +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

From the fact ±<a,b>12a2+12b2, and Lemma D.4.1, D.4.4,

E<fxt,1vt1+ϵmt1>=E<fxt+11vt1+ϵ,mt1vt1+ϵ>                                                                     G2μ22+σ2+G2μ22σ2+G2μ2

Similar,

E<fxt+1,1vt+ϵmt>=E<fxt+11vt1+ϵ,mt1vt1+ϵ>                                                                     G2μ22+σ2+G2μ22σ2+G2μ2

Then,

Efxt+1Efxt+2β1ημ21β1σ2+G2+Lη2μ222β11β12σ2+G2                 +ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ                 +L2η2μ222β11β12σ2+G2+η2μ222σ2+G22λημ1Efxtf                 +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2
Efxt+1f12λημ1Efxtf+2β1ημ21β1σ2+G2+Lη2μ222β11β12σ2+G2         +ηβ11β1Gσ2+G2Ej=1d1vt1,j+ϵ1vt,j+ϵ         +L2η2μ222β11β12σ2+G2+η2μ222σ2+G2         +Lη2β12σ2+G21β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2         12λημ1Efxtf+2β1ημ21β1+Lη2μ222β11β12         +ηβ11β1Ej=1d1vt1,j+ϵ1vt,j+ϵ+L2η2μ222β11β12+η2μ222         +Lη2β121β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

The last inequality holds because Gσ2+G2σ2+G2.

Let

θ=12λημ1
Θt=2β1ημ21β1+Lη2μ222β11β12+ηβ11β1Ej=1d1vt1,j+ϵ1vt,j+ϵ+L2η2μ222β11β12        +η2μ222+Lη2β121β12Ej=1d1vt1,j+ϵ21vt,j+ϵ2+Lη2μ22σ2+G2

then we have

Efxt+1fθEfxtf+Θt.

Let Φt=Efxtf, then Φ1=Efx1f,

Φt+1θΦt+Θtθ2Φt1+θΘt1+Θt                    θtΦ1+θt1Θ1++θΘt1+Θt          θ<1θtΦ1+Θ1++Θt1+Θt.

Let t = T,

ΦT+1θTΦ1+Θ1++ΘT1+ΘT        θTΦ1+2β1ημ2T1β1+Lη2μ22T2β11β12+ηβ11β1Ej=1d1v0,j+ϵ1vT,j+ϵ        +L2η2μ22T2β11β12+η2μ22T2        +Lη2β121β12Ej=1d1v0,j+ϵ21vT,j+ϵ2+Lη2μ22Tσ2+G2        θTΦ1+2β1ημ2T1β1+Lη2μ22T2β11β12+ηβ1d1β1μ2μ1+L2η2μ22T2β11β12        +η2μ22T2+Lη2β12d1β12μ22μ12+Lη2μ22Tσ2+G2        =θTΦ1+OηT+Oη2T+Oη+Oη2

From the above inequality, η should be set less than O1T to ensure all items in the RHS small enough.

Set η=1T2, then θ=12λημ1=12λμ1T2

ΦT+1=θTΦ1+O1T+O1T3+O1T2+O1T4           =θTΦ1+O1T0

With appropriate η, we can derive the convergence rate under P-L condition (strongly convex) case.

The proof of Sadam is exactly same as Adam, by replacing the bounded pairs (µ1,µ2) with (µ3,µ4), and we can also get:

ΦT+1θTΦ1+Θ1++ΘT1+ΘT        θTΦ1+2β1ημ4T1β1+Lη2μ42T2β11β12+ηβ11β1Ej=1d1softplusv0,j1softplusvT,j        +L2η2μ42T2β11β12+η2μ42T2        +Lη2β121β12Ej=1d1softplusv0,j21softplusvT,j2+Lη2μ42Tσ2+G2        θTΦ1+2β1ημ4T1β1+Lη2μ42T2β11β12+ηβ1d1β1μ4μ3+L2η2μ42T2β11β12        +η2μ42T2+Lη2β12d1β12μ42μ32+Lη2μ42Tσ2+G2        =θTΦ1+OηT+Oη2T+Oη+Oη2

By setting appropriate η, we can also prove the Sadam converges under PL condition (and strongly convex).

Set η=O1T2,

EfxT+1f12λμ3T2TEfx1f+O1T.

Overall, we have proved Adam algorithm and Sadam in all commonly used conditions, our designed algorithms always enjoy the same convergence rate compared with Adam, and even get better results with appropriate choice of β defined in softplus function. The proof procedure can be easily extended to other adaptive gradient algorithms, and theoretical results support the discussion and experiments in our main paper.

Footnotes

1

The PAdam in [19] actually used AMSGrad, and for clear comparison, we named it PAMSGrad. In our experiments, we also compared with the Adam that used the A-LR formula with p, which we named PAdam.

This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].Ghadimi S, Lan G, Stochastic first-and zeroth-order methods for nonconvex stochastic programming, SIAM Journal on Optimization 23 (4) (2013) 2341–2368. [Google Scholar]
  • [2].Wright SJ, Nocedal J, Numerical optimization, Springer Science 35 (6768) (1999) 7. [Google Scholar]
  • [3].Wilson AC, Recht B, Jordan MI, A lyapunov analysis of momentum methods in optimization, arXiv preprint arXiv:1611.02635
  • [4].Yang T, Lin Q, Li Z, Unified convergence analysis of stochastic momentum methods for convex and non-convex optimization, arXiv preprint arXiv:1604.03257
  • [5].Duchi J, Hazan E, Singer Y, Adaptive subgradient methods for online learning and stochastic optimization, Journal of Machine Learning Research 12 (Jul) (2011) 2121–2159. [Google Scholar]
  • [6].Kingma DP, Ba J, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980
  • [7].Zeiler MD, Adadelta: an adaptive learning rate method, arXiv preprint arXiv:1212.5701
  • [8].Tieleman T, Hinton G, Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent magnitude, COURSERA: Neural networks for machine learning 4 (2) (2012) 26–31. [Google Scholar]
  • [9].He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  • [10].Zagoruyko S, Komodakis N, Wide residual networks, arXiv preprint arXiv:1605.07146
  • [11].Huang G, Liu Z, Van Der Maaten L, Weinberger KQ, Densely connected convolutional networks, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
  • [12].Reddi SJ, Kale S, Kumar S, On the convergence of adam and beyond
  • [13].Luo L, Xiong Y, Liu Y, Sun X, Adaptive gradient methods with dynamic bound of learning rate, arXiv preprint arXiv:1902.09843
  • [14].Zaheer M, Reddi S, Sachan D, Kale S, Kumar S, Adaptive methods for nonconvex optimization, in: Advances in Neural Information Processing Systems, 2018, pp. 9815–9825.
  • [15].Zhou D, Tang Y, Yang Z, Cao Y, Gu Q, On the convergence of adaptive gradient methods for nonconvex optimization, arXiv preprint arXiv:1808.05671
  • [16].Chen X, Liu S, Sun R, Hong M, On the convergence of a class of adam-type algorithms for non-convex optimization, arXiv preprint arXiv:1808.02941
  • [17].De S, Mukherjee A, Ullah E, Convergence guarantees for rmsprop and adam in non-convex optimization and an empirical comparison to nesterov acceleration
  • [18].Staib M, Reddi SJ, Kale S, Kumar S, Sra S, Escaping saddle points with adaptive gradient methods, arXiv preprint arXiv:1901.09149
  • [19].Chen J, Gu Q, Closing the generalization gap of adaptive gradient methods in training deep neural networks, arXiv preprint arXiv:1806.06763
  • [20].Reddi SJ, Hefny A, Sra S, Poczos B, Smola AJ, On variance reduction in stochastic gradient descent and its asynchronous variants, in: Advances in Neural Information Processing Systems, 2015, pp. 2647–2655. [PMC free article] [PubMed]
  • [21].Kleinberg R, Li Y, Yuan Y, An alternative view: When does sgd escape local minima?, in: International Conference on Machine Learning, 2018, pp. 2703–2712.
  • [22].Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP, On large-batch training for deep learning: Generalization gap and sharp minima, arXiv preprint arXiv:1609.04836
  • [23].Chaudhari P, Choromanska A, Soatto S, LeCun Y, Baldassi C, Borgs C, Chayes J, Sagun L, Zecchina R, Entropy-sgd: Biasing gradient descent into wide valleys, arXiv preprint arXiv:1611.01838
  • [24].Li H, Xu Z, Taylor G, Studer C, Goldstein T, Visualizing the loss landscape of neural nets, in: Advances in Neural Information Processing Systems, 2018, pp. 6389–6399.
  • [25].Dozat T, Incorporating nesterov momentum into adam
  • [26].Simonyan K, Zisserman A, Very deep convolutional networks for largescale image recognition, arXiv preprint arXiv:1409.1556
  • [27].Merity S, Keskar NS, Socher R, Regularizing and Optimizing LSTM Language Models, arXiv preprint arXiv:1708.02182
  • [28].Keskar NS, Socher R, Improving generalization performance by switching from adam to sgd, arXiv preprint arXiv:1712.07628
  • [29].Mikolov T, Karafiát M, Burget L, Černockỳ J, Khudanpur S, Recurrent neural network based language model, in: Eleventh annual conference of the international speech communication association, 2010.
  • [30].Bradbury J, Merity S, Xiong C, Socher R, Quasi-recurrent neural networks, arXiv preprint arXiv:1611.01576

RESOURCES