Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Nov 25.
Published in final edited form as: Adv Neural Inf Process Syst. 2019 Dec;2019:5563–5573.

An Adaptive Empirical Bayesian Method for Sparse Deep Learning

Wei Deng 1, Xiao Zhang 2, Faming Liang 3, Guang Lin 4
PMCID: PMC7687285  NIHMSID: NIHMS1645700  PMID: 33244209

Abstract

We propose a novel adaptive empirical Bayesian (AEB) method for sparse deep learning, where the sparsity is ensured via a class of self-adaptive spike-and-slab priors. The proposed method works by alternatively sampling from an adaptive hierarchical posterior distribution using stochastic gradient Markov Chain Monte Carlo (MCMC) and smoothly optimizing the hyperparameters using stochastic approximation (SA). We further prove the convergence of the proposed method to the asymptotically correct distribution under mild conditions. Empirical applications of the proposed method lead to the state-of-the-art performance on MNIST and Fashion MNIST with shallow convolutional neural networks (CNN) and the state-of-the-art compression performance on CIFAR10 with Residual Networks. The proposed method also improves resistance to adversarial attacks.

1. Introduction

MCMC, known for its asymptotic properties, has not been fully investigated in deep neural networks (DNNs) due to its unscalability in dealing with big data. Stochastic gradient Langevin dynamics (SGLD) [Welling and Teh, 2011], the first stochastic gradient MCMC (SG-MCMC) algorithm, tackled this issue by adding noise to the stochastic gradient, smoothing the transition between optimization and sampling and making MCMC scalable. Chen et al. [2014] proposed using stochastic gradient Hamiltonian Monte Carlo (SGHMC), the second-order SG-MCMC, which was shown to converge faster. In addition to modeling uncertainty, SG-MCMC also has remarkable non-convex optimization abilities. Raginsky et al. [2017], Xu et al. [2018] proved that SGLD, the first-order SG-MCMC, is guaranteed to converge to an approximate global minimum of the empirical risk in finite time. Zhang et al. [2017] showed that SGLD hits the approximate local minimum of the population risk in polynomial time. Mangoubi and Vishnoi [2018] further demonstrated SGLD with simulated annealing has a higher chance to obtain the global minima on a wider class of non-convex functions. However, all the analyses fail when DNN has too many parameters, and the over-specified model tends to have a large prediction variance, resulting in poor generalization and causing over-fitting. Therefore, a proper model selection is on demand at this situation.

A standard method to deal with model selection is variable selection. Notably, the best variable selection based on the L0 penalty is conceptually ideal for sparsity detection but is computationally slow. Two alternatives emerged to approximate it. On the one hand, penalized likelihood approaches, such as Lasso [Tibshirani, 1994], induce sparsity due to the geometry that underlies the L1 penalty. To better handle highly correlated variables, Elastic Net was proposed [Zou and Hastie, 2005] and makes a compromise between L1 penalty and L2 penalty. On the other hand, spike-and-slab approaches to Bayesian variable selection originates from probabilistic considerations. George and McCulloch [1993] proposed to build a continuous approximation of the spike-and-slab prior to sample from a hierarchical Bayesian model using Gibbs sampling. This continuous relaxation inspired the efficient EM variable selection (EMVS) algorithm in linear models [Rořková and George, 2014, 2018].

Despite the advances of model selection in linear systems, model selection in DNNs has received less attention. Ghosh et al. [2018] proposed to use variational inference (VI) based on regularized horseshoe priors to obtain a compact model. Liang et al. [2018] presented the theory of posterior consistency for Bayesian neural networks (BNNs) with Gaussian priors, and Ye and Sun [2018] applied a greedy elimination algorithm to conduct group model selection with the group Lasso penalty. Although these works only show the performance of shallow BNNs, the experimental methodologies imply the potential of model selection in DNNs. Louizos et al. [2017] studied scale mixtures of Gaussian priors and half-Cauchy scale priors for the hidden units of VGG models [Simonyan and Zisserman, 2014] and achieved good model compression performance on CIFAR10 [Krizhevsky, 2009] using VI. However, due to the limitation of VI in non-convex optimization, the compression is still not sparse enough and can be further optimized.

Over-parameterized DNNs often demand for tremendous memory use and heavy computational resources, which is impractical for smart devices. More critically, over-parametrization frequently overfits the data and results in worse performance [Lin et al., 2017]. To ensure the efficiency of the sparse sampling algorithm without over-shrinkage in DNN models, we propose an AEB method to adaptively sample from a hierarchical Bayesian DNN model with spike-and-slab Gaussian-Laplace (SSGL) priors and the priors are learned through optimization instead of sampling. The AEB method differs from the full Bayesian method in that the priors are inferred from the empirical data and the uncertainty of the priors is no longer considered to speed up the inference. In order to optimize the latent variables without affecting the convergence to the asymptotically correct distribution, stochastic approximation (SA) [Benveniste et al., 1990], a standard method for adaptive sampling [Andrieu et al., 2005, Liang, 2010], naturally fits to train the adaptive hierarchical Bayesian model.

In this paper, we propose a sparse Bayesian deep learning algorithm, SG-MCMC-SA, to adaptively learn the hierarchical Bayes mixture models in DNNs. This algorithm has four main contributions:

  • We propose a novel AEB method to efficiently train hierarchical Bayesian mixture DNN models, where the parameters are learned through sampling while the priors are learned through optimization.

  • We prove the convergence of this approach to the asymptotically correct distribution, and it can be further generalized to a class of adaptive sampling algorithms for estimating state-space models in deep learning.

  • We apply this adaptive sampling algorithm in the DNN compression problems firstly, with potential extension to a variety of model compression problems.

  • It achieves the state of the art in terms of compression rates, which is 91.68% accuracy on CIFAR10 using only 27K parameters (90% sparsity) with Resnet20 [He et al., 2016].

2. Stochastic Gradient MCMC

We denote the set of model parameters by β, the learning rate at time k by ϵ(k), the entire data by D={di}i=1N, where di = (xi,yi), the log of posterior by L(β). The mini-batch of data B is of size n with indices S={s1,s2,,sn}, where si ∈ {1, 2,…, N}. Stochastic gradient βL˜(β) from a mini-batch of data B randomly sampled from D is used to approximate ∇βL(β):

βL˜(β)=βlog P(β)+NniSβlog P(diβ). (1)

SGLD (no momentum) is formulated as follows:

β(k+1)=β(k)+ϵ(k)βL˜(β(k))+N(0,2ϵ(k)τ1), (2)

where τ > 0 denotes the inverse temperature. It has been shown that SGLD asymptotically converges to a stationary distribution π(βD)eτL(β) [Teh et al., 2016, Zhang et al., 2017]. As τ increases and ϵ decreases gradually, the solution tends towards the global optima with a higher probability. Another variant of SG-MCMC, SGHMC [Chen et al., 2014, Ma et al., 2015], proposes to generate samples as follows:

{dβ=rdt,dr=βL˜(β)dtCrdt+N(0,2Bτ1dt)+N(0,2(CB^)τ1dt), (3)

where r is the momentum item, B^ is an estimate of the stochastic gradient variance, C is a user-specified friction term. Regarding the discretization of (3), we follow the numerical method proposed by Saatci and Wilson [2017] due to its convenience to import parameter settings from SGD.

3. Empirical Bayesian via Stochastic Approximation

3.1. A hierarchical formulation with deep SSGL priors

Inspired by the hierarchical Bayesian formulation for sparse inference [George and McCulloch, 1993], we assume the weight βlj in sparse layer l with index j follows the SSGL prior

βljσ2,γlj~(1γlj)L(0,σv0)+γljN(0,σ2v1). (4)

where γlj{0,1},βlpl,σ2,L(0,σv0) denotes a Laplace distribution with mean 0 and scale σv0, and N(0,σ2v1) denotes a Normal distribution with mean 0 and variance σ2v1. The sparse layer can be the fully connected layers (FC) in a shallow CNN or Convolutional layers in ResNet. If we have γlj = 0, the prior behaves like Lasso, which leads to a shrinkage effect; when γlj = 1, the L2 penalty dominates. The likelihood follows

π(Bβ,σ2)={exp{iS(yiψ(xi;β))22σ2}(2πσ2)n/2(regression),iSexp{ψyi(xi;β)}t=1Kexp{ψt(xi;β)}(classification), (5)

where ψ(xi; β) is a linear or non-linear mapping, and yi ∈ {1, 2, …, K} is the response value of the i-th example. In addition, the variance σ2 follows an inverse gamma prior π(σ2) = IG(ν/2, νλ/2). The i.i.d. Bernoulli prior is used for γ, namely π(γlδl)=δl|γl|(1δl)pl|γl| where δl follows Beta distribution π(δl)δla1(1δl)b1. The use of self-adaptive penalty enables the model to learn the level of sparsity automatically. Finally, our posterior follows

π(β,σ2,δ,γB)π(Bβ,σ2)Nnπ(βσ2,γ)π(σ2γ)π(γδ)π(δ). (6)

3.2. Empirical Bayesian with approximate priors

To speed up the inference, we propose the AEB method by sampling β and optimizing σ2, δ, γ, where uncertainty of the hyperparameters is not considered. Because the binary variable γ is harder to optimize directly, we consider optimizing the adaptive posterior Eγ,D[π(β,σ2,δ,γD)] * instead. Due to the limited memory, which restricts us from sampling directly from D, we choose to sample β from Eγ,D[EB[π(β,σ2,δ,γB)]]. By Fubini’s theorem and Jensen’s inequality, we have

log Eγ,D[EB[π(β,σ2,δ,γB)]]=log EB[Eγ,D[π(β,σ2,δ,γB)]]EB[log Eγ,D[π(β,σ2,δ,γB)]]EB[Eγ,D[log π(β,σ2,δ,γB)]]. (7)

Instead of tackling π(β,σ2,δ,γD) directly, we propose to iteratively update the lower bound Q

Q(β,σ,δβ(k),σ(k),δ(k))=EB[EγD[logπ(β,σ2,δ,γB)]]. (8)

Given (β(k), σ(k), δ(k)) at the k-th iteration, we first sample β(k+1) from Q, then optimize Q with respect to σ, δ and Eγl,D via SA, where Eγl,D is used since γ is treated as unobserved variable. To make the computation easier, we decompose our Q as follows:

Q(β,σ,δβ(k),σ(k),δ(k))=Q1(β,σβ(k),σ(k),δ(k))+Q2(δβ(k),σ(k),δ(k))+C, (9)

Denote X and C as the sets of the indices of sparse and non-sparse layers, respectively. We have:

Q1(ββ(k),σ(k),δ(k))=Nnlogπ(Bβ)log likelihood lCjplβlj22σ02non-sparse layers Cp+ν+22log(σ2)lXjpl[|βlj|EγlD[1v0(1γlj)]κlj0σ+βlj2Eγl,D[1v1γlj]κlj12σ2]deep SSGL priors in sparse layers Xνλ2σ2 (10)
Q2(δlβl(k),δl(k))=lXjpllog(δl1δl)Eγl,Dγljρlj+(a1)log(δl)+(pl+b1)log(1δl), (11)

where ρ, κ, σ and δ are to be estimated in the next section.

3.3. Empirical Bayesian via stochastic approximation

To simplify the notation, we denote the vector (ρ, κ, σ, δ) by θ. Our interest is to obtain the optimal θ* based on the asymptotically correct distribution π(β, θ*). This implies that we need to obtain an estimate θ* that solves a fixed-point formulation gθ*(β)π(β,θ*)dβ=θ*, where gθ(β) is inspired by EMVS to obtain the optimal θ based on the current β. Define the random output gθ(β) − θ as H(β, θ) and the mean field function h(θ)E[H(β,θ)]. The stochastic approximation algorithm can be used to solve the fixed-point iterations:

  1. Sample β(k+1) from a transition kernel Πθ(k)(β), which yields the distribution π(β, θ(k)),

  2. Update θ(k+1) = θ(k) + ω(k+1) H(θ(k), β(k+1)) = θ(k) + ω(k+1) (h(θ(k)) + Ω(k)).

where ω(k+1) is the step size. The equilibrium point θ* is obtained when the distribution of β converges to the invariant distribution π(β, θ*). The stochastic approximation [Benveniste et al., 1990] differs from the Robbins–Monro algorithm in that sampling β from a transition kernel instead of a distribution introduces a Markov state-dependent noise Ω(k) [Andrieu et al., 2005]. In addition, since variational technique is only used to approximate the priors, and the exact likelihood doesn’t change, the algorithm falls into a class of adaptive SG-MCMC instead of variational inference.

Regarding the updates of gθ(β) with respect to ρ, we denote the optimal ρ based on the current β and δ by ρ˜. We have that ρ˜lj(k+1), the probability of βlj being dominated by the L2 penalty is

ρ˜lj(k+1)=Eγl,Bγlj=P(γlj=1βl(k),δl(k))=aljalj+blj, (12)

Where alj=π(βlj(k)γlj=1)P(γlj=1δl(k)) and blj=π(βlj(k)γlj=0)P(γlj=0δl(k)). The choice of Bernoulli prior enables us to use P(γlj=1δl(k))=δl(k).

Similarly, as to gθ(β) w.r.t. κ, the optimal κ˜lj0 and κ˜lj1 based on the current ρlj are given by:

κ˜lj0=Eγl,B[1v0(1γlj)]=1ρljv0;κ˜lj1=EγlB[1v1γlj]=ρljv1. (13)

To optimize Q1 with respect to σ, by denoting diag{κ0li}i=1pl as V0l, diag{κ1li}i=1pl as V1l we have:

σ˜(k+1)={Rb+Rb2+4RaRc2Ra (regression) ,Cb+Cb2+4CaCc2Ca (classification) , (14)

where Ra=N+lXpl+ν,Ca=lXpl+ν+2, Rb=Cb=lXV0lβl(k+1)1, Rc = I + J + , Cc = J + , I=NniS(yiψ(xi;β(k+1)))2, J=lXV1l1/2βl(k+1)2.

To optimize Q2, a closed-form update can be derived from Eq.(11) and Eq.(12) given batch data B:

δ˜l(k+1)=arg max δlQ2(δlβl(k),δl(k))=j=1plρlj+a1a+b+pl2. (15)

3.4. Pruning strategy

There are quite a few methods for pruning neural networks including the oracle pruning and the easy-to-use magnitude-based pruning [Molchanov et al., 2017]. Although the magnitude-based unit pruning shows more computational savings [Gomez et al., 2018], it doesn’t demonstrate robustness under coarser pruning [Han et al., 2016, Gomez et al., 2018]. Pruning based on the probability ρ is also popular in the Bayesian community, but achieving the target sparsity in sophisticated networks requires extra fine-tuning. We instead apply the magnitude-based weight-pruning to our Resnet compression experiments and refer to it as SGLD-SA, which is detailed in Algorithm 1. The corresponding variant of SGHMC with SA is referred to as SGHMC-SA.

4. Convergence Analysis

The key to guaranteeing the convergence of the adaptive SGLD algorithm is to use Poisson’s equation to analyze additive functionals. By decomposing the Markov state-dependent noise Ω into martingale difference sequences and perturbations, where the latter can be controlled by the regularity of the solution of Poisson’s equation, we can guarantee the consistency of the latent variable estimators.

Theorem 1 (L2 convergence rate). For any α ∈ (0, 1], under assumptions in Appendix B.1, the algorithm satisfies: there exists a large enough constant λ and an equilibrium θ* such that

E[θ(k)θ*2]λkα.

SGLD with adaptive latent variables forms a sequence of inhomogenous Markov chains and the weak convergence of β to the target posterior is equivalent to proving the weak convergence of SGLD with biased estimations of gradients. Inspired by Chen et al. [2015], we have:

Corollary 1. Under assumptions in Appendix B.2, the random vector β(k) from the adaptive transition kernel Πθ(k1) converges weakly to the invariant distribution eτL(β,θ*) as ϵ → 0 and k → ∞.

The smooth optimization of the priors makes the algorithm robust to bad initialization and avoids entrapment in poor local optima. In addition, the convergence to the asymptotically correct distribution enables us to combine simulated annealing [Kirkpatrick et al., 1983], simulated tempering [Marinari and Parisi, 1992], parallel tempering [Swendsen and Wang, 1986] or (and) dynamical weighting [Wong and Liang, 1997] to obtain better point estimates in non-convex optimization and more robust posterior averages in multi-modal sampling.

5. Experiments

5.1. Simulation of Large-p-Small-n Regression

We conduct the linear regression experiments with a dataset containing n = 100 observations and p = 1000 predictors. Np(0,Σ) is chosen to simulate the predictor values X (training set)

Algorithm 1 SGLD-SA with SSGL priors
Initialize: β(1), ρ(1), κ(1), σ(2) and δ(1) from scratch, set target sparse rates D, ℧ and S
for k ← 1 : kmax do
  Sampling β(k+1)β(k)+ϵ(k)βQ(B(k))+N(0,2ϵ(k)τ1)
  Stochastic Approximation for Latent Variables
  SA: ρ(k+1)(1ω(k+1))ρ(k)+ω(k+1)ρ˜(k+1) following Eq.(12)
  SA: κ(k+1)(1ω(k+1))κ(k)+ω(k+1)κ˜(k+1) following Eq.(13)
  SA: σ(k+1)(1ω(k+1))σ(k)+ω(k+1)σ˜(k+1) following Eq.(14)
  SA: δ(k+1)(1ω(k+1))δ(k)+ω(k+1)δ˜(k+1) following Eq.(15)
  if Pruning then
   Prune the bottom-s% lowest magnitude weights
   Increase the sparse rate sS(1Dk/)
  end if
end for

where Σ=(Σ)i,j=1p with Σi,j = 0.6|ij|. Response values y are generated from + η, where β = (β1, β2, β3, 0, 0, …, 0)′ and η~Nn(0,3In). We assume β1~N(3,σc2), β2~N(2,σc2), β3~N(1,σc2), σc = 0.2. We introduce some hyperparameters, but most of them are uninformative. We fix τ = 1, λ = 1, ν = 1,v1 = 10, δ = 0.5, b = p and set a = 1. The learning rate follows ϵ(k)=0.001×k13, and the step size is given by ω(k) = 10 × (k + 1000)−0.7. We vary v0 and σ to show the robustness of SGLD-SA to different initializations. In addition, to show the superiority of the adaptive update, we compare SGLD-SA with the intuitive implementation of the EMVS to SGLD and refer to this algorithm as SGLD-EM, which is equivalent to setting ω(k) ≔ 1 in SGLD-SA. To obtain the stochastic gradient, we randomly select 50 observations and calculate the numerical gradient. SGLD is sampled from the same hierarchical model without updating the latent variables.

We simulate 500,000 samples from the posterior distribution, and also simulate a test set with 50 observations to evaluate the prediction. As shown in Fig.1 (d), all three algorithms are fitted very well in the training set, however, SGLD fails completely in the test set (Fig.1 (e)), indicating the over-fitting problem of SGLD without proper regularization when the latent variables are not updated. Fig.1 (f) shows that although SGLD-EM successfully identifies the right variables, the estimations are lower biased. The reason is that SGLD-EM fails to regulate the right variables with L2 penalty, and L1 leads to a greater amount of shrinkage for β1, β2 and β3 (Fig. 1 (ac)), implying the importance of the adaptive update via SA in the stochastic optimization of the latent variables. In addition, from Fig. 1(a), Fig. 1(b) and Fig.1(c), we see that SGLD-SA is the only algorithm among the three that quantifies the uncertainties of β1, β2 and β3 and always gives the best prediction as shown in Table.1. We notice that SGLD-SA is fairly robust to various hyperparameters.

Figure 1:

Figure 1:

Linear regression simulation when v0 = 0.1 and σ = 1.

Table 1:

Predictive errors in linear regression based on a test set considering different v0 and σ

MAE / MSE υ0=0.01, σ=2 υ0=.01, σ=2 υ0=0.01, σ=1 υ0=0.1, σ=1
SGLD-SA 1.89 / 5.56 1.72 / 5.64 1.48 / 3.51 1.54 / 4.42
SGLD-EM 3.49 / 19.31 2.23 / 8.22 2.23 / 19.28 2.07 / 6.94
SGLD 15.85 / 416.39 15.85 / 416.39 11.86 / 229.38 7.72 / 88.90

For the simulation of SGLD-SA in logistic regression and the evaluation of SGLD-SA on UCI datasets, we leave the results in Appendix C and D.

5.2. Classification with Auto-tuning Hyperparameters

The following experiments are based on non-pruning SG-MCMC-SA, the goal is to show that auto-tuning sparse priors are useful to avoid over-fitting. The posterior average is applied to each Bayesian model. We implement all the algorithms in Pytorch [Paszke et al., 2017]. The first DNN is a standard 2-Conv-2-FC CNN model of 670K parameters (see details in Appendix D.1).

The first set of experiments is to compare methods on the same model without using data augmentation (DA) and batch normalization (BN) [Ioffe and Szegedy, 2015]. We refer to the general CNN without dropout as Vanilla, with 50% dropout rate applied to the hidden units next to FC1 as Dropout.

Vanilla and Dropout models are trained with Adam [Kingma and Ba, 2014] and Pytorch default parameters (with learning rate 0.001). We use SGHMC as a benchmark method as it is also sampling-based and has a close relationship with the popular momentum based optimization approaches in DNNs. SGHMC-SA differs from SGHMC in that SGHMC-SA keeps updating SSGL priors for the first FC layer while they are fixed in SGHMC. We set the training batch size n = 1000, a, b = p and ν, λ = 1000. The hyperparameters for SGHMC-SA are set to v0 = 1,v1 = 0.1 and σ = 1 to regularize the over-fitted space. The learning rate is set to 5 × 10−7, and the step size is ω(k)=1×(k+1000)34. We use a thinning factor 500 to avoid a cumbersome system. Fixed temperature can also be powerful in escaping “shallow” local traps [Zhang et al., 2017], our temperatures are set to τ = 1000 for MNIST and τ = 2500 for FMNIST.

The four CNN models are tested on MNIST and Fashion MNIST (FMNIST) [Xiao et al., 2017] dataset. Performance of these models is shown in Tab.2. Compared with SGHMC, our SGHMC-SA outperforms SGHMC on both datasets. We notice the posterior averages from SGHMC-SA and SGHMC obtain much better performance than Vanilla and Dropout. Without using either DA or BN, SGHMC-SA achieves 99.59% which outperforms some state-of-the-art models, such as Maxout Network (99.55%) [Goodfellow et al., 2013] and pSGLD (99.55%) [Li et al., 2016]. In F-MNIST, SGHMC-SA obtains 93.01% accuracy, outperforming all other competing models.

Table 2:

Classification accuracy using shallow networks

Dataset MNIST DA-MNIST FMNIST DA-FMNIST
Vanilla 99.31 99.54 92.73 93.14
Dropout 99.38 99.56 92.81 93.35
SGHMC 99.47 99.63 92.88 94.29
SGHMC-SA 99.59 99.75 93.01 94.38

To further test the performance, we apply DA and BN to the following experiments (see details in Appendix D.2) and refer to the datasets as DA-MNIST and DA-FMNIST. All the experiments are conducted using a 2-Conv-BN-3-FC CNN of 490K parameters. Using this model, we obtain the state-of-the-art 99.75% on DA-MNIST (200 epochs) and 94.38% on DA-FMNIST (1000 epochs) as shown in Tab. 2. The results are noticeable, because posterior average is only conducted on a single shallow CNN.

5.3. Defenses against Adversarial Attacks

Continuing with the setup in Sec. 5.2, the third set of experiments focuses on evaluating model robustness. We apply the Fast Gradient Sign method [Goodfellow et al., 2014] to generate the adversarial examples with one single gradient step:

xadvxζsign{δx max y log P(yx)},

where ζ ranges from 0.1, 0.2, …, 0.5 to control the different levels of adversarial attacks.

Similar to the setup in Li and Gal [2017], we normalize the adversarial images by clipping to the range [0, 1]. In Fig. 2(b) and Fig.2(d), we see no significant difference among all the four models in the early phase. As the degree of adversarial attacks arises, the images become vaguer as shown in Fig.2(a) and Fig.2(c). The performance of Vanilla decreases rapidly, reflecting its poor defense against adversarial attacks, while Dropout performs better than Vanilla. But Dropout is still significantly worse than the sampling based methods. The advantage of SGHMC-SA over SGHMC becomes more significant when ζ > 0.25. In the case of ζ = 0.5 in MNIST where the images are hardly recognizable, both Vanilla and Dropout models fail to identify the right images and their predictions are as worse as random guesses. However, SGHMC-SA model achieves roughly 11% higher than these two models and 1% higher than SGHMC, which demonstrates the robustness of SGHMC-SA.

Figure 2:

Figure 2:

Adversarial test accuracies based on adversarial images of different levels

5.4. Residual Network Compression

Our compression experiments are conducted on the CIFAR-10 dataset [Krizhevsky, 2009] with DA. SGHMC and the non-adaptive SGHMC-EM are chosen as baselines. Simulated annealing is used to enhance the non-convex optimization and the methods with simulated annealing are referred to as A-SGHMC, A-SGHMC-EM and A-SGHMC-SA, respectively. We report the best point estimate.

We first use SGHMC to train a Resnet20 model and apply the magnitude-based criterion to prune weights to all convolutional layers (except the very first one). All the following methods are evaluated based on the same setup except for different step sizes to learn the latent variables. The sparse training takes 1000 epochs. The mini-batch size is 1000. The learning rate starts from 2e-9 and is divided by 10 at the 700th and 900th epoch. We set the inverse temperature τ to 1000 and multiply τ by 1.005 every epoch. We fix ν = 1000 and λ = 1000 for the inverse gamma prior. v0 and v1 are tuned based on different sparsity to maximize the performance. The smooth increase of the sparse rate follows the pruning rule in Algorithm 1, and D and ℧ are set to 0.99 and 50, respectively. The increase in the sparse rate s is faster in the beginning and slower in the later phase to avoid destroying the network structure. Weight decay in the non-sparse layers C is set as 25.

As shown in Table 3, A-SGHMC-SA doesn’t distinguish itself from A-SGHMC-EM and A-SGHMC when the sparse rate S is small, but outperforms the baselines given a large sparse rate. The pretrained model has accuracy 93.90%, however, the prediction performance can be improved to the state-of-the-art 94.27% with 50% sparsity. Most notably, we obtain 91.68% accuracy based on 27K parameters (90% sparsity) in Resnet20. By contrast, targeted dropout obtained 91.48% accuracy based on 47K parameters (90% sparsity) of Resnet32 [Gomez et al., 2018], BC-GHS achieves 91.0% accuracy based on 8M parameters (94.5% sparsity) of VGG models [Louizos et al., 2017]. We also notice that when simulated annealing is not used as in SGHMC-SA, the performance will decrease by 0.2% to 0.3%. When we use batch size 2000 and inverse temperature schedule τ(k) = 20 × 1.01k, A-SGHMC-SA still achieves roughly the same level, but the prediction of SGHMC-SA can be 1% lower than A-SGHMC-SA.

Table 3:

Resnet20 Compression on CIFAR10. When S=0.9, we fix v0 = 0.005, v1 =1e-5; When S=0.7, we fix v0 = 0.1, v1 =5e-5; When S=0.5, we fix v0 = 0.1, v1 =5e-4; When S=0.3, we fix v0 = 0.5, v1 =1e-3.

Methods \ S 30% 50% 70% 90%
A-SGHMC 94.07 94.16 93.16 90.59
A-SGHMC-EM 94.18 94.19 93.41 91.26
SGHMC-SA 94.13 94.11 93.52 91.45
A-SGHMC-SA 94.23 94.27 93.74 91.68

6. Conclusion

We propose a novel AEB method to adaptively sample from hierarchical Bayesian DNNs and optimize the spike-and-slab priors, which yields a class of scalable adaptive sampling algorithms in DNNs. We prove the convergence of this approach to the asymptotically correct distribution. By adaptively searching and penalizing the over-fitted parameters, the proposed method achieves higher prediction accuracy over the traditional SG-MCMC methods in both simulated examples and real applications and shows more robustness towards adversarial attacks. Together with the magnitude-based weight pruning strategy and simulated annealing, the AEB-based method, A-SGHMC-SA, obtains the state-of-the-art performance in model compression.

Supplementary Material

1

Acknowledgments

We would like to thank Prof. Vinayak Rao, Dr. Yunfan Li and the reviewers for their insightful comments. We acknowledge the support from the National Science Foundation (DMS-1555072, DMS-1736364, DMS-1821233 and DMS-1818674) and the GPU grant program from NVIDIA.

Footnotes

*

Eγ,D[] is short for Eγβ(k),σ(k),δ(k),D[].

EB[π(β,σ2,δ,γB)] denotes Dπ(β,σ2,δ,γB)dB

The quadratic equation has only one unique positive root. ∥·∥ refers to L2 norm, ∥ · ∥1 represents L1 norm.

33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.

Contributor Information

Wei Deng, Department of Mathematics, Purdue University, West Lafayette, IN 47907.

Xiao Zhang, Department of Computer Science, Purdue University, West Lafayette, IN 47907.

Faming Liang, Department of Statistics, Purdue University, West Lafayette, IN 47907.

Guang Lin, Departments of Mathematics, Statistics and School of Mechanical Engineering, Purdue University, West Lafayette, IN 47907.

References

  1. Andrieu Christophe, Moulines Éric, and Priouret Pierre. Stability of Stochastic Approximation under Verifiable Conditions. SIAM J. Control Optim, 44(1):283–312, 2005. [Google Scholar]
  2. Benveniste Albert, Métivier Michael, and Priouret Pierre. Adaptive Algorithms and Stochastic Approximations. Berlin: Springer, 1990. [Google Scholar]
  3. Chen Changyou, Ding Nan, and Carin Lawrence. On the Convergence of Stochastic Gradient MCMC Algorithms with High-order Integrators. In Proc. of the Conference on Advances in Neural Information Processing Systems (NeurIPS), pages 2278–2286, 2015. [Google Scholar]
  4. Chen Tianqi, Fox Emily B., and Guestrin Carlos. Stochastic Gradient Hamiltonian Monte Carlo. In Proc. of the International Conference on Machine Learning (ICML), 2014. [Google Scholar]
  5. Dalalyan Arnak S. and Karagulyan Avetik G.. User-friendly Guarantees for the Langevin Monte Carlo with Inaccurate Gradient. ArXiv e-prints, September 2018. [Google Scholar]
  6. George Edward I. and McCulloch Robert E.. Variable Selection via Gibbs Sampling. Journal of the American Statistical Association, 88(423):881–889, 1993. [Google Scholar]
  7. Ghosh Soumya, Yao Jiayu, and Doshi-Velez Finale. Structured Variational Learning of Bayesian Neural Networks with Horseshoe Priors. In Proc. of the International Conference on Machine Learning (ICML), 2018. [Google Scholar]
  8. Gomez Aidan N., Zhang Ivan, Swersky Kevin, Gal Yarin, and Hinton Geoffrey E.. Targeted Dropout. In NeurIPS 2018 workshop on Compact Deep Neural Networks with industrial applications, 2018. [Google Scholar]
  9. Goodfellow Ian J., David Warde-Farley Mehdi Mirza, Courville Aaron, and Bengio Yoshua. Maxout networks. In Proc. of the International Conference on Machine Learning (ICML), pages III–1319– III–1327, 2013. [Google Scholar]
  10. Goodfellow Ian J., Shlens Jonathon, and Szegedy Christian. Explaining and Harnessing Adversarial Examples. ArXiv e-prints, December 2014. [Google Scholar]
  11. Han Song, Mao Huizi, and Dally William J. Deep compression: Compressing Deep Neural Network with Pruning, Trained Quantization and Huffman Coding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Google Scholar]
  12. He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. Deep Residual Learning for Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [Google Scholar]
  13. Hernandez-Lobato Jose Miguel and Adams Ryan. Probabilistic Backpropagation for Scalable Learning of Bayesian Neural Networks. In Proc. of the International Conference on Machine Learning (ICML), volume 37, pages 1861–1869, 2015. [Google Scholar]
  14. Ioffe Sergey and Szegedy Christian. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proc. of the International Conference on Machine Learning (ICML), pages 448–456, 2015. [Google Scholar]
  15. Jarrett K, Kavukcuoglu K, Ranzato M, and LeCun Y. What is the best multi-stage architecture for object recognition? In Proc. of the International Conference on Computer Vision (ICCV), pages 2146–2153, September 2009. [Google Scholar]
  16. Kingma Diederik P. and Ba Jimmy. Adam: A Method for Stochastic Optimization. In Proc. of the International Conference on Learning Representation (ICLR), 2014. [Google Scholar]
  17. Kirkpatrick Scott, Gelatt D Jr, and Vecchi Mario P.. Optimization by Simulated Annealing. Science, 220(4598):671–680, 1983. [DOI] [PubMed] [Google Scholar]
  18. Krizhevsky Alex. Learning Multiple Layers of Features from Tiny Images. In Tech Report, 2009. [Google Scholar]
  19. Li Chunyuan, Chen Changyou, Carlson David, and Carin Lawrence. Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks. In Proc. of the National Conference on Artificial Intelligence (AAAI), pages 1788–1794, 2016. [Google Scholar]
  20. Li Yingzhen and Gal Yarin. Dropout Inference in Bayesian Neural Networks with Alpha-divergences. In Proc. of the International Conference on Machine Learning (ICML), 2017. [Google Scholar]
  21. Liang Faming. Trajectory Averaging for Stochastic Approximation MCMC Algorithms. The Annals of Statistics, 38:2823–2856, 2010. [Google Scholar]
  22. Liang Faming, Jia Bochao, Xue Jingnan, Li Qizhai, and Luo Ye. Bayesian Neural Networks for Selection of Drug Sensitive Genes. Journal of the American Statistical Association, 113(5233): 955–972, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Lin Ji, Rao Yongming, Lu Jiwen, and Zhou Jie. Runtime Neural Pruning. In Proc. of the Conference on Advances in Neural Information Processing Systems (NeurIPS), 2017. [Google Scholar]
  24. Louizos Christos, Ullrich Karen, and Welling Max. Bayesian Compression for Deep learning. In Proc. of the Conference on Advances in Neural Information Processing Systems (NeurIPS), 2017. [Google Scholar]
  25. Ma Yi-An, Chen Tianqi, and Fox Emily B.. A Complete Recipe for Stochastic Gradient MCMC. In Proc. of the Conference on Advances in Neural Information Processing Systems (NeurIPS), 2015. [Google Scholar]
  26. Mangoubi Oren and Vishnoi Nisheeth K.. Convex Optimization with Unbounded Nonconvex Oracles using Simulated Annealing. In Proc. of Conference on Learning Theory (COLT), 2018. [Google Scholar]
  27. Marinari E and Parisi G. Simulated Tempering: A New Monte Carlo Scheme. Europhysics Letters (EPL), 19(6):451–458, 1992. [Google Scholar]
  28. Mattingly Jonathan C., Stuart Andrew M., and Tretyakov MV. Convergence of Numerical Time-Averaging and Stationary Measures via Poisson Equations. SIAM Journal on Numerical Analysis, 48:552–577, 2010. [Google Scholar]
  29. Molchanov Pavlo, Tyree Stephen, Karras Tero, Aila Timo, and Kautz Jan. Pruning Convolutional Neural Networks for Resource Efficient Inference. In Proc. of the International Conference on Learning Representation (ICLR), 2017. [Google Scholar]
  30. Paszke Adam, Gross Sam, Chintala Soumith, Chanan Gregory, Yang Edward, Zachary DeVito Zeming Lin, Desmaison Alban, Antiga Luca, and Lerer Adam. Automatic differentiation in PyTorch. In NeurIPS Autodiff Workshop, 2017. [Google Scholar]
  31. Raginsky Maxim, Rakhlin Alexander, and Telgarsky Matus. Non-convex Learning via Stochastic Gradient Langevin Dynamics: a Nonasymptotic Analysis. In Proc. of Conference on Learning Theory (COLT), June 2017. [Google Scholar]
  32. Rořková Veronika and George Edward I.. EMVS: The EM Approach to Bayesian variable selection. Journal of the American Statistical Association, 109(506):828–846, 2014. [Google Scholar]
  33. Rořková Veronika and George Edward I.. The Spike-and-Slab Lasso. Journal of the American Statistical Association, 113:431–444, 2018. [Google Scholar]
  34. Saatci Yunus and Wilson Andrew G. Bayesian GAN. In Proc. of the Conference on Advances in Neural Information Processing Systems (NeurIPS), pages 3622–3631, 2017. [Google Scholar]
  35. Simonyan Karen and Zisserman Andrew. Very Deep Convolutional Networks for Large-scale Image Recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014. [Google Scholar]
  36. Swendsen Robert H. and Wang Jian-Sheng. Replica Monte Carlo Simulation of Spin-Glasses. Phys. Rev. Lett, 57:2607–2609, 1986. [DOI] [PubMed] [Google Scholar]
  37. Yee Whye Teh Alexandre Thiéry, and Vollmer Sebastian. Consistency and Fluctuations for Stochastic Gradient Langevin Dynamics. Journal of Machine Learning Research, 17:1–33, 2016. [Google Scholar]
  38. Tibshirani Robert. Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society, Series B, 58:267–288, 1994. [Google Scholar]
  39. Welling Max and Teh Yee Whye. Bayesian Learning via Stochastic Gradient Langevin Dynamics. In Proc. of the International Conference on Machine Learning (ICML), pages 681–688, 2011. [Google Scholar]
  40. Wong Wing Hung and Liang Faming. Dynamic Weighting in Monte Carlo and Optimization. Proc. Natl. Acad. Sci, 94:14220–14224, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Xiao Han, Rasul Kashif, and Vollgraf Roland. Fashion-MNIST: a Novel Image Dataset for Bench-marking Machine Learning Algorithms. ArXiv e-prints, August 2017. [Google Scholar]
  42. Xu Pan, Chen Jinghui, Zou Difan, and Gu Quanquan. Global Convergence of Langevin Dynamics Based Algorithms for Nonconvex Optimization. In Proc. of the Conference on Advances in Neural Information Processing Systems (NeurIPS), December 2018. [Google Scholar]
  43. Ye Mao and Sun Yan. Variable Selection via Penalized Neural Network: a Drop-Out-One Loss Approach. In Proc. of the International Conference on Machine Learning (ICML), volume 80, pages 5620–5629, 10–15 Jul 2018. [Google Scholar]
  44. Zhang Yuchen, Liang Percy, and Charikar Moses. A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics. In Proc. of Conference on Learning Theory (COLT), pages 1980–2022, 2017. [Google Scholar]
  45. Zhong Zhun, Zheng Liang, Kang Guoliang, Li Shaozi, and Yang Yi. Random Erasing Data Augmentation. In Proc. of the National Conference on Artificial Intelligence (AAAI), 2020. [Google Scholar]
  46. Zou Hui and Hastie Trevor. Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society, Series B, 67(2):301–320, 2005. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES