Learning Sparse Deep Neural Networks with a Spike-and-Slab Prior

Yan Sun; Qifan Song; Faming Liang

doi:10.1016/j.spl.2021.109246

. Author manuscript; available in PMC: 2023 Jan 1.

Published in final edited form as: Stat Probab Lett. 2021 Sep 24;180:109246. doi: 10.1016/j.spl.2021.109246

Learning Sparse Deep Neural Networks with a Spike-and-Slab Prior

Yan Sun ¹, Qifan Song ¹, Faming Liang ¹

PMCID: PMC8570537 NIHMSID: NIHMS1745833 PMID: 34744226

Abstract

Deep learning has achieved great successes in many machine learning tasks. However, the deep neural networks (DNNs) are often severely over-parameterized, making them computationally expensive, memory intensive, less interpretable and mis-calibrated. We study sparse DNNs under the Bayesian framework: we establish posterior consistency and structure selection consistency for Bayesian DNNs with a spike-and-slab prior, and illustrate their performance using examples on high-dimensional nonlinear variable selection, large network compression and model calibration. Our numerical results indicate that sparsity is essential for improving the prediction accuracy and calibration of the DNN.

Keywords: Bayesian neural network, sparse deep learning, posterior consistency, high-dimensional nonlinear variable selection, model calibration

1. INTRODUCTION

During the past decade, deep neural networks (DNNs) have achieved great successes in solving many complex machine learning tasks such as pattern recognition and natural language processing. A key factor to these successes is their universal approximation power, while heavily relying on a large number of parameters. The DNNs used in these tasks may consist of hundreds of layers and billions of parameters, see e.g. [1] on image classification. Many DNNs are severely over-parameterized; for some networks only 5% of parameters are enough to achieve acceptable models [2]. Moreover, these over-parameterized networks are no longer well calibrated [3].

To reduce the complexity of the DNN and improve its calibration, sparse deep learning has been considered by some researchers, see e.g. [4, 5, 6, 7]. The approximation power of the sparse DNN has been studied from both frequentist and Bayesian perspectives. From the frequentist perspective, [4] and [5] quantify the approximation error of the sparse DNN in approximating Hölder smooth functions. From the Bayesian perspective, [6] established posterior consistency for the Bayesian DNN with a spike-and slab prior, and [7] established posterior consistency for the Bayesian DNN with a mixture Gaussian prior. Quite recently, the theory for sparse deep learning has also been developed under the framework of variational inference, see e.g. [8, 9, 10]. However, due to the approximation nature of the variational posterior distribution, the resulting statistical inference is often sub-optimal.

This paper studies theoretical properties of the Bayesian sparse DNN with a spike-and-slab prior. It is closely related to [6], which both employ a spike-and-slab prior, but our theory is more general. In particular, our theory is developed under the assumption that the input dimension and the upper bound of the connection weights can increase with the training sample size; and the activation function is Lipschitz continuous, which has covered many popular activation functions such as ReLU, tanh, and sigmoid. In contrast, [6] assumes that the input dimension is of the order O(1), the connections weights are bounded, and the activation function is ReLU. This paper also provides a scalable algorithm for simulating from the posterior distribution of the Bayesian DNN. The numerical results indicate that sparsity is essential for improving the prediction accuracy and model calibration of the DNN.

The remaining part of this paper is organized as follows. Section 2 presents the theoretical results. Section 3 describes a stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithm for Bayesian DNN simulation. Section 4 presents a network compression example. Section 5 concludes the paper with a brief discussion.

2. THEORETICAL RESULTS

Let $D_{n} = {x^{(i)}, y^{(i)}}_{i = 1}^{n}$ denote a training dataset of n i.i.d observations, where $x^{(i)} \in ℝ^{p_{n}}$ , $y^{(i)} \in ℝ$ , and p_n denotes the dimension of the input variable and is assumed to grow with the training sample size n_. We assume that the data are generated from a generalized linear model with the pdf/pmf given by

f (y ∣ μ * (x)) = \exp {A (μ * (x)) y + B (μ * (x)) + C (y)},

where μ∗(x) denotes an unknown function of x, and A(·), B(·) and C(·) are appropriately defined functions. For example, for logistic regression, we have A(μ*) = μ*, $B (μ *) = - \log (1 + e^{μ^{*}})$ and C(y) = 1; and for normal regression, by introducing an extra dispersion parameter σ², we have A(μ∗) = μ∗/σ², B(μ∗) = μ∗²/2σ² and C(y) = −y²/2σ² − log(2πσ²)/2. For simplicity of analysis, σ² is assumed to be known. To model the data, we propose to approximate μ∗(x) by a DNN with H_n − 1 hidden layers. Let L_h denote the number of hidden units at layer h, with $L_{H_{n}} = 1$ for the output layer and L₀ = p_n for the input layer. Let $w^{h} \in ℝ^{L_{h} \times L_{h - 1}}$ and $b^{h} \in ℝ^{L_{h} \times 1}$ , h ∈ {1, 2, ..., H_n} denote the weights and bias of layer h, respectively. Let $ψ^{h} : ℝ^{L_{h} \times 1} \to ℝ^{L_{h} \times 1}$ denote a mapping which applies a piece-wise differentiable activation function ψ to each entry of the input. In summary, the DNN forms a nonlinear mapping

μ (β, x) = w^{H_{n}} ψ^{H_{n} - 1} [\dots ψ^{1} [w^{1} x + b^{1}] \dots] + b^{H_{n}},

(1)

where $β = {w_{i j}^{h}, b_{k}^{h} : 1 \leq h \leq H_{n}, 1 \leq i, k \leq L_{h}, 1 \leq j \leq L_{h - 1}}$ denotes the collection of weights and biases of the DNN. A fully connected DNN consists of $K_{n} = | β | = \sum_{h = 1}^{H_{n}} (L_{h - 1} \times L_{h} + L_{h})$ tuning parameters. To facilitate representation for sparse DNNs, we re-parameterize β by introducing an indicator variable for each element of β, which indicates the existence of the corresponding connection. Here the bias is treated as a special connection with a constant input of 1. Let $w^{h} = {\tilde{w}}^{h} \circ γ^{w^{h}}$ and $b^{h} = {\tilde{b}}^{h} \circ γ^{b^{h}}$ , where $γ^{w^{h}}$ and $γ^{b^{h}}$ are the matrices or vectors of indicators, and ◦ denotes element-wise product. Correspondingly, we define $\tilde{β} = {{\tilde{w}}_{i j}^{h}, {\tilde{b}}_{k}^{h}}$ and $γ = {γ_{i j}^{w^{h}}, γ_{k}^{b^{h}}}$ which represent connection weights and network structure, respectively. For notational simplicity, we rewrite $\tilde{β}$ and γ as $\tilde{β} = ({\tilde{β}}_{1}, {\tilde{β}}_{2}, \dots, {\tilde{β}}_{K_{n}}) \in ℝ^{K_{n}}$ and $γ = (γ_{1}, γ_{2}, \dots, γ_{K_{n}}) \in {0, 1}^{K_{n}}$ . Correspondingly, we have $β = \tilde{β} \circ γ$ .

2.1. Regularity conditions on sparse DNNs

With slight abuse of notation, we will rewrite μ(β, x) in (1) for the sparse DNN as μ(β, γ, x) by including the network structure information. In what follows, we characterize how well μ∗(x) can be approximated by μ(β, γ, x) with a sparse DNN from the Bayesian perspective. For this purpose, we assume the following regularity conditions:

A.1 The input x is bounded by 1 entry-wisely, i.e. $x \in Ω = {[- 1, 1]}^{p_{n}}$ , and the density of x is bounded in its support Ω uniformly with respect to n.

A.2 The unknown regression mean function μ∗(x) can be well approximated by a sparse DNN model such that μ(β∗, γ∗, x) satisfies

A.2.1 ${‖ μ * (x) - μ (β *, γ *, x) ‖}_{L^{2} (Ω)} \leq ϖ_{n}$ , where the approximation error ϖ_n → 0 as the sample size n → ∞.
A.2.2 H_n log K_n ≥ C log n for some constant C > 0, $1 ≺ r_{n}^{*} H_{n} \log K_{n} ≺ n^{1}$ and ‖β∗‖_∞ < E_n, where E_n is a sequence satisfying log E_n = O(log K_n), and $r_{n}^{*}$ is the network size, i.e., $r_{n}^{*} = \sum_{i = 1}^{K_{n}} γ_{i}^{*}$ .

A.3 The activation function ψ is Lipschitz continuous with a Lipschitz constant of 1.

Refer to the supplementary material for more discussions on the conditions, where we show how our Bayesian DNN theory is compatible with the existing sparse DNN approximation theory.

2.2. The Spike-and-Slab Prior

To conduct Bayesian inference for the DNN, we consider a hierarchical prior given by

{\tilde{β}}_{i} ∣ γ_{i} = 1 ~ N (0, σ_{1, n}^{2}), i = 1, 2, \dots, K_{n}; π (γ) \propto λ_{n}^{| γ |} {(1 - λ_{n})}^{K_{n} - | γ |} 1 {| γ | \leq {\bar{r}}_{n}, γ \in G},

(2)

where $σ_{1, n}^{2}$ is the variance of the normal distribution, ${\bar{r}}_{n}$ is the maximum size of candidate sparse DNNs, λ_n can be read as an approximate prior probability for each connection to be included in the DNN, and $G$ is the set of all valid DNNs. With this hierarchical prior, βi follows a discrete spike-and-slab prior, i.e., $β_{i} = {\tilde{β}}_{i} \circ γ_{i} ~ λ_{n} N (0, σ_{1, n}^{2}) + (1 - λ_{n}) δ_{0}$ for i = 1, 2, . . . , K_n, where δ₀ denotes the Dirac delta function.

2.3. Posterior consistency

Posterior consistency measures the speed of posterior concentration around the true density function, which plays a major role in validating Bayesian methods. For DNNs, since the total number of parameters K_n is often much larger than the sample size n, posterior consistency provides a general guideline for prior setting. Otherwise, the prior information may dominate data information, rendering a biased inference for the underlying true model. For learning sparse DNNs, we generally set λ_n → 0 as K_n → ∞, which provides an automatic control for the multiplicity involved in structural selection [11]. Denote by P∗ and E∗ the respective probability measure and expectation with respective to the data D_n, denote by d(p₁, p₂) the Hellinger distance between two densities p₁(x, y) and p₂(x, y), denote by p_β the density function induced by a DNN associated with the parameters β, and denote by $p_{μ *}$ the true density function induced by the unknown function μ∗(x). Let π(A|D_n) denote the posterior probability of an event A. The following theorem establishes posterior consistency for sparse DNNs under the spike-and-slab prior (2).

Theorem 2.1.

Consider a DNN with H_n layers and at most K_n possible connections, where both H_n and K_n are increasing with n. If the assumptions A.1-A.3 hold, and $\tilde{β}$ and γ are subject to the prior (2) with the hyperparameters subject to the constraints $\log σ_{1, n}^{2} + E_{n}^{2} / σ_{1, n}^{2} = O (H_{n} \log K_{n})$ and $1 \leq r_{n} \leq {\bar{r}}_{n} ≺ K_{n}$ , where r_n = λ_nK_n. Then there exists a sequence $ϵ_{n} ≍ \max {ϖ_{n}, \sqrt{{\bar{r}}_{n} H_{n} \log K_{n} / n}}$ such that for sufficiently large n ², the posterior distribution satisfies

\begin{array}{l} P * {π [d (p_{β}, p_{μ^{*}}) > 4 ϵ_{n} ∣ D_{n}] \geq 2 e^{- n ϵ_{n}^{2} / 4}} \leq 2 e^{- n ϵ_{n}^{2} / 4}, \\ E * π [d (p_{β}, p_{μ^{*}}) > 4 ϵ_{n} ∣ D_{n}] \leq 4 e^{- n ϵ_{n}^{2} / 2}, \end{array}

(3)

Theorem 2.1 establishes a Bayesian contraction rate for sparse DNNs under the Hellinger metric. Therefore, to obtain consistent Bayesian inference, it suffices to require ϖ_n → 0 and ${\bar{r}}_{n} H_{n} \log (K_{n}) ≺ n$ . If μ∗ can be optimally represented by a sparse DNN model (refer to Remark S1 in the supplementary material), then ϖ_n polynomially decays to 0 as r_n increases to infinity. On the other hand, ${\bar{r}}_{n} H_{n} \log (K_{n}) ≺ n$ implies ${\bar{r}}_{n} = o (n / \log (n))$ by noting that $K_{n} \geq {\bar{r}}_{n}$ . Refer to the supplementary material for more discussions on this theorem. In conclusion, we only need ${\bar{r}}_{n}$ , the upper bound of total connectivity, to increase slowly; or said differently, a sparse DNN of connectivity O(n/ log(n)) could have been large enough for providing a consistent approximation to the true posterior distribution.

2.4. Consistency of DNN structure selection

It is known that the DNN model is nonidentifiable in general due to the symmetry of the network structure. However, by introducing appropriate constrains [12, 13], we can define a set of neural networks such that each network in the set is unique up to simple node permutations, sign changes, etc. Let Θ denote such a set of DNNs, where each element in Θ can be viewed as an equivalent class of networks. Let ν(·) be the transformation such that ν(β) ∈ Θ. Since p_β = p_ν(β), the posterior consistency (3) holds for ν(β) as well. In what follows, we rewrite ν(β) by ν(γ, β) by including the network structure information. To serve the purpose of structure selection for the DNN, we consider the marginal inclusion posterior probability (MIPP) approach proposed in [14] for high-dimensional variable selection. For each entry of γ, we define its MIPP by $q_{i} = \int \sum_{γ} e_{i ∣ v (γ, β)} π (γ ∣ β, D_{n}) π (β ∣ D_{n}) d β$ , where e_i|ν(γ,β) indicates existence of connection i in the network ν(γ, β). Similarly, we define $e_{i ∣ v (γ^{*}, β^{*})}$ as the indicator for connection i in the true DNN ν(γ∗, β∗). The MIPP approach is to choose the connections for which the marginal inclusion posterior probability is greater than a threshold value $\hat{q}$ ; that is, setting ${\hat{γ}}_{\hat{q}} = {j : q_{j} > \hat{q}, j = 1, 2, \dots, K_{n}}$ as a Bayes estimator of the true model. To establish the consistency of ${\hat{γ}}_{\hat{q}}$ , the following identifiability condition for the true network γ∗ is needed. Let $A (ϵ_{n}) = {β : d (p_{β}, p_{μ^{*}}) \geq ϵ_{n}}$ . Define $ρ (ϵ_{n}) = \max_{1 \leq i \leq K_{n}} \int_{A {(ϵ_{n})}^{c}} \sum_{γ} | e_{i ∣ v (γ, β)} - e_{i ∣ v (γ^{*}, β^{*})} | π (γ ∣ β, D_{n}) π (β ∣ D_{n}) d β$ , which measures the distance between the true DNN model and the models drawn from the posterior in the ϵ_n-neighborhood A(ϵ_n)^c. The identifiability condition can be stated as follows: $B.1 ρ (ϵ_{n}) \to 0, as n \to \infty and ϵ_{n} \to 0.$

That is, when n is sufficiently large, if a DNN produces approximately the same probability distribution as the true data, then its structure, after mapping into the network space Θ, must coincide with the true DNN structure. Note that this identifiability is different from the one mentioned at the beginning of this section, which is only with respect to structure and parameter rearrangements. Theorem 2.2 concerns consistency of γ ${\hat{γ}}_{\hat{q}}$ and its sure screening property.

Theorem 2.2.

If the conditions of Theorem 2.1 and condition B.1 hold, then (i) $\max_{1 \leq i \leq K_{n}} {| q_{i} - e_{i ∣ v (γ^{*}, β^{*})} |} \to 0$ in probability; and (ii) for any pre-specified $\hat{q} \in (0, 1)$ , $\lim_{n \to \infty} P (γ_{*} \subset {\hat{γ}}_{\hat{q}}) = 1$ .

In practice, the value of $\hat{q}$ can be determined by a multiple hypothesis test, see e.g. [15]. For simplicity, we often set $\hat{q} = 0.5$ . Recall that $γ^{w^{h}} \in ℝ^{L_{h} \times L_{h - 1}}$ denotes the connection indicator matrix of layer h. Let

γ^{x} = γ^{w^{H_{n}}} γ^{w^{H_{n} - 1}} \dots γ^{w^{1}} \in ℝ^{1 \times p_{n}},

(4)

and let $γ_{i}^{x}$ denote the i-th element of γ^x. If $γ_{i}^{x} > 0$ then the input variable x_i is effective in the network γ, and $γ_{i}^{x} = 0$ otherwise. As in treating network connections, we can define a MIPP for each input variable and determine a threshold value for the MIPP based on a multiple hypothesis test. As implied by (4), the consistency of structure selection implies consistency of variable selection with respect to the true network γ∗ as defined in Assumption A.2.1.

3. COMPUTATIONAL ALGORITHM

Although the spike-and-slab prior leads to nice theoretical properties for Bayesian sparse DNNs, the resulting posterior is a mixture of distributions with varying dimensions. Conventional Bayesian sampling algorithms adopt reversible jump moves [16] to explore different network architectures, see e.g. [13], and lack scalability necessarily required for big data problems. [17] proposed an extended stochastic gradient Langevin dynamics (SGLD) algorithm for Bayesian variable selection under the big data scenario, which allows the mini-batch data to be used in updating the model and regression coefficients at each iteration and is thus scalable to big data problems. In this paper, we extend it further by combining it with stochastic gradient Hamiltonian Monte Carlo (SGHMC) [18], see Algorithm 1 for the procedure and the supplementary material for more explanations.

The main parameters of the extended SGHMC algorithm are the prior hyperparameters λ_n and σ_1,n defined in (2). Our theory allows σ_1,n to grow with n from the perspective of data fitting, but in our experience, the magnitude of weights tend to adversely affect the generalization ability of the network. For this reason, we usually set σ_1,n to a small number such as 0.01 or 0.02, and then tune the value of λ_n for network sparsity such that the sparse constraint given in Assumption A.2.2 is satisfied.

4. EXPERIMENTS

This section presents an example on network compression for convolutional neural networks (CNNs). Refer to the supplementary material for examples on high-dimensional nonlinear variable selection and DNN model calibration. In terms of network compression, extended SGHMC can be viewed as an unstructured dynamic pruning algorithm, where the spike-and-slab prior is imposed on each parameter and the sparsity mask γ can be adjusted during training [19]. We test the proposed method on the CIFAR-10 data [20], pruning ResNet20 and ResNet32 networks [1] to different sparsity levels. We train the network using the standard setup as in [19], i.e. we use a mini-batch size of 128 and train the model for 300 epochs with a fixed initial learning rate of ϵ₀ = 0.1 and ϵ_t being divided by 10 at epochs 150 and 225. We set $σ_{1, n}^{2} = 0.04$ , and sample from the posterior using the extended SGHMC algorithm with t₀ = 1,

Algorithm 1.

Extended Stochastic Gradient Hamiltonian Monte Carlo

Input: T ; m; τ; t₀; t₁

Initialization: Randomly initialize

{\tilde{β}}^{(1)}

. Set

γ_{i}^{(0, t_{0})} = 1

for all i ∈ {1, . . . , K_n};

for t = 1, 2, ..., T do

(i) Draw a subsample of size m from the full dataset D_n, and duplicate it to form a dataset

D_{m, n}^{(t)}

(ii) Simulate models

γ^{(t, 1)}, γ^{(t, 2)}, \dots, γ^{(t, t_{0})}

from the conditional posterior

π (γ ∣ {\tilde{β}}^{(t)}, D_{m, n}^{(t)})

by MetropolisHastings(MH) algorithm with a magnitude-based proposal.

(iii) Update

{\tilde{β}}^{(t)}

via a SGHMC move,

v^{(t + 1)} = (1 - α) v^{(t)} + \frac{ϵ_{t + 1}}{2 t_{0}} \sum_{k = 1}^{t_{0}} \nabla_{\tilde{β}} \log π ({\tilde{β}}^{(t)} ∣ γ^{(t, k)}, D_{m, n}^{(t)}) + \sqrt{α ϵ_{t + 1}} τ η_{t + 1},

{\tilde{β}}^{(t + 1)} = {\tilde{β}}^{(t)} + v^{(t + 1)},

where η_t+1 ~ N(0, I_d), ϵ_t+1 is the learning rate, τ is the temperature, and 1 − α gives the momentum parameter.

end for

Open in a new tab

τ = 0.0001, α = 0.1, and a magnitude-based proposal for γ. We tune λ_n and B (a parameter of the magnitude-based proposal as described in the supplementary material) to achieve target sparsity levels for different ResNets. Posterior samples are collected at the end of each epoch and the first 225 samples are discarded for the burn-in process.

We compare the proposed method with some recent unstructured compression methods including dynamic pruning with feedback (DPF) [19], dynamic sparse reparameterization (DSR) [21], sparse momentum (SM) [22], consistent sparse deep learning (BNN_cs) [7], and variation inference for sparse deep learning (SVI) [9]. The numerical results are summarized in Table 1. The comparison shows that the proposed method can achieve higher prediction accuracy than the existing ones at a similar sparsity level.

Table 1:

Network pruning for CIFAR-10 data, where the result of each method is calculated by averaging over 3 independent runs with the standard deviation reported in the parentheses. BNN_bma and BNN_single denote the proposed method with the results calculated in the way of Bayesian model averaging and based on the single model collected at the last epoch, respectively. Other methods have been defined in the context.

Model	Method	Pruning Ratio(%)	Test Accuracy(%)	Pruning Ratio(%)	Test Accuracy(%)

ResNet20	BNN_bma	9.65(0.05)	91.60(0.06)	19.76(0.02)	92.65(0.02)
	BNNsingle	9.88(0.08)	91.26(0.02)	19.83(0.02)	92.32(0.04)
	BNN_cs	9.55(0.03)	91.27(0.05)	19.67(0.05)	92.27(0.03)
	SVI	10	89.43(0.05)	20	91.78(0.06)
	SM	10	89.76(0.40)	20	91.54(0.16)
	DSR	10	87.88(0.04)	20	91.78(0.28)
	DPF	10	90.88(0.07)	20	92.17(0.21)

ResNet32	BNN_bma	4.89(0.09)	91.84(0.09)	9.65(0.05)	92.99(0.08)
	BNNsingle	4.99(0.06)	91.39(0.10)	8.77(0.12)	92.74(0.10)
	BNN_cs	4.78(0.01)	91.21(0.01)	9.53(0.04)	92.74(0.07)
	SVI	5	86.31(0.23)	10	91.61(0.10)
	SM	5	88.68(0.22)	10	91.54(0.18)
	DSR	5	84.12(0.32)	10	91.41(0.23)
	DPF	5	90.94(0.35)	10	92.42(0.18)

Open in a new tab

5. CONCLUSION

In this paper, we study sparse deep learning from the Bayesian perspective. Theoretically, we establish posterior consistency and structure selection consistency for Bayesian DNNs with a spike-and-slab prior. We give the posterior contraction rate, and propose to use the marginal inclusion posterior probability approach to determine the structure of the sparse DNN. We employ a scalable SGMCMC algorithm for simulating from the posterior of the sparse DNN, which has about the same order of computational complexity as popular optimization algorithms such as SGD. When irrelevant features exist in the dataset, our numerical results show that learning a sparse DNN is always rewarded in prediction compared to blindly learning a large, fully connected DNN. The proposed method can be applied for large-scale network pruning tasks as an unstructured dynamic pruning method, where the sparsity mask is adjusted systematically during training. Our numerical result show that the proposed method can achieve the state-of-the-art performance in network structure selection, large-scale network pruning and model calibration.

As mentioned previously, sparsity is essential for improving the calibration of the DNN [23, 24]. This work provides a practical and efficient method for learning sparse DNNs with a theoretical guarantee for identification of an appropriate sparse structure, and thus it will naturally enhance the development of trustworthy artificial intelligence.

Supplementary Material

NIHMS1745833-supplement-1.pdf^{(341.6KB, pdf)}

Acknowledgments

Liangs research was supported in part by the grants DMS-2015498, R01-GM117597 and R01-GM126089. Songs research was supported in part by the grant DMS-1811812. The authors thank the editor, associate editor, and referees for their constructive comments which have led to significant improvement of this paper.

Footnotes

a_n ≺ b_n mean a_n/b_n → 0 as n → ∞

$a_{n} ≍ b_{n}$ means there exists a constant C such that a_n/b_n → C as n → ∞.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

[1].He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 770–778. [Google Scholar]
[2].Denil M, Shakibi B, Dinh L, Ranzato M, de Freitas N, Predicting parameters in deep learning, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, Curran Associates Inc., Red Hook, NY, USA, 2013, p. 21482156. [Google Scholar]
[3].Guo C, Pleiss G, Sun Y, Weinberger KQ, On calibration of modern neural networks, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, JMLR.org, 2017, p. 13211330. [Google Scholar]
[4].Schmidt-Hieber J, Nonparametric regression using deep neural networks with ReLU activation function, Ann. Statist. 48 (4) (2020) 1875–1897. [Google Scholar]
[5].Bauler B, Kohler M, On deep learning as a remedy for the curse of dimensionality in nonparametric regression, Ann. Statist. 47 (4) (2019) 2261–2285. [Google Scholar]
[6].Polson N, Ročková V, Posterior concentration for sparse deep learning, in: Advances in Neural Information Processing Systems, 2018, pp. 930–941. [Google Scholar]
[7].Sun Y, Song Q, Liang F, Consistent sparse deep learning: Theory and computation, J. Amer. Statist. Assoc. (2021) in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
[8].Chérief-Abdellatif B-E, Convergence rates of variational inference in sparse deep learning, in: Singh HDIII,A (Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 1831–1842. [Google Scholar]
[9].Bai J, Song Q, Cheng G, Efficient variational inference for sparse deep learning with theoretical guarantee, in: Advances in Neural Information Processing Systems, 2020, pp. 466–476. [Google Scholar]
[10].Bai J, Song Q, Cheng G, Adaptive variational Bayesian inference for sparse deep neural network (2020). arXiv:1910.04355. [Google Scholar]
[11].Scott JG, Berger JO, et al. , Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem, Ann. Statist. 38 (5) (2010) 2587–2619. [Google Scholar]
[12].Pourzanjani AA, Jiang RM, Petzold LR, Improving the identifiability of neural networks for Bayesian inference, in: NIPS Workshop on Bayesian Deep Learning, 2017. [Google Scholar]
[13].Liang F, Li Q, Zhou L, Bayesian neural networks for selection of drug sensitive genes, J. Amer. Statist. Assoc. 113 (523) (2018) 955–972. [DOI] [PMC free article] [PubMed] [Google Scholar]
[14].Liang F, Song Q, Yu K, Bayesian subset modeling for high dimensional generalized linear models, J. Amer. Statist. Assoc. 108 (2013) 589–606. [Google Scholar]
[15].Liang F, Zhang J, Estimating false discovery rate using stochastic approximation, Biometrika 95 (2008) 961–977. [Google Scholar]
[16].Green PJ, Reversible jump markov chain monte carlo computation and Bayesian model determination, Biometrika 82 (4) (1995) 711–732. [Google Scholar]
[17].Song Q, Sun Y, Ye M, Liang F, Extended stochastic gradient MCMC for large-scale Bayesian variable selection, Biometrika 107 (4) (2020) 997–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
[18].Chen T, Fox E, Guestrin C, Stochastic gradient hamiltonian monte carlo, in: International conference on machine learning, 2014, pp. 1683–1691. [Google Scholar]
[19].Lin T, Stich SU, Barba L, Dmitriev D, Jaggi M, Dynamic model pruning with feedback, in: International Conference on Learning Representations, 2020. [Google Scholar]
[20].Krizhevsky A, Hinton G, et al. , Learning multiple layers of features from tiny images, Tech. rep., Citeseer; (2009). [Google Scholar]
[21].Mostafa H, Wang X, Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, in: International Conference on Machine Learning, 2019, pp. 4646–4655. [Google Scholar]
[22].Dettmers T, Zettlemoyer L, Sparse networks from scratch: Faster training without losing performance, arXiv preprint arXiv: 1907.04840. [Google Scholar]
[23].Wang Y, Ročková V, Uncertainty quantification for sparse deep learning, in: AISTATS, 2020. [Google Scholar]
[24].Lee J, Humt M, Feng J, Triebel R, Estimating model uncertainty of neural networks in sparse information form, in: Singh HDIII,A(Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119, PMLR, 2020, pp. 5702–5713. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS1745833-supplement-1.pdf^{(341.6KB, pdf)}

[R1] [1].He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 770–778. [Google Scholar]

[R2] [2].Denil M, Shakibi B, Dinh L, Ranzato M, de Freitas N, Predicting parameters in deep learning, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, Curran Associates Inc., Red Hook, NY, USA, 2013, p. 21482156. [Google Scholar]

[R3] [3].Guo C, Pleiss G, Sun Y, Weinberger KQ, On calibration of modern neural networks, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, JMLR.org, 2017, p. 13211330. [Google Scholar]

[R4] [4].Schmidt-Hieber J, Nonparametric regression using deep neural networks with ReLU activation function, Ann. Statist. 48 (4) (2020) 1875–1897. [Google Scholar]

[R5] [5].Bauler B, Kohler M, On deep learning as a remedy for the curse of dimensionality in nonparametric regression, Ann. Statist. 47 (4) (2019) 2261–2285. [Google Scholar]

[R6] [6].Polson N, Ročková V, Posterior concentration for sparse deep learning, in: Advances in Neural Information Processing Systems, 2018, pp. 930–941. [Google Scholar]

[R7] [7].Sun Y, Song Q, Liang F, Consistent sparse deep learning: Theory and computation, J. Amer. Statist. Assoc. (2021) in press. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] [8].Chérief-Abdellatif B-E, Convergence rates of variational inference in sparse deep learning, in: Singh HDIII,A (Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 1831–1842. [Google Scholar]

[R9] [9].Bai J, Song Q, Cheng G, Efficient variational inference for sparse deep learning with theoretical guarantee, in: Advances in Neural Information Processing Systems, 2020, pp. 466–476. [Google Scholar]

[R10] [10].Bai J, Song Q, Cheng G, Adaptive variational Bayesian inference for sparse deep neural network (2020). arXiv:1910.04355. [Google Scholar]

[R11] [11].Scott JG, Berger JO, et al. , Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem, Ann. Statist. 38 (5) (2010) 2587–2619. [Google Scholar]

[R12] [12].Pourzanjani AA, Jiang RM, Petzold LR, Improving the identifiability of neural networks for Bayesian inference, in: NIPS Workshop on Bayesian Deep Learning, 2017. [Google Scholar]

[R13] [13].Liang F, Li Q, Zhou L, Bayesian neural networks for selection of drug sensitive genes, J. Amer. Statist. Assoc. 113 (523) (2018) 955–972. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] [14].Liang F, Song Q, Yu K, Bayesian subset modeling for high dimensional generalized linear models, J. Amer. Statist. Assoc. 108 (2013) 589–606. [Google Scholar]

[R15] [15].Liang F, Zhang J, Estimating false discovery rate using stochastic approximation, Biometrika 95 (2008) 961–977. [Google Scholar]

[R16] [16].Green PJ, Reversible jump markov chain monte carlo computation and Bayesian model determination, Biometrika 82 (4) (1995) 711–732. [Google Scholar]

[R17] [17].Song Q, Sun Y, Ye M, Liang F, Extended stochastic gradient MCMC for large-scale Bayesian variable selection, Biometrika 107 (4) (2020) 997–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] [18].Chen T, Fox E, Guestrin C, Stochastic gradient hamiltonian monte carlo, in: International conference on machine learning, 2014, pp. 1683–1691. [Google Scholar]

[R19] [19].Lin T, Stich SU, Barba L, Dmitriev D, Jaggi M, Dynamic model pruning with feedback, in: International Conference on Learning Representations, 2020. [Google Scholar]

[R20] [20].Krizhevsky A, Hinton G, et al. , Learning multiple layers of features from tiny images, Tech. rep., Citeseer; (2009). [Google Scholar]

[R21] [21].Mostafa H, Wang X, Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, in: International Conference on Machine Learning, 2019, pp. 4646–4655. [Google Scholar]

[R22] [22].Dettmers T, Zettlemoyer L, Sparse networks from scratch: Faster training without losing performance, arXiv preprint arXiv: 1907.04840. [Google Scholar]

[R23] [23].Wang Y, Ročková V, Uncertainty quantification for sparse deep learning, in: AISTATS, 2020. [Google Scholar]

[R24] [24].Lee J, Humt M, Feng J, Triebel R, Estimating model uncertainty of neural networks in sparse information form, in: Singh HDIII,A(Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119, PMLR, 2020, pp. 5702–5713. [Google Scholar]

PERMALINK

Learning Sparse Deep Neural Networks with a Spike-and-Slab Prior

Yan Sun

Qifan Song

Faming Liang

Abstract

1. INTRODUCTION

2. THEORETICAL RESULTS

2.1. Regularity conditions on sparse DNNs

2.2. The Spike-and-Slab Prior

2.3. Posterior consistency

Theorem 2.1.

2.4. Consistency of DNN structure selection

Theorem 2.2.

3. COMPUTATIONAL ALGORITHM

4. EXPERIMENTS

Algorithm 1.

Table 1:

5. CONCLUSION

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Learning Sparse Deep Neural Networks with a Spike-and-Slab Prior

Yan Sun

Qifan Song

Faming Liang

Abstract

1. INTRODUCTION

2. THEORETICAL RESULTS

2.1. Regularity conditions on sparse DNNs

2.2. The Spike-and-Slab Prior

2.3. Posterior consistency

Theorem 2.1.

2.4. Consistency of DNN structure selection

Theorem 2.2.

3. COMPUTATIONAL ALGORITHM

4. EXPERIMENTS

Algorithm 1.

Table 1:

5. CONCLUSION

Supplementary Material

Acknowledgments

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases