Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 1.
Published in final edited form as: Stat Probab Lett. 2021 Sep 24;180:109246. doi: 10.1016/j.spl.2021.109246

Learning Sparse Deep Neural Networks with a Spike-and-Slab Prior

Yan Sun 1, Qifan Song 1, Faming Liang 1
PMCID: PMC8570537  NIHMSID: NIHMS1745833  PMID: 34744226

Abstract

Deep learning has achieved great successes in many machine learning tasks. However, the deep neural networks (DNNs) are often severely over-parameterized, making them computationally expensive, memory intensive, less interpretable and mis-calibrated. We study sparse DNNs under the Bayesian framework: we establish posterior consistency and structure selection consistency for Bayesian DNNs with a spike-and-slab prior, and illustrate their performance using examples on high-dimensional nonlinear variable selection, large network compression and model calibration. Our numerical results indicate that sparsity is essential for improving the prediction accuracy and calibration of the DNN.

Keywords: Bayesian neural network, sparse deep learning, posterior consistency, high-dimensional nonlinear variable selection, model calibration

1. INTRODUCTION

During the past decade, deep neural networks (DNNs) have achieved great successes in solving many complex machine learning tasks such as pattern recognition and natural language processing. A key factor to these successes is their universal approximation power, while heavily relying on a large number of parameters. The DNNs used in these tasks may consist of hundreds of layers and billions of parameters, see e.g. [1] on image classification. Many DNNs are severely over-parameterized; for some networks only 5% of parameters are enough to achieve acceptable models [2]. Moreover, these over-parameterized networks are no longer well calibrated [3].

To reduce the complexity of the DNN and improve its calibration, sparse deep learning has been considered by some researchers, see e.g. [4, 5, 6, 7]. The approximation power of the sparse DNN has been studied from both frequentist and Bayesian perspectives. From the frequentist perspective, [4] and [5] quantify the approximation error of the sparse DNN in approximating Hölder smooth functions. From the Bayesian perspective, [6] established posterior consistency for the Bayesian DNN with a spike-and slab prior, and [7] established posterior consistency for the Bayesian DNN with a mixture Gaussian prior. Quite recently, the theory for sparse deep learning has also been developed under the framework of variational inference, see e.g. [8, 9, 10]. However, due to the approximation nature of the variational posterior distribution, the resulting statistical inference is often sub-optimal.

This paper studies theoretical properties of the Bayesian sparse DNN with a spike-and-slab prior. It is closely related to [6], which both employ a spike-and-slab prior, but our theory is more general. In particular, our theory is developed under the assumption that the input dimension and the upper bound of the connection weights can increase with the training sample size; and the activation function is Lipschitz continuous, which has covered many popular activation functions such as ReLU, tanh, and sigmoid. In contrast, [6] assumes that the input dimension is of the order O(1), the connections weights are bounded, and the activation function is ReLU. This paper also provides a scalable algorithm for simulating from the posterior distribution of the Bayesian DNN. The numerical results indicate that sparsity is essential for improving the prediction accuracy and model calibration of the DNN.

The remaining part of this paper is organized as follows. Section 2 presents the theoretical results. Section 3 describes a stochastic gradient Markov chain Monte Carlo (SGMCMC) algorithm for Bayesian DNN simulation. Section 4 presents a network compression example. Section 5 concludes the paper with a brief discussion.

2. THEORETICAL RESULTS

Let Dn={x(i),y(i)}i=1n denote a training dataset of n i.i.d observations, where x(i)pn, y(i), and pn denotes the dimension of the input variable and is assumed to grow with the training sample size n. We assume that the data are generated from a generalized linear model with the pdf/pmf given by

f(yμ*(x))=exp{A(μ*(x))y+B(μ*(x))+C(y)},

where μ∗(x) denotes an unknown function of x, and A(·), B(·) and C(·) are appropriately defined functions. For example, for logistic regression, we have A(μ*) = μ*, B(μ*)=log(1+eμ*) and C(y) = 1; and for normal regression, by introducing an extra dispersion parameter σ2, we have A(μ∗) = μ∗/σ2, B(μ∗) = μ2/2σ2 and C(y) = −y2/2σ2 − log(2πσ2)/2. For simplicity of analysis, σ2 is assumed to be known. To model the data, we propose to approximate μ∗(x) by a DNN with Hn − 1 hidden layers. Let Lh denote the number of hidden units at layer h, with LHn=1 for the output layer and L0 = pn for the input layer. Let whLh×Lh1 and bhLh×1, h ∈ {1, 2, ..., Hn} denote the weights and bias of layer h, respectively. Let ψh:Lh×1Lh×1 denote a mapping which applies a piece-wise differentiable activation function ψ to each entry of the input. In summary, the DNN forms a nonlinear mapping

μ(β,x)=wHnψHn1[ψ1[w1x+b1]]+bHn, (1)

where β={wijh,bkh:1hHn,1i,kLh,1jLh1} denotes the collection of weights and biases of the DNN. A fully connected DNN consists of Kn=|β|=h=1Hn(Lh1×Lh+Lh) tuning parameters. To facilitate representation for sparse DNNs, we re-parameterize β by introducing an indicator variable for each element of β, which indicates the existence of the corresponding connection. Here the bias is treated as a special connection with a constant input of 1. Let wh=w˜hγwh and bh=b˜hγbh, where γwh and γbh are the matrices or vectors of indicators, and ◦ denotes element-wise product. Correspondingly, we define β˜={w˜ijh,b˜kh} and γ={γijwh,γkbh} which represent connection weights and network structure, respectively. For notational simplicity, we rewrite β˜ and γ as β˜=(β˜1,β˜2,,β˜Kn)Kn and γ=(γ1,γ2,,γKn){0,1}Kn. Correspondingly, we have β=β˜γ.

2.1. Regularity conditions on sparse DNNs

With slight abuse of notation, we will rewrite μ(β, x) in (1) for the sparse DNN as μ(β, γ, x) by including the network structure information. In what follows, we characterize how well μ∗(x) can be approximated by μ(β, γ, x) with a sparse DNN from the Bayesian perspective. For this purpose, we assume the following regularity conditions:

A.1 The input x is bounded by 1 entry-wisely, i.e.xΩ=[1,1]pn, and the density of x is bounded in its support Ω uniformly with respect to n.

A.2 The unknown regression mean function μ∗(x) can be well approximated by a sparse DNN model such that μ(β∗, γ∗, x) satisfies

  • A.2.1μ*(x)μ(β*,γ*,x)L2(Ω)ϖn, where the approximation error ϖn → 0 as the sample size n → ∞.

  • A.2.2 Hn log KnC log n for some constant C > 0, 1rn*HnlogKnn1 and ‖β∗‖ < En, where En is a sequence satisfying log En = O(log Kn), and rn* is the network size, i.e.,rn*=i=1Knγi*.

A.3 The activation function ψ is Lipschitz continuous with a Lipschitz constant of 1.

Refer to the supplementary material for more discussions on the conditions, where we show how our Bayesian DNN theory is compatible with the existing sparse DNN approximation theory.

2.2. The Spike-and-Slab Prior

To conduct Bayesian inference for the DNN, we consider a hierarchical prior given by

β˜iγi=1~N(0,σ1,n2),i=1,2,,Kn;π(γ)λn|γ|(1λn)Kn|γ|1{|γ|r¯n,γG}, (2)

where σ1,n2 is the variance of the normal distribution, r¯n is the maximum size of candidate sparse DNNs, λn can be read as an approximate prior probability for each connection to be included in the DNN, and G is the set of all valid DNNs. With this hierarchical prior, βi follows a discrete spike-and-slab prior, i.e., βi=β˜iγi~λnN(0,σ1,n2)+(1λn)δ0 for i = 1, 2, . . . , Kn, where δ0 denotes the Dirac delta function.

2.3. Posterior consistency

Posterior consistency measures the speed of posterior concentration around the true density function, which plays a major role in validating Bayesian methods. For DNNs, since the total number of parameters Kn is often much larger than the sample size n, posterior consistency provides a general guideline for prior setting. Otherwise, the prior information may dominate data information, rendering a biased inference for the underlying true model. For learning sparse DNNs, we generally set λn → 0 as Kn → ∞, which provides an automatic control for the multiplicity involved in structural selection [11]. Denote by P∗ and E∗ the respective probability measure and expectation with respective to the data Dn, denote by d(p1, p2) the Hellinger distance between two densities p1(x, y) and p2(x, y), denote by pβ the density function induced by a DNN associated with the parameters β, and denote by pμ* the true density function induced by the unknown function μ∗(x). Let π(A|Dn) denote the posterior probability of an event A. The following theorem establishes posterior consistency for sparse DNNs under the spike-and-slab prior (2).

Theorem 2.1.

Consider a DNN with Hn layers and at most Kn possible connections, where both Hn and Kn are increasing with n. If the assumptions A.1-A.3 hold, and β˜ and γ are subject to the prior (2) with the hyperparameters subject to the constraints logσ1,n2+En2/σ1,n2=O(HnlogKn) and 1rnr¯nKn, where rn = λnKn. Then there exists a sequence ϵnmax{ϖn,r¯nHnlogKn/n} such that for sufficiently large n 2, the posterior distribution satisfies

P*{π[d(pβ,pμ*)>4ϵnDn]2enϵn2/4}2enϵn2/4,E*π[d(pβ,pμ*)>4ϵnDn]4enϵn2/2, (3)

Theorem 2.1 establishes a Bayesian contraction rate for sparse DNNs under the Hellinger metric. Therefore, to obtain consistent Bayesian inference, it suffices to require ϖn → 0 and r¯nHnlog(Kn)n. If μ∗ can be optimally represented by a sparse DNN model (refer to Remark S1 in the supplementary material), then ϖn polynomially decays to 0 as rn increases to infinity. On the other hand, r¯nHnlog(Kn)n implies r¯n=o(n/log(n)) by noting that Knr¯n. Refer to the supplementary material for more discussions on this theorem. In conclusion, we only need r¯n, the upper bound of total connectivity, to increase slowly; or said differently, a sparse DNN of connectivity O(n/ log(n)) could have been large enough for providing a consistent approximation to the true posterior distribution.

2.4. Consistency of DNN structure selection

It is known that the DNN model is nonidentifiable in general due to the symmetry of the network structure. However, by introducing appropriate constrains [12, 13], we can define a set of neural networks such that each network in the set is unique up to simple node permutations, sign changes, etc. Let Θ denote such a set of DNNs, where each element in Θ can be viewed as an equivalent class of networks. Let ν(·) be the transformation such that ν(β) ∈ Θ. Since pβ = pν(β), the posterior consistency (3) holds for ν(β) as well. In what follows, we rewrite ν(β) by ν(γ, β) by including the network structure information. To serve the purpose of structure selection for the DNN, we consider the marginal inclusion posterior probability (MIPP) approach proposed in [14] for high-dimensional variable selection. For each entry of γ, we define its MIPP by qi=γeiv(γ,β)π(γβ,Dn)π(βDn)dβ, where ei|ν(γ,β) indicates existence of connection i in the network ν(γ, β). Similarly, we define eiv(γ*,β*) as the indicator for connection i in the true DNN ν(γ∗, β∗). The MIPP approach is to choose the connections for which the marginal inclusion posterior probability is greater than a threshold value q^; that is, setting γ^q^={j:qj>q^,j=1,2,,Kn} as a Bayes estimator of the true model. To establish the consistency of γ^q^, the following identifiability condition for the true network γ∗ is needed. Let A(ϵn)={β:d(pβ,pμ*)ϵn}. Define ρ(ϵn)=max1iKnA(ϵn)cγ|eiv(γ,β)eiv(γ*,β*)|π(γβ,Dn)π(βDn)dβ, which measures the distance between the true DNN model and the models drawn from the posterior in the ϵn-neighborhood A(ϵn)c. The identifiability condition can be stated as follows: B.1ρ(ϵn)0,asnandϵn0.

That is, when n is sufficiently large, if a DNN produces approximately the same probability distribution as the true data, then its structure, after mapping into the network space Θ, must coincide with the true DNN structure. Note that this identifiability is different from the one mentioned at the beginning of this section, which is only with respect to structure and parameter rearrangements. Theorem 2.2 concerns consistency of γγ^q^ and its sure screening property.

Theorem 2.2.

If the conditions of Theorem 2.1 and condition B.1 hold, then (i) max1iKn{|qieiv(γ*,β*)|}0 in probability; and (ii) for any pre-specified q^(0,1),limnP(γ*γ^q^)=1.

In practice, the value of q^ can be determined by a multiple hypothesis test, see e.g. [15]. For simplicity, we often set q^=0.5. Recall that γwhLh×Lh1 denotes the connection indicator matrix of layer h. Let

γx=γwHnγwHn1γw11×pn, (4)

and let γix denote the i-th element of γx. If γix>0 then the input variable xi is effective in the network γ, and γix=0 otherwise. As in treating network connections, we can define a MIPP for each input variable and determine a threshold value for the MIPP based on a multiple hypothesis test. As implied by (4), the consistency of structure selection implies consistency of variable selection with respect to the true network γ∗ as defined in Assumption A.2.1.

3. COMPUTATIONAL ALGORITHM

Although the spike-and-slab prior leads to nice theoretical properties for Bayesian sparse DNNs, the resulting posterior is a mixture of distributions with varying dimensions. Conventional Bayesian sampling algorithms adopt reversible jump moves [16] to explore different network architectures, see e.g. [13], and lack scalability necessarily required for big data problems. [17] proposed an extended stochastic gradient Langevin dynamics (SGLD) algorithm for Bayesian variable selection under the big data scenario, which allows the mini-batch data to be used in updating the model and regression coefficients at each iteration and is thus scalable to big data problems. In this paper, we extend it further by combining it with stochastic gradient Hamiltonian Monte Carlo (SGHMC) [18], see Algorithm 1 for the procedure and the supplementary material for more explanations.

The main parameters of the extended SGHMC algorithm are the prior hyperparameters λn and σ1,n defined in (2). Our theory allows σ1,n to grow with n from the perspective of data fitting, but in our experience, the magnitude of weights tend to adversely affect the generalization ability of the network. For this reason, we usually set σ1,n to a small number such as 0.01 or 0.02, and then tune the value of λn for network sparsity such that the sparse constraint given in Assumption A.2.2 is satisfied.

4. EXPERIMENTS

This section presents an example on network compression for convolutional neural networks (CNNs). Refer to the supplementary material for examples on high-dimensional nonlinear variable selection and DNN model calibration. In terms of network compression, extended SGHMC can be viewed as an unstructured dynamic pruning algorithm, where the spike-and-slab prior is imposed on each parameter and the sparsity mask γ can be adjusted during training [19]. We test the proposed method on the CIFAR-10 data [20], pruning ResNet20 and ResNet32 networks [1] to different sparsity levels. We train the network using the standard setup as in [19], i.e. we use a mini-batch size of 128 and train the model for 300 epochs with a fixed initial learning rate of ϵ0 = 0.1 and ϵt being divided by 10 at epochs 150 and 225. We set σ1,n2=0.04, and sample from the posterior using the extended SGHMC algorithm with t0 = 1,

Algorithm 1.

Extended Stochastic Gradient Hamiltonian Monte Carlo

Input: T ; m; τ; t0; t1
Initialization: Randomly initializeβ˜(1). Set γi(0,t0)=1 for all i ∈ {1, . . . , Kn};
for t = 1, 2, ..., T do
 (i) Draw a subsample of size m from the full dataset Dn, and duplicate it to form a datasetDm,n(t).
 (ii) Simulate models γ(t,1),γ(t,2),,γ(t,t0) from the conditional posterior π(γβ˜(t),Dm,n(t)) by MetropolisHastings(MH) algorithm with a magnitude-based proposal.
 (iii) Update β˜(t) via a SGHMC move,
  
v(t+1)=(1α)v(t)+ϵt+12t0k=1t0β˜logπ(β˜(t)γ(t,k),Dm,n(t))+αϵt+1τηt+1,
β˜(t+1)=β˜(t)+v(t+1),
 where ηt+1 ~ N(0, Id), ϵt+1 is the learning rate, τ is the temperature, and 1 − α gives the momentum parameter.
end for

τ = 0.0001, α = 0.1, and a magnitude-based proposal for γ. We tune λn and B (a parameter of the magnitude-based proposal as described in the supplementary material) to achieve target sparsity levels for different ResNets. Posterior samples are collected at the end of each epoch and the first 225 samples are discarded for the burn-in process.

We compare the proposed method with some recent unstructured compression methods including dynamic pruning with feedback (DPF) [19], dynamic sparse reparameterization (DSR) [21], sparse momentum (SM) [22], consistent sparse deep learning (BNNcs) [7], and variation inference for sparse deep learning (SVI) [9]. The numerical results are summarized in Table 1. The comparison shows that the proposed method can achieve higher prediction accuracy than the existing ones at a similar sparsity level.

Table 1:

Network pruning for CIFAR-10 data, where the result of each method is calculated by averaging over 3 independent runs with the standard deviation reported in the parentheses. BNNbma and BNNsingle denote the proposed method with the results calculated in the way of Bayesian model averaging and based on the single model collected at the last epoch, respectively. Other methods have been defined in the context.

Model Method Pruning Ratio(%) Test Accuracy(%) Pruning Ratio(%) Test Accuracy(%)

ResNet20 BNNbma 9.65(0.05) 91.60(0.06) 19.76(0.02) 92.65(0.02)
BNNsingle 9.88(0.08) 91.26(0.02) 19.83(0.02) 92.32(0.04)
BNNcs 9.55(0.03) 91.27(0.05) 19.67(0.05) 92.27(0.03)
SVI 10 89.43(0.05) 20 91.78(0.06)
SM 10 89.76(0.40) 20 91.54(0.16)
DSR 10 87.88(0.04) 20 91.78(0.28)
DPF 10 90.88(0.07) 20 92.17(0.21)

ResNet32 BNNbma 4.89(0.09) 91.84(0.09) 9.65(0.05) 92.99(0.08)
BNNsingle 4.99(0.06) 91.39(0.10) 8.77(0.12) 92.74(0.10)
BNNcs 4.78(0.01) 91.21(0.01) 9.53(0.04) 92.74(0.07)
SVI 5 86.31(0.23) 10 91.61(0.10)
SM 5 88.68(0.22) 10 91.54(0.18)
DSR 5 84.12(0.32) 10 91.41(0.23)
DPF 5 90.94(0.35) 10 92.42(0.18)

5. CONCLUSION

In this paper, we study sparse deep learning from the Bayesian perspective. Theoretically, we establish posterior consistency and structure selection consistency for Bayesian DNNs with a spike-and-slab prior. We give the posterior contraction rate, and propose to use the marginal inclusion posterior probability approach to determine the structure of the sparse DNN. We employ a scalable SGMCMC algorithm for simulating from the posterior of the sparse DNN, which has about the same order of computational complexity as popular optimization algorithms such as SGD. When irrelevant features exist in the dataset, our numerical results show that learning a sparse DNN is always rewarded in prediction compared to blindly learning a large, fully connected DNN. The proposed method can be applied for large-scale network pruning tasks as an unstructured dynamic pruning method, where the sparsity mask is adjusted systematically during training. Our numerical result show that the proposed method can achieve the state-of-the-art performance in network structure selection, large-scale network pruning and model calibration.

As mentioned previously, sparsity is essential for improving the calibration of the DNN [23, 24]. This work provides a practical and efficient method for learning sparse DNNs with a theoretical guarantee for identification of an appropriate sparse structure, and thus it will naturally enhance the development of trustworthy artificial intelligence.

Supplementary Material

1

Acknowledgments

Liangs research was supported in part by the grants DMS-2015498, R01-GM117597 and R01-GM126089. Songs research was supported in part by the grant DMS-1811812. The authors thank the editor, associate editor, and referees for their constructive comments which have led to significant improvement of this paper.

Footnotes

1

anbn mean an/bn → 0 as n → ∞

2

anbn means there exists a constant C such that an/bnC as n → ∞.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • [1].He K, Zhang X, Ren S, Sun J, Deep residual learning for image recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 770–778. [Google Scholar]
  • [2].Denil M, Shakibi B, Dinh L, Ranzato M, de Freitas N, Predicting parameters in deep learning, in: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, Curran Associates Inc., Red Hook, NY, USA, 2013, p. 21482156. [Google Scholar]
  • [3].Guo C, Pleiss G, Sun Y, Weinberger KQ, On calibration of modern neural networks, in: Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, JMLR.org, 2017, p. 13211330. [Google Scholar]
  • [4].Schmidt-Hieber J, Nonparametric regression using deep neural networks with ReLU activation function, Ann. Statist. 48 (4) (2020) 1875–1897. [Google Scholar]
  • [5].Bauler B, Kohler M, On deep learning as a remedy for the curse of dimensionality in nonparametric regression, Ann. Statist. 47 (4) (2019) 2261–2285. [Google Scholar]
  • [6].Polson N, Ročková V, Posterior concentration for sparse deep learning, in: Advances in Neural Information Processing Systems, 2018, pp. 930–941. [Google Scholar]
  • [7].Sun Y, Song Q, Liang F, Consistent sparse deep learning: Theory and computation, J. Amer. Statist. Assoc. (2021) in press. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Chérief-Abdellatif B-E, Convergence rates of variational inference in sparse deep learning, in: Singh HDIII,A (Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119 of Proceedings of Machine Learning Research, PMLR, 2020, pp. 1831–1842. [Google Scholar]
  • [9].Bai J, Song Q, Cheng G, Efficient variational inference for sparse deep learning with theoretical guarantee, in: Advances in Neural Information Processing Systems, 2020, pp. 466–476. [Google Scholar]
  • [10].Bai J, Song Q, Cheng G, Adaptive variational Bayesian inference for sparse deep neural network (2020). arXiv:1910.04355. [Google Scholar]
  • [11].Scott JG, Berger JO, et al. , Bayes and empirical-bayes multiplicity adjustment in the variable-selection problem, Ann. Statist. 38 (5) (2010) 2587–2619. [Google Scholar]
  • [12].Pourzanjani AA, Jiang RM, Petzold LR, Improving the identifiability of neural networks for Bayesian inference, in: NIPS Workshop on Bayesian Deep Learning, 2017. [Google Scholar]
  • [13].Liang F, Li Q, Zhou L, Bayesian neural networks for selection of drug sensitive genes, J. Amer. Statist. Assoc. 113 (523) (2018) 955–972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Liang F, Song Q, Yu K, Bayesian subset modeling for high dimensional generalized linear models, J. Amer. Statist. Assoc. 108 (2013) 589–606. [Google Scholar]
  • [15].Liang F, Zhang J, Estimating false discovery rate using stochastic approximation, Biometrika 95 (2008) 961–977. [Google Scholar]
  • [16].Green PJ, Reversible jump markov chain monte carlo computation and Bayesian model determination, Biometrika 82 (4) (1995) 711–732. [Google Scholar]
  • [17].Song Q, Sun Y, Ye M, Liang F, Extended stochastic gradient MCMC for large-scale Bayesian variable selection, Biometrika 107 (4) (2020) 997–1004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Chen T, Fox E, Guestrin C, Stochastic gradient hamiltonian monte carlo, in: International conference on machine learning, 2014, pp. 1683–1691. [Google Scholar]
  • [19].Lin T, Stich SU, Barba L, Dmitriev D, Jaggi M, Dynamic model pruning with feedback, in: International Conference on Learning Representations, 2020. [Google Scholar]
  • [20].Krizhevsky A, Hinton G, et al. , Learning multiple layers of features from tiny images, Tech. rep., Citeseer; (2009). [Google Scholar]
  • [21].Mostafa H, Wang X, Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization, in: International Conference on Machine Learning, 2019, pp. 4646–4655. [Google Scholar]
  • [22].Dettmers T, Zettlemoyer L, Sparse networks from scratch: Faster training without losing performance, arXiv preprint arXiv: 1907.04840. [Google Scholar]
  • [23].Wang Y, Ročková V, Uncertainty quantification for sparse deep learning, in: AISTATS, 2020. [Google Scholar]
  • [24].Lee J, Humt M, Feng J, Triebel R, Estimating model uncertainty of neural networks in sparse information form, in: Singh HDIII,A(Eds.), Proceedings of the 37th International Conference on Machine Learning, Vol. 119, PMLR, 2020, pp. 5702–5713. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1

RESOURCES