PAC Bayesian Performance Guarantees for Deep (Stochastic) Networks in Medical Imaging

Anthony Sicilia; Xingchen Zhao; Anastasia Sosnovskikh; Seong Jae Hwang

doi:10.1007/978-3-030-87199-4_53

. Author manuscript; available in PMC: 2021 Dec 23.

Published in final edited form as: Med Image Comput Comput Assist Interv. 2021 Sep 21;12903:560–570. doi: 10.1007/978-3-030-87199-4_53

PAC Bayesian Performance Guarantees for Deep (Stochastic) Networks in Medical Imaging

Anthony Sicilia ¹, Xingchen Zhao ², Anastasia Sosnovskikh ², Seong Jae Hwang ^1,²

PMCID: PMC8702021 NIHMSID: NIHMS1760019 PMID: 34957473

Abstract

Application of deep neural networks to medical imaging tasks has in some sense become commonplace. Still, a “thorn in the side” of the deep learning movement is the argument that deep networks are prone to overfitting and are thus unable to generalize well when datasets are small (as is common in medical imaging tasks). One way to bolster confidence is to provide mathematical guarantees, or bounds, on network performance after training which explicitly quantify the possibility of overfitting. In this work, we explore recent advances using the PAC-Bayesian framework to provide bounds on generalization error for large (stochastic) networks. While previous efforts focus on classification in larger natural image datasets (e.g., MNIST and CIFAR-10), we apply these techniques to both classification and segmentation in a smaller medical imagining dataset: the ISIC 2018 challenge set. We observe the resultant bounds are competitive compared to a simpler baseline, while also being more explainable and alleviating the need for holdout sets.

1. Introduction

Understanding the generalization of learning algorithms is a classical problem. Practically speaking, verifying whether a fixed method of inference will generalize may not seem to be a challenging task. Holdout sets are the tool of choice for most practitioners – when sample sizes are large, we can be confident the measured performance is representative. In medical imaging, however, sample sizes are often small and stakes are often high. Thus, mathematical guarantees¹ on the performance of our algorithms are of paramount importance. Yet, it is not abundantly common to provide such guarantees in medical imaging research on deep neural networks; we are interested in supplementing this shortage.

A simple (but effective) guarantee on performance is achieved by applying a Hoeffding Bound to the error of an inference algorithm reported on a holdout set. In classification tasks, Langford provides a useful tutorial on these types of high probability bounds among others [32]. The bounds are easily extended to any bounded performance metrics in general, and we use this methodology as a baseline in our own experimentation (Sect. 3). While effective, Hoeffding’s Inequality falls short in two regards: (1) use of a holdout set requires that the model does not see all available data and (2) the practitioner gains no insight into why the model generalized well. Clearly, both short-comings can be undesirable in a medical imaging context: (1) access to the entire dataset can be especially useful for datasets with rare presence of a disease and (2) understanding why can both improve algorithm design and give confidence when deploying models in a clinical setting. Thus, we desire practically applicable bounds – i.e., competitive with Hoeffding’s Inequality – which avoid the aforementioned caveats.

Unfortunately, for deep neural networks, practically applicable guarantees of this nature can be challenging to produce. Traditional PAC bounds based on the Vapnik-Chervonenkis (VC) dimension [8,46,48,49] accomplish our goals to some extent, but require (much) more samples than parameters in our network to produce a good guarantee.² When our networks are large – e.g., a ResNet-18 [23] with more than 10M parameters – our datasets are thus required to be unreasonably sized to ensure generalization, especially, in medical imaging contexts. In effect, these bounds are vacuous; they are logically meaningless for the sample sizes we observe in practice. Specifically, vacuous describes any bound on error which is larger than 1, and therefore, gives us no new insight on a model’s generalization ability because error rates lie in the range [0, 1]. The term was coined by Dziugaite & Roy [16] who observed that most PAC bounds on error for deep networks were in fact vacuous (when computed). The authors demonstrate a technique to compute non-vacuous bounds for some deep stochastic networks trained on standard datasets (e.g., MNIST) using the PAC-Bayesian framework.³ As discussed later, the motivation for using PAC-Bayesian bounds is the hypothesis that stochastic gradient descent (SGD) trained networks generalize well when their solutions lie in large, flat minima.⁴ Thus, these PAC-Bayes bounds give us exactly what we desire. They provide practically applicable guarantees on the performance of deep networks, while allowing the model access to all data and also giving insight on what network properties lead to good generalization in practice. To borrow an analogy from Arora [2], this approach is “prescriptive”. It is similar to a doctor’s orders to resolve high blood pressure: cut the salt, or in our case, look for flat minima.

In this context, the end goal of this paper is to validate whether the “flat minima prescription” – with observed evidence in traditional computer vision – may find similar success in explaining generalization for the small-data, non-traditional tasks that are common to medical imaging. Our contributions, in this sense, are primarily experimental. We demonstrate non-vacuous PAC-Bayesian performance guarantees for deep stochastic networks applied to the classification and segmentation tasks within the ISIC 2018 Challenge [13]. Importantly, our results show that PAC-Bayesian bounds are competitive against Hoeffding’s Inequality, offering a practical alternative which avoids the aforementioned caveats. We employ much the same strategies used by Dziugaite et al. [15,16] as well those used by Pérez-Ortiz et al. [43]. With that said, our different setting yields novel experimental results and poses some novel challenges. Specifically, in segmentation, we compute non-vacuous bounds for a fully-sized U-Net using a medical imaging dataset with about 2.3K training samples. To our knowledge, for deep stochastic networks, we are the first to compute non-vacuous bounds in segmentation on such small datasets. Along the way, we offer some practical insights for the medical imaging practitioner including a (mathematically sound) trick to handle batch normalization layers in PAC-Bayes bounds and an experimental technique to “probe” parameter space to learn about the generalization properties of a particular model and dataset. We hope these results promote continued research on the important topic of guarantees in medical imaging.

2. PAC-Bayesian Theory and Generalization

2.1. Formal Setup

In the PAC-Bayes setting, we consider a hypothesis space $H$ and a distribution Q over this space. Specific to our context, $H = ℝ^{d}$ will represent the space of deep networks with some fixed architecture, and Q will be a distribution over $H$ . Typically, we will set $Q = N (μ, Σ)$ , a multivariate normal. For some fixed space $Z = X \times Y$ , the hypothesis $h \in H$ defines a mapping $x \mapsto h (x)$ with $x \in X$ , and $h (x) \in Y$ . Given a [0, 1]-bounded loss function $ℓ : H \times Z \to [0, 1]$ and a distribution D over $Z$ , the risk of h is defined $R_{ℓ} (h, D) = E_{(x, y) ~ D} ℓ (h, (x, y))$ . Given instead a sample S ~ D^m over $Z$ , the empirical risk is denoted ${\hat{R}}_{ℓ} (h, S)$ and is computed as usual by averaging. In all of our discussions, the data distribution D or sample S is usually easily inferred from context. Therefore, we typically write $R_{ℓ} (h) = R_{ℓ} (h, D)$ and ${\hat{R}}_{ℓ} (h) = {\hat{R}}_{ℓ} (h, S)$ . With these definitions, we are interested in quantifying the risk of a stochastic model.⁵ In the context of neural networks, one can imagine sampling the distribution Q over $ℝ^{d}$ and setting the weights before performing inference on some data-point x. Often, we will refer to Q itself as the stochastic predictor. The associated risk for such a model is defined as $R_{ℓ} (Q) = E_{h ~ Q} R_{ℓ} (h)$ with ${\hat{R}}_{ℓ} (Q)$ similarly defined. Typically, we cannot exactly compute expectations over Q. For this reason, we also define ${\hat{R}}_{ℓ} (\hat{Q})$ for a sample $\hat{Q} ~ Q^{n}$ as an empirical variant, computed by averaging. The last components for any PAC-Bayesian bound come from two notions of the Kullback-Leibler (KL) divergence between two distributions Q and P written KL(Q||P) and defined as usual. For numbers q,p ∑ [0,1], we write kl(q||p) as shorthand for the KL divergence between Bernoulli distributions parameterized by q, p. This is typically used to quantify the difference between the risk on a sample and the true risk. In the next section, we put the discussed pieces in play.

2.2. The PAC-Bayesian Theorem

The PAC-Bayesian theory begins primarily with the work of McAllester [37] with similar conceptualizations given by Shawe-Taylor & Williamson [47]. Besides what is discussed in this section, for completeness, readers are also directed to the work of Catoni [11], McAllester [36], Germain et al. [19,20], and the primer by Guedj [21]. We start by stating the main PAC-Bayesian Theorem as given by Maurer [35]. See also Langford & Seeger [33] for the case where l is a 01 loss.

Theorem 1 (Maurer).

Let l be a [0, 1]-bounded loss function, D be a distribution over $Z$ , and P be a probability distribution over $H$ . Then, for δ ∑ (0,1)

\underset{S ~ D^{m}}{\Pr} (\forall Q : kl ({\hat{R}}_{l} (Q) ‖ R_{l} (Q)) \leq \frac{KL (Q ‖ P) + \ln \frac{1}{δ} + \ln \sqrt{4 m}}{m}) \geq 1 - δ .

(1)

By way of Pinsker’s inequality, the above may be loosened for the purpose of interpretation [19]

R_{ℓ} (Q) \leq {\hat{R}}_{ℓ} (Q) + \sqrt{\frac{KL (Q ‖ P) + \ln \frac{1}{δ} + \ln \sqrt{4 m}}{2 m}} .

(2)

In Sect. 3, we compute a much tighter formulation of the bound given in Eq. (2) which handles the term kl $({\hat{R}}_{ℓ} (Q) ‖ R_{ℓ} (Q))$ directly. We provide the derivation of this bound in the Appendix. Various insights and results used to build the final bound are of course due to Langford & Caruana [31]; Dziugaite et al. [15,16]; and Pérez-Ortiz et al. [43] who have all computed similar (or identical) bounds on stochastic neural networks before us. In Sect. 3, we compute this bound for classification and segmentation tasks. For classification, we take to be the 01-loss defined l₀₁(h, (x, y)) = 1[h(x) = y] where 1 is the indicator function. Precisely, $R_{ℓ_{01}} (h)$ is equal to 1 minus the accuracy. For segmentation, we pick to be l_DSC(h, (x, y)) = 1 - DSC(h, (x, y)) where DSC is the [0, 1]-valued Dice similarity coefficient. These upperbounds trivially yield corresponding lowerbounds for both the accuracy and the Dice similarity coefficient, respectively.

Selecting the Prior.

Often P is referred to as the prior and Q as the posterior. Still, it is important to note that no restriction on the form of P and Q is required (e.g., as it is in Bayesian parameter estimation). What is required is that P be fixed before observing the sample S that is used to compute the bound. Albeit, P can depend on D and samples independent of S. In fact, it is not uncommon for the prior P to be data-dependent.⁶ That is, P may be trained on a sample which is disjoint from that which is used to compute the bound; i.e., disjoint from S. On the other hand, the bound holds for all posteriors Q regardless of how Q is selected. So, the datasets used to train Q and P may actually intersect. All in all, we can train Q with all available data without violating assumptions. We must only ensure the datasets used to train P and compute the bound do not intersect. In effect, we avoid the first caveat of Hoeffding’s Inequality.

Interpretation.

The bound also offers insight into why the model generalizes. Intuitively, we quantify the complexity of the stochastic predictor Q in so much as it deviates from some prior knowledge we have on the solution space (e.g., from the data-dependent prior). This is captured in the term KL(Q||P). Dziugaite & Roy [16] also relate PAC-Bayesian bounds to the flat-minima hypothesis. To understand their observation, we consider the case where Q is a normal distribution $N (μ, σ^{2} I)$ with σ a constant and I the identity matrix. The model Q on which we bound the error is stochastic: each time we do inference, we sample from $N (μ, σ^{2} I)$ . Because the distribution has some variance (dictated by σ), we sample network weights in a region around the mean. Thus, when performance of the stochastic model Q is good, there must be a non-negligible area around the mean where most networks perform well, i.e., a flat minimum around the mean. We know the variance is non-negligible in the posterior network Q because a small upperbound implies small KL divergence with the prior P which itself has non-negligible variance (we pick this value). So, to reiterate, small KL divergence and small empirical risk imply a flat-minimum of appropriate size around the mean of Q. In this sense the bound is explainable: Q generalizes well because it does not deviate much from prior knowledge and it lies in a flat minimum.

Additional Context.

Dziugaite & Roy [16] provide a nice synopsis of the history behind the flat-minima hypothesis including the work of Hochreiter & Schmidhuber [25], Hinton & Van Camp [24], Baldassi et al. [3,4], Chaudhari et al. [12], and Keskar et al. [27]. Since then, large scale empirical studies – e.g., Jiang et al. [26]; Dziugaite, Drouin, et al. [14] – have continued to indicate that measures of sharpness of the minimum may be good indicators of neural network generalization in practice. For completeness, we also point out some other theoretically plausible indicators of deep network generalization. These include small weight norms – e.g., Bartlett [5,6]; Neyshabur et al. [40,41] – and the notion of algorithmic stability proposed by Bousquet & Elisseeff [10] which focuses instead on the SGD algorithm – e.g., Hardt et al. [22]; Mou et al. [39]; Kuzborskij & Lampert [29].

3. Experiments

In this section, we first evaluate PAC-Bayesian bounds within a self-bounded learning setting. Specifically, a self-bounded learner must both learn and provide a guarantee using the same dataset.⁷ As noted, providing guarantees with our trained networks can bolster confidence in small data regimes. We compare the PAC-Bayesian bounds discussed in Sect. 2 to a simple baseline for producing performance guarantees: application of Hoeffding’s Inequality to a holdout set.⁸ We show PAC-Bayesian bounds are competitive with Hoeffding’s Inequality, while also alleviating some caveats discussed in the previous sections. This result (in medical imaging) compliments those previously shown on natural image datasets by Pérez-Ortiz et al. [43]. Specifically, we compute bounds on the Lesion Segmentation and the Lesion Classification Tasks in the ISIC 2018 Skin Lesion Challenge Dataset [13]. The data in these sets used for training (and bound computation) totals 2.3K and 9K labeled examples, respectively, which is much smaller than previous computation [15,16,43] using MNIST [34] or CIFAR-10 [28].⁹ Our second contribution in this section comes from tricks and tools which we hope prove useful for the medical imaging practitioner. We demonstrate an experiment to probe the loss landscape using PAC-Bayesian bounds and also devise a strategy to handle batch normalization layers when computing PAC-Bayesian bounds. Our code is available at: https://github.com/anthonysicilia/PAC-Bayes-In-Medical-Imaging.

3.1. Setup

Models.

For segmentation, we use U-Net (UN) [45] and a light-weight version of U-Net (LW) with 3% of the parameters and no skip-connections. For classification, we use ResNet-18 (RN) [23]. Probabilistic models use the same architecture but define a multivariate normal distribution over network weights $N (μ, Σ)$ with a diagonal covariance matrix Σ. The distribution is sampled to do inference.

Losses.

For segmentation, we train using the common Dice Loss [38] which is a smooth surrogate for 1 - DSC. For classification, we use the negative log-likelihood. Probabilistic models used modified losses we describe next.

Training Probabilistic Models.

Recall, in the PAC-Bayes setting we define both the prior P and the posterior Q as both are needed to compute bounds and Q is needed for inference. The prior P is a probabilistic network defined by $N (μ, σ^{2} I)$ where I is the identity matrix and σ_p is a constant. In this text, we use a data-dependent prior unless otherwise noted (see Sect. 2). To pick the prior, μ_p is learned by performing traditional optimization on a dataset disjoint from that which is used to compute the bound (see Data Splits). The parameter σ_p = 0.01 unless otherwise noted. The posterior Q is initialized identically to P before it is trained using PAC-Bayes with Backprop (PBB) as proposed by Rivasplata et al. [43,44]. This training technique may be viewed as (mechanically) similar to Bayes-by-Backprop (BBB) [9]. In particular, it uses a re-parameterization trick to optimize a probabilistic network through SGD. Where PBB and BBB differ is the motivation, and subsequently, the use of PAC-Bayes upperbounds as the objective to optimize.¹⁰ Note, PAC-Bayes bounds are valid for all [0, 1]-bounded losses, and thus, are valid for the Dice Loss or normalized negative log-likelihood. The upperbound used for our PBB objective is the Variational bound of Dziugaite et al. [15].

Probabilistic Batch Norm Layers.

While generally each weight in the probabilistic networks we consider is sampled independently according to a normal distribution, batch norm layers¹¹ must be handled with special care. We treat the parameters of batch norm layers as point mass distributions. Specifically, the parameter value has probability 1 and 0 is assigned to all other values. The posterior distribution for these parameters is made identical to the prior by “freezing” the posterior batch norm layers during training and inference. In effect, we avoid sampling the means and variances in our batch norm layers, and importantly, the batch norm layers do not contribute to the KL-divergence computation. In the Appendix, we provide a derivation to show this strategy is (mathematically) correct; it relies primarily on the independence of the weight distributions.

Optimization Parameters.

Optimization is done using SGD with momentum set to 0.95. For classification, the batch size is 64, and the initial learning rate is 0.5. For segmentation, the batch size is 8, and the initial learning rate is 0.1 for LW and 0.01 for U-Net. All models are initialized at the same random location and are trained for 120 epochs with the learning rate decayed by a factor of 10 every 30 epochs. In the PAC-Bayesian setting, the data-dependent prior mean μ_p is randomly initialized (as other models) and trained for 30 epochs. The posterior Q is initialized at the prior P and trained for the remaining 90 epochs. The learning rate decay schedule is not reset for posterior training.

Bound Details.

Note, in all cases, PAC-Bayes bounds are computed using a data-dependent prior. Bounds are computed with 95% confidence (δ = 0.05) with data sample size given in the next section. For PAC-Bayes bounds, the number of models sampled is either 1000 (in Fig. 1a) or 100 (in Fig. 1b, c, d).

Fig. 1. — **(a)** DSC/ACC (red) and lowerbounds (blue). **(b, c, d)** Modulation of prior variance for U-Net **(b)** and LW **(c)** and ResNet-18 **(d)**.

Data Splits for Self-bounded Learning.

Each method is given access to a base training set (90% of the data) and is expected to both learn a model and provide a performance guarantee for this model when applied to unseen data. To evaluate both the model and performance guarantee in a more traditional fashion, each method is also tested on a final holdout set (10% of the data) which no model sees. Splits are random but identically fixed for all models. For probabilistic networks trained using PBB, we split the base training data into a 50%-prefix set¹² used to train the prior and a disjoint 50%-bound set. Both the prefix-set and bound-set are used to train the posterior, but recall, the PAC-Bayes bound can only be computed on the bound-set (Sect. 2). For the baseline non-probabilistic networks, we instead train the model using a traditional training set (composed of 90% of the base training set) and then compute a guarantee on the model performance (i.e., a Hoeffding bound) using an independent holdout set (the remaining 10% of the base training set). In this sense, all models are on an equal footing with respect to the task of a self-bounded learner. All models have access only to the base training set for both training and computation of valid performance guarantees. In relation to Fig. 1, performance metrics such as DSC are computed using the final holdout set. Lowerbound computation and training is done using the base training set.

3.2. Results

Comparison to Hoeffding's Inequality.

As promised, we observe in Fig. 1a performance guarantees by both bounds are comparable and performance on the final holdout set is also comparable. Hoeffding’s Inequality does have a slight advantage with regards to these metrics, but as mentioned, PAC-Bayes bounds possess some desirable qualitative properties which make it an appealing alternative. For PAC-Bayes bounds the posterior Q sees all training data, while for the Hoeffding Bound, one must maintain an unseen holdout set to compute a guarantee. Further, we may explain the generalization of the PBB trained model through our interpretation of the PAC-Bayes bound (see Sect. 2). These qualities make PAC-Bayes appealing in medical imaging contexts where explainability is a priority and we often need to maximize the utility of the training data.

Flat Minima and Their Size.

As discussed, the application of PAC-Bayesian bounds may be motivated by the flat minima hypothesis (see Sect. 2). We explore this idea in Fig. 1b, c, d by modulating the prior variance σ_p across runs. Informally, these plots can give us insight into our loss landscape. The reasonably tight lowerbounds – which are slightly looser than in Fig. 1a only due to fewer model samples – imply small KL-Divergence and indicate the prior and posterior variances are of a similar magnitude. Likewise, the difference between the prior and posterior means should not be too large, relative to the variance. A fixed random-seed ensures priors are identical, so each data-point within a plot should correspond to roughly the same location in parameter space; i.e., we will assume we are analyzing the location of a single minimum.¹³ For U-Net, we see stable performance and a sudden drop as the prior variance grows. Before the drop at σ_p = 0.04, consistently high DSC and a high lowerbound indicate the network solution lies in a flat minimum (as discussed in Sect. 2). So, we may conclude a flat minimum proportional in size to σ_p = 0.03. For LW and ResNet-18, we instead see consistent performance degradation as the prior variance grows. For these networks, the minima may not be as flat. Informally, such sensitivity analysis can tell us “how flat” the minima are for a particular network and dataset as well as “how large”. Practically, information like this can be useful to practitioners interested in understanding the generalization ability of their models. Namely, it is hypothesized “larger” flat minima lead to better generalization because less precision (fewer bits) is required to specify the weights [25].

4. Conclusion

As a whole, our results show how PAC-Bayes bounds can be practically applied in medical imaging contexts – where theoretical guarantees (for deep networks) would appear useful but not commonly discussed. With this said, we hope for this paper to act primarily as a conversation starter. At 2.3K examples, the segmentation dataset we consider is still larger than commonly available in some application domains (e.g., neuroimaging applications). It remains to be considered how effective these bounds can be in ultra-low resource settings.

Supplementary Material

Appendix

NIHMS1760019-supplement-Appendix.pdf^{(212.4KB, pdf)}

Acknowledgment.

This work is supported by the University of Pittsburgh Alzheimer Disease Research Center Grant (P30 AG066468).

Footnotes

Guarantees in this paper are probabilistic. Similar to confidence intervals, one should interpret with care: the guarantees hold with high probability prior to observing data.

Electronic supplementary material The online version of this chapter (https://doi.org/10.1007/978-3-030-87199-4_53) contains supplementary material, which is available to authorized users.

We (very roughly) estimate this by Thm. 6.11 of Shalev-Shwartz & Ben-David [46]. Bartlett et al. [7] provide tight bounds on VC dimension of ReLU networks. Based on these, the sample size must be magnitudes larger than the parameter count for a small generalization gap. See Appendix for additional details and a plot.

PAC-Bayes is attributed to McAllester [37]; also, Shawe-Taylor & Williamson [47].

⁴

Early formulations of this hypothesis are due to Hochreiter & Schmidhuber [25].

⁵

Sometimes, in classification, this may be called the Gibbs classifier. Not to be confused with the “deterministic”, majority vote classifier. An insightful discussion on the relationship between risk in these distinct cases is provided by Germain et al [19].

⁶

For example, see Ambroladze et al. [1], Parrado-Hernández et al. [42], Pérez-Ortiz et al. [43], and Dziugaite et al. [15,17].

⁷

See Freund [18] or Langford & Blum [30].

⁸

We provide additional details on this procedure in the Appendix.

⁹

These datasets have 60K and 50K labeled examples, respectively.

¹⁰

See Pérez-Ortiz et al. [43] for more detailed discussion.

¹¹

We refer here to both the running statistics and any learned weights.

¹²

See Dziugaite et al. [15] who coin the term “prefix”.

¹³

Notice, another approach might be to the fix the posterior mean at the result of, say, the run with σ_p = 0.01 and then modulate the variance from this fixed location. We are not guaranteed this run will be near the center of a minimum, and so, may underestimate the minimum’s size by this procedure. Our approach, instead, allows the center of the posterior to change (slightly) when the variance grows.

References

1.Ambroladze A, Parrado-Hernádez E, Shawe-Taylor J: Tighter PAC-Bayes Bounds (2007) [Google Scholar]
2.Arora S: Generalization Theory and Deep Nets, An introduction (2017). https://www.offconvex.org/2017/12/08/generalization1/
3.Baldassi C, et al. : Unreasonable effectiveness of learning neural networks: from accessible states and robust ensembles to basic algorithmic schemes. PNAS 113, E7655–E7662 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Baldassi C, Ingrosso A, Lucibello C, Saglietti L, Zecchina R: Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115, 128101 (2015) [DOI] [PubMed] [Google Scholar]
5.Bartlett PL: For valid generalization, the size of the weights is more important than the size of the network (1997) [Google Scholar]
6.Bartlett PL: The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44, 525–536 (1998) [Google Scholar]
7.Bartlett PL, Harvey N, Liaw C, Mehrabian A: Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. JMLR 20, 2285–2301 (2019) [Google Scholar]
8.Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK: Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 929–965 (1989) [Google Scholar]
9.Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D: Weight uncertainty in neural network. In: ICML (2015) [Google Scholar]
10.Bousquet O, Elisseeff A: Stability and generalization. JMLR 2, 499–526 (2002) [Google Scholar]
11.Catoni O: PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv:0712.0248v1 (2007) [Google Scholar]
12.Chaudhari P, et al. : Entropy-SGD: biasing gradient descent into wide valleys. arXiv:1611.01838v5 (2016) [Google Scholar]
13.Codella N, et al. : Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). arXiv:1902.03368v2 (2019) [Google Scholar]
14.Dziugaite GK, et al. : In search of robust measures of generalization. arXiv:2010.11924v2 (2020) [Google Scholar]
15.Dziugaite GK, Hsu K, Gharbieh W, Roy DM: On the role of data in PAC-Bayes bounds. arXiv:2006.10929v2 (2020) [Google Scholar]
16.Dziugaite GK, Roy DM: Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv:1703.11008v2 (2017) [Google Scholar]
17.Dziugaite GK, Roy DM: Data-dependent PAC-Bayes priors via differential privacy. In: NeurIPS (2018) [Google Scholar]
18.Freund Y: Self bounding learning algorithms. In: COLT (1998) [Google Scholar]
19.Germain P, Lacasse A, Laviolette F, March M, Roy JF: Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm. JMLR 16, 787–860 (2015) [Google Scholar]
20.Germain P, Lacasse A, Laviolette F, Marchand M: PAC-Bayesian learning of linear classifiers. In: ICML (2009) [Google Scholar]
21.Guedj B: A primer on PAC-Bayesian learning. arXiv:1901.05353v3 (2019) [Google Scholar]
22.Hardt M, Recht B, Singer Y: Train faster, generalize better: stability of stochastic gradient descent. In: ICML (2016) [Google Scholar]
23.He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: CVPR (2016) [Google Scholar]
24.Hinton GE, Van Camp D: Keeping the neural networks simple by minimizing the description length of the weights. In: COLT (1993) [Google Scholar]
25.Hochreiter S, Schmidhuber J: Flat minima. Neural Comput. 9, 1–42 (1997) [DOI] [PubMed] [Google Scholar]
26.Jiang Y, Neyshabur B, Mobahi H, Krishnan D, Bengio S: Fantastic generalization measures and where to find them. arXiv:1912.02178v1 (2019) [Google Scholar]
27.Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP: On large-batch training for deep learning: generalization gap and sharp minima. arXiv:1609.04836v2 (2016) [Google Scholar]
28.Krizhevsky A, Hinton G, et al. : Learning multiple layers of features from tiny images (2009) [Google Scholar]
29.Kuzborskij I, Lampert C: Data-dependent stability of stochastic gradient descent. In: ICML (2018) [Google Scholar]
30.Langford J, Blum A: Microchoice bounds and self bounding learning algorithms. Mach. Learn 51, 165–179 (2003) [Google Scholar]
31.Langford J, Caruana R: (Not) bounding the true error. In: NeurIPS (2002) [Google Scholar]
32.Langford J, Schapire R: Tutorial on practical prediction theory for classification. JMLR 6, 273–306 (2005) [Google Scholar]
33.Langford J, Seeger M: Bounds for averaging classifiers (2001) [Google Scholar]
34.LeCun Y, Cortes C: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/
35.Maurer A: A note on the PAC Bayesian theorem. arXiv:cs/0411099v1 (2004)
36.McAllester D: A PAC-Bayesian tutorial with a dropout bound arXiv:1307.2118v1 (2013)
37.McAllester DA: Some PAC-Bayesian theorems. Mach. Learn 37, 355–363 (1999) [Google Scholar]
38.Milletari F, Navab N, Ahmadi SA: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016) [Google Scholar]
39.Mou W, Wang L, Zhai X, Zheng K: Generalization bounds of SGLD for non-convex learning: two theoretical viewpoints. In: COLT (2018) [Google Scholar]
40.Neyshabur B, Bhojanapalli S, McAllester D, Srebro N: Exploring generalization in deep learning arXiv:1706.08947v2 (2017) [Google Scholar]
41.Neyshabur B, Tomioka R, Srebro N: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv:1412.6614v4 (2014) [Google Scholar]
42.Parrado-Hernández E, Ambroladze A, Shawe-Taylor J, Sun S: PAC-Bayes bounds with data dependent priors. JMLR 13, 3507–3531 (2012) [Google Scholar]
43.Pérez-Ortiz M, Rivasplata O, Shawe-Taylor J, Szepesvári C: Tighter risk certificates for neural networks. arXiv:2007.12911v2 (2020) [Google Scholar]
44.Rivasplata O, Tankasali VM, Szepesvari C: PAC-Bayes with backprop. arXiv:1908.07380v5 (2019) [Google Scholar]
45.Ronneberger O, Fischer P, Brox T: U-net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells, WM, Frangi AF (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham; (2015). 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]
46.Shalev-Shwartz S, Ben-David S: Understanding Machine Learning: From Theory to Algorithms Cambridge University Press, Cambridge: (2014) [Google Scholar]
47.Shawe-Taylor J, Williamson RC: A PAC analysis of a Bayesian estimator. In: COLT (1997) [Google Scholar]
48.Valiant LG: A theory of the learnable. Commun. ACM 27, 1134–1142 (1984) [Google Scholar]
49.Vapnik VN, Chervonenkis AY: On uniform convergence of the frequencies of events to their probabilities. Teoriya Veroyatnostei Primeneniya 16 (1971) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Appendix

NIHMS1760019-supplement-Appendix.pdf^{(212.4KB, pdf)}

[R1] 1.Ambroladze A, Parrado-Hernádez E, Shawe-Taylor J: Tighter PAC-Bayes Bounds (2007) [Google Scholar]

[R2] 2.Arora S: Generalization Theory and Deep Nets, An introduction (2017). https://www.offconvex.org/2017/12/08/generalization1/

[R3] 3.Baldassi C, et al. : Unreasonable effectiveness of learning neural networks: from accessible states and robust ensembles to basic algorithmic schemes. PNAS 113, E7655–E7662 (2016) [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] 4.Baldassi C, Ingrosso A, Lucibello C, Saglietti L, Zecchina R: Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115, 128101 (2015) [DOI] [PubMed] [Google Scholar]

[R5] 5.Bartlett PL: For valid generalization, the size of the weights is more important than the size of the network (1997) [Google Scholar]

[R6] 6.Bartlett PL: The sample complexity of pattern classification with neural networks: the size of the weights is more important than the size of the network. IEEE Trans. Inform. Theory 44, 525–536 (1998) [Google Scholar]

[R7] 7.Bartlett PL, Harvey N, Liaw C, Mehrabian A: Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks. JMLR 20, 2285–2301 (2019) [Google Scholar]

[R8] 8.Blumer A, Ehrenfeucht A, Haussler D, Warmuth MK: Learnability and the Vapnik-Chervonenkis dimension. J. ACM 36, 929–965 (1989) [Google Scholar]

[R9] 9.Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D: Weight uncertainty in neural network. In: ICML (2015) [Google Scholar]

[R10] 10.Bousquet O, Elisseeff A: Stability and generalization. JMLR 2, 499–526 (2002) [Google Scholar]

[R11] 11.Catoni O: PAC-Bayesian supervised classification: the thermodynamics of statistical learning. arXiv:0712.0248v1 (2007) [Google Scholar]

[R12] 12.Chaudhari P, et al. : Entropy-SGD: biasing gradient descent into wide valleys. arXiv:1611.01838v5 (2016) [Google Scholar]

[R13] 13.Codella N, et al. : Skin lesion analysis toward melanoma detection 2018: a challenge hosted by the international skin imaging collaboration (ISIC). arXiv:1902.03368v2 (2019) [Google Scholar]

[R14] 14.Dziugaite GK, et al. : In search of robust measures of generalization. arXiv:2010.11924v2 (2020) [Google Scholar]

[R15] 15.Dziugaite GK, Hsu K, Gharbieh W, Roy DM: On the role of data in PAC-Bayes bounds. arXiv:2006.10929v2 (2020) [Google Scholar]

[R16] 16.Dziugaite GK, Roy DM: Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv:1703.11008v2 (2017) [Google Scholar]

[R17] 17.Dziugaite GK, Roy DM: Data-dependent PAC-Bayes priors via differential privacy. In: NeurIPS (2018) [Google Scholar]

[R18] 18.Freund Y: Self bounding learning algorithms. In: COLT (1998) [Google Scholar]

[R19] 19.Germain P, Lacasse A, Laviolette F, March M, Roy JF: Risk bounds for the majority vote: from a PAC-Bayesian analysis to a learning algorithm. JMLR 16, 787–860 (2015) [Google Scholar]

[R20] 20.Germain P, Lacasse A, Laviolette F, Marchand M: PAC-Bayesian learning of linear classifiers. In: ICML (2009) [Google Scholar]

[R21] 21.Guedj B: A primer on PAC-Bayesian learning. arXiv:1901.05353v3 (2019) [Google Scholar]

[R22] 22.Hardt M, Recht B, Singer Y: Train faster, generalize better: stability of stochastic gradient descent. In: ICML (2016) [Google Scholar]

[R23] 23.He K, Zhang X, Ren S, Sun J: Deep residual learning for image recognition. In: CVPR (2016) [Google Scholar]

[R24] 24.Hinton GE, Van Camp D: Keeping the neural networks simple by minimizing the description length of the weights. In: COLT (1993) [Google Scholar]

[R25] 25.Hochreiter S, Schmidhuber J: Flat minima. Neural Comput. 9, 1–42 (1997) [DOI] [PubMed] [Google Scholar]

[R26] 26.Jiang Y, Neyshabur B, Mobahi H, Krishnan D, Bengio S: Fantastic generalization measures and where to find them. arXiv:1912.02178v1 (2019) [Google Scholar]

[R27] 27.Keskar NS, Mudigere D, Nocedal J, Smelyanskiy M, Tang PTP: On large-batch training for deep learning: generalization gap and sharp minima. arXiv:1609.04836v2 (2016) [Google Scholar]

[R28] 28.Krizhevsky A, Hinton G, et al. : Learning multiple layers of features from tiny images (2009) [Google Scholar]

[R29] 29.Kuzborskij I, Lampert C: Data-dependent stability of stochastic gradient descent. In: ICML (2018) [Google Scholar]

[R30] 30.Langford J, Blum A: Microchoice bounds and self bounding learning algorithms. Mach. Learn 51, 165–179 (2003) [Google Scholar]

[R31] 31.Langford J, Caruana R: (Not) bounding the true error. In: NeurIPS (2002) [Google Scholar]

[R32] 32.Langford J, Schapire R: Tutorial on practical prediction theory for classification. JMLR 6, 273–306 (2005) [Google Scholar]

[R33] 33.Langford J, Seeger M: Bounds for averaging classifiers (2001) [Google Scholar]

[R34] 34.LeCun Y, Cortes C: MNIST handwritten digit database (2010). http://yann.lecun.com/exdb/mnist/

[R35] 35.Maurer A: A note on the PAC Bayesian theorem. arXiv:cs/0411099v1 (2004)

[R36] 36.McAllester D: A PAC-Bayesian tutorial with a dropout bound arXiv:1307.2118v1 (2013)

[R37] 37.McAllester DA: Some PAC-Bayesian theorems. Mach. Learn 37, 355–363 (1999) [Google Scholar]

[R38] 38.Milletari F, Navab N, Ahmadi SA: V-net: fully convolutional neural networks for volumetric medical image segmentation. In: 3DV (2016) [Google Scholar]

[R39] 39.Mou W, Wang L, Zhai X, Zheng K: Generalization bounds of SGLD for non-convex learning: two theoretical viewpoints. In: COLT (2018) [Google Scholar]

[R40] 40.Neyshabur B, Bhojanapalli S, McAllester D, Srebro N: Exploring generalization in deep learning arXiv:1706.08947v2 (2017) [Google Scholar]

[R41] 41.Neyshabur B, Tomioka R, Srebro N: In search of the real inductive bias: on the role of implicit regularization in deep learning. arXiv:1412.6614v4 (2014) [Google Scholar]

[R42] 42.Parrado-Hernández E, Ambroladze A, Shawe-Taylor J, Sun S: PAC-Bayes bounds with data dependent priors. JMLR 13, 3507–3531 (2012) [Google Scholar]

[R43] 43.Pérez-Ortiz M, Rivasplata O, Shawe-Taylor J, Szepesvári C: Tighter risk certificates for neural networks. arXiv:2007.12911v2 (2020) [Google Scholar]

[R44] 44.Rivasplata O, Tankasali VM, Szepesvari C: PAC-Bayes with backprop. arXiv:1908.07380v5 (2019) [Google Scholar]

[R45] 45.Ronneberger O, Fischer P, Brox T: U-net: convolutional networks for biomedical image segmentation. In: Navab N, Hornegger J, Wells, WM, Frangi AF (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham; (2015). 10.1007/978-3-319-24574-4_28 [DOI] [Google Scholar]

[R46] 46.Shalev-Shwartz S, Ben-David S: Understanding Machine Learning: From Theory to Algorithms Cambridge University Press, Cambridge: (2014) [Google Scholar]

[R47] 47.Shawe-Taylor J, Williamson RC: A PAC analysis of a Bayesian estimator. In: COLT (1997) [Google Scholar]

[R48] 48.Valiant LG: A theory of the learnable. Commun. ACM 27, 1134–1142 (1984) [Google Scholar]

[R49] 49.Vapnik VN, Chervonenkis AY: On uniform convergence of the frequencies of events to their probabilities. Teoriya Veroyatnostei Primeneniya 16 (1971) [Google Scholar]

PERMALINK

PAC Bayesian Performance Guarantees for Deep (Stochastic) Networks in Medical Imaging

Anthony Sicilia

Xingchen Zhao

Anastasia Sosnovskikh

Seong Jae Hwang

Abstract

1. Introduction

2. PAC-Bayesian Theory and Generalization

2.1. Formal Setup

2.2. The PAC-Bayesian Theorem

Theorem 1 (Maurer).

Selecting the Prior.

Interpretation.

Additional Context.

3. Experiments

3.1. Setup

Models.

Losses.

Training Probabilistic Models.

Probabilistic Batch Norm Layers.

Optimization Parameters.

Bound Details.

Fig. 1.

Data Splits for Self-bounded Learning.

3.2. Results

Comparison to Hoeffding's Inequality.

Flat Minima and Their Size.

4. Conclusion

Supplementary Material

Acknowledgment.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

PAC Bayesian Performance Guarantees for Deep (Stochastic) Networks in Medical Imaging

Anthony Sicilia

Xingchen Zhao

Anastasia Sosnovskikh

Seong Jae Hwang

Abstract

1. Introduction

2. PAC-Bayesian Theory and Generalization

2.1. Formal Setup

2.2. The PAC-Bayesian Theorem

Theorem 1 (Maurer).

Selecting the Prior.

Interpretation.

Additional Context.

3. Experiments

3.1. Setup

Models.

Losses.

Training Probabilistic Models.

Probabilistic Batch Norm Layers.

Optimization Parameters.

Bound Details.

Fig. 1.

Data Splits for Self-bounded Learning.

3.2. Results

Comparison to Hoeffding's Inequality.

Flat Minima and Their Size.

4. Conclusion

Supplementary Material

Acknowledgment.

Footnotes

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases