PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses

Maxime Haddouche; Benjamin Guedj; Omar Rivasplata; John Shawe-Taylor

doi:10.3390/e23101330

. 2021 Oct 12;23(10):1330. doi: 10.3390/e23101330

PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses

Maxime Haddouche ¹, Benjamin Guedj ^2,^3,^*, Omar Rivasplata ², John Shawe-Taylor ²

Editor: Boris Ryabko

PMCID: PMC8534909 PMID: 34682054

Abstract

We present new PAC-Bayesian generalisation bounds for learning problems with unbounded loss functions. This extends the relevance and applicability of the PAC-Bayes learning framework, where most of the existing literature focuses on supervised learning problems with a bounded loss function (typically assumed to take values in the interval [0;1]). In order to relax this classical assumption, we propose to allow the range of the loss to depend on each predictor. This relaxation is captured by our new notion of HYPothesis-dependent rangE (HYPE). Based on this, we derive a novel PAC-Bayesian generalisation bound for unbounded loss functions, and we instantiate it on a linear regression problem. To make our theory usable by the largest audience possible, we include discussions on actual computation, practicality and limitations of our assumptions.

Keywords: statistical learning theory, PAC-Bayes, generalisation bounds

1. Introduction

Since its emergence in the late 1990s, the PAC-Bayes theory (see the seminal works of [1,2,3], the recent survey by [4] and work by [5]) has been a powerful tool to obtain generalisation bounds and to derive efficient learning algorithms. Generalisation bounds are helpful for understanding how a learning algorithm may perform on future similar batches of data. While the classical generalization bounds typically address the performance of individual predictors from a given hypothesis class, PAC-Bayes bounds typically address a randomized predictor defined by a distribution over the hypothesis class.

PAC-Bayes bounds were originally meant for binary classification problems [6,7,8], but the literature now includes many contributions involving any bounded loss function (without loss of generality, with values in $[0; 1]$ ), not just the binary loss. Our goal is to provide new PAC-Bayes bounds that are valid for unbounded loss functions, and thus extend the usability of PAC-Bayes to a much larger class of learning problems. To do so, we reformulate the general PAC-Bayes theorem of [9] and use it as basic building block to derive our new PAC-Bayes bound.

Some ways to circumvent the bounded range assumption on the losses have been explored in the recent literature. For instance, one approach consists of assuming a tail decay rate on the loss, such as sub-gaussian or sub-exponential tails [10,11]; however, this approach requires the knowledge of additional parameters. Some other works have also looked into the analysis for heavy-tailed losses, e.g., ref. [12] proposed a polynomial moment-dependent bound with f-divergences, while [13] devised an exponential bound that assumes the second (uncentered) moment of the loss is bounded by a constant (with a truncated risk estimator, as recalled in Section 4 below). A somewhat related approach was explored by [14], who do not assume boundedness of the loss, but instead control higher-order moments of the generalization gap through the Efron-Stein variance proxy. See also [5].

We investigate a different route here. We introduce the HYPothesis-dependent rangE (HYPE) condition, which means that the loss is upper-bounded by a term that depends on the chosen predictor (but does not depend on the data). Thus, effectively, the loss may have an arbitrarily large range. The HYPE condition allows us to derive an upper bound on the exponential moment of a suitably chosen functional, which, combined with the general PAC-Bayes theorem, leads to our new PAC-Bayes bound. To illustrate it, we instantiate the new bound on a linear regression problem, which additionally serves the purpose of illustrating that our HYPE condition is easy to verify in practice, given an explicit formulation of the loss function. In particular, we shall see in the linear regression setting that a mere use of the triangle inequality is enough to check the HYPE condition. The technical assumptions on which our results are based are comparable to those of the classical PAC-Bayes bounds; we state them in full detail, with discussions, for the sake of clarity and to make our work accessible.

Our contributions are twofold. (i) We propose PAC-Bayesian bounds holding with unbounded loss functions, therefore overcoming a limitation of the mainstream PAC-Bayesian literature for which a bounded loss is usually assumed. (ii) We analyse the bound, its implications, limitations of our assumptions, and their usability by practitioners. We hope this will extend the PAC-Bayes framework into a widely usable tool for a significantly wider range of problems, such as unbounded regression or reinforcement learning problems with unbounded rewards.

Outline.Section 2 introduces our notation and definition of the HYPE condition and provides a general PAC-Bayesian bound, which is valid for any learning problem complying with a mild assumption. For the sake of completeness, we present how our approach (designed for the unbounded case) behaves in the bounded case (Section 3). This section is not the core of our work, but rather serves as a safety check and particularises our bound to more classical PAC-Bayesian assumptions. We also provide numerical experiments. Section 4 introduces the notion of softening functions and particularises Section 2’s PAC-Bayesian bound. In particular, we make explicit all terms in the right-hand side. Section 5.1 extends our results to linear regression (which has been studied from the perspective of PAC-Bayes in the literature, most recently by [15]). We also experimentally illustrate the behaviour of our bound. Finally, Section 6 presents, in detail, related works and Section 7 contains all proofs of the original claims we make in the paper.

2. Framework and Preliminary Results

The learning problem is specified by three variables $(H, Z, ℓ)$ consisting of a set $H$ of predictors, the data space $Z$ , and a loss function $ℓ : H \times Z \to R^{+}$ .

For a given positive integer m, we consider size-m datasets. The space of all possible datasets of this fixed size is $S = Z^{m}$ ; an arbitrary element of this space is $s = (z_{1}, \dots, z_{m})$ . We denote S as a random dataset: $S = (Z_{1}, \dots, Z_{m})$ where the random data points $Z_{i}$ are independent and sampled from the same distribution $μ$ over $Z$ . We call $μ$ the data-generating distribution. The assumption that the $Z_{i}$ ’s are independent and identically distributed is typically called the i.i.d. data assumption. It means that the random sample S (of size m) has distribution $μ^{\otimes m}$ which is the product of m copies of $μ$ .

For any predictor $h \in H$ , we define the empirical risk of h over a sample s, denoted $R_{s} (h)$ , and the theoretical risk of h, denoted $R (h)$ , as:

R_{s} (h) = \frac{1}{m} \sum_{i = 1}^{m} ℓ (h, z_{i}) and R (h) = E_{μ} [ℓ (h, Z)]

respectively, where $E_{μ} [ℓ (h, Z)]$ denotes the expectation with respect to $Z \sim μ$ . Finally, we define the risk gap $Δ_{s} (h) = R (h) - R_{s} (h)$ for any $h \in H$ and $s \in S$ . Often, $Δ_{s} (h)$ is referred to as the generalisation gap.

Notice that for a random dataset S, the empirical risk $R_{S} (h)$ is random, with expected value $E_{μ^{\otimes m}} [R_{S} (h)] = R (h)$ , where $E_{μ^{\otimes m}}$ the expectation under the distribution of the random sample S.

In general, $E_{μ} [\cdot]$ denotes an expectation under the distribution $μ$ . When we want to emphasize the role of the random variable $Z \sim μ$ we write $E_{Z} [\cdot]$ or $E_{Z \sim μ} [\cdot]$ instead of $E_{μ} [\cdot]$ . We use a similar convention for expectations related to any other distributions and random quantities. We now introduce the key concept to our analysis.

Definition 1.

(HYPE). A loss function $ℓ : H \times Z \to R^{+}$ is said to satisfy the hypothesis-dependent range (HYPE) condition if there exists a function $K : H \to R^{+} \ {0}$ such that ${sup}_{z \in Z} ℓ (h, z) \leq K (h)$ for every predictor h. We then say that ℓ is HYPE(K) compliant.

Let $M_{1}^{+} (H)$ be the set of probability distributions on $H$ . We assume that all considered probability measures on $H$ are defined on a fixed $σ$ -algebra over $H$ , while the notation $M_{1}^{+} (H)$ hides the $σ$ -algebra, for simplicity. For $P, P^{'} \in M_{1}^{+} (H)$ , the notation $P^{'} ≪ P$ indicates that $P^{'}$ is absolutely continuous with respect to P (i.e., $P^{'} (A) = 0$ if $P (A) = 0$ for measurable $A \subset H$ ). We write $P^{'} \sim P$ to indicate that $P^{'} ≪ P$ and $P ≪ P^{'}$ , i.e., these two distributions are absolutely continuous with respect to each other.

We now recall a result from Germain et al. [9]. Note that while implicit in many PAC-Bayes works (including theirs), we make it explicit that both the prior P and the posterior Q must be absolutely continuous with respect to each other. We discuss this restriction below.

Theorem 1.

(Adapted from [9], Theorem 2.1.) For any $P \in M_{1}^{+} (H)$ with no dependency on data, for any function $F : R^{+} \times R^{+} \to R$ , define the exponential moment:

$χ : = E_{S} E_{h \sim P} [e^{F (R_{S} (h), R (h))}] .$

If F is convex, then for any $δ \in [0; 1]$ , with probability of at least $1 - δ$ over random samples S, simultaneously for all $Q \in M_{1}^{+} (H)$ such that $Q \sim P$ we have:

$\begin{matrix} F (E_{h \sim Q} [R_{S} (h)], E_{h \sim Q} [R (h)]) \leq KL (Q | | P) + log (\frac{χ}{δ}) . \end{matrix}$

The proof is deferred to Section 7.1. Note that the proof in [9] requires that $P ≪ Q$ , although it is not explicitly stated; we highlight this in our own proof. While $Q ≪ P$ is classical and necessary for the $KL (Q | | P)$ to be meaningful, $P ≪ Q$ appears to be more restrictive. In particular, we have to choose Q such that it has the exact same support as P (e.g., choosing a Gaussian and a truncated Gaussian is not possible). However, we can still apply our theorem when P and Q belong to the same parametric family of distributions, e.g., both ‘full-support’ Gaussian or Laplace distributions, but these are just two examples and there are many others.

Note that Alquier et al. [10] (Theorem 4.1) adapted a result from Catoni [8], which only requires $Q ≪ P$ . This comes at the expense of what Alquier et al. [10] (Definition 2.3) called a Hoeffding’s assumption, which means that the exponential moment $χ$ is assumed to be bounded by a function depending only on the hyperparameters (such as the dataset size m or parameters given by Hoeffding’s assumption). Our analysis does not require this assumption, which might prove restrictive in practice.

Theorem 1 may be seen as a basis to recover many classical PAC-Bayesian bounds. For instance, $F (x, y) = 2 m {(x - y)}^{2}$ , recovers McAllester’s bound as recalled in [4] (Theorem 1). To get a usable bound, the outstanding task is to bound the exponential moment $χ$ . Note that a previous attempt has been made in [11], as described in Section 6.1 below. Furthermore, under the assumption that the distribution P has no dependency on the data, we may swap the order of integration in the exponential moment thanks to Fubini-Tonelli’s theorem and the positiveness of the exponential:

χ = E_{h \sim P} E_{S} [e^{F (R_{S} (h), R (h))}] .

This is the starting point for the way that the exponential moment was handled in several works in the PAC-Bayes literature. Essentially, for a fixed h, one may upper-bound the innermost expectation (with respect to S) using standard exponential moment inequalities.

In this work, we will use Theorem 1 with $F (x, y) = m^{α} D (x, y)$ , where $α > 0$ , and $D : R^{+} \times R^{+} \to R$ is a convex function. In this case, the high-probability inequality of the theorem takes the form:

\begin{matrix} D (E_{h \sim Q} [R_{S} (h)], E_{h \sim Q} [R (h)]) \leq \\ \frac{1}{m^{α}} (KL (Q | | P) + log (\frac{1}{δ} E_{h \sim P} E_{S} e^{m^{α} D (R_{S} (h), R (h))})) . \end{matrix}

(1)

Our goal is to control $E_{S} e^{m^{α} D (R_{S} (h), R (h))}$ for a fixed h, when $D (x, y) = y - x$ . This will readily give us control on the exponential moment $χ$ . To do so, we propose the following theorem:

Theorem 2.

Let $h \in H$ be a fixed predictor and $α \in R$ . If the loss function ℓ is HYPE(K) compliant, then for $Δ_{S} (h) = R (h) - R_{S} (h)$ we have:

$E_{S} [e^{m^{α} Δ_{S} (h)}] \leq exp (\frac{K {(h)}^{2}}{2 m^{1 - 2 α}}) .$

Proof.

Let $h \in H$ . Then:

$\begin{matrix} E_{S} [e^{m^{α} Δ_{S} (h)}] & = E [exp (m^{α - 1} \sum_{i = 1}^{m} (l (h, Z_{i}) - R (h)))] \\ = E [\prod_{i = 1}^{m} exp (m^{α - 1} (ℓ (h, Z_{i}) - R (h)))] \\ = \prod_{i = 1}^{m} E [exp (m^{α - 1} (ℓ (h, Z_{i}) - R (h)))] . \end{matrix}$

We now apply Hoeffding’s lemma, for any $i \in {1 . . m}$ , the random (in $Z_{i}$ ) variable $ℓ (h, Z_{i}) - R (h)$ is centered, taking values in $[- K (h); K (h)]$ , so that:

$E [exp (m^{α - 1} (ℓ (h, Z_{i}) - R (h)))] \leq exp (m^{2 α - 2} \frac{4 K {(h)}^{2}}{8})$

and finally:

$E_{S} [e^{m^{α} Δ_{S} (h)}] \leq \prod_{i = 1}^{m} exp (m^{2 α - 2} \frac{4 K {(h)}^{2}}{8}) = exp (\frac{K {(h)}^{2}}{2 m^{1 - 2 α}}) .$

□

The strength of this result lies in the fact that $\frac{K {(h)}^{2}}{m^{1 - 2 α}}$ , is a decreasing factor in m, when $α \leq 1 / 2$ , and more generally, one can control how fast the exponential moment will explode when m grows by the choice of the hyperparameter $α$ .

For convenient cross-referencing, we state the following rewriting of Theorem 1.

Theorem 3.

Let the loss ℓ be HYPE(K) compliant. For any $P \in M_{1}^{+} (H)$ with no data dependency, for any $α \in R$ and for any $δ \in [0; 1]$ , with probability of at least $1 - δ$ over size-m random samples S, simultaneously for all Q such that $Q \sim P$ we have:

$E_{h \sim Q} [R (h)] \leq E_{h \sim Q} [R_{S} (h)] + \frac{1}{m^{α}} (KL (Q | | P) + log \frac{E_{h \sim P} [exp (\frac{K {(h)}^{2}}{2 m^{1 - 2 α}})]}{δ}) .$

Proof.

We first apply Theorem 1 with $F (x, y) = m^{α} (y - x)$ . More precisely, we use Equation (1) with $D (x, y) = y - x$ . We then conclude with Theorem 2. □

3. Safety Check: The Bounded Loss Case

3.1. Theoretical Results

At this stage, the reader might wonder whether this new approach allows for the recovery of known results in the bounded case: the answer is yes.

In this section, we study the case where ℓ is bounded by some constant $C \in R^{+} \ {0}$ . In other words, we consider the case that ${sup}_{h} {sup}_{z} ℓ (h, z) \leq C$ . We provide a bound, valid for any choice of “priors” P and “posteriors” Q such that $P \sim Q$ , which is an immediate corollary of Theorem 3.

Proposition 1.

Let ℓ be HYPE(K) compliant, with $K (h) = C$ constant, and let $α \in R$ . Let $P \in M_{1}^{+} (H)$ be a distribution with no data dependency. Then, for any $δ \in [0; 1]$ , with probability of at least $1 - δ$ over random m-samples S, simultaneously for all $Q \in M_{1}^{+} (H)$ such that $Q \sim P$ we have:

$E_{h \sim Q} [R (h)] \leq E_{h \sim Q} [R_{S} (h)] + \frac{KL (Q | | P) + log (1 / δ)}{m^{α}} + \frac{C^{2}}{2 m^{1 - α}} .$

Remark 1.

We provide Proposition 1 to evaluate the robustness of our approach. For instance, by comparing it with the PAC-Bayesian bound found in Germain et al. [11]. This discussion can be found in Section 6.1, where the bound from Germain et al. [11] is presented in detail.

Remark 2.

At first glance, a naive remark: in order to control the rate of convergence of all the terms of the bound in Proposition 1 (as is often the case in classical PAC-Bayesian bounds), then the only case of interest is in fact $α = \frac{1}{2}$ . However, one could notice that the factor $C^{2}$ is not optimisable, while the KL is. In this way, if it appears that $C^{2}$ is too big, in practice, one wants to have the ability to attenuate its influence as much as possible and this may lead us to consider $α < 1 / 2$ . The following lemma answers this question.

Lemma 1.

For any given $K_{1} > 0$ , the function $f_{K_{1}} (α) : = \frac{K_{1}}{m^{α}} + \frac{C^{2}}{m^{1 - α}}$ reaches its minimum at

$α_{0} = \frac{1}{2} + \frac{1}{2 log (m)} log (\frac{2 K_{1}}{C^{2}}) .$

Proof.

The explicit calculus of the $f_{K_{1}}^{^{'}}$ and the resolution of $f_{K_{1}}^{^{'}} (α) = 0$ provides the result. □

Remark 3.

Lemma 1 indicates that with a fixed “prior” P and “posterior” Q, taking $K_{1} = KL (Q | | P) + log (1 / δ)$ , gives the optimised value of the bound in Proposition 1. We numerically show in Section 3.2 (first experiment there) that optimising α leads to significantly better results.

Now the only remaining question is how to optimise the KL divergence. To do so, we may need to fix an “informed prior” to minimise the KL divergence with an interesting posterior. This idea has been studied by [16,17] and, more recently, by Mhammedi et al. [18], Rivasplata et al. [5], among others. We will adapt it to our problem in the simplest way.

We now introduce some additional notation. For a sample $s = (z_{1}, \dots, z_{m})$ and $k \in {1 . . m}$ , we define $s_{\leq k} : = {z_{1}, \dots, z_{k}}$ and $s_{> k} : = {z_{k + 1}, \dots, z_{m}}$ . Then, similarly, for a random sample S, we have the splits $S_{\leq k}$ and $S_{> k}$ .

Proposition 2.

Let ℓ be HYPE(K) compliant, with constant $K (h) = C$ , and $α_{1}, α_{2} \in R$ . Consider any “priors” $P_{1} \in M_{1}^{+} (H)$ (possibly dependent on $S_{> m / 2}$ ) and $P_{2} \in M_{1}^{+} (H)$ (possibly dependent on $S_{\leq m / 2}$ ). Then, for any $δ \in [0; 1]$ , with probability of at least $1 - δ$ over random size-m samples S, simultaneously for all $Q \in M_{1}^{+} (H)$ such that $Q \sim P_{1}$ and $Q \sim P_{2}$ we have:

$\begin{matrix} E_{h \sim Q} [R (h)] \leq E_{h \sim Q} [R_{S} (h)] + \frac{1}{2} (\frac{KL (Q | | P_{1}) + log (2 / δ)}{{(m / 2)}^{α_{1}}} + \frac{C^{2}}{2 {(m / 2)}^{1 - α_{1}}}) \\ + \frac{1}{2} (\frac{KL (Q | | P_{2}) + log (2 / δ)}{{(m / 2)}^{α_{2}}} + \frac{C^{2}}{2 {(m / 2)}^{1 - α_{2}}}) . \end{matrix}$

Proof.

Let $P_{1}, P_{2}, Q$ be as stated in Proposition 2. We first notice that by using Proposition 1 on the two halves of the sample, we obtain, with a probability of at least $1 - δ / 2$ :

$E_{h \sim Q} [R (h)] \leq E_{h \sim Q} [\frac{1}{m / 2} \sum_{i = 1}^{m / 2} ℓ (h, Z_{i})] + \frac{KL (Q | | P_{1}) + log (2 / δ)}{{(m / 2)}^{α_{1}}} + \frac{C^{2}}{2 {(m / 2)}^{1 - α_{1}}}$

and also with probability at least $1 - δ / 2$ :

$E_{h \sim Q} [R (h)] \leq E_{h \sim Q} [\frac{1}{m / 2} \sum_{i = 1}^{m / 2} ℓ (h, Z_{m / 2 + i})] + \frac{KL (Q | | P_{2}) + log (2 / δ)}{{(m / 2)}^{α_{2}}} + \frac{C^{2}}{2 {(m / 2)}^{1 - α_{2}}} .$

Hence, with a probability of at least $1 - δ$ , both inequalities hold, and the result follows by adding them and dividing by 2. □

Remark 4.

One can notice that the main difference between Proposition 2 and Proposition 1 lies in the implicit PAC-Bayesian paradigm that our priors must not depend on the data. With this last proposition, we implicitly allow $P_{1}$ to depend on $S_{> m / 2}$ and $P_{2}$ on $S_{\leq m / 2}$ , which can in practice lead to far more accurate priors. We numerically show this fact in Section 3.2’s second experiment. Note that this idea is not new and has been studied, for instance, in [19] for the specific case of SVMs.

3.2. Numerical Experiments

Our experimental framework has been inspired by the work of [18].

Settings. We generate synthetic data for classification, and we are using the 0–1 loss. The data space is $Z = X \times Y = R^{d} \times {0, 1}$ with $d \in N$ . The set of predictors $H$ is parameterised with d-dimensional ‘weight’ vectors: $H = {h_{w} : X \to Y | w \in R^{d}}$ . For simplicity, we identify $h_{w}$ with w and we also identify the space $H$ , with the weight space $W = R^{d}$ . For $z = (x, y) \in Z$ and $w \in W$ , we define the loss as $ℓ (w, z) : = | 𝟙 \{ϕ (w^{⊤} x) > 1 / 2\} - y |$ , where $ϕ (r) = \frac{1}{1 + e^{- r}}$ . We want to learn an optimised predictor given a dataset $S = {(Z_{i})}_{i = 1 . . m}$ where $Z_{i} = (X_{i}, Y_{i})$ . To do so, we use regularised logistic regression and compute:

\begin{matrix} \hat{w} (S) : = \arg min_{w \in W} λ \frac{{| | w | |}^{2}}{2} - \frac{1}{m} \sum_{i = 1}^{m} y_{i} log (ϕ (w^{⊤} x_{i})) + (1 - y_{i}) log (1 - ϕ (w^{⊤} x_{i})) \end{matrix}

(2)

where $λ$ is a fixed regularisation parameter.

We also restrict the probability distributions (over $W = R^{d}$ ), considered for this learning problem. We consider the Gaussian distribution $N (w, σ^{2} I_{d})$ with centre $w \in R^{d}$ and diagonal covariance $σ^{2} I_{d} \in R^{d \times d}$ with $σ^{2} > 0$ .

Parameters. We set $δ = 0.05, λ = 0.01$ . We approximately solve Equation (2) by using the minimize function of the optimisation module in Python, with the Powell method. To approximate gaussian expectations, we use Monte-Carlo sampling.

Synthetic data. We generate synthetic data for $d = 10$ according to the following process: for a fixed sample size m, we draw $X_{1}, \dots, X_{m}$ under the multivariate Gaussian distribution $N (0, I_{d})$ and for each i we compute the label if $X_{i}$ as: $Y_{i} = 𝟙 {ϕ (w^{* ⊤} x_{i}) > 1 / 2}$ where $w^{*}$ is the vector formed by the d first digits of the number $π$ .

Normalisation trick. Given the predictors shape, we notice that for any $w \in W$ :

𝟙 {ϕ (w^{⊤} x) > 1 / 2} = 1 \Leftrightarrow \frac{1}{1 + exp (- w^{⊤} x)} > \frac{1}{2} \Leftrightarrow w^{⊤} x < 0 .

Thus, the value of the prediction is exclusively determined by the sign of the inner product, and this quantity is definitely not influenced by the norm of the vector. Then, for any sample S, we call the normalisation trick the fact of considering $\hat{w} (S) / | | \hat{w} (S) | |$ instead of $\hat{w} (S)$ in our calculations. This process will not deteriorate the quality of the prediction and will considerably enhance the value of the KL divergence.

3.2.1. First Experiment

Our goal here is to highlight the point discussed in Remark 2, e.g., the influence of the parameter $α$ in Proposition 1. We arbitrarily fix $σ_{0}^{2} = 1 / 2$ , and define our naive prior as $P_{0} = N (0, σ_{0}^{2} I_{d})$ . For a fixed dataset S, we define our posterior as $P (S) : = N (\hat{h} (S), σ^{2} I_{d})$ , with $σ^{2} \in {1 / 2, \dots, 1 / 2^{J}}$ (for $J = {log}_{2} (m)$ ) such that it is minimising the bound among candidates. We computed two curves: first, Proposition 1 with $α = 1 / 2$ second, Proposition 1 again with $α$ equals to the value proposed in Lemma 1. Notice that to compute this last bound, we first optimised our choice of posterior with $α = 1 / 2$ and then optimised $α$ , to be consistent with Lemma 1. Indeed, we proved this lemma by assuming that the KL divergence was already fixed, hence our optimisation process is in two steps. Note that we chose to apply the normalisation trick here, we then obtained the left curve of Figure 1.

Above, result of the first experiment which highlight the importance of optimising $α$ . Below, result of the second experiment which show how effective an informed prior is.

Discussion. From this curve, we formulate several remarks. First, we remark on this specific case, our theorem provides a tight result in practice (with an error rate lesser than $10 %$ for the bound with optimised alpha). Second, we can now confirm that choosing an optimised $α$ leads to a tighter bound. In further studies, it will be relevant to adjust $α$ with regards to the different terms of our bound instead of looking for an identical convergence rate for all terms.

3.2.2. Second Experiment

We now study Proposition 2 to see if an informed prior effectively provides a tighter bound than a naive one. We will use the notations introduced in Proposition 2. For a dataset S, we define $w_{1} (S) = w (S_{> m / 2})$ as the vector resulting from the optimisation of Equation (2) on $S_{> m / 2}$ . Similarly, we define $w_{2} (S) : = w (S_{\leq m / 2})$ . We arbitrarily fix $σ_{0}^{2} = 1 / 2$ , and define our informed priors as: $P_{1} = N (w_{1} (S), σ_{0}^{2} I_{d})$ and $P_{2} = N (w_{2} (S), σ_{0}^{2} I_{d})$ . Finally, we define our posterior as $P (S) : = N (\hat{w} (S), σ^{2} I_{d})$ , with $σ^{2} \in {1 / 2, \dots, 1 / 2^{J}}$ (for $J = {log}_{2} (m)$ ) with $σ^{2}$ optimising the bound among the same candidate than the first experiment. We computed two curves: first, Proposition 1 with $α$ optimised accordingly to Lemma 1 secondly, Proposition 2 with $α_{1}, α_{2}$ optimised as well, and informed priors as defined above. We chose to not apply the normalisation trick here, we then obtained the right curve of Figure 1.

Discussion. It is clear, that with this framework, having an informed prior is a powerful tool to enhance the quality of our bound. Notice that we voluntarily chose to not apply the normalisation trick here. The reason is that this trick appears to be too powerful in practice, and applying it leads to counterproductive results; to highlight our point: the bound without informed prior would be tighter than the one with informed prior. Furthermore, this trick is linked to the specific structure of our problem and is not valid for any classification problem. Thus, the idea of providing informed priors remains an interesting tool for most cases.

4. PAC Bayesian Bounds with Smoothed Estimator

We now move on to control the right-hand side term in Theorem 3 when K is not constant. A first step is to consider a transformed estimate of the risk, inspired by the truncated estimator from [20], also used in [21], and more recently in [13]. The following is inspired by the results of [13], which we summarise in Section 6.

The idea is to modify the estimator $R_{S} (h)$ for any h by introducing a threshold t and a function $ψ$ which will attenuate the influence of the empirical losses ${(ℓ (h, Z_{i}))}_{i = 1 . . m}$ that exceed t.

Definition 2.

$ψ$ -risks. For every $t > 0$ , $ψ : R^{+} \to R^{+}$ , for any $h \in H$ , we define the empirical ψ-risk $R_{S, ψ, t}$ and the theoretical ψ-risk $R_{ψ, t}$ as follows:

$R_{S, ψ, t} (h) : = \frac{t}{m} \sum_{i = 1}^{m} ψ (\frac{ℓ (h, Z_{i})}{t}) and R_{ψ, t} (h) = E_{μ} [t ψ (\frac{ℓ (h, Z)}{t})]$

where $Z \sim μ$ . Notice that $E_{S} [R_{S, ψ, t} (h)] = R_{ψ, t} (h)$ .

We now focus on what we call softening functions, i.e., functions that will temper high values of the loss function ℓ.

Definition 3.

(Softening function). We say that $ψ : R^{+} \to R^{+}$ is a softening function if:

$\forall x \in [0; 1], ψ (x) = x$ ,

ψ is non-decreasing,

$\forall x \geq 1, ψ (x) \leq x$ .

We let $F$ denote the set of all softening functions.

Remark 5.

Notice that those three assumptions ensure that ψ is continuous at 1. For instance, the functions $f : x \mapsto x 𝟙 {x \leq 1} + 𝟙 {x > 1}$ and $g : x \mapsto x 𝟙 {x \leq 1} + (2 \sqrt{x} - 1) 𝟙 {x > 1}$ are in $F$ . In Section 6 we compare these softening functions and those used by Holland [13].

Using $ψ \in F$ , for a fixed threshold $t > 0$ , the softened loss function $t ψ (\frac{ℓ (h, z)}{t})$ verifies for any $h \in H$ , $z \in Z$ :

t ψ (\frac{ℓ (h, z)}{t}) \leq t ψ (\frac{K (h)}{t})

because $ψ$ is non-decreasing. In this way, the exponential moment in Theorem 3 can be far more controllable. The trade-off lies in the fact that softening ℓ (instead of taking directly ℓ) will deteriorate our ability to distinguish between two bad predictions when both of them are greater than t. For instance, if we choose $ψ \in F$ such as $ψ = 1$ on $[1; + \infty)$ and $t > 0$ , if $ψ (ℓ (h, z) / t) = 1$ for a certain pair $(h, z)$ , then we cannot tell how far $ℓ (h, z)$ is from t and we only can affirm that $ℓ (h, z) \geq t$ .

We now move on to the following lemma, which controls the shortfall between $E_{h \sim Q} [R (h)]$ and $E_{h \sim Q} [R_{ψ, t} (h)]$ for all $Q \in M_{1}^{+} (H)$ , for a given $ψ$ and $t > 0$ . To do that, we assume that K admits a finite moment under any posterior distribution:

\begin{matrix} \forall Q \in M_{1}^{+} (H), E_{h \sim Q} [K (h)] < + \infty . \end{matrix}

(3)

For instance, in the case of $H$ identified with a weight space $W = R^{N}$ , and if K is polynomial in $| | w | |$ (where $| | . | |$ denotes the Euclidean norm), then this assumption holds if we consider Gaussian priors and posteriors.

Lemma 2.

Assume that Equation (3) holds, and let $ψ \in F$ , $Q \in M_{1}^{+} (H), t > 0$ . We have:

$E_{h \sim Q} [R (h)] \leq E_{h \sim Q} [R_{ψ, t} (h)] + E_{h \sim Q} [K (h) 𝟙 \{K (h) \geq t\}] .$

Proof.

Let $ψ \in F$ , $Q \in M_{1}^{+} (H), t > 0$ . We have, for $h \in H$ :

$\begin{matrix} R (h) - R_{ψ, t} (h) \\ = E_{Z \sim μ} [ℓ (h, Z) - t ψ (\frac{ℓ (h, Z)}{t})] \end{matrix}$

and using that $\forall x \in [0, 1], ψ (x) = x$ ,

$\begin{matrix} = E_{Z \sim μ} [(ℓ (h, Z) - t ψ (\frac{ℓ (h, Z)}{t})) 𝟙 {ℓ (h, Z) \geq t}] \end{matrix}$

while using that $ℓ (h, z) \leq K (h)$ ,

$\begin{matrix} = E_{Z \sim μ} [(ℓ (h, Z) - t ψ (\frac{ℓ (h, Z)}{t})) 𝟙 {ℓ (h, Z) \geq t} 𝟙 \{K (h) \geq t\}] \end{matrix}$

and continuing:

$\begin{matrix} \leq E_{Z \sim μ} [ℓ (h, Z) 𝟙 {ℓ (h, Z) \geq t}] 𝟙 \{K (h) \geq t\} & (ψ \geq 0) \end{matrix}$

$\begin{matrix} \leq K (h) P_{Z \sim μ} \{ℓ (h, Z) \geq t\} 𝟙 \{K (h) \geq t\} & (ℓ (h, Z) \leq K (h)) \end{matrix}$

Finally, by crudely bounding the probability by 1, we get:

$R (h) \leq R_{ψ, t} (h) + K (h) 𝟙 \{K (h) \geq t\} .$

Hence the result by integrating over $H$ with respect to Q. □

Finally we present the following theorem, which provides a PAC-Bayesian inequality bounding the theoretical risk by the empirical $ψ$ -risk for $ψ \in F$ .

Theorem 4.

Let ℓ be HYPE(K) compliant, and assume K satisfies Equation (3). Then for any $P \in M_{1}^{+} (H)$ with no data dependency, for any $α \in R$ , for any $ψ \in F$ and for any $δ \in [0; 1]$ , with probability of at least $1 - δ$ over size-m random samples S, simultaneously for all Q such that $Q \sim P$ we have:

$\begin{matrix} E_{h \sim Q} [R (h)] & \leq E_{h \sim Q} [R_{S, ψ, t} (h)] + E_{h \sim Q} [K (h) 𝟙 {K (h) \geq t}] \\ + \frac{KL (Q | | P) + log (\frac{1}{δ})}{m^{α}} \\ + \frac{1}{m^{α}} log (E_{h \sim P} [exp (\frac{t^{2}}{2 m^{1 - 2 α}} ψ {(\frac{K (h)}{t})}^{2})]) . \end{matrix}$

Proof.

Let $ψ \in F$ , we define the $ψ$ -loss:

$ℓ_{2} (h, z) = t ψ (\frac{ℓ (h, z)}{t}) .$

Since $ψ$ is non decreasing, we have for all $(h, z) \in H \times Z$ :

$ℓ_{2} (h, z) \leq t ψ (\frac{K (h)}{t}) : = K_{2} (h) .$

Thus, we apply Theorem 3 to the learning problem defined with $ℓ_{2}$ : for any $α$ and $δ \in (0, 1)$ , with probability at least $1 - δ$ over size-m random samples S, simultaneously for all Q such that $Q \sim P$ we have:

$\begin{matrix} E_{h \sim Q} [R_{ψ, t} (h)] & \leq E_{h \sim Q} [R_{S, ψ, t} (h)] + \frac{KL (Q | | P) + log (\frac{1}{δ})}{m^{α}} \\ + \frac{1}{m^{α}} log (E_{h \sim P} [exp (\frac{K_{2} {(h)}^{2}}{2 m^{1 - 2 α}})]) . \end{matrix}$

We then add $E_{h \sim Q} [K (h) 𝟙 \{K (h) \geq t\}]$ on both sides of the latter inequality and apply Lemma 2. □

Remark 6.

Notice that the function $ψ : x \mapsto x 𝟙 {x \leq 1} + 𝟙 {x > 1}$ is such that for any given prior P we have $E_{h \sim P} [exp (\frac{t^{2}}{2 m^{1 - 2 α}} ψ {(\frac{K (h)}{t})}^{2})] < + \infty$ . So the exponential moment can be controlled with a good choice of ψ. Thus the strength of Theorem 4 is to provide a PAC-Bayesian bound valid for any set of posterior measures verifying Equation (3). The choice of ψ minimising the bound is still an open problem.

5. The Linear Regression Problem

5.1. Theoretical Result

We now focus on the celebrated linear regression problem and see how our theory translates to that particular learning problem. We assume that the data is a size-m random sample $S = {(Z_{i})}_{i = 1 . . m}$ where the $Z_{i}$ are i.i.d. drawn from the distribution $μ$ , and $Z_{i} = (X_{i}, Y_{i})$ with $X_{i} \in R^{N}$ , $Y_{i} \in R$ .

Our goal here is to find the most accurate predictor $h_{w}$ (with $w \in R^{N}$ ), with respect to the loss function $ℓ (h_{w}, z) = | 〈 w, x 〉 - y |$ , where $z = (x, y)$ . We will make the following mild assumption: there exists $B, C \in R + \ {0}$ such that for all $z = (x, y)$ drawn under $μ$ :

| | x | | \leq B and | y | \leq C

where $| | . | |$ is the norm associated to the classical inner product of $R^{N}$ . Under this assumption we note that for all $z = (x, y)$ drawn according to $μ$ , we have:

\begin{matrix} ℓ (h_{w}, z) = | 〈 w, x 〉 - y | \leq | 〈 w, x 〉 | + | y] \leq | | w | | . | | x | | + | y | \leq B | | w | | + C . \end{matrix}

Thus we define $K (h_{w}) = B | | w | | + C$ for $w \in R^{N}$ . If we first restrict ourselves to the framework of Section 2, we want to use Theorem 3 and doing so, our goal is to bound $ξ : = E_{w \sim P} [exp (\frac{K {(w)}^{2}}{2 m^{1 - 2 α}})]$ . The shape of K invites us to consider a Gaussian prior. Indeed, we notice that if $P = N (0, σ^{2} I_{N})$ with $0 < σ^{2} < \frac{m^{1 - 2 α}}{B^{2}}$ , then $ξ < + \infty$ . Notice that we cannot take just any Gaussian prior, however with a small $α$ , the condition $0 < σ^{2} < \frac{m^{1 - 2 α}}{B^{2}}$ may become quite loose. Thus, we have the following:

Theorem 5.

Let $α \in R$ and $N \geq 6$ . Assume that the loss ℓ is HYPE(K) compliant with $K (h) = B | | h | | + C$ , with $B > 0, C \geq 0$ . For a prior distribution, consider any Gaussian $P = N (0, σ^{2} I_{N})$ with $σ^{2} = t \frac{m^{1 - 2 α}}{B^{2}}$ , $0 < t < 1$ . Then, for any $δ \in [0; 1]$ , with probability of at least $1 - δ$ over size-m random samples S, simultaneously for all $Q \in M_{1}^{+} (H)$ such that $P \sim Q$ we have:

$\begin{matrix} E_{h \sim Q} [R (h)] & \leq E_{h \sim Q} [R_{S} (h)] + \frac{KL (Q | | P) + log (2 / δ)}{m^{α}} + \frac{C^{2}}{2 m^{1 - α}} (1 + f {(t)}^{- 1}) \\ + \frac{N}{m^{α}} (log (1 + (\frac{C}{\sqrt{2 f (t) m^{1 - 2 α}}})) + log (\frac{1}{\sqrt{1 - t}})) \end{matrix}$

where $f (t) = \frac{1 - t}{t}$ .

The proof is deferred to Section 7.2. To compare our result with those found in the literature, we can fix $α = 1 / 2$ . Doing so, we lose the dependency in m for the choice of the variance of the prior (which now only depends on B), but we recover the classic decreasing factor $1 / \sqrt{m}$ .

Remark 7.

Notice that for now we did not use Section 4, even if we could (because K is polynomial in $| | w | |$ and we consider Gaussian priors and posteriors, so Equation (3) is satisfied). Doing so, we obtained a bound which appears to depend linearly on the dimension N. In practice, N may be too big, and in this case, introducing an adapted softening function ψ (one can think for instance of $ψ (x) = x 𝟙 {x \leq 1} + 𝟙 {x > 1}$ ) is a powerful tool to attenuate the weight of the exponential moment. This also extends the class of authorised Gaussian priors by avoidance, to stick with a variance $σ^{2} = t \frac{m^{1 - 2 α}}{B^{2}}$ , $0 < t < 1$ .

5.2. Numerical Experiment

5.2.1. Setting

In this section we apply Theorem 5 on a concrete linear regression problem. The situation is as follows: we want to approximate the function $f (x) = \sqrt{〈 w^{*}, x 〉}$ , where $w^{*} \in R^{d}$ . We assume that $W = {[- c, c]}^{d}$ so that $w^{*}$ lies in an hypercube centred at 0 of half-side $c > 0$ , i.e., the set ${{(w_{i})}_{i = 1, \dots, d} ∣ \forall i, | w_{i} | \leq c}$ . Doing so we have $| | w^{*} | | \leq c \sqrt{d}$ .

Furthermore, we assume that input data are drawn inside a hypercube of half-side $e > 0$ , i.e., $X = {[- e, e]}^{d}$ . Doing so we have for any data $x, | | x | | \leq e \sqrt{d}$ .

For any data $x \in R^{d}$ , we define $y = f (x)$ . As before, we identify the hypothesis set $H$ with the weight space $W = R^{d}$ . As described in Section 5.1, we set $ℓ (h_{w}, x, y) = | 〈 w, x 〉 - y |$ . We then remark that for any $(w, x, y)$ :

\begin{matrix} ℓ (h_{w}, x, y) & \leq | 〈 w, x 〉 | + | y | \leq | | w | | | | x | | + | \sqrt{〈 w^{*}, x 〉} | \\ \leq e \sqrt{d} | | w | | + \sqrt{| | w^{*} | | . | | x | |} \leq e \sqrt{d} | | w | | + \sqrt{c \sqrt{d} . e \sqrt{d}} \\ \leq e \sqrt{d} | | w | | + \sqrt{c d e} . \end{matrix}

Then we can define $B = e \sqrt{d}$ and $C = \sqrt{c d e}$ to apply Theorem 5. We restrict (as before) the class of distributions over $W$ to be d-dimensional Gaussians:

\{N (w, σ^{2} I_{d}) ∣ w \in H, σ^{2} \in R^{+}\},

which is the set of candidate distributions for this learning problem. Recall that in practice, given a fixed $α \in R$ , we are only allowed to consider priors such that their variance $σ^{2} \in]0; \frac{m^{1 - 2 α}}{B^{2}}[$ . We want to learn an optimised predictor (posterior) given a random dataset $S = {((X_{i}, Y_{i}))}_{i = 1, \dots, m}$ . To do so, we consider synthetic data.

5.2.2. Synthetic Data

We draw $w^{*}$ under a Gaussian (with mean 0 and standard deviation equal to 5) truncated to the hypercube centered at 0 of the half-side $c > 0$ . We generate synthetic data according to the following process: for a fixed sample size m, we draw $X_{1}, \dots, X_{m}$ under a Gaussian (with mean 0 and standard deviation equal to 5) truncated to the hypercube centered at 0 of the half-side $e > 0$ .

5.2.3. Experiment

First, we fix $c = e = 10$ . Our goal here is to obtain a generalisation bound on our problem. We fix arbitrarily, for a fixed $α \in R$ , $t_{0} = 1 / 2$ and $σ_{0}^{2} = t_{0} \frac{m^{1 - 2 α}}{B^{2}}$ and we define our naive prior as $P_{0} = N (0, σ_{0}^{2} I_{d})$ . For a given dataset S, we define our posterior as $Q (S) : = N (\hat{w} (S), σ^{2} I_{d})$ , with $σ^{2} \in {σ_{0}^{2} / 2, \dots, σ_{0}^{2} / 2^{J}}$ ( $J = {log}_{2} (m)$ ), such that it is minimising the bound among candidates. Note that all the previously defined parameters are dependent on $α$ , which is why we choose $α \in {i / step ∣ 0 \leq i \leq step}$ for step a fixed integer (in practice step = 8 or 16) and we take the value of $α$ minimising the bound among the candidates as well. Figure 2 contains two figures, one with $d = 10$ , the other with $d = 50$ . On each figure are computed the right-hand side term in 5 with an optimised $α$ for each step.

Evaluation of the right hand side in Theorem 5 with $d = 10$ and $d = 50$ .

5.2.4. Discussion

To the the best of our knowledge, this is the first attempt to numerically compute PAC-Bayes bounds for unbounded problems, making it impossible to compare to other results. We stress, however, that obtaining numerical values for the bound without assuming a bounded loss is a significant first step. Furthermore, we consider a rather hard problem: f is not linear, so we cannot rely on a linear approximation fitting perfectly data, and the larger the dimension, the larger the error, as illustrated by Figure 2. Thus, for any posterior Q, the quantity $E_{h \sim Q} [R (h)]$ is potentially large in practice and our bound might not be tight. Finally, notice that optimising $α$ (instead of taking $α = 1 / 2$ to recover a classic convergence rate) leads to a significantly better bound. A numerical example of this assertion is presented in Section 3.2. We aim to conduct further studies to consider the convergence rate as an hyperparameter to optimise, rather than selecting the same rate for all terms in the bound.

6. Existing Work

6.1. Germain et al., 2016

In Germain et al. [11] (Section 4), a PAC-Bayesian bound has been provided for all sub-gamma losses with a variance $t^{2}$ and scale parameter $c > 0$ , under a data distribution $μ$ and a prior P, i.e., losses such that for every $λ \in (0, \frac{1}{c})$ the following is satisfied:

log (\frac{1}{δ} E_{h \sim P} E_{S} e^{λ (R (h) - R_{S} (h))}) \leq \frac{t^{2}}{c^{2}} (- log (1 - c λ) - λ c) \leq \frac{λ^{2} t^{2}}{2 (1 - c λ)} .

Note that a sub-gamma loss (with regards to $μ$ and P) is potentially unbounded. Germain et al. then propose the following PAC-Bayesian bound:

Theorem 6.

Ref. [11]. If the loss ℓ is sub-gamma with a variance $t^{2}$ and scale parameter c, under the data distribution μ and a fixed prior $P \in H$ , then for any $δ \in [0; 1]$ , with probability $1 - δ$ over size-m random samples, simultaneously for all $Q ≪ P$ we have:

$E_{h \sim Q} [R (h)] \leq E_{h \sim Q} [R_{S} (h)] + \frac{KL (Q | | P) + log (1 / δ)}{m} + \frac{t^{2}}{2 (1 - c)} .$

Theorem 6 will be quoted several times in this paper given that it is a concrete PAC Bayesian bound provided with the will to overcome the constraint of a bounded loss. It is also one of the only one found in the literature.

Can we apply this theorem to the bounded case? The answer is yes: we remark that thanks to Hoeffding’s lemma, if ℓ is bounded by $C > 0$ , then for any $h \in H$ it holds that $R_{S} (h) - R (h) \in [- C, C]$ almost surely. So, $\forall λ \in R,$ $log E_{z \sim μ} [e^{λ (R (h) - R_{S} (h)}] \leq \frac{λ^{2} C^{2}}{2}$ . Therefore, for any prior P, we have:

log E_{h \sim P} E_{z \sim μ} [e^{λ (R (h) - R_{S} (h)}] \leq \frac{λ^{2} C^{2}}{2} .

Thus, ℓ is sub-gamma with variance $C^{2}$ and scale parameter 0. Then, Theorem 6 can be applied with $t^{2} = C^{2}$ , $c = 0$ .

Comparison with Proposition 1. We remark that by taking $K = C$ and $α = 1$ in Proposition 1, we are recovering Theorem 6. However, our approach allows us to say that if we can obtain a more precise form of K such that $\forall h \in H$ , $K (h) \leq C$ and K is non-constant, 3, will ensure that:

\frac{1}{m^{α}} log (E_{h \sim P} [exp (\frac{K {(h)}^{2}}{2 m^{1 - 2 α}})]) \leq \frac{C^{2}}{2 m^{1 - α}} .

Thus, having precise information on the behavior of the loss function ℓ, with regards to the predictor h, allows us to obtain a tighter control of the exponential moment, and hence a tighter bound.

Remark 8.

We can see that Theorem 6 cannot control the factor $C^{2} / 2$ . However, Ref. [11] remarked on this apparent weakness and partially corrected this issue [11] (Section 4, Equations (13) and (14)). Indeed, they proposed to balance the influence of m between the different terms of the PAC-Bayes bound by providing the same convergence rate in $1 / \sqrt{m}$ to all terms.

We can then see Proposition 1 as a proper generalisation of Germain et al. [11] (Section 4, Equations (13) and (14)). Indeed, our bound exhibits properly the influence of the parameter α. Thus, we understand (and Lemma 1 proves it) that the choice of α deserves a study in itself in the way it is now a parameter of our optimisation problem. This fact has already been highlighted in Alquier et al. [10] (Theorem 4.1) (where $λ : = m^{α}$ ).

6.2. Holland, 2019

In [13], Holland proposed a PAC Bayesian inequality with unbounded loss. For that, he introduced a function $ψ$ verifying a few specific conditions, different to those used in Section 4 to define our set of softening functions. Indeed, he considered a function $ψ$ such that:

$ψ$ is bounded,
$ψ$ is non decreasing,
it exists $b > 0$ such that for all $u \in R$ :
$\begin{matrix} - log (1 - u + \frac{u^{2}}{b}) \leq ψ (u) \leq log (1 + u + \frac{u^{2}}{b}) . \end{matrix}$ (4)

We remark that, as Holland did, we supposed that our softening functions are non-decreasing. We chose softening functions to be equal to the identity function ( $x \mapsto x$ ) on $[0, 1]$ , which is quite restrictive. However, we are imposing softening functions to be lesser than the identity on $[1, + \infty)$ ; whereas, Holland supposed $ψ$ to be bounded and satisfy Equation (4). A concrete example of such a function $ψ$ , lies in the piecewise polynomial function of Catoni and Giulini [21], defined by:

ψ (u) = \{\begin{matrix} - 2 \sqrt{2} / 3 & if u \leq - \sqrt{2} \\ u - u^{3} / 6 & if u \in [- 2 \sqrt{2} / 3, 2 \sqrt{2} / 3] \\ 2 \sqrt{2} / 3 & otherwise . \end{matrix}

As in Section 4, we are considering the $ψ$ -empirical risk $R_{S, ψ, t}$ for any $t > 0$ . Holland provided his theorem given the fact the following assumptions are realised:

Bounds on lower-order moments. For all $h \in H$ , we have $E_{Z \sim μ} [ℓ {(h, Z)}^{2}] \leq M_{2} < + \infty$ and $E_{Z \sim μ} [ℓ {(h, Z)}^{3}] \leq M_{3} < + \infty$ .
Bounds on the risk. For all $h \in H$ , we suppose $R (h) \leq \sqrt{m M_{2} / (4 log (δ^{- 1})}$ .
Large enough confidence, we require $δ \leq e^{- 1 / 9}$ .

Now we can state Holland’s theorem.

Theorem 7.

Ref. [13]. Let P be a prior distribution on model $H$ . Let the three assumptions listed above hold. Setting $t^{2} = m M_{2} / (2 log (δ^{- 1}))$ , then for any $δ \in [0; 1]$ , with probability of at least $1 - δ$ over the random draw of the size-m sample S, simultaneously for all Q it holds that:

$\begin{matrix} E_{h \sim Q} [R (h)] & \leq E_{h \sim Q} [R_{S, ψ, t} (h)] + \frac{1}{\sqrt{m}} (KL (Q | | P) + \frac{1}{2} log (\frac{8 π M_{2}}{δ^{2}}) - 1) \\ + \frac{1}{\sqrt{m}} ν^{*} (H) + O (\frac{1}{m}) \end{matrix}$

where:

$ν^{*} (H) : = \frac{E_{h \sim P} [exp (\sqrt{m} (R (h) - R_{S, ψ, t} (h)))]}{E_{h \sim P} [exp (R (h) - R_{S, ψ, t} (h))]} .$

7. Proofs

7.1. Proof of Theorem 1

Proof.

Let $F : R^{+} \times R^{+} \mapsto R$ be a convex function, P a fixed prior, and $δ \in [0, 1]$ . Since $E_{h \sim P} [e^{F (R_{S} (h), R (h))}]$ is a nonnegative random variable, we know that, by Markov’s inequality, for any $h \in H$ :

$P (E_{h \sim P} [e^{F (R_{S} (h), R (h))}] > \frac{1}{δ} E_{S} E_{h \sim P} [e^{F (R_{S} (h), R (h))}]) \leq δ .$

So with probability of at least $1 - δ$ , we have:

$E_{h \sim P} [e^{F (R_{S} (h), R (h))}] \leq \frac{1}{δ} E_{S} E_{h \sim P} [e^{F (R_{S} (h), R (h))}] = \frac{χ}{δ} .$

Applying the log function on each side of this inequality gives us with probability of at least $1 - δ$ over samples S:

$log (E_{h \sim P} [e^{F (R_{S} (h), R (h))}]) \leq log (\frac{χ}{δ}) .$

We now rename $A : = log (E_{h \sim P} [e^{F (R_{S} (h), R (h))}])$ .

Furthermore, if we denote by $\frac{d Q}{d P}$ the Radon-Nikodym derivative of Q with respect to P when $Q ≪ P$ , we then have, for all Q such that $Q \sim P$ :

$\begin{matrix} A & = log (E_{h \sim Q} [\frac{d P}{d Q} e^{F (R_{S} (h), R (h))}]) \\ = log (E_{h \sim Q} [{(\frac{d Q}{d P})}^{- 1} e^{F (R_{S} (h), R (h))}]) & (\frac{d P}{d Q} = {(\frac{d Q}{d P})}^{- 1}) \end{matrix}$

and by concavity of log and Jensen’s inequality,

$\begin{matrix} \geq - E_{h \sim Q} [log (\frac{d Q}{d P})] + E_{h \sim Q} [F (R_{S} (h), R (h))] \\ = - KL (Q | | P) + E_{h \sim Q} [F (R_{S} (h), R (h))] \end{matrix}$

while by convexity of F with Jensen’s inequality,

$\begin{matrix} \geq - KL (Q | | P) + F (E_{h \sim Q} [R_{S} (h)], E_{h \sim Q} [R (h)]) . \end{matrix}$

Hence, for Q such that $Q \sim P$ ,

$F (E_{h \sim Q} [R_{S} (h)], E_{h \sim Q} [R (h)]) \leq KL (Q | | P) + A .$

So with probability $1 - δ$ , for Q such that $Q \sim P$ ,

$\begin{matrix} F (E_{h \sim Q} [R_{S} (h)], E_{h \sim Q} [R (h)]) \leq KL (Q | | P) + log (\frac{χ}{δ}) . \end{matrix}$

This completes the proof of Theorem 1. □

7.2. Proof of Theorem 5

We first provide a technical property. Recall that:

ξ = E_{h \sim P} [exp (\frac{K {(h)}^{2}}{2 m^{1 - 2 α}})] .

Proposition 3.

Let $α \in R$ . Suppose the loss ℓ is HYPE(K) compliant with $K (h) = B | | h | | + C$ , with $B > 0$ , $C \geq 0$ . Then, for any Gaussian prior $P = N (0, σ^{2} I_{N})$ with $σ^{2} = t \frac{m^{1 - 2 α}}{B^{2}}$ , $0 < t < 1$ and $N \geq 6$ we have:

$ξ \leq 2 exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t))) \frac{1}{{(\sqrt{1 - t})}^{N}} {(1 + (\frac{C}{\sqrt{2 f (t) m^{1 - 2 α}}}))}^{N - 1}$

with $f (t) = \frac{1 - t}{t}$ .

Proof.

We recall that $σ^{2} = t \frac{m^{1 - 2 α}}{B^{2}}$ . By expliciting the expectation and $K (h)$ we thus obtain:

$\begin{matrix} ξ & = {(\frac{1}{\sqrt{2 π σ^{2}}})}^{N} \int_{h \in R^{N}} exp (\frac{{(B | | h | | + C)}^{2}}{2 m^{1 - 2 α}} - \frac{{| | h | |}^{2} B^{2}}{2 t m^{1 - 2 α}}) d h \\ = {(\frac{1}{\sqrt{2 π σ^{2}}})}^{N} \int_{h \in R^{N}} exp (- \frac{1}{2 m^{1 - 2 α}} (f (t) B^{2} {| | h | |}^{2} - 2 B C | | h | | - C^{2})) d h \\ = {(\frac{1}{\sqrt{2 π σ^{2}}})}^{N} \int_{h \in R^{N}} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} ({| | h | |}^{2} - \frac{2 C | | h | |}{B f (t)} - \frac{C^{2}}{B^{2} f (t)})) d h \\ = exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t))) \frac{1}{{(\sqrt{2 π σ^{2}})}^{N}} \int_{h \in R^{N}} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} {(| | h | | - \frac{C}{B f (t)})}^{2}) d h . \end{matrix}$

We will use the spherical coordinates in N-dimensional Euclidean space given in [22]:

$φ : (h_{1}, \dots, h_{N}) \to (r, φ_{1}, \dots, φ_{N - 1})$

where especially $r = | | h | |$ and also the Jacobian of $ϕ$ is given by:

$d^{N} V = r^{N - 1} \prod_{k = 1}^{N - 2} {sin}^{k} (φ_{N - 1 - k}) = r^{N - 1} d_{S^{N - 1}} V .$

Let us also precise that as given in Blumenson [22] (page 66), we have that the surface of the sphere of radius 1 in N-dimensional space is:

$\int_{φ_{1}, \dots, φ_{N - 1}} d_{S^{N - 1}} V d φ_{1} \dots d φ_{N - 1} = \frac{2 {\sqrt{π}}^{N}}{Γ (\frac{N}{2})}$

where $Γ$ is the Gamma function defined as:

$Γ (x) = \int_{0}^{+ \infty} t^{x - 1} e^{- t} d t for x > - 1 .$

Then, if we set:

$A : = \int_{h \in R^{N}} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} {(| | h | | - \frac{C}{B f (t)})}^{2}) d h$

we obtain by a change of variable:

$\begin{matrix} A & = \int_{r, φ_{1}, \dots, φ_{N - 1}} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} {(r - \frac{C}{B f (t)})}^{2}) d^{N} V d r d φ_{1} \dots d φ_{N - 1} \\ = (\frac{2 {\sqrt{π}}^{N}}{Γ (\frac{N}{2})}) \int_{r = 0}^{+ \infty} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} {(r - \frac{C}{B f (t)})}^{2}) r^{N - 1} d r \\ = (\frac{2 {\sqrt{π}}^{N}}{Γ (\frac{N}{2})}) \int_{r = - \frac{C}{B f (t)}}^{+ \infty} {(r + \frac{C}{B f (t)})}^{N - 1} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} r^{2}) d r \\ = (\frac{2 {\sqrt{π}}^{N}}{Γ (\frac{N}{2})}) \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{B f (t)})}^{N - k - 1} \int_{r = - \frac{C}{B f (t)}}^{+ \infty} r^{k} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} r^{2}) d r . \end{matrix}$

We fix a random variable X such that:

$X \sim N (0, \frac{m^{1 - 2 α}}{B^{2} (f (t)}) .$

We then have for any k positive integer, if k is even:

$\begin{matrix} \int_{r = - \frac{C}{B f (t)}}^{+ \infty} r^{k} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} r^{2}) d r & \leq \int_{r = - \infty}^{+ \infty} r^{k} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} r^{2}) d r \\ \leq \sqrt{2 π \frac{m^{1 - 2 α}}{B^{2} f (t)}} {E [| X |}^{k}] . \end{matrix}$

And if k is odd:

$\begin{matrix} \int_{r = - \frac{C}{B f (t)}}^{+ \infty} r^{k} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} r^{2}) d r & \leq \int_{r = 0}^{+ \infty} r^{k} exp (- \frac{B^{2} f (t)}{2 m^{1 - 2 α}} r^{2}) d r \\ \leq \sqrt{2 π \frac{m^{1 - 2 α}}{B^{2} f (t)}} {E [| X |}^{k} 𝟙 (X \geq 0)] \\ \leq \sqrt{2 π \frac{m^{1 - 2 α}}{B^{2} f (t)}} {E [| X |}^{k}] . \end{matrix}$

So we have:

$\begin{matrix} A & \leq (\frac{2 {\sqrt{π}}^{N}}{Γ (\frac{N}{2})}) \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{B f (t)})}^{N - k - 1} \sqrt{2 π \frac{m^{1 - 2 α}}{B^{2} f (t)}} {E [| X |}^{k}] . \end{matrix}$

As precised in [23], we have for any k:

${E [| X |}^{k}] = {(\sqrt{\frac{m^{1 - 2 α}}{B^{2} f (t)}})}^{k} 2^{k / 2} \frac{Γ (\frac{k + 1}{2})}{\sqrt{π}} .$

So finally:

$\begin{matrix} A & \leq 2 {\sqrt{π}}^{N} \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{B f (t)})}^{N - k - 1} {(\sqrt{\frac{2 m^{1 - 2 α}}{B^{2} f (t)}})}^{k + 1} \frac{Γ (\frac{k + 1}{2})}{Γ (\frac{N}{2})} . \end{matrix}$

Lemma 3.

If $N \geq 6$ , then:

$max_{k = 0 . . N - 1} \frac{Γ (\frac{k + 1}{2})}{Γ (\frac{N}{2})} = 1 .$

Proof.

As precised in the introduction of Srinivasan and Zvengrowski [24], Gauss [25] (page 147) proved that on the interval $[x_{0}, + \infty)$ where $x_{0} \in [1.46, 1.47]$ , $Γ$ is a monotonic increasing function. So, for $N - 1 \geq k \geq 2, Γ (\frac{k + 1}{2}) \leq Γ (\frac{N}{2})$ . And because $Γ (1 / 2) = \sqrt{π}, Γ (1) = 1$ , we have:

$max_{k = 0 . . N - 1} \frac{Γ (\frac{k + 1}{2})}{Γ (\frac{N}{2})} = max (\frac{\sqrt{π}}{Γ (\frac{N}{2})}, \frac{Γ (\frac{N - 1 + 1}{2})}{Γ (\frac{N}{2})}) = max (\frac{\sqrt{π}}{Γ (\frac{N}{2})}, 1)$

Because $N \geq 6$ , and $Γ$ is monotone and increasing on $[3; + \infty]$ , we have $Γ (N / 2) \geq Γ (3) \geq \sqrt{π}$ . Hence the result. □

Using Lemma 3 allows us to write:

$\begin{matrix} A & \leq 2 {\sqrt{π}}^{N} \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{B f (t)})}^{N - k - 1} {(\sqrt{\frac{2 m^{1 - 2 α}}{B^{2} f (t)}})}^{k + 1} . \end{matrix}$

We recall that $σ^{2} = t \frac{m^{1 - 2 α}}{B^{2}}$ and $f (t) = \frac{1 - t}{t}$ . Then we can write:

$A \leq 2 {\sqrt{π}}^{N} \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{B f (t)})}^{N - k - 1} {(\sqrt{\frac{2 σ^{2}}{1 - t}})}^{k + 1} .$

We now conclude with the final bound on $ξ$ :

$\begin{matrix} ξ & \leq exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t))) \frac{1}{{(\sqrt{2 π σ^{2}})}^{N}} A \\ \leq exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t))) \frac{1}{{(\sqrt{2 π σ^{2}})}^{N}} 2 {\sqrt{π}}^{N} \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{B f (t)})}^{N - k - 1} {(\sqrt{\frac{2 σ^{2}}{1 - t}})}^{k + 1} \\ \leq 2 exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t))) \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{B f (t)})}^{N - k - 1} {(\sqrt{\frac{1}{1 - t}})}^{k + 1} {(\sqrt{\frac{B^{2}}{2 t m^{1 - 2 α}}})}^{N - k - 1} \\ \leq 2 exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t))) \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C \sqrt{t}}{(1 - t) \sqrt{2 m^{1 - 2 α}}})}^{N - k - 1} {(\sqrt{\frac{1}{1 - t}})}^{k + 1} \\ \leq 2 \frac{exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t)))}{{(\sqrt{1 - t})}^{N}} \sum_{k = 0}^{N - 1} (\binom{N - 1}{k}) {(\frac{C}{\sqrt{2 f (t) m^{1 - 2 α}}})}^{N - k - 1} \\ \leq 2 \frac{exp (\frac{C^{2}}{2 m^{1 - 2 α} f (t)} (1 + f (t)))}{{(\sqrt{1 - t})}^{N}} {(1 + (\frac{C}{\sqrt{2 f (t) m^{1 - 2 α}}}))}^{N - 1} . \end{matrix}$

This completes the proof of Proposition 3. □

Proof of Theorem 5.

We combine Theorem 3 with Proposition 3. We also upper-bound $N - 1$ by N. □

Author Contributions

Conceptualization, M.H., B.G. and J.S.-T.; Formal analysis, M.H., B.G. and O.R.; Project administration, B.G.; Supervision, B.G.; Writing—original draft, M.H., B.G. and O.R.; Writing—review and editing, M.H., B.G., O.R. and J.S.-T. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported in part by the U.S. Army Research Laboratory and the U. S. Army Research Office, and by the U.K. Ministry of Defence and the U.K. Engineering and Physical Sciences Research Council (EPSRC) under grant number EP/R013616/1. BG acknowledges partial support from the French National Agency for Research, grants ANR-18-CE40-0016-01 and ANR-18-CE23-0015-02.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Shawe-Taylor J., Williamson R.C. A PAC analysis of a Bayes estimator; Proceedings of the 10th Annual Conference on Computational Learning Theory; Nashville, TN, USA. 6–9 July 1997; New York, NY, USA: ACM; 1997. pp. 2–9. [Google Scholar]
2.McAllester D.A. Some PAC-Bayesian theorems; Proceedings of the Eleventh Annual Conference on Computational Learning Theory; Madison, WI, USA. 24–26 July 1998; New York, NY, USA: ACM; 1998. pp. 230–234. [Google Scholar]
3.McAllester D.A. PAC-Bayesian model averaging; Proceedings of the Twelfth Annual Conference on Computational Learning Theory; Santa Cruz, CA, USA. 7–9 July 1999; New York, NY, USA: ACM; 1999. pp. 164–170. [Google Scholar]
4.Guedj B. A Primer on PAC-Bayesian Learning. arXiv. 2019stat.ML/1901.05353 [Google Scholar]
5.Rivasplata O., Kuzborskij I., Szepesvári C., Shawe-Taylor J. PAC-Bayes Analysis Beyond the Usual Bounds; Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020; Online. 6–12 December 2020. [Google Scholar]
6.Seeger M. PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification. J. Mach. Learn. Res. 2002;3:233–269. [Google Scholar]
7.Langford J. Tutorial on practical prediction theory for classification. J. Mach. Learn. Res. 2005;6:273–306. [Google Scholar]
8.Catoni O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics; Waite Hill, OH, USA: 2007. [Google Scholar]
9.Germain P., Lacasse A., Laviolette F., Marchand M. PAC-Bayesian Learning of Linear Classifiers; Proceedings of the 26th Annual International Conference on Machine Learning; Montreal, QC, Canada. 14–18 June 2009; New York, NY, USA: Association for Computing Machinery; 2009. pp. 353–360. [Google Scholar]
10.Alquier P., Ridgway J., Chopin N. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016;17:1–41. [Google Scholar]
11.Germain P., Bach F., Lacoste A., Lacoste-Julien S. PAC-Bayesian Theory Meets Bayesian Inference. In: Lee D.D., Sugiyama M., Luxburg U.V., Guyon I., Garnett R., editors. Advances in Neural Information Processing Systems 29. Curran Associates, Inc.; New York, NY, USA: 2016. pp. 1884–1892. [Google Scholar]
12.Alquier P., Guedj B. Simpler PAC-Bayesian bounds for hostile data. Mach. Learn. 2018;107:887–902. doi: 10.1007/s10994-017-5690-0. [DOI] [Google Scholar]
13.Holland M. PAC-Bayes under potentially heavy tails. In: Wallach H., Larochelle H., Beygelzimer A., d Alché-Buc F., Fox E., Garnett R., editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; New York, NY, USA: 2019. pp. 2715–2724. [Google Scholar]
14.Kuzborskij I., Szepesvári C. Efron-Stein PAC-Bayesian Inequalities. arXiv. 20191909.01931 [Google Scholar]
15.Shalaeva V., Fakhrizadeh Esfahani A., Germain P., Petreczky M. Improved PAC-Bayesian Bounds for Linear Regression; Proceedings of the AAAI 2020—Thirty-Fourth AAAI Conference on Artificial Intelligence; New York, NY, USA. 7–12 February 2020. [Google Scholar]
16.Lever G., Laviolette F., Shawe-Taylor J. Distribution-Dependent PAC-Bayes Priors. In: Hutter M., Stephan F., Vovk V., Zeugmann T., editors. Algorithmic Learning Theory. Springer; Berlin/Heidelberg, Germany: 2010. pp. 119–133. [Google Scholar]
17.Lever G., Laviolette F., Shawe-Taylor J. Tighter PAC-Bayes Bounds through Distribution-Dependent Priors. Theor. Comput. Sci. 2013;473:4–28. doi: 10.1016/j.tcs.2012.10.013. [DOI] [Google Scholar]
18.Mhammedi Z., Grünwald P., Guedj B. PAC-Bayes Un-Expected Bernstein Inequality. In: Wallach H., Larochelle H., Beygelzimer A., d Alché-Buc F., Fox E., Garnett R., editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; New York, NY, USA: 2019. pp. 12202–12213. [Google Scholar]
19.Parrado-Hernández E., Ambroladze A., Shawe-Taylor J., Sun S. PAC-Bayes bounds with data dependent priors. J. Mach. Learn. Res. 2012;13:3507–3531. [Google Scholar]
20.Catoni O. Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. H. Poincaré Probab. Statist. 2012;48:1148–1185. doi: 10.1214/11-AIHP454. [DOI] [Google Scholar]
21.Catoni O., Giulini I. Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. arXiv. 2017math.ST/1712.02747 [Google Scholar]
22.Blumenson L.E. A Derivation of n-Dimensional Spherical Coordinates. Am. Math. Mon. 1960;67:63–66. doi: 10.2307/2308932. [DOI] [Google Scholar]
23.Winkelbauer A. Moments and Absolute Moments of the Normal Distribution. arXiv. 2012math.ST/1209.4340 [Google Scholar]
24.Srinivasan G.K., Zvengrowski P. On the Horizontal Monotonicity of |Γ(s)|. Can. Math. Bull. 2011;54:538–543. doi: 10.4153/CMB-2010-107-8. [DOI] [Google Scholar]
25.Gauss C.F. Werke. Volume 3 Cambridge University Press; Cambridge, UK: 2011. Disquisitiones Generales Circa Seriem Infinitam (reprint) [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.

[B1-entropy-23-01330] 1.Shawe-Taylor J., Williamson R.C. A PAC analysis of a Bayes estimator; Proceedings of the 10th Annual Conference on Computational Learning Theory; Nashville, TN, USA. 6–9 July 1997; New York, NY, USA: ACM; 1997. pp. 2–9. [Google Scholar]

[B2-entropy-23-01330] 2.McAllester D.A. Some PAC-Bayesian theorems; Proceedings of the Eleventh Annual Conference on Computational Learning Theory; Madison, WI, USA. 24–26 July 1998; New York, NY, USA: ACM; 1998. pp. 230–234. [Google Scholar]

[B3-entropy-23-01330] 3.McAllester D.A. PAC-Bayesian model averaging; Proceedings of the Twelfth Annual Conference on Computational Learning Theory; Santa Cruz, CA, USA. 7–9 July 1999; New York, NY, USA: ACM; 1999. pp. 164–170. [Google Scholar]

[B4-entropy-23-01330] 4.Guedj B. A Primer on PAC-Bayesian Learning. arXiv. 2019stat.ML/1901.05353 [Google Scholar]

[B5-entropy-23-01330] 5.Rivasplata O., Kuzborskij I., Szepesvári C., Shawe-Taylor J. PAC-Bayes Analysis Beyond the Usual Bounds; Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020; Online. 6–12 December 2020. [Google Scholar]

[B6-entropy-23-01330] 6.Seeger M. PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification. J. Mach. Learn. Res. 2002;3:233–269. [Google Scholar]

[B7-entropy-23-01330] 7.Langford J. Tutorial on practical prediction theory for classification. J. Mach. Learn. Res. 2005;6:273–306. [Google Scholar]

[B8-entropy-23-01330] 8.Catoni O. PAC-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics; Waite Hill, OH, USA: 2007. [Google Scholar]

[B9-entropy-23-01330] 9.Germain P., Lacasse A., Laviolette F., Marchand M. PAC-Bayesian Learning of Linear Classifiers; Proceedings of the 26th Annual International Conference on Machine Learning; Montreal, QC, Canada. 14–18 June 2009; New York, NY, USA: Association for Computing Machinery; 2009. pp. 353–360. [Google Scholar]

[B10-entropy-23-01330] 10.Alquier P., Ridgway J., Chopin N. On the properties of variational approximations of Gibbs posteriors. J. Mach. Learn. Res. 2016;17:1–41. [Google Scholar]

[B11-entropy-23-01330] 11.Germain P., Bach F., Lacoste A., Lacoste-Julien S. PAC-Bayesian Theory Meets Bayesian Inference. In: Lee D.D., Sugiyama M., Luxburg U.V., Guyon I., Garnett R., editors. Advances in Neural Information Processing Systems 29. Curran Associates, Inc.; New York, NY, USA: 2016. pp. 1884–1892. [Google Scholar]

[B12-entropy-23-01330] 12.Alquier P., Guedj B. Simpler PAC-Bayesian bounds for hostile data. Mach. Learn. 2018;107:887–902. doi: 10.1007/s10994-017-5690-0. [DOI] [Google Scholar]

[B13-entropy-23-01330] 13.Holland M. PAC-Bayes under potentially heavy tails. In: Wallach H., Larochelle H., Beygelzimer A., d Alché-Buc F., Fox E., Garnett R., editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; New York, NY, USA: 2019. pp. 2715–2724. [Google Scholar]

[B14-entropy-23-01330] 14.Kuzborskij I., Szepesvári C. Efron-Stein PAC-Bayesian Inequalities. arXiv. 20191909.01931 [Google Scholar]

[B15-entropy-23-01330] 15.Shalaeva V., Fakhrizadeh Esfahani A., Germain P., Petreczky M. Improved PAC-Bayesian Bounds for Linear Regression; Proceedings of the AAAI 2020—Thirty-Fourth AAAI Conference on Artificial Intelligence; New York, NY, USA. 7–12 February 2020. [Google Scholar]

[B16-entropy-23-01330] 16.Lever G., Laviolette F., Shawe-Taylor J. Distribution-Dependent PAC-Bayes Priors. In: Hutter M., Stephan F., Vovk V., Zeugmann T., editors. Algorithmic Learning Theory. Springer; Berlin/Heidelberg, Germany: 2010. pp. 119–133. [Google Scholar]

[B17-entropy-23-01330] 17.Lever G., Laviolette F., Shawe-Taylor J. Tighter PAC-Bayes Bounds through Distribution-Dependent Priors. Theor. Comput. Sci. 2013;473:4–28. doi: 10.1016/j.tcs.2012.10.013. [DOI] [Google Scholar]

[B18-entropy-23-01330] 18.Mhammedi Z., Grünwald P., Guedj B. PAC-Bayes Un-Expected Bernstein Inequality. In: Wallach H., Larochelle H., Beygelzimer A., d Alché-Buc F., Fox E., Garnett R., editors. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; New York, NY, USA: 2019. pp. 12202–12213. [Google Scholar]

[B19-entropy-23-01330] 19.Parrado-Hernández E., Ambroladze A., Shawe-Taylor J., Sun S. PAC-Bayes bounds with data dependent priors. J. Mach. Learn. Res. 2012;13:3507–3531. [Google Scholar]

[B20-entropy-23-01330] 20.Catoni O. Challenging the empirical mean and empirical variance: A deviation study. Ann. Inst. H. Poincaré Probab. Statist. 2012;48:1148–1185. doi: 10.1214/11-AIHP454. [DOI] [Google Scholar]

[B21-entropy-23-01330] 21.Catoni O., Giulini I. Dimension-free PAC-Bayesian bounds for matrices, vectors, and linear least squares regression. arXiv. 2017math.ST/1712.02747 [Google Scholar]

[B22-entropy-23-01330] 22.Blumenson L.E. A Derivation of n-Dimensional Spherical Coordinates. Am. Math. Mon. 1960;67:63–66. doi: 10.2307/2308932. [DOI] [Google Scholar]

[B23-entropy-23-01330] 23.Winkelbauer A. Moments and Absolute Moments of the Normal Distribution. arXiv. 2012math.ST/1209.4340 [Google Scholar]

[B24-entropy-23-01330] 24.Srinivasan G.K., Zvengrowski P. On the Horizontal Monotonicity of |Γ(s)|. Can. Math. Bull. 2011;54:538–543. doi: 10.4153/CMB-2010-107-8. [DOI] [Google Scholar]

[B25-entropy-23-01330] 25.Gauss C.F. Werke. Volume 3 Cambridge University Press; Cambridge, UK: 2011. Disquisitiones Generales Circa Seriem Infinitam (reprint) [Google Scholar]

PERMALINK

PAC-Bayes Unleashed: Generalisation Bounds with Unbounded Losses

Maxime Haddouche

Benjamin Guedj

Omar Rivasplata

John Shawe-Taylor

Roles

Abstract

1. Introduction

2. Framework and Preliminary Results

Definition 1.

Theorem 1.

Theorem 2.

Proof.

Theorem 3.

Proof.

3. Safety Check: The Bounded Loss Case

3.1. Theoretical Results

Proposition 1.

Remark 1.

Remark 2.

Lemma 1.

Proof.

Remark 3.

Proposition 2.

Proof.

Remark 4.

3.2. Numerical Experiments

3.2.1. First Experiment

Figure 1.

3.2.2. Second Experiment

4. PAC Bayesian Bounds with Smoothed Estimator

Definition 2.

Definition 3.

Remark 5.

Lemma 2.

Proof.

Theorem 4.

Proof.

Remark 6.

5. The Linear Regression Problem

5.1. Theoretical Result

Theorem 5.

Remark 7.

5.2. Numerical Experiment

5.2.1. Setting

5.2.2. Synthetic Data

5.2.3. Experiment

Figure 2.

5.2.4. Discussion

6. Existing Work

6.1. Germain et al., 2016

Theorem 6.

Remark 8.

6.2. Holland, 2019

Theorem 7.

7. Proofs

7.1. Proof of Theorem 1

Proof.

7.2. Proof of Theorem 5

Proposition 3.

Proof.

Lemma 3.

Proof.

Proof of Theorem 5.

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Footnotes

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles