Skip to main content
Entropy logoLink to Entropy
. 2020 Oct 29;22(11):1229. doi: 10.3390/e22111229

A Comparison of Variational Bounds for the Information Bottleneck Functional

Bernhard C Geiger 1,*, Ian S Fischer 2
PMCID: PMC7712881  PMID: 33286997

Abstract

In this short note, we relate the variational bounds proposed in Alemi et al. (2017) and Fischer (2020) for the information bottleneck (IB) and the conditional entropy bottleneck (CEB) functional, respectively. Although the two functionals were shown to be equivalent, it was empirically observed that optimizing bounds on the CEB functional achieves better generalization performance and adversarial robustness than optimizing those on the IB functional. This work tries to shed light on this issue by showing that, in the most general setting, no ordering can be established between these variational bounds, while such an ordering can be enforced by restricting the feasible sets over which the optimizations take place. The absence of such an ordering in the general setup suggests that the variational bound on the CEB functional is either more amenable to optimization or a relevant cost function for optimization in its own regard, i.e., without justification from the IB or CEB functionals.

Keywords: information bottleneck, deep learning, neural networks

1. Introduction

The celebrated information bottleneck (IB) functional [1] is a cost function for supervised lossy compression. More specifically, if X is an observation and Y a stochastically related random variable (RV) that we associate with relevance, then the IB problem aims to find an encoder eZ|X, i.e., a conditional distribution of Z given X, that minimizes

LIB:=I(X;Z)βI(Y;Z). (1)

In (1), I(X;Z) and I(Y;Z) denote the mutual information between observation X and representation Z and between relevant variable Y and representation Z, respectively, and β is a Lagrangian parameter. The aim is to obtain a representation Z that is simultaneously compressed (small I(X;Z)) and informative about the relevant variable Y (large I(Y;Z)), and the parameter β trades between these two goals.

Recently, Fischer proposed an equivalent formulation, termed the conditional entropy bottleneck (CEB) [2]. While the IB functional inherently assumes the Markov condition YXZ, the CEB is motivated from the principle of Minimum Necessary Information, which lacks this Markov condition and which aims to find a representation Z that compresses a bi-variate dataset (X;Y) while still being useful for a given task. Instantiating the principle of Minimum Necessary Information induces then a Markov condition. For example, the task of finding a representation Z that makes X and Y conditionally independent induces the Markov condition XZY, and the representation optimal w.r.t. the principle of Minimum Necessary Information turns out to be arginfXZYI(X,Y;Z), i.e., it is related to Wyner’s common information [3]. The task relevant in this work—estimating Y from a representation Z that is obtained exclusively from X—induces the Markov condition YXZ and the constraint I(Y;Z)I(X;Y). A Lagrangian formulation of the constrained optimization problem infI(Y;Z)I(X;Y)I(X;Z), where the infimum is taken over all encoders eZ|X that take only X as input, yields the CEB functional (see Section 2.3 of [2])

LCEB:=I(X;Z|Y)γI(Y;Z). (2)

Due to the chain rule of mutual information [4] (Theorem 2.5.2), (2) is equivalent to (1) for γ=β1. Nevertheless, (2) has additional appeals. To this end, note that I(X;Z|Y) captures the information about X contained in the representation Z that is redundant for the task of predicting the class variable Y. In the language of [5], which essentially also proposed (2), I(X;Z|Y) thus quantifies class-conditional compression. Minimizing this class-conditional compression term I(X;Z|Y) is not in conflict with maximizing I(Y;Z), whereas minimizing I(X;Z) is (see Figure 2 in [2] and Section 2 in [5]). At the same time, as stated in [2] (p. 6), I(X;Z|Y) allows to “measure in absolute terms how much more we could compress our representation at the same predictive performance”, i.e., by how much I(X;Z|Y) could potentially be further reduced without simultaneously reducing I(Y;Z).

Aside from these theoretical considerations that make the CEB functional preferable over the equivalent IB functional, it has been shown that minimizing variational bounds on the former achieve better performance than minimizing variational bounds on the latter [2,6]. More specifically, it was shown that variational CEB (VCEB) achieves higher classification accuracy and better robustness against adversarial attacks than variational IB (VIB) proposed in [7].

The exact underlying reason why VCEB outperforms VIB is currently still being investigated. Comparing these two bounds at β1=γ=1, Fischer suggests that “we may expect VIB to converge to a looser approximation of I(X;Z)=I(Y;Z)=I(X;Y)”, where the later equation corresponds to the Minimum Necessary Information point (see Section 2.5.1 of [2]). Furthermore, Fischer and Alemi claim that VCEB “can be thought of as a tighter variational approximation to the IB objective than VIB” (see Section 2.1 of [6]). Nevertheless, the following question remains: Does VCEB outperform VIB because the variational bound of VCEB is tighter, or because VCEB is more amenable to optimization than VIB?

To partly answer this question, we compare the optimization problems corresponding to VCEB and VIB. Rather than focusing on actual (commonly neural network-based) implementations of these problems, we keep an entirely mathematical perspective and discuss the problem of finding minimizers within well-defined feasible sets (see Section 3). Our main result in Section 4 shows that the optimization problems corresponding to VCEB and VIB are indeed ordered if additional constraints are added: If VCEB is constrained to use a consistent classifier-backward encoder pair (see Definition 1 below), then (unconstrained) VIB yields a tighter approximation of the IB functional. In contrast, if VIB is constrained to use a consistent classifier-marginal pair, then (constrained and unconstrained) VCEB yields a tighter approximation. If neither VCEB nor VIB are constrained, then no ordering can be shown between the resulting optimal variational bounds. Taken together, these results indicate that the superiority of VCEB over VIB observed in [2,6] cannot be due to VCEB better approximating the IB functional. Rather, we conclude in Section 5 that the variational bound provided in [2] is either more amenable to optimization, at least when the variational terms in VCEB and VIB are implemented using neural networks (NNs), or a successful cost function for optimization in its own regard, i.e., without justification from the IB or Minimum Necessary Information principles.

Related Work and Scope. Many variational bounds for mutual information have been proposed [8], and many of these bounds can be applied to the IB functional. Both the VIB and VCEB variational bounds belong to the class of Barber & Agakov bounds, cf. Section 2.1 of [8]. As an alternative example, the authors of [9] bounded the IB functional using the Donsker–Varadhan representation of mutual information. Aside from that, the IB functional has been used for NN training also without resorting to purely variational approaches. For example, the authors of [10] applied the Barber & Agakov bound to replace I(Y;Z) by the standard cross-entropy loss of a trained classifier, but used a non-parametric estimator for I(X;Z). Rather than comparing multiple variational bounds with each other, in this work we focus exclusively on the VIB [7] and VCEB [2] bounds. The structural similarity of these bounds allows a direct comparison and still yields interesting insights that can potentially carry over to other variational approaches.

We finally want to mention two works that draw conclusions similar to ours. First, Achille and Soatto [11] pointed to the fact that their choice of injecting multiplicative noise to neuron activations is not only a restriction of the feasible set over which the optimization is performed, but it can also be interpreted as a means of regularization or as an approach to perform optimization. Thus, the authors claim, there is an intricate connection between regularization (i.e., the cost function), the feasible set, and the method of optimization (see Section 9 of [11]); this claim resonates with our Section 5. Second, Wieczorek and Roth [12] investigate the difference between IB and VIB: While IB implicitly assumes the Markov condition YXZ, the variational approach taken in VIB assumes that an estimate of Y is obtained from the representation Z, i.e., XZY. Dropping the former assumption allows to express the difference between the VIB bound and the IB functional via mutual and lautum information, which, taken together, measure the violation of the condition YXZ. The authors thus argue that dropping this condition enables VIB and similar variants to optimize over larger sets of joint distributions of X, Y, and Z. In this work, we take a slightly different approach and argue that the posterior distribution of Y given Z is approximated by a classifier with input Z that responds with a class estimate Y^. Thus, we stick to the Markov condition inherent to IB and extend it by an additional variable, resulting in YXZY^. As a consequence, our variational approach does not assume that XZY holds, which also leads to a larger set of joint distributions of X, Y, and Z. Finally, while [12] compares the IB functional with the VIB bound, in our work we compare two variational bounds on the IB functional with each other.

Notation. We consider a classification task with a feature RV X on Rm and a class RV Y on the finite set Y of classes. We assume that the joint distribution of X and Y is denoted by pXY. In this work we are interested in representations Z of the feature RV X. This (typically real-valued) representation Z is obtained by feeding X to a stochastic encoder eZ|X, and the representation Z can be used to infer the class label by feeding it to a classifier cY^|Z. Note that this classifier yields a class estimate Y^ that need not coincide with the class RV Y. Thus, the setup of encoder, representation, and classifier yields the following Markov condition: YXZY^. We abuse notation and abbreviate the conditional probability (density) pW|V=v(·) of a RV W given that another RV V assumes a certain value v as pW|V(·|v). For example, the probability density of the representation Z for an input X=x is induced by the encoder eZ|X and is given as eZ|X(·|x).

We obtain encoder, classifier, and eventual variational distributions via solving a constrained optimization problem. For example, mineZ|XEJ minimizes the objective J over all encoders eZ|X from a given family E. In practice, encoder, classifier, and variational distributions are parameterized by (stochastic) feed-forward NNs. The chosen architecture has a certain influence on the feasible set; e.g., E may denote the set of encoders that can be parameterized by a NN of a given architecture.

We assume that the reader is familiar with information-theoretic quantities. More specifically, we let I(·;·) and D·· denote mutual information and Kullback–Leibler divergence, respectively. The expectation w.r.t. to a RV W drawn from a distribution pW is denoted as EWpW·.

2. Variational Bounds on the Information Bottleneck Functional

We consider the IB principle for NN training. Specifically, we are interested in a (real-valued) representation Z, obtained directly from X, that minimizes the following functional:

LIB(β):=I(X;Z)βI(Y;Z)=I(X;Z|Y)(β1)I(Y;Z)=:LCEB(β1) (3)

Rather than optimizing (3) directly (which was shown to be ill-advised at least for deterministic NNs in [13]), we rely on minimizing variational upper bounds. More specifically, the authors of [7] introduced the following variational bound on LIB:

LVIB(β):=EXpXDeZ|X(·|X)qZβH(Y)βEXYZpXYeZ|XlogcY^|Z(Y|Z) (4)

where eZ|X, cY^|Z, and qZ are called the encoder, classifier, and marginal. The classifier is used as a variational approximation to the distribution pY|Z. The marginal qZ is a learned distribution that aims to marginalize out the encoder eZ|X. As such, this distribution is conceptually different from a fixed (unlearned) prior distribution in a Bayesian framework as in, e.g., the variational auto-encoder [14].

As an alternative and motivated by the principle of Minimum Necessary Information, the author of [2] proposed the variational bound on the CEB functional:

LVCEB(β):=EXYpXYDeZ|X(·|X)bZ|Y(·|Y)βH(Y)βEXYZpXYeZ|XlogcY^|Z(Y|Z) (5)

where bZ|Y is called the backward encoder, which is a variational approximation to the distribution pZ|Y.

3. Variational IB and Variational CEB as Optimization Problems

While it is known that LIB(β)LVIB(β) and LIB(β)LVCEB(β1) for all possible pXY and all choices of eZ|X, bZ|Y, cY^|Z, and qZ, it is not obvious how LIB(β) and LVCEB(β1) compare during optimization. In other words, we are interested in determining whether there is an ordering between

mineZ|X,cY^|Z,qZLVIB(β) (6a)

and

mineZ|X,cY^|Z,bZ|YLVCEB(β1). (6b)

Since we will always compare variational bounds for equivalent parameterization, i.e., compare LVIB(β) with LVCEB(β1), we will drop the arguments β and β1 for the sake of readability.

For a fair comparison, we need to ensure that both cost functions are optimized over comparable feasible sets E, C, B, and Q for the encoder, classifier, the backward encoder, and the marginal. We make this explicit in the following assumption.

Assumption 1.

The optimizations of VCEB and VIB are performed over equivalent feasible sets. Specifically, the families E and C from which VCEB and VIB can choose encoder eZ|X and classifier cY^|Z shall be the same. Depending on the scenario, we may require that the optimization over the marginal qZ is able to choose from the same mixture models as are induced by VCEB. i.e., if bZ|Y(·|y) is a feasible solution of LVCEB, then qZ(·)=ybZ|Y(·|y)pY(y) shall also be a feasible solution for LVIB; we thus require that Q{qZ:qZ(z)=ybZ|Y(z|y)pY(y),bZ|YB}. Depending on the scenario, we may require that every feasible solution for the marginal qZ shall be achievable by selecting feasible backward encoders; we thus require that B{bZ|Y:qZ(z)=ybZ|Y(z|y)pY(y),qZQ}. If both conditions are fulfilled, then we write that BQ.

We furthermore need the following definition:

Definition 1.

In the optimization of LVCEB, we say that backward encoder bZ|Y and classifier cY^|Z are a consistent pair if

cY^|Z(y|z)=pY(y)bZ|Y(z|y)ypY(y)bZ|Y(z|y)=pY(y)bZ|Y(z|y)qZ(z) (7)

holds. In the optimization of LVIB, we say that marginal qZ and classifier cY^|Z are a consistent pair if

pY(y)=zcY^|Z(y|z)qZ(z) (8)

holds.

The restriction to consistent pairs restricts the feasible sets. For example, for VCEB, if C is large enough to contain all classifiers consistent with backward encoders in B, i.e., if C {cY^|Z:cY^|Z(y|z)pY(y)bZ|Y(z|y),bZ|Y(·|y)B}, then the triple minimization

mineZ|XE,cY^|Z,bZ|YB(cY^|Z,bZ|Y)consistentLVCEB (9)

is reduced to the double minimization

mineZ|XE,bZ|YBLVCEB. (10)

Equivalently, one can write the joint triple minimization as a consecutive double minimization and a single minimization, where the inner minimization runs over all backwards encoders consistent with the classifier chosen in the outer minimization (where the minimization over an empty set returns infinity):

mineZ|XE,cY^|ZCminbZ|YBbZ|Y:pY(y)bZ|Y(z|y)ypY(y)bZ|Y(z|y)=cY^|Z(y|z)LVCEB. (11)

Similar considerations hold for VIB.

4. Main Results

Our first main result is negative in the sense that it shows LVIB and LVCEB cannot be ordered in general. To this end, consider the following two examples.

Example 1

(VIB < VCEB). In this example, let BQ, where B and Q are constrained, and let C be unconstrained, thus mincY^|ZEXYZpXYeZ|XlogcY^|Z(Y|Z)=H(Y|Z). Suppose further that we have selected a fixed encoder eZ|X that induces the marginal and conditional distributions pZ and pZ|Y, respectively. With this, we can write

EXYpXYDeZ|X(·|X)bZ|Y(·|Y)=I(X;Z|Y)+EYpYDpZ|Y(·|Y)bZ|Y(·|Y) (12a)

and

EXpXDeZ|X(·|X)qZ=I(X;Z)+DpZqZ. (12b)

Suppose that bZ|YVCEB is a minimizer of (12a) over B and that qZVCEB(z)=ypY(y)bZ|YVCEB(z|y). By the chain rule of of Kullback–Leibler divergence [4] (Th. 2.5.3) and with bY|ZVCEB(y|z)=pY(y)bZ|YVCEB(z|y)/qZVCEB(z), we can expand

DpZYbZ|YVCEBpY=DpZqZVCEB+EZpZDpY|Z(·|Z)bY|ZVCEB(·|Z)0=DpYpY=0+EYpYDpZ|Y(·|Y)bZ|YVCEB(·|Y)

thus

EYpYDpZ|Y(·|Y)bZ|YVCEB(·|Y)DpZqZVCEB.

Suppose that eZ|X is such that the inequality above is strict. Then,

mineZ|XE,cY^|Z,bZ|YBLVCEB(β1)=I(X;Z|Y)+EYpYDpZ|Y(·|Y)bZ|YVCEB(·|Y)(β1)I(Y;Z)>I(X;Z|Y)+DpZqZVCEB(β1)I(Y;Z)=I(X;Z)+DpZqZVCEBβI(Y;Z)mineZ|XE,cY^|Z,qZQLVIB(β)

where the last inequality follows because qZVCEB may not be optimal for the VIB cost function.

Example 2

(VIB > VCEB). Let BQ, where Q and B are unconstrained, thus with (12) we have

minbZ|YBEXYpXYDeZ|X(·|X)bZ|Y(·|Y)=I(X;Z|Y)andminqZQEXpXDeZ|X(·|X)qZ=I(X;Z).

Suppose further that C is such that mincY^|ZCEXYZpXYeZ|XlogcY^|Z(Y|Z)=H(Y|Z)+ε, where ε>0. It then follows that

mineZ|XE,cY^|ZC,qZQLVIB(β)=I(X;Z)βI(Y;Z)+βε>I(X;Z|Y)(β1)I(Y;Z)+(β1)ε=mineZ|XE,cY^|ZC,bZ|YBLVCEB(β1).

In both of these examples we have ensured that the comparison is fair in the sense of Assumption 1. Aside from showing that VIB and VCEB in general allow no ordering, additional interesting insights can be gleaned from Examples 1 and 2. First, whether VIB or VCEB yield tighter approximations of the IB and CEB functionals for a fixed encoder depends largely on the feasible sets C and B: Constraints on C cause disadvantages for VIB, while constraints on B lead to the VCEB bound becoming looser. Second, for fixed encoders, the tightness of the respective bounds and the question which of the bounds is tighter do not depend on how well the IB and CEB objectives are met: These objectives are functions only of the encoder eZ|X, whereas the tightness of the variational bounds depends on C, B, and Q. (Of course, the tightness of the respective bounds after the triple optimization in (6) depends also on E, as the optimization over B and Q in Example 1 and over C in Example 2 interacts with the optimization over E in a non-trivial manner.)

Our second main result, in contrast, shows that the variational bounds can indeed be ordered if additional constraints are introduced. More specifically, if the variational bounds are restricted to consistent pairs as in Definition 1, then the following ordering can be shown. The proof of Theorem 1 is deferred to Section 6.

Theorem 1.

If VCEB is constrained to a consistent classifier-backward encoder pair, and if Q{qZ:qZ(z)=ybZ|Y(z|y)pY(y),bZ|YB}, then

mineZ|XE,cY^|ZC,qZQLVIBmineZ|XE,cY^|ZC,bZ|YB(cY^|Z,bZ|Y)consistentLVCEB. (13a)

If VIB and VCEB are constrained to a consistent classifier–marginal and classifier-backward encoder pair, respectively, and if B{bZ|Y:bZ|Y(z|y)=cY^|Z(y|z)qZ(z)/pY(y),qZQ,cY^|ZC}, then

mineZ|XE,cY^|ZC,qZQ(cY^|Z,qZ)consistentLVIBmineZ|XE,cY^|ZC,bZ|YB(cY^|Z,bZ|Y)consistentLVCEB. (13b)

A fortiori, (13b) continues to hold if VCEB is not constrained to a consistent classifier-backward encoder pair.

Theorem 1 thus relates the cost functions of VIB and VCEB in certain well-defined scenarios, contingent on the size of the feasible sets B and Q. If the variational approximations are implemented using NNs, then these bounds are thus contingent on the capacity of the NNs trained to represent the backward encoder in case of VCEB and the marginal in the case of VIB. A few clarifying statements are now in order.

First, it is easy to imagine scenarios in which the inequalities are strict. Trivially, this is the case for (13a) if C and B, and for (13b) if C and Q do not contain a consistent pair. Furthermore, if the set relations in the respective conditions do not hold with equality, the optimization over the strictly larger set of, e.g., marginals in (13a), may yield strictly smaller values for the cost function LVIB.

Second, the condition that B{bZ|Y:bZ|Y(z|y)=cY^|Z(y|z)qZ(z)/pY(y),qZQ,cY^|ZC} is less restrictive than the condition stated in Assumption 1. This is because every backward encoder that is written as bZ|Y(z|y)=cY^|Z(y|z)qZ(z)/pY(y) for qZQ and cY^|ZC satisfies trivially that ybZ|Y(z|y)pY(y)=qZ(z). Thus, if one accepts Assumption 1 as reasonable for a fair comparison between VCEB and VIB, then one must also accept that the ordering provided in the theorem is mainly a consequence of the restriction to consistent pairs, and not to one of the optimization problems having access to a significantly larger feasible set.

Finally, if C, B, and Q are sufficiently large, i.e., if the NNs implementing the classifier, backward encoder, and marginal are sufficiently powerful, then both VCEB and VIB can be assumed to yield equally good approximations of the IB functional. To see this, let pZ, pZ|Y, and pY|Z denote the marginal and conditional distributions induced by eZ|X and note that with (12) we get

LVIB(β)=LIB(β)+DpZqZ+βEZpZDpY|Z(·|Z)cY^|Z(·|Z) (14a)

and

LVCEB(β1)=LIB(β)+EYpYDpZ|Y(·|Y)bZ|Y(·|Y)+(β1)EZpZDpY|Z(·|Z)cY^|Z(·|Z). (14b)

Large B and Q render the second terms in both equations close to zero for all choices of eZ|X (see Example 2), while large C renders the last terms close to zero (see Example 1). Thus, in this case not only do we have LVIB(β)LVCEB(β1)LIB(β), but we also have that VCEB employs a consistent classifier-backward encoder pair by the fact that bZ|YpZ|Y and cY^|ZpY|Z. Thus, one may argue that if the feasible sets are sufficiently large, the restriction to consistent pairs may not lead to significantly looser bounds.

5. Discussion

In this note we have compared the IB and CEB functionals and their respective variational approximations. While IB and CEB are shown to be equivalent, the variational approximations VIB and VCEB yield different results after optimization. Specifically, it was observed that using VCEB as a training objective for stochastic NNs outperforms VIB in terms of accuracy, adversarial robustness, and out-of-distribution detection (see Section 3.1 of [2]). In our analysis we have observed that, although in general there is no ordering between VIB and VCEB (Examples 1 and 2), the optimal values of the cost functions can be ordered if additional restrictions are imposed (Theorem 1). Specifically, if VCEB is constrained to a consistent classifier-backward encoder pair, then its optimal value cannot fall below the optimal value of VIB. If, in contrast, VIB is constrained, then the optimal value of VIB cannot fall below the optimal value of VCEB (constrained or unconstrained). Thus, as expected, adding restrictions weakens the optimization problem w.r.t. the unconstrained counterpart.

These results imply that the superiority of VCEB is not caused by enabling a tighter bound on the IB functional than VIB does. Furthermore, it was shown in Table 1 of [6] that VCEB, constrained to a consistent classifier-backward encoder pair, yields better classification accuracy and robustness against corruptions than the unconstrained VCEB objective. Since obviously

mineZ|XE,cY^|Z,bZ|YB(cY^|Z,bZ|Y)consistentLVCEBmineZ|XE,cY^|Z,bZ|YBLVCEB (15)

the achievable tightness of a variational bound on the IB functional appears to be even negatively correlated with generalization performance in this set of experiments. (We note that [6] only reports constrained VCEB results for the largest NN models, and the constrained models perform slightly worse on robustness to adversarial examples than the unconstrained VCEB models of the same size.)

One may hypothesize, though, that VCEB is more amenable to optimization, in the sense that it achieves a tighter bound on the IB functional when encoder, classifier, and variational distributions are implemented and optimized using NNs. However, optimizing VCEB and VIB was shown to yield very similar results in terms of a lower bound on I(X;Z) for several values of β, cf. Figure 4 of [2], which seems not to support above hypothesis.

We therefore conclude that the superiority of (constrained) VCEB is not due to it better approximating the IB functional. While the hypothesis that the optimized VCEB functional approximates the optimized IB functional better cannot be ruled out, we will now formulate an alternative hypothesis. Namely, that the VCEB cost function itself instills desirable properties in the encoder that would otherwise not be instilled when relying exclusively on the IB functional, cf. Section 5.4 of [13]. For example, neither IB nor the Minimum Necessary Information principle include a classifier cY^|Z in their formulations. Thus, by the invariance of mutual information under bijections, there may be many encoders eZ|X in the feasible set E that lead to representations Z equivalent in terms of (1) and (2). Only few of these representations are useful in the sense that the information about the class Y can be extracted “easily”. The variational approach of using a classifier to approximate I(Y;Z), however, ensures that, among all encoders eZ|X that are equivalent under the IB principle, one is chosen such that there exists a classifier cY^|Z in C that allows inferring the class variable Y from Z with low entropy: While the IB and Minimum Necessary Information principles ensure that Z is informative about Y, the variational approaches of VIB and VCEB ensure that this information can be accessed in practice. Regarding the observed superiority of VCEB over VIB, one may argue that a variational bound relying on a backward encoder instills properties in the latent representation Z that are preferable over those that are achieved by optimizing a variational bound relying on a marginal only.

In other words, VCEB and VIB are justified as cost functions for NN training even without recourse to the IB and Minimum Necessary Information principles. This does not say that the concept of compression, inherent in both of these principles, is not a useful guidance—whether compression and generalization are causally related is the topic of an ongoing debate to which we do not want to contribute in this work. Rather, we claim that variational approaches may yield desirable properties that go beyond compression and that may be overlooked when too much focus is put on the functionals that are approximated with these variational bounds.

In combination with the variational approach, the selection of feasible sets can also have profound impact on the properties of the representation Z. A representation Z is called disentangled if its distribution pZ factorizes. Disentanglement can thus be measured by total correlation, i.e., the Kullback–Leibler divergence between pZ and the product of its marginals Section 5 of the [11]. Achille and Soatto have shown that selecting Q in the optimization of VIB as a family of factorized marginals is equivalent to adding a total correlation term to the IB functional, effectively encouraging disentanglement, cf. Proposition 1 in [11]. Similarly, Amjad and Geiger note that selecting B in the optimization of VCEB as a family of factorized backward encoders encourages class-conditional disentanglement; i.e., it enforces a Naive Bayes structure on the representation Z, cf. Corollary 1 & Section 3.1 of [5]. To understand the implications of these observations, it is important to note that neither disentanglement nor class-conditional disentanglement are encouraged by the IB or CEB functionals. However, by appropriately selecting the feasible sets of VIB or VCEB, disentanglement and class-conditional disentanglement can be achieved. While we leave it to the discretion of the reader to decide whether disentanglement is desirable or not, we believe that it is vital to understand that disentanglement is an achievement of optimizing a variational bound over an appropriately selected feasible set, and not one of the principles based on which these variational approaches are motivated.

6. Proof of Theorem 1

We start with the first assertion. Assume that eZ|XVCEB, bZ|YVCEB, and cY^|ZVCEB are the optimal encoder, backward encoder, and classifier in terms of the VCEB cost function under the assumption of consistency, i.e.,

mineZ|XE,cY^|ZC,bZ|YB(cY^|Z,bZ|Y)consistentLVCEB=EXYpXYDeZ|XVCEB(·|X)bZ|YVCEB(·|Y)(β1)H(Y)(β1)EXYZpXYeZ|XVCEBlogcY^|ZVCEB(Y|Z) (16)

where

cY^|ZVCEB(y|z)=pY(y)bZ|YVCEB(z|y)ypY(y)bZ|YVCEB(z|y)=pY(y)bZ|YVCEB(z|y)qZ(z). (17)

Certainly, if C and B are such that they do not admit a consistent pair, then this minimum is infinity and the inequality holds trivially.

For the VIB optimization problem, we obtain

mineZ|XE,cY^|ZC,qZQLVIB=mineZ|XE,cY^|ZC,qZQEXpXDeZ|X(·|X)qZβH(Y)βEXYZpXYeZ|XlogcY^|Z(Y|Z)=(a)mineZ|XE,cY^|ZC,qZQEXZpXeZ|XlogeZ|X(Z|X)qZ(Z)βH(Y)βEXYZpXYeZ|XlogcY^|Z(Y|Z)=(b)mineZ|XE,cY^|ZC,qZQEXYZpXYeZ|XlogeZ|X(·|X)cY^|ZVCEB(Y|Z)qZ(Z)cY^|ZVCEB(Y|Z)βH(Y)βEXYZpXYeZ|XlogcY^|Z(Y|Z)(c)mineZ|XE,qZQEXYZpXYeZ|XlogeZ|X(·|X)cY^|ZVCEB(Y|Z)qZ(Z)cY^|ZVCEB(Y|Z)βH(Y)βEXYZpXYeZ|XlogcY^|ZVCEB(Y|Z)(d)mineZ|XEEXYZpXYeZ|XlogeZ|X(Z|X)cY^|ZVCEB(Y|Z)pY(Y)bZ|YVCEB(Z|Y)βH(Y)βEXYZpXYeZ|XlogcY^|ZVCEB(Y|Z)=(e)mineZ|XEEXYpXYDeZ|X(·|X)bZ|YVCEB(·|Y)+EXYZpXYeZ|XlogcY^|ZVCEB(Y|Z)pY(Y)βH(Y)βEXYZpXYeZ|XlogcY^|ZVCEB(Y|Z)=(f)mineZ|XEEXYpXYDeZ|X(·|X)bZ|YVCEB(·|Y)(β1)H(Y)(β1)EXYZpXYeZ|XlogcY^|ZVCEB(Y|Z)(g)EXYpXYDeZ|XVCEB(·|X)bZ|YVCEB(·|Y)(β1)H(Y)(β1)EXYZpXYeZ|XVCEBlogcY^|ZVCEB(Y|Z)=mineZ|XE,cY^|ZC,bZ|YBLCEB

where

  • (a) follows by writing the KL divergence as an expectation of the logarithm of a ratio;

  • (b) follows by multiplying both numerator and denominator in the first term with cY^|ZVCEB;

  • (c) is because of the (potential) suboptimality of cY^|ZVCEB for the VIB cost function;

  • (d) is because Q{qZ:qZ(z)=ybZ|Y(z|y)pY(y),bZ|YB}, thus we may choose qZ=qZ where qZ is defined in (17); and because this particular choice may be suboptimal for the VIB cost function;

  • (e) follows by splitting the logarithm

  • (f) follows by noticing that EXYZpXYeZ|XlogpY(Y)=H(Y)

  • (g) follows because eZ|XVCEB may be suboptimal for the VIB cost function.

Comparing the last line with (16) completes the proof of the first assertion.

We next turn to the second assertion. Assume that eZ|XVIB, cY^|ZVIB, and qZVIB are the optimal encoder, classifier, and marginal in terms of the VIB cost function under the assumption of consistency, i.e.,

mineZ|XE,cY^|ZC,qZQ(cY^|Z,qZ)consistentLVIB:=EXpXDeZ|XVIB(·|X)qZVIBβH(Y)βEXYZpXYeZ|XVIBlogcY^|ZVIB(Y|Z) (18)

where

pY(y)=zcY^|ZVIB(y|z)qZVIB(z). (19)

Again, if C and Q are such that they do not admit a consistent pair, then this minimum is infinity and the inequality holds trivially.

For the VCEB optimization problem, we obtain

mineZ|XE,cY^|ZC,bZ|YB(cY^|Z,bZ|Y)consistentLVCEB+(β1)H(Y)=mineZ|XE,cY^|ZC,bZ|YB(cY^|Z,bZ|Y)consistentEXYpXYDeZ|X(·|X)bZ|Y(·|Y)(β1)EXYZpXYeZ|XlogcY^|Z(Y|Z)=(a)mineZ|XE,cY^|ZC,bZ|YB(cY^|Z,bZ|Y)consistentEXYZpXYeZ|XlogeZ|X(Z|X)bZ|Y(Y|Z)(β1)EXYZpXYeZ|XlogcY^|Z(Y|Z)=(b)mineZ|XE,cY^|ZCminbZ|YBbZ|Y:pY(y)bZ|Y(z|y)ypY(y)bZ|Y(z|y)=cY^|Z(y|z)EXYZpXYeZ|XlogeZ|X(Z|X)bZ|Y(Y|Z)(β1)EXYZpXYeZ|XlogcY^|Z(Y|Z)(c)mineZ|XEminbZ|YBbZ|Y:pY(y)bZ|Y(z|y)ypY(y)bZ|Y(z|y)=cY^|ZVIB(y|z)EXYZpXYeZ|XlogeZ|X(Z|X)bZ|Y(Z|Y)(β1)EXYZpXYeZ|XlogcY^|ZVIB(Y|Z)(d)mineZ|XEEXYZpXYeZ|XlogeZ|X(Z|X)cY^|ZVIB(Y|Z)qZVIB(Z)H(Y)(β1)EXYZpXYeZ|XlogcY^|ZVIB(Y|Z)=mineZ|XEEXZpXeZ|XlogeZ|X(Z|X)qZVIB(Z)H(Y)βEXYZpXYeZ|XlogcY^|ZVIB(Y|Z)(e)EXZpXeZ|XVIBlogeZ|XVIB(Z|X)qZVIB(Z)H(Y)βEXYZpXYeZ|XVIBlogcY^|ZVIB(Y|Z)=mineZ|XE,cY^|ZC,qZQ(cY^|Z,qZ)consistentLVIB+(β1)H(Y)

where

  • (a) follows by writing the KL divergence as an expectation of the logarithm of a ratio;

  • (b) follows by the assumption that the VCEB problem is constrained to a consistent classifier-backward encoder pair, and from (11);

  • (c) is because of the (potential) suboptimality of cY^|ZVIB for the VCEB cost function;

  • (d) follows by adding and subtracting H(Y); by choosing bZ|YVIB=cY^|ZVIBqZVIB/pY, which is possible because B{bZ|Y:bZ|Y(z|y)=cY^|Z(y|z)qZ(z)/pY(y),qZQ,cY^|ZC}; and by the fact that this particular choice may be suboptimal for the VCEB cost function;

  • (e) follows because eZ|XVIB may be suboptimal for the VCEB cost function.

This completes the proof.□

Author Contributions

Conceptualization, formal analysis, validation, writing: B.C.G. and I.S.F.; Proof of Theorem 1: B.C.G. Both authors have read and agreed to the published version of the manuscript. All authors have read and agreed to the published version of the manuscript.

Funding

The work of Bernhard C. Geiger was supported by the iDev40 project. The iDev40 project has received funding from the ECSEL Joint Undertaking (JU) under grant agreement No 783163. The JU receives support from the European Union’s Horizon 2020 research and innovation programme. It is co-funded by the consortium members, grants from Austria, Germany, Belgium, Italy, Spain and Romania. The information and results set out in this publication are those of the authors and do not necessarily reflect the opinion of the ECSEL Joint Undertaking. The Know-Center is funded within the Austrian COMET Program-Competence Centers for Excellent Technologies - under the auspices of the Austrian Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology, the Austrian Federal Ministry of Digital and Economic Affairs, and by the State of Styria. COMET is managed by the Austrian Research Promotion Agency FFG.

Conflicts of Interest

The authors declare no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Tishby N., Pereira F.C., Bialek W. The Information Bottleneck Method; Proceedings of the Allerton Conference on Communication, Control, and Computing; Monticello, IL, USA. 22–24 September 1999; pp. 368–377. [Google Scholar]
  • 2.Fischer I. The Conditional Entropy Bottleneck. Entropy. 2020;22:999. doi: 10.3390/e22090999. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Wyner A. The common information of two dependent random variables. IEEE Trans. Inf. Theory. 1975;21:163–179. doi: 10.1109/TIT.1975.1055346. [DOI] [Google Scholar]
  • 4.Cover T.M., Thomas J.A. Elements of Information Theory. 1st ed. John Wiley & Sons, Inc.; New York, NY, USA: 1991. [Google Scholar]
  • 5.Amjad R.A., Geiger B.C. Class-Conditional Compression and Disentanglement: Bridging the Gap between Neural Networks and Naive Bayes Classifiers. arXiv. 20191906.02576 [Google Scholar]
  • 6.Fischer I., Alemi A.A. CEB Improves Model Robustness. Entropy. 2020;22:1081. doi: 10.3390/e22101081. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Alemi A.A., Fischer I., Dillon J.V., Murphy K. Deep Variational Information Bottleneck; Proceedings of the International Conference on Learning Representations (ICLR); Toulon, France. 24–26 April 2017. [Google Scholar]
  • 8.Poole B., Ozari S., van den Oord A., Alemi A.A., Tucker G. On Variational Bounds of Mutual Information; Proceedings of the International Conference on Machine Learning (ICML); Long Beach, CA, USA. 10–15 June 2019; pp. 5171–5180. [Google Scholar]
  • 9.Belghazi M.I., Baratin A., Rajeshwar S., Ozair S., Bengio Y., Courville A., Hjelm D. Mutual Information Neural Estimation; Proceedings of the International Conference on Machine Learning (ICML); Stockholm, Sweden. 10–15 July 2018; pp. 531–540. [Google Scholar]
  • 10.Kolchinsky A., Tracey B.D., Wolpert D.H. Nonlinear Information Bottleneck. Entropy. 2019;21:1181. doi: 10.3390/e21121181. [DOI] [Google Scholar]
  • 11.Achille A., Soatto S. Information Dropout: Learning Optimal Representations Through Noisy Computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018;40:2897–2905. doi: 10.1109/TPAMI.2017.2784440. [DOI] [PubMed] [Google Scholar]
  • 12.Wieczorek A., Roth V. On the difference between the Information Bottleneck and the Deep Information Bottleneck. Entropy. 2020;22:131. doi: 10.3390/e22020131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Amjad R.A., Geiger B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2020;42:2225–2239. doi: 10.1109/TPAMI.2019.2909031. [DOI] [PubMed] [Google Scholar]
  • 14.Kingma D.P., Welling M. Auto-encoding variational Bayes; Proceedings of the International Conference on Learning Representations (ICLR); Banff, AB, Canada. 14–16 April 2014. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES