The Convex Information Bottleneck Lagrangian

Borja Rodríguez Gálvez; Ragnar Thobaben; Mikael Skoglund

doi:10.3390/e22010098

. 2020 Jan 14;22(1):98. doi: 10.3390/e22010098

The Convex Information Bottleneck Lagrangian

Borja Rodríguez Gálvez ^1,^*,^†, Ragnar Thobaben ^1,^*,^†, Mikael Skoglund ^1,^*,^†

PMCID: PMC7516537 PMID: 33285873

Abstract

The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations T of some random variable X for the task of predicting Y. It is defined as a constrained optimization problem that maximizes the information the representation has about the task, $I (T; Y)$ , while ensuring that a certain level of compression r is achieved (i.e., $I (X; T) \leq r$ ). For practical reasons, the problem is usually solved by maximizing the IB Lagrangian (i.e., $L_{IB} (T; β) = I (T; Y) - β I (X; T)$ ) for many values of $β \in [0, 1]$ . Then, the curve of maximal $I (T; Y)$ for a given $I (X; T)$ is drawn and a representation with the desired predictability and compression is selected. It is known when Y is a deterministic function of X, the IB curve cannot be explored and another Lagrangian has been proposed to tackle this problem: the squared IB Lagrangian: $L_{sq - IB} (T; β_{sq}) = I (T; Y) - β_{sq} I {(X; T)}^{2}$ . In this paper, we (i) present a general family of Lagrangians which allow for the exploration of the IB curve in all scenarios; (ii) provide the exact one-to-one mapping between the Lagrange multiplier and the desired compression rate r for known IB curve shapes; and (iii) show we can approximately obtain a specific compression level with the convex IB Lagrangian for both known and unknown IB curve shapes. This eliminates the burden of solving the optimization problem for many values of the Lagrange multiplier. That is, we prove that we can solve the original constrained problem with a single optimization.

Keywords: information bottleneck, representation learning, mutual information, optimization

1. Introduction

Let $X \in X$ and $Y \in Y$ be two statistically dependent random variables with joint distribution $p_{(X, Y)}$ . The information bottleneck (IB) [1] investigates the problem of extracting the relevant information from X for the task of predicting Y.

For this purpose, the IB defines a bottleneck variable $T \in T$ obeying the Markov chain $Y \leftrightarrow X \leftrightarrow T$ so that T acts as a representation of X. Tishby et al. [1] define the relevant information as the information the representation keeps from Y after the compression of X (i.e., $I (T; Y)$ ), provided a certain level of compression (i.e., $I (X; T) \leq r$ ). Therefore, we select the representation which yields the value of the IB curve that best fits our requirements.

Definition 1 (IB Functional).

Let X and Y be statistically dependent variables. Let Δ be the set of random variables T obeying the Markov condition $Y \leftrightarrow X \leftrightarrow T$ . Then the IB functional is

$F_{IB, \max} (r) = max_{T \in Δ} \{I (T; Y)\} s . t . I (X; T) \leq r, \forall r \in [0, \infty) .$ (1)

Definition 2 (IB Curve).

The IB curve is the set of points defined by the solutions of $F_{IB, \max} (r)$ for varying values of $r \in [0, \infty)$ .

Definition 3 (Information Plane).

The plane is defined by the axes $I (T; Y)$ and $I (X; T)$ .

This method has been successfully applied to solve different problems from a variety of domains. For example:

Supervised learning. In supervised learning, we are presented with a set of n pairs of input features and task outputs instances. We seek an approximation of the conditional probability distribution between the task outputs Y and the input features X. In classification tasks (i.e., when Y is a discrete random variable), the introduction of the variable T learned through the information bottleneck principle maintained the performance of standard algorithms based on the cross-entropy loss while providing with more adversarial attacks robustness and invariance to nuisances [2,3,4]. Moreover, by the nature of its definition the information bottleneck appears to be closely related with a trade-off between accuracy on the observable set and generalization to new, unseen instances (see Section 2).
Clustering. In clustering, we are presented with a set of n pairs of instances of a random variable X and their attributes of interest Y. We seek groups of instances (or clusters T) such that the attributes of interest within the instances of each cluster are similar and the attributes of interest of the instances of different clusters are dissimilar. Therefore, the information bottleneck can be employed since it allows us to aim for attribute representative clusters (maximizing the similarity between instances within the clusters) and enforce a certain compression of the random variable X (ensuring a certain difference between instances of the different clusters). This has been successfully implemented, for instance, for gene expression analysis and word, document, stock pricing, or movie rating clustering [5,6,7].
Image segmentation. In image segmentation, we want to partition an image into segments such that each pixel in a region shares some attributes. If we divide the image into very small regions X (e.g., each region is a pixel or a set of pixels defined by a grid), we can consider the problem of segmentation as that of clustering the regions X based on the region attributes Y. Hence, we can use the information bottleneck so that we seek region clusters T that are maximally informative about the attributes Y (e.g., the intensity histogram bins) and maintain a level of compression of the original regions X [8].
Quantization. In quantization, we consider a random variable $X \in X$ such that $X$ is a large or continuous set. Our objective is to map X into a variable $T \in T$ such that $T$ is a smaller, countable set. If we fix the quantization set size to $| T | = ⌊ r ⌋$ and aim at maximizing the information of the quantized variable with another random variable Y and restrict the mapping to be deterministic, then the problem is equivalent to the information bottleneck [9,10].
Source coding. In source coding, we consider a data source $S$ which generates a signal $Y \in Y$ , which is later perturbed by a channel $C : Y \to X$ that outputs X. We seek a coding scheme that generates a code $T \in T$ from the output of the channel X which is as informative as possible about the original source signal Y and can be transmitted at a small rate $I (X; T) \leq r$ . Therefore, this problem is equivalent to the the formulation of the information bottleneck [11].

Furthermore, it has been employed as a tool for development or explanation in other disciplines like reinforcement learning [12,13,14], attribution methods [15], natural language processing [16], linguistics [17] or neuroscience [18]. Moreover, it has connections with other problems such as source coding with side information (or the Wyner-Ahlswede-Körner (WAK) problem), the rate-distortion problem or the cost-capacity problem (see Sections 3, 6 and 7 from [19]).

In practice, solving a constrained optimization problem such as the IB functional is challenging. Thus, in order to avoid the non-linear constraints from the IB functional, the IB Lagrangian is defined.

Definition 4 (IB Lagrangian).

Let X and Y be statistically dependent variables. Let Δ be the set of random variables T obeying the Markov condition $Y \leftrightarrow X \leftrightarrow T$ . Then we define the IB Lagrangian as

$L_{IB} (T; β) = I (T; Y) - β I (X; T) .$ (2)

Here $β \in [0, 1]$ is the Lagrange multiplier which controls the trade-off between the information of Y retained and the compression of X. Note we consider $β \in [0, 1]$ because (i) for $β \leq 0$ many uncompressed solutions such as $T = X$ maximize $L_{IB} (T; β)$ , and (ii) for $β \geq 1$ the IB Lagrangian is non-positive due to the data processing inequality (DPI) (Theorem 2.8.1 from Cover and Thomas [20]) and trivial solutions like $T = const$ are maximizers with $L_{IB} (T; β) = 0$ [21].

We know the solutions of the IB Lagrangian optimization (if existent) are solutions of the IB functional by the Lagrange’s sufficiency theorem (Theorem 5 in Appendix A of Courcoubetis [22]). Moreover, since the IB functional is concave (Lemma 5 of Gilad-Bachrach et al. [19]) we know they exist (Theorem 6 in Appendix A of Courcoubetis [22]).

Therefore, the problem is usually solved by maximizing the IB Lagrangian with adaptations of the Blahut-Arimoto algorithm [1], deterministic annealing approaches [23] or a bottom-up greedy agglomerative clustering [6] or its improved sequential counterpart [24]. However, when provided with high-dimensional random variables X such as images, these algorithms do not scale well and deep learning-based techniques, where the IB Lagrangian is used as the objective function, prevailed [2,25,26].

Note the IB Lagrangian optimization yields a representation T with a given performance ( $I (X; T), I (T; Y)$ ) for a given $β$ . However, there is no one-to-one mapping between $β$ and $I (X; T)$ . Hence, we cannot directly optimize for the desired compression level r but we need to perform several optimizations for different values of $β$ and select the representation with the desired performance; e.g., [2]. The Lagrange multiplier selection is important since (i) sometimes even choices of $β < 1$ lead to trivial representations such that $p_{T | X} = p_{T}$ , and (ii) there exist some discontinuities on the performance level w.r.t. the values of $β$ [27].

Moreover, recently Kolchinsky et al. [21] showed how in deterministic scenarios (such as many classification problems where an input $x_{i}$ belongs to a single particular class $y_{i}$ ) the IB Lagrangian could not explore the IB curve. Particularly, they showed that multiple $β$ yielded the same performance level and that a single value of $β$ could result in different performance levels. To solve this issue, they introduced the squared IB Lagrangian, $L_{sq - IB} (T; β_{sq}) = I (T; Y) - β_{s q} I {(X; T)}^{2}$ , which is able to explore the IB curve in any scenario by optimizing for different values of $β_{sq}$ . However, even though they realized a one-to-one mapping between $β_{sq}$ and the compression level existed, they did not find such mapping. Hence, multiple optimizations of the Lagrangian were still required to find the best trade-off solution.

The main contributions of this article are:

We introduce a general family of Lagrangians (the convex IB Lagrangians) which are able to explore the IB curve in any scenario for which the squared IB Lagrangian [21] is a particular case of. More importantly, the analysis made for deriving this family of Lagrangians can serve as inspiration for obtaining new Lagrangian families that solve other objective functions with intrinsic trade-offs such as the IB Lagrangian.
We show that in deterministic scenarios (and other scenarios where the IB curve shape is known) one can use the convex IB Lagrangian to obtain a desired level of performance with a single optimization. That is, there is a one-to-one mapping between the Lagrange multiplier used for the optimization and the level of compression and informativeness obtained, and we provide the exact mapping. This eliminates the need for multiple optimizations to select a suitable representation.
We introduce a particular case of the convex IB Lagrangians: the shifted exponential IB Lagrangian, which allows us to approximately obtain a specific compression level in any scenario. This way, we can approximately solve the initial constrained optimization problem from Equation (1) with a single optimization.

Furthermore, we provide some insight for explaining why there are discontinuities in the performance levels w.r.t. the values of the Lagrange multipliers. In a classification setting, we connect those discontinuities with the intrinsic clusterization of the representations when optimizing the IB bottleneck objective.

The structure of the article is the following: In Section 2 we motivate the usage of the IB in supervised learning settings. Then, in Section 3 we outline the important results used about the IB curve in deterministic scenarios. Later, in Section 4 we introduce the convex IB Lagrangian and explain some of its properties like the bijective mapping between Lagrange multipliers and the compression level and the range of such multipliers. After that, we support our (proved) claims with some empirical evidence on the MNIST [28] and TREC-6 [29] datasets in Section 5. Finally, in Section 6 we discuss our claims and empirical results. A PyTorch [30] implementation of the article can be found at https://github.com/burklight/convex-IB-Lagrangian-PyTorch.

In the Appendix A, Appendix B, Appendix C, Appendix D, Appendix E and Appendix F we provide with the proofs of the theoretical results. Then, in Appendix G we show some alternative families of Lagrangians with similar properties. Later, in Appendix H we provide with the precise experimental setup details to reproduce the results from the paper, and further experimentation with different datasets and neural network architectures. To conclude, in Appendix I we show some guidelines on how to set the convex information bottleneck Lagrangians for practical problems.

2. The IB in Supervised Learning

In this section, we will first give an overview of supervised learning in order to later motivate the usage of the information bottleneck in this setting.

2.1. Supervised Learning Overview

In supervised learning we are given a dataset $D_{n} = {(x_{i}, y_{i})}_{i = 1}^{n}$ of n pairs of input features and task outputs. In this case, X and Y are the random variables of the input features and the task outputs. We assume $x_{i}$ and $y_{i}$ are sampled i.i.d. from the true distribution $p_{(X, Y)} = p_{Y | X} p_{X}$ . The usual aim of supervised learning is to use the dataset $D_{n}$ to learn a particular conditional distribution $q_{\hat{Y} | X}$ of the task outputs given the input features, parametrized by $θ$ , which is a good approximation of $p_{Y | X}$ . We use $\hat{Y}$ and $\hat{y}$ to indicate the predicted task output random variable and its outcome. We call a supervised learning task regression when Y is continuous-valued and classification when it is discrete.

Usually, supervised learning methods employ intermediate representations of the inputs before making predictions about the outputs; e.g., hidden layers in neural networks (Chapter 5 from Bishop [31]) or transformations in a feature space through the kernel trick in kernel machines like SVMs or RVMs (Sections 7.1 and 7.2 from Bishop [31]). Let T be a possibly stochastic function of the input features X with a parametrized conditional distribution $q_{T | X}$ , then, T obeys the Markov condition $Y \leftrightarrow X \leftrightarrow T$ . The mapping from the representation to the predicted task outputs is defined by the parametrized conditional distribution $q_{\hat{Y} | T}$ . Therefore, in representation-based machine learning methods, the full Markov Chain is $Y \leftrightarrow X \leftrightarrow T \leftrightarrow \hat{Y}$ . Hence, the overall estimation of the conditional probability $p_{Y | X}$ is given by the marginalization of the representations; i.e., $q_{\hat{Y} | X} = E_{t \sim q_{T | X}} [q_{\hat{Y} | T = t}]$ (The notation $q_{\hat{Y} | T = t}$ represents the probability distribution $q_{\hat{Y} | T} (\cdot | t; θ)$ . For the rest of the text, we will use the same notation to represent conditional probability distributions where the conditioning argument is given).

In order to achieve the goal of having a good estimation of the conditional probability distribution $p_{Y | X}$ , we usually define an instantaneous cost function $𝒿 : X \times Y \to R$ . The value of this function $𝒿 (x, y; θ)$ serves as a heuristic to measure the loss of our algorithm, parametrized by $θ$ , obtains when trying to predict the realization of the task output y with the input realization x.

Clearly, we can be interested in minimizing the expectation of the instantaneous cost function over all the possible input features and task outputs, which we call the cost function. However, since we only have a finite dataset $D_{n}$ we have instead to minimize the empirical cost function.

Definition 5 (Cost Function and Empirical Cost Function).

Let X and Y be the input features and task output random variables and $x \in X$ and $y \in Y$ their realizations. Let also $𝒿$ be the instantaneous cost function, θ the parametrization of our learning algorithm, and $D_{n} = {(x_{i}, y_{i})}_{i = 1}^{n}$ the given dataset. Then, we define:

$\begin{matrix} 1 . T h e c o s t f u n c t i o n : & J (p_{(X, Y)}; θ) = E_{(x, y) \sim p_{(X, Y)}} [𝒿 (x, y; θ)] \end{matrix}$ (3)

$\begin{matrix} 2 . T h e e m p r i c a l c o s t f u n c t i o n : & \hat{J} (D_{n}; θ) = \frac{1}{n} \sum_{i = 1}^{n} 𝒿 (x_{i}, y_{i}; θ) \end{matrix}$ (4)

The discrepancy between the normal and empirical cost functions is called the generalization gap or generalization error (see Section 1 of Xu and Raginsky [32], for instance) and intuitively, the smaller this gap is, the better our model generalizes; i.e., the better it will perform to new, unseen samples in terms of our cost function.

Definition 6 (Generalization Gap).

Let $J (p_{(X, Y)}; θ)$ and $\hat{J} (D_{n}; θ)$ be the cost and the empirical cost functions as defined in Definition 5. Then, the generalization gap is defined as

$gen (D_{n}; θ) = J (p_{(X, Y)}; θ) - \hat{J} (D_{n}; θ),$ (5)

and it represents the error incurred when the selected distribution is the one parametrized by θ when the rule $\hat{J} (D_{n}; θ)$ is used instead of $J (p_{(X, Y)}; θ)$ as the function to minimize.

Ideally, we would want to minimize the cost function. Hence, we usually try to minimize the empirical cost function and the generalization gap simultaneously. The modifications to our learning algorithm which intend to reduce the generalization gap but not hurt the performance on the empirical cost function are known as regularization.

2.2. Why Do We Use the IB?

Definition 7 (Representation cross-entropy cost function).

Let X and Y be two statistically dependent variables with joint distribution $p_{(X, Y)} = p_{Y | X} p_{X}$ . Let also T be a random variable obeying the Markov condition $Y \leftrightarrow X \leftrightarrow T$ and $q_{T | X}$ and $q_{\hat{Y} | T}$ be the encoding and decoding distributions of our model, parametrized by θ. Finally, let $C (p_{Z} | | q_{Z}) = - E_{z \sim p_{Z}} [log (q_{Z} (z))]$ be the cross entropy between two probability distributions $p_{Z}$ and $q_{Z}$ . Then, the cross-entropy cost function is

$\begin{matrix} J_{CE} (p_{(X, Y)}; θ) = E_{(x, t) \sim q_{T | X} p_{X}} [C (q_{Y | T = t} | | q_{\hat{Y} | T = t})] = E_{(x, y) \sim p_{(X, Y)}} [𝒿_{CE} (x, y; θ)], \end{matrix}$ (6)

where $𝒿_{CE} (x, y; θ) = - E_{t \sim q_{T | X = x}} [q_{\hat{Y} | T = t} (y | t; θ)]$ is the instantaneous representation cross-entropy cost function and $q_{Y | T} = E_{x \sim p_{X}} [p_{Y | X = x} q_{T | X = x} / q_{T}]$ and $q_{T} = E_{x \sim p_{X}} [q_{T | X = x}]$ .

The cross-entropy is a widely used cost function in classification tasks (e.g., Teahan [8], Krizhevsky et al. [33], Shore and Gray [34]) which has many interesting properties [35]. Moreover, it is known that minimizing the $J_{CE} (p_{(X, Y)}; θ)$ maximizes the mutual information $I (T; Y)$ . That is:

Proposition 1 (Minimizing the Cross Entropy Maximizes the Mutual Information).

Let $J_{CE} (p_{(X, Y)}; θ)$ be the representation cross-entropy cost function as defined in Definition 7. Let also $I (T; Y)$ be the mutual information between random variables T and Y in the setting from Definition 7. Then, minimizing $J_{CE} (p_{(X, Y)}; θ)$ implies maximizing $I (T; Y)$ .

The proof of this proposition can be found in Appendix A.

Definition 8 (Nuisance).

A nuisance is any random variable that affects the observed data X but is not informative to the task we are trying to solve. That is, Ξ is a nuisance for Y if $Y ⊥ Ξ$ or $I (Ξ, Y) = 0$ .

Similarly, we know that minimizing $I (X; T)$ minimizes the generalization gap for restricted classes when using the cross-entropy cost function (Theorem 1 of Vera et al. [36]), and when using $I (T; Y)$ directly as an objective to maximize (Theorem 4 of Shamir et al. [37]). Furthermore, Achille and Soatto [38] in Proposition 3.1 upper bound the information of the input representations, T, with nuisances that affect the observed data, $Ξ$ , with $I (X; T)$ . Therefore, minimizing $I (X; T)$ helps generalization by not keeping useless information of $Ξ$ in our representations.

Thus, jointly maximizing $I (T; Y)$ and minimizing $I (X; T)$ is a good choice both in terms of performance in the available dataset and in new, unseen data, which motivates studies on the IB.

3. The Information Bottleneck in Deterministic Scenarios

Kolchinsky et al. [21] showed that when Y is a deterministic function of X (i.e., $Y = f (X)$ ), the IB curve is piecewise linear. More precisely, it is shaped as stated in Proposition 2.

Proposition 2 (The IB Curve is Piecewise Linear in Deterministic Scenarios).

Let X be a random variable and $Y = f (X)$ be a deterministic function of X. Let also T be the bottleneck variable that solves the IB functional. Then the IB curve in the information plane is defined by the following equation:

$\{\begin{matrix} I (T; Y) = I (X; T) & if & I (X; T) \in [0, I (X; Y)) \\ I (T; Y) = I (X; Y) & if & I (X; T) \geq I (X; Y) \end{matrix}$ (7)

Furthermore, they showed that the IB curve could not be explored by optimizing the IB Lagrangian for multiple $β$ because the curve was not strictly concave. That is, there was not a one-to-one relationship between $β$ and the performance level.

Theorem 1 (In Deterministic Scenarios, the IB Curve cannot be Explored Using the IB Lagrangian).

Let X be a random variable and $Y = f (X)$ be a deterministic function of X. Let also Δ be the set of random variables T obeying the Markov condition $Y \leftrightarrow X \leftrightarrow T$ . Then:

1.
Any solution $T \in Δ$ such that $I (X; T) \in [0, I (X; Y))$ and $I (T; Y) = I (X; T)$ solves ${arg max}_{T \in Δ} {L_{IB} (T; β)}$ for $β = 1$ . That is, many different compression and performance levels can be achieved for $β = 1$ .

2.
Any solution $T \in Δ$ such that $I (X; T) > I (X; Y)$ and $I (T; Y) = I (X; Y)$ solves ${arg sup}_{T \in Δ} {L_{IB} (T; β)}$ for $β = 0$ . That is, many compression levels can be achieved with the same performance for $β = 0$ .

Note we use the supremum in this case since for $β = 0$ we have that $I (X; T)$ could be infinite and then the search set from Equation (1); i.e., ${T : Y \leftrightarrow X \leftrightarrow T} \cap {T : I (X; T) < \infty}$ is not compact anymore.

3.
Any solution $T \in Δ$ such that $I (X; T) = I (T; Y) = I (X; Y)$ solves ${arg max}_{T \in Δ} {L_{IB} (T; β)}$ for all $β \in (0, 1)$ . That is, many different β achieve the same compression and performance level.

An alternative proof for this theorem can be found in Appendix B.

4. The Convex IB Lagrangian

4.1. Exploring the IB Curve

Clearly, a situation like the one depicted in Theorem 1 is not desirable, since we cannot aim for different levels of compression or performance. For this reason, we generalize the effort from Kolchinsky et al. [21] and look for families of Lagrangians which are able to explore the IB curve. Inspired by the squared IB Lagrangian, $L_{sq - IB} (T; β_{sq}) = I (T; Y) - β_{sq} I {(X; T)}^{2}$ , we look at the conditions a function of $I (X; T)$ requires in order to be able to explore the IB curve. In this way, we realize that any monotonically increasing and strictly convex function will be able to do so, and we call the family of Lagrangians with these characteristics the convex IB Lagrangians, due to the nature of the introduced function.

Theorem 2 (Convex IB Lagrangians).

Let Δ be the set of r.v. T obeying the Markov condition $Y \leftrightarrow X \leftrightarrow T$ . Then, if u is a monotonically increasing and strictly convex function, the IB curve can always be recovered by the solutions of ${arg max}_{T \in Δ} {L_{IB, u} (T; β_{u})}$ , with

$L_{IB, u} (T; β_{u}) = I (T; Y) - β_{u} u (I (X; T)) .$ (8)

That is, for each point $(I (X; T), I (T; Y))$ s.t. $d I (T; Y) / d I (X; T) > 0$ there is a unique $β_{u}$ for which maximizing $L_{IB, u} (T; β_{u})$ achieves this solution. Furthermore, $β_{u}$ is strictly decreasing w.r.t. $I (X; T)$ . We call $L_{IB, u} (T; β_{u})$ the convex IB Lagrangian.

The proof of this theorem can be found in Appendix C. Furthermore, by exploiting the IB curve duality (Lemma 10 of Gilad-Bachrach et al. [19]) we were able to derive other families of Lagrangians which allow for the exploration of the IB curve (Appendix G).

Remark 1.

Clearly, we can see how if u is the identity function (i.e., $u (I (X; T)) = I (X; T)$ ) then we end up with the normal IB Lagrangian. However, since the identity function is not strictly convex, it cannot ensure the exploration of the IB curve.

During the proof of this theorem we observed a relationship between the Lagrange multipliers and the solutions obtained of the normal IB Lagrangian $L_{IB} (T; β)$ and the convex IB Lagrangian $L_{IB, u} (T; β_{u})$ . This relationship is formalized in the following corollary.

Corollary 1 (IB Lagrangian and IB convex Lagrangian connection).

Let $L_{IB} (T; β)$ be the IB Lagrangian and $L_{IB, u} (T; β_{u})$ the convex IB Lagrangian. Then, maximizing $L_{IB} (T; β)$ and $L_{IB, u} (T; β_{u})$ can obtain the same point in the IB curve if $β_{u} = β / u^{'} (I (X; T))$ , where $u^{'}$ is the derivative of u.

This corollary allows us to better understand why the addition of u allows for the exploration of the IB curve in deterministic scenarios. If we note that for $β = 1$ we can obtain any point in the increasing region of the curve, then we clearly see how evaluating $u^{'}$ for different values of $I (X; T)$ define different values of $β_{u}$ that obtain such points. Moreover, it lets us see how if for $β = 0$ maximizing the IB Lagrangian could obtain any point $(I (X; Y); I (X; T))$ with $I (X; T) > I (X; Y)$ , then the same happens for the IB convex Lagrangian.

4.2. Aiming for a Specific Compression Level

Let $B_{u}$ denote the domain of Lagrange multipliers $β_{u}$ for which we can find solutions in the IB curve with the convex IB Lagrangian. Then, the convex IB Lagrangians do not only allow us to explore the IB curve with different $β_{u}$ . They also allow us to identify the specific $β_{u}$ that obtains a given point $(I (X; T), I (T; Y))$ , provided we know the IB curve in the information plane. Conversely, the convex IB Lagrangian allows finding the specific point $(I (X; T), I (T; Y))$ that is obtained by a given $β_{u}$ .

Proposition 3 (Bijective Mapping between IB Curve Point and Convex IB Lagrange multiplier).

Let the IB curve in the information plane be known; i.e., $I (T; Y) = f_{IB} (I (X; T))$ is known. Then there is a bijective mapping from Lagrange multipliers $β_{u} \in B_{u} \ {0}$ from the convex IB Lagrangian to points in the IB curve $(I (X; T), f_{IB} (I (X; T))$ . Furthermore, these mappings are:

$β_{u} = \frac{d f_{IB} (I (X; T))}{d I (X; T)} \frac{1}{u^{'} (I (X; T))} a n d I (X; T) = {(u^{'})}^{- 1} (\frac{d f_{IB} (I (X; T))}{d I (X; T)} \frac{1}{β_{u}}),$ (9)

where $u^{'}$ is the derivative of u and ${(u^{'})}^{- 1}$ is the inverse of $u^{'}$ .

This is especially interesting since in deterministic scenarios we know the shape of the IB curve (Theorem 2) and since the convex IB Lagrangians allow for the exploration of the IB curve (Theorem 2). A proof for Proposition 3 can be found in Appendix D.

Remark 2.

Note that the definition from Tishby et al. [1] $β = d f_{IB} (I (X; T)) / d I (X; T)$ only allows for a bijection between β and $I (X; T)$ if $f_{IB}$ is a strictly convex, and known function, and we have seen this is not the case in deterministic scenarios (Theorem 1).

A direct result derived from this proposition is that we know the domain of Lagrange multipliers, $B_{u}$ , which allows for the exploration of the IB curve if the shape of the IB curve is known. Furthermore, if the shape is not known we can at least bound that range.

Corollary 2 (Domain of Convex IB Lagrange Multiplier with Known IB Curve Shape).

Let the IB curve in the information plane be $I (T; Y) = f_{IB} (I (X; T))$ and let $I_{\max} = I (X; Y)$ . Let also $I (X; T) = r_{\max}$ be the minimum mutual information s.t. $f_{IB} (r_{\max}) = I_{\max}$ ; i.e., $r_{\max} = {arg inf}_{r} {f_{IB} (r)} s . t . f_{IB} (r) = I_{\max}$ . Then, the range of Lagrange multipliers that allow the exploration of the IB curve with the convex IB Lagrangian is $B_{u} = [β_{u, \min}, β_{u, \max}]$ , with

$β_{u, \min} = lim_{r \to r_{\max}^{-}} \{\frac{f_{IB}^{'} (r)}{u^{'} (r)}\} a n d β_{u, \max} = lim_{r \to 0^{+}} \{\frac{f_{IB}^{'} (r)}{u^{'} (r)}\},$ (10)

where $f_{IB}^{'} (r)$ and $u^{'} (r)$ are the derivatives of $f_{IB} (I (X; T))$ and $u (I (X; T))$ w.r.t. $I (X; T)$ evaluated at r respectively. Also, note that there are some scenarios where $r_{\max} \to \infty$ (see, e.g., [39]), in these scenarios $β_{u, \min} = {lim}_{r \to \infty} \{f_{IB}^{'} (r) / u^{'} (r)\} \geq 0$ .

Corollary 3 (Domain of Convex IB Lagrange Multiplier Bound).

The range of the Lagrange multipliers that allow the exploration of the IB curve is contained by $[0, β_{u, top}]$ which is also contained by $[0, β_{u, top}^{+}]$ , where

$β_{u, top} = \frac{{({inf}_{Ω_{x} \subset X} {β_{0} (Ω_{x})})}^{- 1}}{{lim}_{r \to 0^{+}} \{u^{'} (r)\}}, a n d β_{u, top}^{+} = \frac{1}{{lim}_{r \to 0^{+}} \{u^{'} (r)\}},$ (11)

where $u^{'} (r)$ is the derivative of $u (I (X; T))$ w.r.t. $I (X; T)$ evaluated at r, $X$ is the set of possible realizations of X and $β_{0}$ and $Ω_{x}$ are defined as in [27] (Note in [27] they consider the dual problem (see Appendix G), so when they refer to $β^{- 1}$ it translates to β in this article). That is, $B_{u} \subseteq [0, β_{u, top}] \subseteq [0, β_{u, top}^{+}]$ .

Corollaries 2 and 3 allow us to reduce the range search for $β$ when we want to explore the IB curve. Practically, ${inf}_{Ω_{x} \subset X} {β_{0} (Ω_{x})}$ might be difficult to calculate so Wu et al. [27] derived an algorithm to approximate it. However, we still recommend setting the numerator to 1 for simplicity. The proofs for both corollaries are found in Appendix E and Appendix F.

5. Experimental Support

In order to showcase our claims we use the MNIST [28] and the TREC-6 [29] datasets. We modify the nonlinear-IB method [26], which is a neural network that minimizes the cross-entropy while also minimizing a differentiable kernel-based estimate of $I (X; T)$ [40]. Then, we used this technique to maximize a lower bound on the convex IB Lagrangians by applying the functions u to the $I (X; T)$ estimate.

The network structure is the following: first, a stochastic encoder $T = f_{enc} (X; θ) + W$ with $p_{W} = N (0, I_{d})$ such that $T \in R^{d}$ , where d is the dimension of the bottleneck variable (Note that the encoder needs to be stochastic to (i) ensure a finite and well-defined mutual information [21,41] and (ii) make gradient-based optimization methods over the IB Lagrangian useful [41]). Second, a deterministic decoder $q_{\hat{Y} | T} = f_{dec} (T; θ)$ . For the MNIST dataset both the encoder and the decoder are fully-connected networks, for a fair comparison with [26]. For the TREC-6 dataset, the encoder is a set of convolutions of word embeddings followed by a fully-connected network and the decoder is also a fully-connected network. For further details about the experiment setup, additional results for different values of $α$ and $η$ and supplementary experimental results for different datasets and network architectures, please refer to Appendix H.

In Figure 1 we show our results for two particularizations of the convex IB Lagrangians:

the power IB Lagrangians: $L_{IB, pow} (T; β_{pow}, α) = I (T; Y) - β_{pow} I {(X; T)}^{(1 + α)}$ , $α > 0$ (Note when $α = 1$ we have the squared IB functional from Kolchinsky et al. [21]).
the exponential IB Lagrangians: $L_{IB, \exp} (T; β_{\exp}, η) = I (T; Y) - β_{\exp} exp (η I (X; T))$ , $η > 0$ .

The top row shows the results for the power information bottleneck (IB) Lagrangian with $α = 1$ , and the bottom row for the exponential IB Lagrangian with $η = 1$ , both in the MNIST dataset. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) $I (T; Y)$ as a function of $β_{u}$ ; and (iii) the compression $I (X; T)$ as a function of $β_{u}$ . In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated $I (X; Y) = H (Y) = {log}_{2} (10)$ in a dashed, orange line. All values are shown in bits.

We can clearly see how both Lagrangians are able to explore the IB curve (first column from Figure 1) and how the theoretical performance trend of the Lagrangians matches the experimental results (second and third columns from Figure 1). There are small mismatches between the theoretical and experimental performance. This is because using the nonlinear-IB, as stated by Kolchinsky et al. [21], does not guarantee that we find optimal representations due to factors like (i) inaccurate estimation of $I (X; T)$ , (ii) restrictions on the structure of T, (iii) use of an estimation of the decoder instead of the real one and (iv) the typical non-convex optimization issues that arise with gradient-based methods. The main difference comes from the discontinuities in performance for increasing $β$ , which cause is still unknown (cf. Wu et al. [27]). It has been observed, however, that the bottleneck variable performs an intrinsic clusterization in classification tasks (see, for instance, [21,26,42] or Figure 2b). We observed how this clusterization matches with the quantized performance levels observed (e.g., compare Figure 2a with the top center graph in Figure 1); with maximum performance when the number of clusters is equal to the cardinality of Y and reducing performance with a reduction of the number of clusters, which is in line with the concurrent work from Wu and Fischer [43]. We do not have a mathematical proof for the exact relationship between these two phenomena; however, we agree with Wu et al. [27] that it is an interesting matter and hope this observation serves as motivation to derive new theory.

Depiction of the clusterization behavior of the bottleneck variable for the power IB Lagrangian in the MNIST dataset with $α = 1$ . The clusters were obtained using the DBSCAN algorithm [44,45].

In practice, there are different criteria for choosing the function u. For instance, the exponential IB Lagrangian could be more desirable than the power IB Lagrangian when we want to draw the IB curve since it has a finite range of $β_{u}$ . This is $B_{u} = [{(η exp (η I_{\max}))}^{- 1}, η^{- 1}]$ for the exponential IB Lagrangian vs. $B_{u} = [{((1 + α) I_{\max}^{α})}^{- 1}, \infty)$ for the power IB Lagrangian. Furthermore, there is a trade-off between (i) how much the selected u function resembles a linear function in our region of interest; e.g., with $α$ or $η$ close to zero, since it will suffer from similar problems as the original IB Lagrangian; and (ii) how fast it grows in our region of interest; e.g., higher values of $α$ or $η$ , since it will suffer from value convergence; i.e., optimizing for separate values of $β_{u}$ will achieve similar levels of performance (Figure 3). Please, refer to Appendix I for a more thorough explanation of these two phenomena.

Example of value convergence with the exponential IB Lagrangian with $η = 3$ . We show the intersection of the isolines of $L_{IB, \exp} (T; β_{\exp})$ for different $β_{\exp} \in B_{\exp} \approx [1.56 \times 10^{- 5}, 3^{- 1}]$ using Corollary 2.

Particularly, the value convergence phenomenon can be exploited in order to approximately obtain a particular level of compression $r^{*}$ , both for known and unkown IB curves (see Appendix I or the example in Figure 4). For known IB curves, we also know the achieved predictability $I (T; Y)$ since it is the same as the level of compression $I (X; T)$ . For this exploitation, we can employ the shifted version of the exponential IB Lagrangian (which is also a particular case of the convex IB Lagrangian):

the shifted exponential IB Lagrangians:
$L_{IB, sh - \exp} (T; β_{sh - \exp}, η, r^{*}) = I (T; Y) - β_{sh - \exp} exp (η (I (X; T) - r^{*})), η > 0, r^{*} \in [0, \infty) .$

Example of value convergence exploitation with the shifted exponential Lagrangian with $η = 200$ . In the top row, for the MNIST dataset aiming for a compression level $r^{*} = 2$ and in the bottom row, for the TREC-6 dataset aiming for a compression level of $r^{*} = 16$ . In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) $I (T; Y)$ as a function of $β_{u}$ ; and (iii) the compression $I (X; T)$ as a function of $β_{u}$ . In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated $H (Y)$ in a dashed, orange line. All values are shown in bits.

For this Lagrangian, the optimization procedure converges to representations with approximately the desired compression level $r^{*}$ if the hyperparameter $η$ is set to a large value.

In Figure 4 we show the results of aiming for a compression level of $r^{*} = 2$ bits in the MNIST dataset and of $r^{*} = 16$ bits in the TREC-6 dataset, both with $η = 200$ . We can see how for different values of $β_{sh - \exp}$ we can obtain the same desired compression level, which makes this method stable to variations in the Lagrange multiplier selection.

To sum up, in order to achieve a desired level of performance with the convex IB Lagrangian as an objective one should:

In a deterministic or close to a deterministic setting (see $ϵ$ -deterministic definition in Kolchinsky et al. [21]): Use the adequate $β_{u}$ for that performance using Proposition 3. Then if the performance is lower than desired, i.e., we are placed in the wrong performance plateau, gradually reduce the value of $β_{u}$ until reaching the previous performance plateau. Alternatively, exploit the value convergence phenomenon with, for instance, the shifted exponential IB Lagrangian.
In a stochastic setting: exploit the value convergence phenomenon with, for instance, the shifted exponential IB Lagrangian. Alternatively, draw the IB curve with multiple values of $β_{u}$ on the range defined by Corollary 3 and select the representations that best fit their interests.

6. Conclusions

The information bottleneck is a widely used and studied technique. However, it is known that the IB Lagrangian cannot be used to achieve varying levels of performance in deterministic scenarios. Moreover, in order to achieve a particular level of performance, multiple optimizations with different Lagrange multipliers must be done to draw the IB curve and select the best traded-off representation.

In this article we introduced a general family of Lagrangians which allow to (i) achieve varying levels of performance in any scenario, and (ii) pinpoint a specific Lagrange multiplier $β_{u}$ to optimize for a specific performance level in known IB curve scenarios; e.g., deterministic. Furthermore, we showed the $β_{u}$ domain when the IB curve is known and a $β_{u}$ domain bound for exploring the IB curve when it is unknown. This way we can reduce and/or avoid multiple optimizations and, hence, reduce the computational effort for finding well traded-off representations. Moreover, (iii) when the IB curve is not known, we saw how we can exploit the value convergence issue of the convex IB Lagrangian to approximately obtain a specific compression level for both known and unknown IB curve shapes. Finally, (iv) we provided some insight into the discontinuities on the performance levels w.r.t. the Lagrange multipliers by connecting those with the intrinsic clusterization of the bottleneck variable.

Acknowledgments

We want to thank the anonymous reviewers for their insighful comments.

Appendix A. Proof of Proposition 1

Proof.

We can easily prove this statement by finding $I (T; Y)$ is lower bounded by the $γ J_{CE} (p_{(X, Y)}; θ) + C$ where $γ < 0$ and C does not depend on T. This way maximizing such lower bound would be equivalent to minimizing $J_{CE} (p_{(X, Y)}; θ)$ and, moreover, it would imply maximizing $I (T; Y)$ .

We can find such an expression as follows:

$\begin{matrix} I (T; Y) & = E_{(y, t) \sim q_{Y | T} q_{T}} [log (\frac{q_{Y | T = t} (y | t; θ)}{p_{Y} (y)})] = H (Y) + E_{(y, t) \sim q_{Y | T} q_{T}} [log (q_{Y | T = t} (y | t; θ))] \end{matrix}$ (A1)

$\begin{matrix} = H (Y) + E_{t \sim q_{T}} [D_{KL} (q_{Y | T = t} | | q_{\hat{Y} | T = t})] + E_{(y, t) \sim q_{Y | T} q_{T}} [log (q_{\hat{Y} | T} (y | t; θ))] \end{matrix}$ (A2)

$\begin{matrix} \geq H (Y) + E_{(x, y, t) \sim q_{Y | T} q_{T | X} p_{X}} [log (q_{\hat{Y} | T = t} (y | t, θ))] = H (Y) - E_{(x, t) \sim q_{T | X} p_{X}} [C (q_{Y | T = t} | | q_{\hat{Y} | T = t})] \end{matrix}$ (A3)

$\begin{matrix} = H (Y) - J_{CE} (p_{(X, Y)}; θ) . \end{matrix}$ (A4)

Here, in Equation (A1) we just used the definition of the mutual information between two random variables, and then we decoupled it using the definition of the entropy of a variable (Note we used $H (\cdot)$ which is usually employed for discrete variables. However, in this setting $H (\cdot)$ could also refer to the differential entropy $h (\cdot)$ of a continuous random variable since we employed the general definition using the expectation). Then, in Equation (A2) we only multiplied and divided by $q_{\hat{Y} | T}$ inside the logarithm and employed the definition of the Kullback–Leibler divergence. Finally, in Equation (A3) we first used the fact the Kullback–Leibler divergence is always positive (Theorem 2.6.3 from Cover and Thomas [20]) and then the properties of the Markov chain $T \leftrightarrow X \leftrightarrow Y$ .

Therefore, since $H (Y)$ does not depend on T and we have a negative multiplicative term on $J_{CE} (p_{(X, Y)}; θ)$ the proposition is proved. □

Appendix B. Alternative Proof of Theorem 1

Proof.

We will proof all the enumerated statements sequentially, since the third one requires from the two first ones to be proved.

Proposition 2 states that the IB curve in the information plane follows the equation $I (T; Y) = I (X; T)$ if $I (X; T) \in [0, I (X; Y))$ . Then, since $β = d I (T; Y) / d I (X; T)$ [1], we know $β = 1$ in all these points. Therefore, for $β = 1$ all points ( $I (X; T), I (X; T)$ ) such that $I (X; T) \in [0, I (X; Y))$ are solutions of optimizing the IB Lagrangian.

Similarly, Proposition 2 states that the IB curve follows the equation $I (T; Y) = I (X; Y)$ if $I (X; T) \geq I (X; Y)$ . Then, since $β = d I (T; Y) / d I (X; T)$ [1], we know $β = 0$ in all points such that $I (X; T) > I (X; Y)$ . We cannot ensure it at $I (X; T) = I (X; Y)$ since $β = 1$ for $I (X; T) = {lim}_{ϵ \to 0^{+}} {I (X; Y) - ϵ}$ .

Finally, in order to prove the last statement we will first prove that if $β \in (0, 1)$ achieves a solution, it is $(I (X; Y), I (X; Y))$ . Then, we will prove that if the solution $(I (X; Y), I (X; Y))$ exists, this can be yield by any $β \in (0, 1)$ . Hence, the solution $(I (X; Y), I (X; Y))$ is achieved $\forall β \in (0, 1)$ and it is the only solution achievable.

(a)
Since the IB curve is concave we know $β$ is non-increasing in $I (X; T) \in R^{+}$ . We also know $β = 1$ at the points in the IB curve where $I (X; T) \leq {lim}_{ϵ \to 0^{+}} {I (X; Y) - ϵ}$ and $β = 1$ at the points in the IB curve where $I (X; T) \geq {lim}_{ϵ \to 0^{+}} {I (X; Y) + ϵ}$ . Hence, if we achieve a solution with $β \in (0, 1)$ , this solution is $I (X; T) = I (T; Y) = I (X; Y)$ .

(b)
We can upper bound the IB Lagrangian by
$L_{IB} (T; β) = I (T; Y) - β I (X; T) \leq (1 - β) I (T; Y) \leq (1 - β) I (X; Y),$ (A5)
where the first and second inequalities use the DPI (Theorem 2.8.1 from Cover and Thomas [20]).

Then, we can consider the point of the IB curve $(I (X; Y), I (X; Y))$ . Since the function is concave a tangent line to $(I (X; Y), I (X; Y))$ exists such that all other points in the curve lie below this line. Let $β$ be the slope of this curve (which we know it is from Tishby et al. [1]). Then,
$I (X; Y) - β I (X; Y) = (1 - β) I (X; Y) \geq F_{IB, \max} (r) - β r, \forall r \in [0, \infty) .$ (A6)

As we see, by the upper bound on the IB Lagrangian from Equation (A5), if the point $(I (X; Y), I (X; Y))$ exists, any $β$ can be the slope of the tangent line to $(I (X; Y), I (X; Y))$ that ensures concavity. □

Appendix C. Proof of Theorem 2

Proof.

We start the proof by remembering the optimization problem at hand (Definition 1):

$F_{IB, \max} (r) = max_{T \in Δ} {I (T; Y)} s . t . I (X; T) \leq r$ (A7)

We can modify the optimization problem by

$\begin{matrix} max_{T \in Δ} {I (T; Y)} s . t . u (I (X; T)) \leq u (r) \end{matrix}$ (A8)

iff u is a monotonically non-decreasing function since otherwise $u (I (X; T)) \leq u (r)$ would not hold necessarily. Now, let us assume $\exists T^{*} \in Δ$ and $β_{u}^{*}$ s.t. $T^{*}$ maximizes $L_{IB, u} (T; β_{u}^{*})$ over all $T \in Δ$ , and $I (X; T^{*}) \leq r$ . Then, we can operate as follows:

$\begin{matrix} max_{\begin{matrix} T \in Δ \\ u (I (X; T)) \leq u (r) \end{matrix}} {I (T; Y)} & = max_{\begin{matrix} T \in Δ \\ u (I (X; T)) \leq u (r) \end{matrix}} {I (T; Y) - β_{u}^{*} (u (I (X; T)) - u (r) + ξ)} \end{matrix}$ (A9)

$\begin{matrix} \leq max_{T \in Δ} {I (T; Y) - β_{u}^{*} (u (I (X; T)) - u (r) + ξ)} \end{matrix}$ (A10)

$\begin{matrix} = I (T^{*}; Y) - β_{u}^{*} (u (I (X; T^{*}) - u (r) + ξ) = I (T^{*}; Y) . \end{matrix}$ (A11)

Here, the equality from Equation (A9) comes from the fact that since $I (X; T) \leq r$ , then $\exists ξ \geq 0$ s.t. $u (I (X; T)) - u (r) + ξ = 0$ . Then, the inequality from Equation (A10) holds since we have expanded the optimization search space. Finally, in Equation (A11) we use that $T^{*}$ maximizes $L_{IB, u} (T; β_{u}^{*})$ and that $I (X; T^{*}) \leq r$ .

Now, we can exploit that $u (r)$ and $ξ$ do not depend on T and drop them in the maximization in Equation (A10). We can then realize we are maximizing over $L_{IB, u} (T; β_{u}^{*})$ ; i.e.,

$\begin{matrix} max_{\begin{matrix} T \in Δ \\ u (I (X; T)) \leq u (r) \end{matrix}} {I (T; Y)} & \leq max_{T \in Δ} {I (T; Y) - β_{u}^{*} (u (I (X; T)) - u (r) + ξ)} \end{matrix}$ (A12)

$\begin{matrix} = max_{T \in Δ} {I (T; Y) - β_{u}^{*} (I (X; T))} = max_{T \in Δ} {L_{IB, u} (T; β_{u}^{*})} . \end{matrix}$ (A13)

Therefore, since $I (T^{*}; Y)$ satisfies both the maximization with $T^{*} \in Δ$ and the constraint $I (X; T^{*}) \leq r$ , maximizing $L_{IB, u} (T; β_{u}^{*})$ obtains $F_{IB, \max} (r)$ .

Now, we know if such $β_{u}^{*}$ exists, then the solution of the Lagrangian will be a solution for $F_{IB, \max} (r)$ . Then, if we consider Theorem 6 from the Appendix of Courcoubetis [22] and consider the maximization problem instead of the minimization problem, we know if both $I (T; Y)$ and $- u (I (X; T))$ are concave functions, then a set of Lagrange multipliers $S_{u}^{*}$ exists with these conditions. We can make this consideration because f is concave if $- f$ is convex and $max {f} = min {- f}$ . We know $I (T; Y)$ is a concave function of T for $T \in Δ$ (Lemma 5 of Gilad-Bachrach et al. [19]) and $I (X; T)$ is convex w.r.t. T given $p_{X}$ is fixed (Theorem 2.7.4 of Cover and Thomas [20]). Thus, if we want $- u (I (X; T))$ to be concave we need u to be a convex function.

Finally, we will look at the conditions of u so that for every point $(I (X; T), I (T; Y))$ in the IB curve, there exists a unique $β_{u}^{*}$ s.t. $L_{IB, u} (T; β_{u}^{*})$ is maximized. That is, the conditions of u s.t. $| S_{u}^{*} | = 1$ . For this purpose we will look at the solutions of the Lagrangian optimization:

$\frac{d L_{IB, u} (T; β_{u})}{d T} = \frac{d (I (T; Y) - β_{u} u (I (X; T)))}{d T} = \frac{d I (T; Y)}{d T} - β_{u} \frac{d u (I (X; T))}{d I (X; T)} \frac{d I (X; T)}{d T} = 0$ (A14)

Now, if we integrate both sides of Equation (A14) over all $T \in Δ$ we obtain

$β_{u} = \frac{d I (T; Y)}{d I (X; T)} {(\frac{d u (I (X; T))}{d I (X; T)})}^{- 1} = \frac{β}{u^{'} (I (X; T))},$ (A15)

where $β$ is the Lagrange multiplier from the IB Lagrangian [1] and $u^{'} (I (X; T))$ is $\frac{d u (I (X; T))}{d I (X; T)}$ . Also, if we want to avoid indeterminations of $β_{u}$ we need $u^{'} (I (X; T))$ not to be 0. Since we already imposed u to be monotonically non-decreasing, we can solve this issue by strengthening this condition. That is, we will require u to be monotonically increasing.

We would like $β_{u}$ to be continuous, this way there would be a unique $β_{u}$ for each value of $I (X; T)$ . We know $β$ is a non-increasing function of $I (X; T)$ (Lemma 6 of Gilad-Bachrach et al. [19]). Hence, if we want $β_{u}$ to be a strictly decreasing function of $I (X; T)$ , we will require $u^{'}$ to be a strictly increasing function of $I (X; T)$ . Therefore, we will require u to be a strictly convex function.

Thus, if u is a strictly convex and monotonically increasing function, for each point $(I (X; T), I (T; Y))$ in the IB curve s.t. $d I (T; Y) / d I (X; T) > 0$ there is a unique $β_{u}$ for which maximizing $L_{IB, u} (T; β_{u})$ achieves this solution. □

Appendix D. Proof of Proposition 3

Proof.

In Theorem 2 we showed how each point of the IB curve $(I (X; T), I (T; Y))$ can be found with a unique $β_{u}$ maximizing $L_{IB, u} (T; β_{u})$ . Therefore, since we also proved $L_{IB, u} (T; β_{u})$ is strictly concave w.r.t. T we can find the values of $β_{u}$ that maximize the Lagrangian for fixed $I (X; T)$ .

First, we look at the solutions of the Lagrangian maximization:

$\frac{d L_{IB, u} (T; β_{u})}{d T} = \frac{d (f_{IB} (I (X; T)) - β_{u} u (I (X; T)))}{d T} = \frac{d f_{IB} (I (X; T))}{d T} - β_{u} \frac{d u (I (X; T))}{d I (X; T)} \frac{d I (X; T)}{d T} = 0 .$ (A16)

Then as before we can integrate at both sides for all $T \in Δ$ and solve for $β_{u}$ :

$β_{u} = \frac{d f_{IB} (I (X; T))}{d I (X; T)} \frac{1}{u^{'} (I (X; T))} .$ (A17)

Moreover, since u is a strictly convex function it’s derivative $u^{'}$ is strictly increasing. Hence, $u^{'}$ is an invertible function (since a strictly increasing function is bijective and a function is invertible iff it is bijective by definition). Now, if we consider $β_{u} > 0$ to be known and $I (X; T)$ to be the unknown we can solve for $I (X; T)$ and get:

$I (X; T) = {(u^{'})}^{- 1} (\frac{d f_{IB} (I (X; T))}{d I (X; T)} \frac{1}{β_{u}}) .$ (A18)

Note we require $β_{u}$ not to be 0 so the mapping is defined. □

Appendix E. Proof of Corollary 2

Proof.

We will start the proof by proving the following useful Lemma.

Lemma A1.

Let $L_{IB, u} (T; β_{u})$ be a convex IB Lagrangian, then ${sup}_{T \in Δ} {L_{IB, u} (T; 0)} = I (X; Y)$ .

Proof.

Since $L_{IB, u} (T; 0) = I (T; Y)$ , maximizing this Lagrangian is directly maximizing $I (T; Y)$ . We know $I (T; Y)$ is a concave function of T for $T \in Δ$ (Theorem 2.7.4 from Cover and Thomas [20]); hence it has a supremum. We also know $I (T; Y) \leq I (X; Y)$ . Moreover, we know $I (X; Y)$ can be achieved if, for example, Y is a deterministic function of T (since then the Markov Chain $X \leftrightarrow T \leftrightarrow Y$ is formed). Thus, ${sup}_{T \in Δ} {L_{IB, u} (T; 0)} = I (X; Y)$ . □

For $β_{u} = 0$ we know maximizing $L_{IB, u} (T; 0)$ we can obtain the point in the IB curve $(r_{\max}, I_{\max})$ (Lemma A1). Moreover, we know that for every point $(I (X; T), f_{IB} (I (X; T)))$ such that $d f_{IB} (I (X; T)) / d I (X; T) > 0$ , $\exists! β_{u}$ s.t. $max {L_{IB, u} (T; β_{u})}$ achieves that point (Theorem 2). Thus, $\exists! β_{u, \min}$ s.t. ${lim}_{r \to r_{\max}^{-}} (r, f_{IB} (r))$ is achieved. From Proposition 3 we know this $β_{u, \min}$ is given by

$β_{u, \min} = lim_{r \to r_{\max}^{-}} \{\frac{f_{IB}^{'} (r)}{u^{'} (r)}\} .$ (A19)

Since we know $f_{IB} (I (X; T))$ is a concave non-decreasing function in $(0, r_{\max})$ (Lemma 5 of Gilad-Bachrach et al. [19]) we know it is continuous in this interval. In addition we know $β_{u}$ is strictly decreasing w.r.t. $I (X; T)$ (Theorem 2). Furthermore, by definition of $r_{\max}$ and knowing $I (T; Y) \leq I (X; Y)$ we know $f_{IB}^{'} (r) = 0$ , $\forall r > r_{\max}$ . Therefore, we cannot ensure the exploration of the IB curve for $β_{u}^{'}$ s.t. $0 < β_{u}^{'} < β_{u, \min}$ .

Then, since u is a strictly increasing function in $(0, r_{\max})$ , $u^{'}$ is positive in that interval. Hence, taking into account $β_{u}$ is strictly decreasing we can find a maximum $β_{u}$ when $I (X; T)$ approaches to 0. That is,

$β_{u, \max} = lim_{r \to 0^{+}} \{\frac{f_{IB}^{'} (r)}{u^{'} (r)}\},$ (A20)

□

Appendix F. Proof of Corollary 3

Proof.

If we use Corollary 2, it is straightforward to see that $β_{u} \subseteq [L_{-}, L_{+}]$ if $β_{u, \min} \geq L_{-}$ and $β_{u, \max} \leq L_{+}$ for all IB curves $f_{IB}$ and functions u. Therefore, we look at a domain bound dependent on the function choice. That is, if we can find $β_{\min} \leq f_{IB}^{'} (r)$ and $β_{\max} \geq f_{IB}^{'} (r)$ for all IB curves and all values of r, then

$B_{u} \subseteq [\frac{β_{\min}}{{lim}_{r \to r_{\max}^{-}} {u^{'} (r)}}, \frac{β_{\max}}{{lim}_{r \to 0^{+}} {u^{'} (r)}}] .$ (A21)

The region for all possible IB curves regardless of the relationship between X and Y is depicted in Figure A1. The hard limits are imposed by the DPI (Theorem 2.8.1 from Cover and Thomas [20]) and the fact that the mutual information is non-negative (Corollary with Equation 2.90 for discrete and first Corollary of Theorem 8.6.1 for continuous random variables from Cover and Thomas [20]). Hence, a minimum and maximum values of $f_{IB}^{'}$ are given by the minimum and maximum values of the slope of the Pareto frontier. Which means

$B_{u} \subseteq [0, \frac{1}{{lim}_{r \to 0^{+}} {u^{'} (r)}}] .$ (A22)

Note $0 / ({lim}_{r \to r_{m a x}^{-}} {u^{'} (r)}) = 0$ since u is monotonically increasing and, thus, $u^{'}$ will never be 0.

Then, we can tighten the bound using the results from Wu et al. [27], where, in Theorem 2, they showed the slope of the Pareto frontier could be bounded in the origin by $f_{IB}^{'} \leq {({inf}_{Ω_{x} \subset X} {β_{0} (Ω_{x})})}^{- 1}$ . Finally, we know that in deterministic classification tasks ${inf}_{Ω_{x} \subset X} {β_{0} (Ω_{x})} = 1$ , which aligns with Kolchinsky et al. [21] and what we can observe from Figure A1. Therefore,

$B_{u} \subseteq [0, \frac{{({inf}_{Ω_{x} \subset X} {β_{0} (Ω_{x})})}^{- 1}}{{lim}_{r \to 0^{+}} {u^{'} (r)}}] \subseteq [0, \frac{1}{{lim}_{r \to 0^{+}} {u^{'} (r)}}] .$ (A23)

□

Figure A1 — Graphical representation of the IB curve in the information plane. Dashed lines in orange represent tight bounds confining the region (in light orange) of possible IB curves (delimited by the red line, also known as the Pareto frontier). Black dotted lines are informative values. In blue we show an example of a possible IB curve confining a region (in darker orange) of an IB curve that does not achieve the Pareto frontier. Finally, the yellow star represents the point where the representation keeps the same information about the input and the output.

Appendix G. Other Lagrangian Families

We can use the same ideas we used for the convex IB Lagrangian to formulate new families of Lagrangians that allow the exploration of the IB curve. For that, we will use the duality of the IB curve (Lemma 10 of [19]). That is:

Definition A1 (IB Dual Functional).

Let X and Y be statistically dependent variables. Let also Δ be the set of random variables T obeying the Markov condition $Y \leftrightarrow X \leftrightarrow T$ . Then the IB dual functional is

$F_{IB, \min} (i) = min_{T \in Δ} \{I (X; T)\} s . t . I (T; Y) \geq i, \forall i \in [0, I (X; Y)) .$ (A24)

Theorem A1 (IB Curve Duality).

Let the IB curve be defined by the solutions of $F_{IB, \max} (r)$ for varying $r \in [0, \infty)$ . Then,

$\forall r \exists i s . t . (r, F_{IB, \max} (r)) = (F_{IB, \min} (i), i)$ (A25)

and

$\forall i \exists r s . t . (F_{IB, \min} (i), i) = (r, F_{IB, \max} (r)) .$ (A26)

From this definition, it follows that minimizing the dual IB Lagrangian, $L_{IB, dual} (T; β_{dual}) = I (X; T) - β_{dual} I (T; Y)$ , for $β_{dual} = β^{- 1}$ is equivalent to maximizing the IB Lagrangian. In fact, the original Lagrangian for solving the problem was defined this way [1]. We decided to use the maximization version because the domain of useful $β$ is bounded while it is not for $β_{dual}$ .

Following the same reasoning as we did in the proof of Theorem 2, we can ensure the IB curve can be explored if:

We minimize the concave IB Lagrangian $L_{IB, v} (T; β_{v}) = I (X; T) - β_{v} v (I (T; Y))$ .
We maximize the dual concave IB Lagrangian $L_{IB, v, dual} (T; β_{v, dual}) = v (I (T; Y)) - β_{v, dual} I (X; T)$ .
We minimize the dual convex IB Lagrangian $L_{IB, u, dual} (T; β_{u, dual}) = u (I (X; T)) - β_{u, dual} I (T; Y)$ .

Here, u is a monotonically increasing strictly convex function, v is a monotonically increasing strictly concave function, and $β_{v}, β_{v, dual}, β_{u, dual}$ are the Lagrange multipliers of the families of Lagrangians defined above.

In a similar manner, one could obtain relationships between the Lagrange multipliers of the IB Lagrangian and the convex IB Lagrangian with these Lagrangian families. For instance, the convex IB Lagrangian $L_{IB, u} (T; β_{u})$ is related with the concave IB Lagrangian $L_{IB, v} (T; β_{v})$ as defined by Propositon A1.

Proposition A1 (Relationship between the convex and concave IB Lagrangians).

Consider the convex and concave IB Lagrangians $L_{IB, u} (T; β_{u})$ , $L_{IB, v} (T; β_{v})$ . Let the IB curve defined as in Definition 2 be $f_{IB}$ . Then, if we fix the functions u and v we can obtain the same point in the IB curve $(r, f_{IB} (r))$ with both Lagrangians when

$β_{v}^{- 1} = f_{IB}^{'} (r) v^{'} (f_{IB} ({(u^{'})}^{- 1} (\frac{f_{IB}^{'} (r)}{β_{u}}))),$ (A27)

or equivalently,

$β_{u}^{- 1} = \frac{1}{f_{IB}^{'} (r)} u^{'} (f_{IB}^{- 1} ({(v^{'})}^{- 1} (\frac{β_{v}^{- 1}}{f_{IB}^{'} (r)}))) .$ (A28)

Proof.

If we proceed like we did in the proof of Proposition 3 we can find the mapping between $I (X; T)$ and $β_{u}$ and between $I (T; Y)$ and $β_{v}$ . That is,

$I (X; T) = {(u^{'})}^{- 1} (\frac{d f_{IB} (I (X; T))}{d I (X; T)} \frac{1}{β_{u}}) and I (T; Y) = {(v^{'})}^{- 1} ({(\frac{d f_{IB} (I (X; T))}{d I (X; T)})}^{- 1} \frac{1}{β_{v}}) .$ (A29)

Then, if we recall that $I (T; Y) = f_{IB} (I (X; T))$ , we can directly obtain that

$f_{IB} ({(u^{'})}^{- 1} (\frac{d f_{IB} (I (X; T))}{d I (X; T)} \frac{1}{β_{u}})) = {(v^{'})}^{- 1} ({(\frac{d f_{IB} (I (X; T))}{d I (X; T)})}^{- 1} \frac{1}{β_{v}}) .$ (A30)

Then, if we solve Equation (A30) with a fixed point $(I (X; T) = r, I (T; Y) = f_{IB} (r))$ for $β_{v}$ we obtain Equation (A27), and if we solve it for $β_{u}$ we obtain Equation (A28). □

Also, one could find a range of values for these Lagrangians to allow for the IB curve exploration and define a bijective mapping between their Lagrange multipliers and the IB curve. However, (i) as mentioned in Section 2.2, $I (T; Y)$ is particularly interesting to maximize without transformations because of its meaning. Moreover, (ii) like $β_{dual}$ , the domain of useful $β_{v}$ and $β_{u, dual}$ is not upper bounded. These two reasons make these other Lagrangians less preferable. We only include them here for completeness. Nonetheless, we encourage the curiours reader to explore these families of Lagrangians too. For example, a possible interesting research would be investigating if some particularization of the concave IB Lagrangian suffers from an issue like value convergence that can be exploited for approximately obtaining any predictability level $I (T; Y) = i^{*}$ for many values of $β_{v}$ .

Appendix H. Experimental Setup Details and Further Experiments

In order to generate empirical support for our claims, we performed several experiments on different datasets with different neural network architectures and different ways of calculating the information bottleneck.

Appendix H.1. Information Bottleneck Calculations

The information bottleneck is calculated modifying either the nonlinear-IB [26]. This method of calculating the information bottleneck is a neural network that minimizes the cross-entropy while also miniminizing an upper bound estimate of the mutual information $I_{θ} \approx I (X; T)$ . The nonlinear-IB relies on a kernel-based estimate of this mutual information [40]. We modify this calculation method by applying the function u to the $I (X; T)$ estimate.

For the nonlinear-IB calculations, we estimated the gradients of both $I_{θ} (X; T)$ and the cross-entropy with the same mini-batch. Moreover, we did not learn the covariance of the mixture of Gaussians used for the kernel density estimation of $I_{θ} (X; T)$ and we set it to ${(exp (- 1))}^{2}$ .

In both methods, and for all the experiments, we assumed a Gaussian stochastic encoder $T = f_{enc} (X; θ) + W$ with $p_{W} = N (0, I_{d})$ , where d are the number of dimensions of the representations. We trained the neural networks with the Adam optimization algorithm [46] with a learning rate of $10^{- 4}$ and a $0.6$ decay rate every 10 epochs. We used a batch size of 128 samples and all the weights were initialized according to the method described by Glorot and Bengio [47] using a Gaussian distribution.

Then, we used the DBSCAN algorithm [44,45] for clustering. Particularly, we used the scikit-learn [48] implementation with $ϵ = 0.3$ and min_samples = 50.

The reader can find the PyTorch [30] implementation in the following link: https://github.com/burklight/convex-IB-Lagrangian-PyTorch.

Appendix H.2. The Experiments

We performed experiments in four different datasets:

A Classification Task on the MNIST Dataset [28] (Figure 1, Figure 2, Figure A2, Figure A3 and Figure A4 and top row from Figure 3). This dataset contains 60,000 training samples and 10,000 testing samples of hand-written digits. The samples are 28x28 pixels and are labeled from 0 to 9; i.e., $X = R^{784}$ and $Y = {0, 1, \dots, 9}$ . The data is pre-processed so that the input has zero mean and unit variance. This is a deterministic setting, hence the experiment is designed to showcase how the convex IB Lagrangians allow us to explore the IB curve in a setting where the normal IB Lagrangian cannot and the relationship between the performance plateaus and the clusterization phenomena. Furthermore, it intends to showcase the behavior of the power and exponential Lagrangians with different parameters of $α$ and $η$ . Finally, it wants to demonstrate how the value convergence can be employed to approximately obtain a specific compression value. In this experiment, the encoder $f_{enc}$ is a three fully-connected layer encoder with 800 ReLU units on the first two layers and two linear units on the last layer ( $T \in R^{2}$ ), and the decoder $f_{dec}$ is a fully-connected 800 ReLU unit layers followed by an output layer with 10 softmax units. The convex IB Lagrangian was calculated using the nonlinear-IB.

In Figure A2 we show how the IB curve can be explored with different values of $α$ for the power IB Lagrangian and in Figure A3 for different values of $η$ and the exponential IB Lagrangian.

Finally, in Figure A4 we show the clusterization for the same values of $α$ and $η$ as in Figure A2 and Figure A3. In this way the connection between the performance discontinuities and the clusterization is more evident. Furthermore, we can also observe how the exponential IB Lagrangian maintains better the theoretical performance than the power IB Lagrangian (see Appendix I for an explanation of why).
A Classification Task on the Fashion-MNIST Dataset [49] (Figure A5). As MNSIT, this dataset contains 60,000 training and 10,000 testing samples of 28x28 pixel images labeled from 0 to 9 and constitutes a deterministic setting. The difference is that this dataset contains fashion products instead of hand-written digits and it represents a harder classification task [49]. The data is also pre-processed so that the input has zero mean and unit variance. For this experiment, the encoder $f_{enc}$ is composed of a two-layer convolutional neural network (CNN) with 32 filters on the first layer and 128 filters on the second with kernels of size 5 and stride 2. This CNN is followed by two fully-connected layers of 128 linear units ( $T \in R^{128}$ ). After the first convolution and the first fully-connected layer, a ReLU activation is employed. The decoder $f_{dec}$ is a fully-connected 128 ReLU unit layer followed by an output layer with 10 softmax units. The convex IB Lagrangian was calculated using the nonlinear-IB. Therefore, this experiment intends to showcase how the convex IB Lagrangian can explore the IB curve for different neural network architectures and harder datasets.
A Regression Task on the California Housing Dataset [50] (Figure A6). This dataset contains 20,640 samples of 8 real number input variables like the longitude and latitude of the house (i.e., $X \in R^{8}$ ) and a task output real variable representing the price of the house (i.e., $Y \in R$ ). We used the log-transformed house price as the target variable and dropped the 992 samples in which the house price was equal or greater than $ $500,000$ so that the output distribution was closer to a Gaussian as they did in [26]. The input variables were processed so that they had zero mean and unit variance and we randomly split the samples into a 70% training and 30% test dataset. As in [40], for regression tasks we approximate $H (Y)$ with the entropy of a Gaussian with variance $Var (Y)$ and $H (Y | T)$ with the entropy of a Gaussian with variance equal to the mean-squared error (MSE). This leads to the estimate $I (T; Y) \approx 0.5 log (Var (Y) / M S E)$ . The encoder $f_{enc}$ is a three fully-connected layer encoder with 128 ReLU units on the first two layers and 2 linear units on the last layer ( $T \in R^{2}$ ), and the decoder $f_{dec}$ is a fully-connected 128 ReLU unit layers followed by an output layer with 1 linear unit. The convex IB Lagrangian was calculated using the nonlinear-IB. Hence, this experiment was designed to showcase the convex IB Lagrangian can explore the IB curve in stochastic scenarios for regression tasks.
A Classification Task on the TREC-6 Dataset [29] (Figure A7 and bottom row from Figure 3). This dataset is the six-class version of the TREC [51] dataset. It contains 5452 training and 500 test samples of text questions. Each question is labeled within six different semantic categories based on what the answer is; namely: Abbreviation, description and abstract concepts, entities, human beings, locations, and numeric values. This dataset does not constitute a deterministic setting since there are examples that could belong to more than one class and there are examples which are wrongly labeled (e.g., “What is a fear of parasites?” could belong both to the description and abstract concept category, however it is labeled into the entity category), and hence $H (Y | X) > 0$ . Following Ben Trevett’s tutorial on Sentiment Analysis [52] the encoder $f_{enc}$ is composed by a 6 billion token pre-trained 100-dimensional Glove word embedding [53], followed by a concatenation of three convolutions with kernel sizes 2–4 respectively, and finalized with a fully-connected 128 linear unit layer ( $T \in R^{128}$ ). The decoder $f_{dec}$ is a single fully-connected 6 softmax unit layer. The convex IB Lagrangian was calculated using the nonlinear-IB. Thus, this experiment intends to show an example where the classification task does not convey a deterministic scenario, that the convex IB Lagrangian can recover the IB curve in complex stochastic tasks with complex neural network architectures and that the value convergence can be employed to obtain a specific compression value even in stochastic settings where the IB curve is unknown.

Figure A2 — Results for the power IB Lagrangian in the MNIST dataset with $α = {0.5, 1, 2}$ , from top to bottom. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) $I (T; Y)$ as a function of $β_{u}$ ; and (iii) the compression $I (X; T)$ as a function of $β_{u}$ . In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the green stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated $I (X; Y) = H (Y) = {log}_{2} (10)$ in a dashed, orange line. All values are shown in bits.

Figure A3 — Results for the exponential IB Lagrangian in the MNIST dataset with $η = {log (2), 1, 1.5}$ , from top to bottom. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) $I (T; Y)$ as a function of $β_{u}$ ; and (iii) the compression $I (X; T)$ as a function of $β_{u}$ . In all plots, the red crosses joined by a dotted line represent the values computed with the training set, the blue dots the values computed with the validation set and the gren stars the theoretical values computed as dictated by Proposition 3. Moreover, in all plots, it is indicated $I (X; Y) = H (Y) = {log}_{2} (10)$ in a dashed, orange line. All values are shown in bits.

Figure A4 — Depiction of the clusterization behavior of the bottleneck variable in the MNIST dataset. In the first row, from left to right, the power IB Lagrangian with different values of $α = {0.5, 1, 2}$ . In the second row, from left to right, the exponential IB Lagrangian with different values of $η = {log (2), 1, 1.5}$ .

Figure A5 — Results for the exponential IB Lagrangian in the Fashion MNIST dataset with $η = 1$ . From left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) $I (T; Y)$ as a function of $β_{u}$ ; and (iii) the compression $I (X; T)$ as a function of $β_{u}$ . In all plots, the red crosses joined by a dotted line represent the values computed with the training set and the blue dots the values computed with the validation set. Moreover, in all plots, it is indicated $I (X; Y) = H (Y) = {log}_{2} (10)$ . All values are shown in bits.

Figure A6 — The top row shows the results for the normal IB Lagrangian, and the bottom row for the exponential IB Lagrangian with $η = 1$ , both in the California housing dataset. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) $I (T; Y)$ as a function of $β_{u}$ ; and (iii) the compression $I (X; T)$ as a function of $β_{u}$ . In all plots, the red crosses joined by a dotted line represent the values computed with the training set and the blue dots the values computed with the validation set. Moreover, in all plots, it is indicated $I (X; Y)$ as the empirical value obtained maximizing $I (T; Y)$ without compression limitations as in [26]. All values are shown in bits.

Figure A7 — The top row shows the results for the normal IB Lagrangian, and the bottom row for the power IB Lagrangian with $α = 0.1$ , both in the TREC-6 dataset. In each row, from left to right it is shown (i) the information plane, where the region of possible solutions of the IB problem is shadowed in light orange and the information-theoretic limits are the dashed orange line; (ii) $I (T; Y)$ as a function of $β_{u}$ ; and (iii) the compression $I (X; T)$ as a function of $β_{u}$ . In all plots, the red crosses joined by a dotted line represent the values computed with the training set and the blue dots the values computed with the validation set. Moreover, in all plots, it is indicated $H (Y) = {log}_{2} (6)$ . All values are shown in bits.

Appendix I. Guidelines for Selecting A Proper Function in the Convex IB Lagrangian

When choosing the right u function, it is important to find the right balance between avoiding value convergence and aiming for strong convexity. Practically, this balance is found by looking at how much faster u grows w.r.t. the identity function.

When the aim is not to draw the IB curve but to find a specific level of performance, we can exploit the value convergence phenomenon in order to design a stable performance targeted u function.

Appendix I.1. Avoiding Value Convergence

In order to explain this issue we are going to use the example of classification on MNIST [28], where $I (X; Y) = H (Y) = {log}_{2} (10)$ , and again the power and exponential IB Lagrangians.

If we use Proposition 3 on both Lagrangians we obtain the bijective mapping between their Lagrange multipliers and a certain level of compression in the classification setting:

Power IB Lagrangian: $β_{pow} = {((1 + α) I {(X; T)}^{α})}^{- 1}$ and $I (X; T) = {((1 + α) β_{pow})}^{- \frac{1}{α}}$ .
Exponential IB Lagrangian: $β_{\exp} = {(η exp (η I (X; T)))}^{- 1}$ and $I (X; T) = - log (η β_{\exp}) / η$ .

Hence, we can simply plot the curves of $I (X; T)$ vs. $β_{u}$ for different hyperparameters $α$ and $η$ (see Figure A8). In this way, we can observe how increasing the growth of the function (e.g., increasing $α$ or $η$ in this case) too much provokes that many different values of $β_{u}$ converge to very similar values of $I (X; T)$ . This is an issue both for drawing the curve (for obvious reasons) and for aiming for a specific performance level. Due to the nature of the estimation of the IB Lagrangian, the theoretical and practical value of $β_{u}$ that yields a specific $I (X; T)$ may vary slightly (see Figure 1). Then if we select a function with too high growth, a small change in $β_{u}$ can result in a big change in the performance obtained.

Appendix I.2. Aiming for Strong Convexity

Definition A2 ( $μ$ -Strong Convexity).

If a function $f (r)$ is twice continuous differentiable and its domain is confined in the real line, then it is μ-strong convex if $f^{″} (r) \geq μ \geq 0$ $\forall r$ .

Experimentally, we observed when the growth of our function $u (r)$ is small in the domain of interest $r > 0$ the convex IB Lagrangian does not perform well (see first row of Figure A2 and Figure A3). Later we realized that this was closely related to the strength of the convexity of our function.

In Theorem 2 we imposed the function u to be strictly convex to enforce having a unique $β_{u}$ for each value of $I (X; T)$ . Hence, since in practice we are not exactly computing the Lagrangian but an estimation of it (e.g., with the nonlinear IB [26]) we require strong convexity in order to be able to explore the IB curve.

We now look at the second derivative of the power and exponential function: $u^{″} (r) = (1 + α) α r^{α - 1}$ and $u^{″} (r) = η^{2} exp (η r)$ respectivelly. Here we see how both functions are inherently 0-strong convex for $r > 0$ and $α, η > 0$ . However, values of $α < 1$ and $η < 1$ could lead to low $μ$ -strong convexity in certain domains of r. Particularly, the case of $α < 1$ is dangerous because the function approaches 0-strong convexity as r increases, so the power IB Lagrangian performs poorly when low $α$ are used to find high performances.

Appendix I.3. Exploiting Value Convergence

When the aim is not to draw or explore the IB curve, but to obtain a specific level of performance, the power of exponential IB Lagrangians aforementioned might not be the best choice due to the problems with value convergence or non-strong convexity. However, we can exploit the former in order to design a performance targeted u function.

For instance, if we look at Figure A8 we can see how a modification of the exponential IB Lagrangian could result in such a function. More precisely, a shifted exponential $u (r) = exp (η (r - r^{*}))$ , with $η > 0$ sufficiently large, converges to the compression level $r^{*}$ . We can see this more clearly if we consider the shifted exponential IB Lagrangian $L_{IB, sh - \exp} (T; β_{sh - \exp}, η, r^{*}) = I (T; Y) - β_{sh - \exp} exp (η (I (X; T) - r^{*}))$ , since then the application of Proposition 3 results on $I (X; T) = - log (η β_{sh - \exp} / f_{IB}^{'} (I (X; T))) / η + r^{*}$ , where $f_{IB}^{'} (I (X; T))$ is the derivative of $f_{IB}$ evaluated at $I (X; T)$ . We know $f_{IB}^{'} = 1$ in deterministic scenarios (Theorem 2) and that $f_{IB}^{'} < 1$ otherwise (see, e.g., [27]). Then, for large enough $η$ , $I (X; T) \approx r^{*}$ regardless of the value of $f_{IB}^{'}$ .

For instance, if we consider a deterministic scenario like the MNIST dataset [28] with $I (X; Y) = H (Y) = {log}_{2} (10)$ , for $η = 200$ and $r^{*} = 2$ the range of the Lagrange multipliers that allow the exploration of the IB curve, according to Corollary 2, is $β_{sh - \exp} \in [7.54 \times 10^{- 178}, 2.61 \times 10^{171}]$ . Furthermore, $I (X; T)$ is close to 2 for many values of $β_{sh - \exp}$ . For instance, $I (X; T) = 1.974$ for $β_{sh - \exp} = 1$ and $I (X; T) = 1.963$ for $β_{sh - \exp} = 8$ . This ensures a stability in the performance level obtained so that small changes in the choice of $β_{sh - \exp}$ do not result in significant changes on the performance (e.g., see top row from Figure 4).

If we now consider a stochastic scenario like the TREC-6 dataset [29] with $H (Y) = {log}_{2} (6)$ , for $η = 200$ and $r^{*} = 16$ the range of the Lagrange multipliers that allow the IB curve, according to Corollary 3, is $β_{sh - \exp} \in [0, 2.76 {({inf}_{Ω_{x} \subset X} {β_{0} (Ω_{x})})}^{- 1} \times 10^{1287}]$ , where $β_{0}$ and $Ω_{x}$ are defined as in [27]. Then, unless ${({inf}_{Ω_{x} \subset X} {β_{0} (Ω_{x})})}^{- 1}$ is of the order of $10^{- 1287}$ , the range of possible betas is wide. Moreover, $I (X; T)$ is close to 16 for many values of $β_{sh - \exp}$ . For example, $I (X; T) = 15.939$ if $f_{IB}^{'} = 0.001$ at that point and $I (X; T) = 15.973$ if $f_{IB}^{'} = 0.9$ for $β_{sh - \exp} = 1$ ; and $I (X; T) = 15.929$ if $f_{IB}^{'} = 0.001$ at that point and $I (X; T) = 15.963$ if $f_{IB}^{'} = 0.9$ for $β_{sh - \exp} = 8$ . Hence, as in the deterministic scenario, the performance level obtained is stable with changes in the choice of $β_{sh - \exp}$ (e.g., see bottom row from Figure 4).

Author Contributions

Conceptualization, B.R.G. and R.T.; formal analysis, B.R.G.; funding acquisition, M.S.; methodology, B.R.G. and R.T.; resources, M.S.; software, B.R.G.; supervision, R.T. and M.S.; visualization, B.R.G.; writing—original draft, B.R.G.; writing—review and editing, B.R.G., R.T. and M.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Swedish Research Council.

Conflicts of Interest

The authors declare no conflict of interest.

References

1.Tishby N., Pereira F.C., Bialek W. The information bottleneck method. arXiv. 2000physics/0004057 [Google Scholar]
2.Alemi A.A., Fischer I., Dillon J.V., Murphy K. Deep variational information bottleneck. arXiv. 20161612.00410 [Google Scholar]
3.Peng X.B., Kanazawa A., Toyer S., Abbeel P., Levine S. Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]
4.Achille A., Soatto S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018;40:2897–2905. doi: 10.1109/TPAMI.2017.2784440. [DOI] [PubMed] [Google Scholar]
5.Slonim N., Tishby N. Document clustering using word clusters via the information bottleneck method; Proceedings of the 23rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval; Athens, Greece. 24–28 July 2000. [Google Scholar]
6.Slonim N., Tishby N. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2000. Agglomerative information bottleneck. [Google Scholar]
7.Slonim N., Atwal G.S., Tkačik G., Bialek W. Information-based clustering. Proc. Natl. Acad. Sci. USA. 2005;102:18297–18302. doi: 10.1073/pnas.0507432102. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Teahan W.J. Content-Based Multimedia Information Access. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE; Paris, France: 2000. Text classification and segmentation using minimum cross-entropy; pp. 943–961. [Google Scholar]
9.Strouse D., Schwab D.J. The deterministic information bottleneck. Neur. Comput. 2017;29:1611–1630. doi: 10.1162/NECO_a_00961. [DOI] [PubMed] [Google Scholar]
10.Nazer B., Ordentlich O., Polyanskiy Y. Information-distilling quantizers; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 96–100. [Google Scholar]
11.Hassanpour S., Wübben D., Dekorsy A. On the equivalence of double maxima and KL-means for information bottleneck-based source coding; Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC); Barcelona, Spain. 15–18 April 2018; pp. 1–6. [Google Scholar]
12.Goyal A., Islam R., Strouse D., Ahmed Z., Botvinick M., Larochelle H., Levine S., Bengio Y. Infobot: Transfer and exploration via the information bottleneck. arXiv. 20191901.10902 [Google Scholar]
13.Yingjun P., Xinwen H. Learning Representations in Reinforcement Learning:An Information Bottleneck Approach. arXiv. 2019cs.LG/1911.05695 [Google Scholar]
14.Sharma A., Gu S., Levine S., Kumar V., Hausman K. Dynamics-Aware Unsupervised Skill Discovery; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]
15.Schulz K., Sixt L., Tombari F., Landgraf T. Restricting the Flow: Information Bottlenecks for Attribution; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]
16.Li X.L., Eisner J. Specializing Word Embeddings (for Parsing) by Information Bottleneck; Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Hong Kong, China. 3–7 November 2019; pp. 2744–2754. [Google Scholar]
17.Zaslavsky N., Kemp C., Regier T., Tishby N. Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. USA. 2018;115:7937–7942. doi: 10.1073/pnas.1800521115. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Chalk M., Marre O., Tkačik G. Toward a unified theory of efficient, predictive, and sparse coding. Proc. Natl. Acad. Sci. USA. 2018;115:186–191. doi: 10.1073/pnas.1711114115. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Gilad-Bachrach R., Navot A., Tishby N. Learning Theory and Kernel Machines. Springer; Berlin, Germany: 2003. An information theoretic tradeoff between complexity and accuracy; pp. 595–609. [Google Scholar]
20.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2012. [Google Scholar]
21.Kolchinsky A., Tracey B.D., Van Kuyk S. Caveats for information bottleneck in deterministic scenarios; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]
22.Courcoubetis C. Pricing Communication Networks Economics, Technology and Modelling. Wiley Online Library; Hoboken, NJ, USA: 2003. [Google Scholar]
23.Tishby N., Slonim N. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2001. Data clustering by markovian relaxation and the information bottleneck method; pp. 640–646. [Google Scholar]
24.Slonim N., Friedman N., Tishby N. Unsupervised document classification using sequential information maximization; Proceedings of the 25th annual international ACM SIGIR Conference on Research and Development in Information Retrieval; Tampere, Finland. 11–15 August 2002. [Google Scholar]
25.Chalk M., Marre O., Tkacik G. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2016. Relevant sparse codes with variational information bottleneck; pp. 1957–1965. [Google Scholar]
26.Kolchinsky A., Tracey B.D., Wolpert D.H. Nonlinear information bottleneck. Entropy. 2019;21:1181. doi: 10.3390/e21121181. [DOI] [Google Scholar]
27.Wu T., Fischer I., Chuang I., Tegmark M. Learnability for the Information Bottleneck; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]
28.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition; Proceedings of the 1998 IEEE International Frequency Control Symposium; Pasadena, CA, USA. 27–29 May 1998. [Google Scholar]
29.Li X., Roth D. Proceedings of the 19th international conference on Computational linguistics—Volume 1. Association for Computational Linguistics; Stroudsburg, PA, USA: 2002. Learning question classifiers; pp. 1–7. [Google Scholar]
30.Paszke A., Gross S., Chintala S., Chanan G., Yang E., DeVito Z., Lin Z., Desmaison A., Antiga L., Lerer A. Automatic differentiation in pytorch; Proceedings of the NIPS Autodiff Workshop; Long Beach, CA, USA. 9 December 2017. [Google Scholar]
31.Bishop C.M. Pattern Recognition and Machine Learning. Springer Science+ Business Media; Berlin, Germany: 2006. [Google Scholar]
32.Xu A., Raginsky M. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2017. Information-theoretic analysis of generalization capability of learning algorithms; pp. 2524–2533. [Google Scholar]
33.Krizhevsky A., Sutskever I., Hinton G.E. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2012. Imagenet classification with deep convolutional neural networks; pp. 1097–1105. [Google Scholar]
34.Shore J.E., Gray R.M. Minimum cross-entropy pattern classification and cluster analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1982;1:11–17. doi: 10.1109/TPAMI.1982.4767189. [DOI] [PubMed] [Google Scholar]
35.Shore J., Johnson R. Properties of cross-entropy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 1981;27:472–482. doi: 10.1109/TIT.1981.1056373. [DOI] [Google Scholar]
36.Vera M., Piantanida P., Vega L.R. The role of the information bottleneck in representation learning; Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT); Vail, CO, USA. 17–22 June 2018; pp. 1580–1584. [Google Scholar]
37.Shamir O., Sabato S., Tishby N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010;411:2696–2711. doi: 10.1016/j.tcs.2010.04.006. [DOI] [Google Scholar]
38.Achille A., Soatto S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018;19:1947–1980. [Google Scholar]
39.Du Pin Calmon F., Polyanskiy Y., Wu Y. Strong data processing inequalities for input constrained additive noise channels. IEEE Trans. Inf. Theory. 2017;64:1879–1892. doi: 10.1109/TIT.2017.2782359. [DOI] [Google Scholar]
40.Kolchinsky A., Tracey B. Estimating mixture entropy with pairwise distances. Entropy. 2017;19:361. doi: 10.3390/e19070361. [DOI] [Google Scholar]
41.Amjad R.A., Geiger B.C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans. Pattern Anal. Mach. Intell. 2019:1. doi: 10.1109/TPAMI.2019.2909031. [DOI] [PubMed] [Google Scholar]
42.Alemi A.A., Fischer I., Dillon J.V. Uncertainty in the variational information bottleneck. arXiv. 20181807.00906 [Google Scholar]
43.Wu T., Fischer I. Phase Transitions for the Information Bottleneck in Representation Learning; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]
44.Ester M., Kriegel H.P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise; Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Menlo Park, CA, USA. 2–4 August 1996; pp. 226–231. [Google Scholar]
45.Schubert E., Sander J., Ester M., Kriegel H.P., Xu X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. TODS. 2017;42:19. doi: 10.1145/3068335. [DOI] [Google Scholar]
46.Kingma D.P., Ba J. Adam: A method for stochastic optimization. arXiv. 20141412.6980 [Google Scholar]
47.Glorot X., Bengio Y. Understanding the difficulty of training deep feedforward neural networks; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; Sardinia, Italy. 13–15 May 2010; pp. 249–256. [Google Scholar]
48.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
49.Xiao H., Rasul K., Vollgraf R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv. 20171708.07747 [Google Scholar]
50.Pace R.K., Barry R. Sparse spatial autoregressions. Stat. Probab. Lett. 1997;33:291–297. doi: 10.1016/S0167-7152(96)00140-X. [DOI] [Google Scholar]
51.Voorhees E.M., Tice D.M. Building a question answering test collection; Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Athens, Greece. 24–28 July 2000. [Google Scholar]
52.Trevett, Ben. Tutorial on Sentiment Analysis: 5—Multi-class Sentiment Analysis. April 2019. [(accessed on 14 January 2020)]; Available online: https://github.com/bentrevett/pytorch-sentiment-analysis.
53.Pennington J., Socher R., Manning C. Glove: Global vectors for word representation; Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Doha, Qatar. 25–29 October 2014; pp. 1532–1543. [Google Scholar]

[B1-entropy-22-00098] 1.Tishby N., Pereira F.C., Bialek W. The information bottleneck method. arXiv. 2000physics/0004057 [Google Scholar]

[B2-entropy-22-00098] 2.Alemi A.A., Fischer I., Dillon J.V., Murphy K. Deep variational information bottleneck. arXiv. 20161612.00410 [Google Scholar]

[B3-entropy-22-00098] 3.Peng X.B., Kanazawa A., Toyer S., Abbeel P., Levine S. Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]

[B4-entropy-22-00098] 4.Achille A., Soatto S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018;40:2897–2905. doi: 10.1109/TPAMI.2017.2784440. [DOI] [PubMed] [Google Scholar]

[B5-entropy-22-00098] 5.Slonim N., Tishby N. Document clustering using word clusters via the information bottleneck method; Proceedings of the 23rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval; Athens, Greece. 24–28 July 2000. [Google Scholar]

[B6-entropy-22-00098] 6.Slonim N., Tishby N. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2000. Agglomerative information bottleneck. [Google Scholar]

[B7-entropy-22-00098] 7.Slonim N., Atwal G.S., Tkačik G., Bialek W. Information-based clustering. Proc. Natl. Acad. Sci. USA. 2005;102:18297–18302. doi: 10.1073/pnas.0507432102. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B8-entropy-22-00098] 8.Teahan W.J. Content-Based Multimedia Information Access. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE; Paris, France: 2000. Text classification and segmentation using minimum cross-entropy; pp. 943–961. [Google Scholar]

[B9-entropy-22-00098] 9.Strouse D., Schwab D.J. The deterministic information bottleneck. Neur. Comput. 2017;29:1611–1630. doi: 10.1162/NECO_a_00961. [DOI] [PubMed] [Google Scholar]

[B10-entropy-22-00098] 10.Nazer B., Ordentlich O., Polyanskiy Y. Information-distilling quantizers; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 96–100. [Google Scholar]

[B11-entropy-22-00098] 11.Hassanpour S., Wübben D., Dekorsy A. On the equivalence of double maxima and KL-means for information bottleneck-based source coding; Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC); Barcelona, Spain. 15–18 April 2018; pp. 1–6. [Google Scholar]

[B12-entropy-22-00098] 12.Goyal A., Islam R., Strouse D., Ahmed Z., Botvinick M., Larochelle H., Levine S., Bengio Y. Infobot: Transfer and exploration via the information bottleneck. arXiv. 20191901.10902 [Google Scholar]

[B13-entropy-22-00098] 13.Yingjun P., Xinwen H. Learning Representations in Reinforcement Learning:An Information Bottleneck Approach. arXiv. 2019cs.LG/1911.05695 [Google Scholar]

[B14-entropy-22-00098] 14.Sharma A., Gu S., Levine S., Kumar V., Hausman K. Dynamics-Aware Unsupervised Skill Discovery; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]

[B15-entropy-22-00098] 15.Schulz K., Sixt L., Tombari F., Landgraf T. Restricting the Flow: Information Bottlenecks for Attribution; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]

[B16-entropy-22-00098] 16.Li X.L., Eisner J. Specializing Word Embeddings (for Parsing) by Information Bottleneck; Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Hong Kong, China. 3–7 November 2019; pp. 2744–2754. [Google Scholar]

[B17-entropy-22-00098] 17.Zaslavsky N., Kemp C., Regier T., Tishby N. Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. USA. 2018;115:7937–7942. doi: 10.1073/pnas.1800521115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18-entropy-22-00098] 18.Chalk M., Marre O., Tkačik G. Toward a unified theory of efficient, predictive, and sparse coding. Proc. Natl. Acad. Sci. USA. 2018;115:186–191. doi: 10.1073/pnas.1711114115. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19-entropy-22-00098] 19.Gilad-Bachrach R., Navot A., Tishby N. Learning Theory and Kernel Machines. Springer; Berlin, Germany: 2003. An information theoretic tradeoff between complexity and accuracy; pp. 595–609. [Google Scholar]

[B20-entropy-22-00098] 20.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2012. [Google Scholar]

[B21-entropy-22-00098] 21.Kolchinsky A., Tracey B.D., Van Kuyk S. Caveats for information bottleneck in deterministic scenarios; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]

[B22-entropy-22-00098] 22.Courcoubetis C. Pricing Communication Networks Economics, Technology and Modelling. Wiley Online Library; Hoboken, NJ, USA: 2003. [Google Scholar]

[B23-entropy-22-00098] 23.Tishby N., Slonim N. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2001. Data clustering by markovian relaxation and the information bottleneck method; pp. 640–646. [Google Scholar]

[B24-entropy-22-00098] 24.Slonim N., Friedman N., Tishby N. Unsupervised document classification using sequential information maximization; Proceedings of the 25th annual international ACM SIGIR Conference on Research and Development in Information Retrieval; Tampere, Finland. 11–15 August 2002. [Google Scholar]

[B25-entropy-22-00098] 25.Chalk M., Marre O., Tkacik G. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2016. Relevant sparse codes with variational information bottleneck; pp. 1957–1965. [Google Scholar]

[B26-entropy-22-00098] 26.Kolchinsky A., Tracey B.D., Wolpert D.H. Nonlinear information bottleneck. Entropy. 2019;21:1181. doi: 10.3390/e21121181. [DOI] [Google Scholar]

[B27-entropy-22-00098] 27.Wu T., Fischer I., Chuang I., Tegmark M. Learnability for the Information Bottleneck; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]

[B28-entropy-22-00098] 28.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition; Proceedings of the 1998 IEEE International Frequency Control Symposium; Pasadena, CA, USA. 27–29 May 1998. [Google Scholar]

[B29-entropy-22-00098] 29.Li X., Roth D. Proceedings of the 19th international conference on Computational linguistics—Volume 1. Association for Computational Linguistics; Stroudsburg, PA, USA: 2002. Learning question classifiers; pp. 1–7. [Google Scholar]

[B30-entropy-22-00098] 30.Paszke A., Gross S., Chintala S., Chanan G., Yang E., DeVito Z., Lin Z., Desmaison A., Antiga L., Lerer A. Automatic differentiation in pytorch; Proceedings of the NIPS Autodiff Workshop; Long Beach, CA, USA. 9 December 2017. [Google Scholar]

[B31-entropy-22-00098] 31.Bishop C.M. Pattern Recognition and Machine Learning. Springer Science+ Business Media; Berlin, Germany: 2006. [Google Scholar]

[B32-entropy-22-00098] 32.Xu A., Raginsky M. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2017. Information-theoretic analysis of generalization capability of learning algorithms; pp. 2524–2533. [Google Scholar]

[B33-entropy-22-00098] 33.Krizhevsky A., Sutskever I., Hinton G.E. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2012. Imagenet classification with deep convolutional neural networks; pp. 1097–1105. [Google Scholar]

[B34-entropy-22-00098] 34.Shore J.E., Gray R.M. Minimum cross-entropy pattern classification and cluster analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1982;1:11–17. doi: 10.1109/TPAMI.1982.4767189. [DOI] [PubMed] [Google Scholar]

[B35-entropy-22-00098] 35.Shore J., Johnson R. Properties of cross-entropy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 1981;27:472–482. doi: 10.1109/TIT.1981.1056373. [DOI] [Google Scholar]

[B36-entropy-22-00098] 36.Vera M., Piantanida P., Vega L.R. The role of the information bottleneck in representation learning; Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT); Vail, CO, USA. 17–22 June 2018; pp. 1580–1584. [Google Scholar]

[B37-entropy-22-00098] 37.Shamir O., Sabato S., Tishby N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010;411:2696–2711. doi: 10.1016/j.tcs.2010.04.006. [DOI] [Google Scholar]

[B38-entropy-22-00098] 38.Achille A., Soatto S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018;19:1947–1980. [Google Scholar]

[B39-entropy-22-00098] 39.Du Pin Calmon F., Polyanskiy Y., Wu Y. Strong data processing inequalities for input constrained additive noise channels. IEEE Trans. Inf. Theory. 2017;64:1879–1892. doi: 10.1109/TIT.2017.2782359. [DOI] [Google Scholar]

[B40-entropy-22-00098] 40.Kolchinsky A., Tracey B. Estimating mixture entropy with pairwise distances. Entropy. 2017;19:361. doi: 10.3390/e19070361. [DOI] [Google Scholar]

[B41-entropy-22-00098] 41.Amjad R.A., Geiger B.C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans. Pattern Anal. Mach. Intell. 2019:1. doi: 10.1109/TPAMI.2019.2909031. [DOI] [PubMed] [Google Scholar]

[B42-entropy-22-00098] 42.Alemi A.A., Fischer I., Dillon J.V. Uncertainty in the variational information bottleneck. arXiv. 20181807.00906 [Google Scholar]

[B43-entropy-22-00098] 43.Wu T., Fischer I. Phase Transitions for the Information Bottleneck in Representation Learning; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]

[B44-entropy-22-00098] 44.Ester M., Kriegel H.P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise; Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Menlo Park, CA, USA. 2–4 August 1996; pp. 226–231. [Google Scholar]

[B45-entropy-22-00098] 45.Schubert E., Sander J., Ester M., Kriegel H.P., Xu X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. TODS. 2017;42:19. doi: 10.1145/3068335. [DOI] [Google Scholar]

[B46-entropy-22-00098] 46.Kingma D.P., Ba J. Adam: A method for stochastic optimization. arXiv. 20141412.6980 [Google Scholar]

[B47-entropy-22-00098] 47.Glorot X., Bengio Y. Understanding the difficulty of training deep feedforward neural networks; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; Sardinia, Italy. 13–15 May 2010; pp. 249–256. [Google Scholar]

[B48-entropy-22-00098] 48.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]

[B49-entropy-22-00098] 49.Xiao H., Rasul K., Vollgraf R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv. 20171708.07747 [Google Scholar]

[B50-entropy-22-00098] 50.Pace R.K., Barry R. Sparse spatial autoregressions. Stat. Probab. Lett. 1997;33:291–297. doi: 10.1016/S0167-7152(96)00140-X. [DOI] [Google Scholar]

[B51-entropy-22-00098] 51.Voorhees E.M., Tice D.M. Building a question answering test collection; Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Athens, Greece. 24–28 July 2000. [Google Scholar]

[B52-entropy-22-00098] 52.Trevett, Ben. Tutorial on Sentiment Analysis: 5—Multi-class Sentiment Analysis. April 2019. [(accessed on 14 January 2020)]; Available online: https://github.com/bentrevett/pytorch-sentiment-analysis.

[B53-entropy-22-00098] 53.Pennington J., Socher R., Manning C. Glove: Global vectors for word representation; Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Doha, Qatar. 25–29 October 2014; pp. 1532–1543. [Google Scholar]

PERMALINK

The Convex Information Bottleneck Lagrangian

Borja Rodríguez Gálvez

Ragnar Thobaben

Mikael Skoglund

Abstract

1. Introduction

Definition 1 (IB Functional).

Definition 2 (IB Curve).

Definition 3 (Information Plane).

Definition 4 (IB Lagrangian).

2. The IB in Supervised Learning

2.1. Supervised Learning Overview

Definition 5 (Cost Function and Empirical Cost Function).

Definition 6 (Generalization Gap).

2.2. Why Do We Use the IB?

Definition 7 (Representation cross-entropy cost function).

Proposition 1 (Minimizing the Cross Entropy Maximizes the Mutual Information).

Definition 8 (Nuisance).

3. The Information Bottleneck in Deterministic Scenarios

Proposition 2 (The IB Curve is Piecewise Linear in Deterministic Scenarios).

Theorem 1 (In Deterministic Scenarios, the IB Curve cannot be Explored Using the IB Lagrangian).

4. The Convex IB Lagrangian

4.1. Exploring the IB Curve

Theorem 2 (Convex IB Lagrangians).

Remark 1.

Corollary 1 (IB Lagrangian and IB convex Lagrangian connection).

4.2. Aiming for a Specific Compression Level

Proposition 3 (Bijective Mapping between IB Curve Point and Convex IB Lagrange multiplier).

Remark 2.

Corollary 2 (Domain of Convex IB Lagrange Multiplier with Known IB Curve Shape).

Corollary 3 (Domain of Convex IB Lagrange Multiplier Bound).

5. Experimental Support

Figure 1.

Figure 2.

Figure 3.

Figure 4.

6. Conclusions

Acknowledgments

Appendix A. Proof of Proposition 1

Proof.

Appendix B. Alternative Proof of Theorem 1

Proof.

Appendix C. Proof of Theorem 2

Proof.

Appendix D. Proof of Proposition 3

Proof.

Appendix E. Proof of Corollary 2

Proof.

Lemma A1.

Proof.

Appendix F. Proof of Corollary 3

Proof.

Figure A1.

Appendix G. Other Lagrangian Families

Definition A1 (IB Dual Functional).

Theorem A1 (IB Curve Duality).

Proposition A1 (Relationship between the convex and concave IB Lagrangians).

Proof.

Appendix H. Experimental Setup Details and Further Experiments

Appendix H.1. Information Bottleneck Calculations

Appendix H.2. The Experiments

Figure A2.

Figure A3.

Figure A4.

Figure A5.

Figure A6.

Figure A7.

Appendix I. Guidelines for Selecting A Proper Function in the Convex IB Lagrangian

Appendix I.1. Avoiding Value Convergence

Figure A8.

Appendix I.2. Aiming for Strong Convexity

Definition A2 (μ-Strong Convexity).

Appendix I.3. Exploiting Value Convergence

Author Contributions

Funding

Conflicts of Interest

References

ACTIONS

PERMALINK

Definition A2 ( $μ$ -Strong Convexity).