Abstract
The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations T of some random variable X for the task of predicting Y. It is defined as a constrained optimization problem that maximizes the information the representation has about the task, , while ensuring that a certain level of compression r is achieved (i.e., ). For practical reasons, the problem is usually solved by maximizing the IB Lagrangian (i.e., ) for many values of . Then, the curve of maximal for a given is drawn and a representation with the desired predictability and compression is selected. It is known when Y is a deterministic function of X, the IB curve cannot be explored and another Lagrangian has been proposed to tackle this problem: the squared IB Lagrangian: . In this paper, we (i) present a general family of Lagrangians which allow for the exploration of the IB curve in all scenarios; (ii) provide the exact one-to-one mapping between the Lagrange multiplier and the desired compression rate r for known IB curve shapes; and (iii) show we can approximately obtain a specific compression level with the convex IB Lagrangian for both known and unknown IB curve shapes. This eliminates the burden of solving the optimization problem for many values of the Lagrange multiplier. That is, we prove that we can solve the original constrained problem with a single optimization.
Keywords: information bottleneck, representation learning, mutual information, optimization
1. Introduction
Let and be two statistically dependent random variables with joint distribution . The information bottleneck (IB) [1] investigates the problem of extracting the relevant information from X for the task of predicting Y.
For this purpose, the IB defines a bottleneck variable obeying the Markov chain so that T acts as a representation of X. Tishby et al. [1] define the relevant information as the information the representation keeps from Y after the compression of X (i.e., ), provided a certain level of compression (i.e., ). Therefore, we select the representation which yields the value of the IB curve that best fits our requirements.
Definition 1 (IB Functional).
Let X and Y be statistically dependent variables. Let Δ be the set of random variables T obeying the Markov condition . Then the IB functional is
(1)
Definition 2 (IB Curve).
The IB curve is the set of points defined by the solutions offor varying values of.
Definition 3 (Information Plane).
The plane is defined by the axes and .
This method has been successfully applied to solve different problems from a variety of domains. For example:
Supervised learning. In supervised learning, we are presented with a set of n pairs of input features and task outputs instances. We seek an approximation of the conditional probability distribution between the task outputs Y and the input features X. In classification tasks (i.e., when Y is a discrete random variable), the introduction of the variable T learned through the information bottleneck principle maintained the performance of standard algorithms based on the cross-entropy loss while providing with more adversarial attacks robustness and invariance to nuisances [2,3,4]. Moreover, by the nature of its definition the information bottleneck appears to be closely related with a trade-off between accuracy on the observable set and generalization to new, unseen instances (see Section 2).
Clustering. In clustering, we are presented with a set of n pairs of instances of a random variable X and their attributes of interest Y. We seek groups of instances (or clusters T) such that the attributes of interest within the instances of each cluster are similar and the attributes of interest of the instances of different clusters are dissimilar. Therefore, the information bottleneck can be employed since it allows us to aim for attribute representative clusters (maximizing the similarity between instances within the clusters) and enforce a certain compression of the random variable X (ensuring a certain difference between instances of the different clusters). This has been successfully implemented, for instance, for gene expression analysis and word, document, stock pricing, or movie rating clustering [5,6,7].
Image segmentation. In image segmentation, we want to partition an image into segments such that each pixel in a region shares some attributes. If we divide the image into very small regions X (e.g., each region is a pixel or a set of pixels defined by a grid), we can consider the problem of segmentation as that of clustering the regions X based on the region attributes Y. Hence, we can use the information bottleneck so that we seek region clusters T that are maximally informative about the attributes Y (e.g., the intensity histogram bins) and maintain a level of compression of the original regions X [8].
Quantization. In quantization, we consider a random variable such that is a large or continuous set. Our objective is to map X into a variable such that is a smaller, countable set. If we fix the quantization set size to and aim at maximizing the information of the quantized variable with another random variable Y and restrict the mapping to be deterministic, then the problem is equivalent to the information bottleneck [9,10].
Source coding. In source coding, we consider a data source which generates a signal , which is later perturbed by a channel that outputs X. We seek a coding scheme that generates a code from the output of the channel X which is as informative as possible about the original source signal Y and can be transmitted at a small rate . Therefore, this problem is equivalent to the the formulation of the information bottleneck [11].
Furthermore, it has been employed as a tool for development or explanation in other disciplines like reinforcement learning [12,13,14], attribution methods [15], natural language processing [16], linguistics [17] or neuroscience [18]. Moreover, it has connections with other problems such as source coding with side information (or the Wyner-Ahlswede-Körner (WAK) problem), the rate-distortion problem or the cost-capacity problem (see Sections 3, 6 and 7 from [19]).
In practice, solving a constrained optimization problem such as the IB functional is challenging. Thus, in order to avoid the non-linear constraints from the IB functional, the IB Lagrangian is defined.
Definition 4 (IB Lagrangian).
Let X and Y be statistically dependent variables. Let Δ be the set of random variables T obeying the Markov condition . Then we define the IB Lagrangian as
(2)
Here is the Lagrange multiplier which controls the trade-off between the information of Y retained and the compression of X. Note we consider because (i) for many uncompressed solutions such as maximize , and (ii) for the IB Lagrangian is non-positive due to the data processing inequality (DPI) (Theorem 2.8.1 from Cover and Thomas [20]) and trivial solutions like are maximizers with [21].
We know the solutions of the IB Lagrangian optimization (if existent) are solutions of the IB functional by the Lagrange’s sufficiency theorem (Theorem 5 in Appendix A of Courcoubetis [22]). Moreover, since the IB functional is concave (Lemma 5 of Gilad-Bachrach et al. [19]) we know they exist (Theorem 6 in Appendix A of Courcoubetis [22]).
Therefore, the problem is usually solved by maximizing the IB Lagrangian with adaptations of the Blahut-Arimoto algorithm [1], deterministic annealing approaches [23] or a bottom-up greedy agglomerative clustering [6] or its improved sequential counterpart [24]. However, when provided with high-dimensional random variables X such as images, these algorithms do not scale well and deep learning-based techniques, where the IB Lagrangian is used as the objective function, prevailed [2,25,26].
Note the IB Lagrangian optimization yields a representation T with a given performance () for a given . However, there is no one-to-one mapping between and . Hence, we cannot directly optimize for the desired compression level r but we need to perform several optimizations for different values of and select the representation with the desired performance; e.g., [2]. The Lagrange multiplier selection is important since (i) sometimes even choices of lead to trivial representations such that , and (ii) there exist some discontinuities on the performance level w.r.t. the values of [27].
Moreover, recently Kolchinsky et al. [21] showed how in deterministic scenarios (such as many classification problems where an input belongs to a single particular class ) the IB Lagrangian could not explore the IB curve. Particularly, they showed that multiple yielded the same performance level and that a single value of could result in different performance levels. To solve this issue, they introduced the squared IB Lagrangian, , which is able to explore the IB curve in any scenario by optimizing for different values of . However, even though they realized a one-to-one mapping between and the compression level existed, they did not find such mapping. Hence, multiple optimizations of the Lagrangian were still required to find the best trade-off solution.
The main contributions of this article are:
We introduce a general family of Lagrangians (the convex IB Lagrangians) which are able to explore the IB curve in any scenario for which the squared IB Lagrangian [21] is a particular case of. More importantly, the analysis made for deriving this family of Lagrangians can serve as inspiration for obtaining new Lagrangian families that solve other objective functions with intrinsic trade-offs such as the IB Lagrangian.
We show that in deterministic scenarios (and other scenarios where the IB curve shape is known) one can use the convex IB Lagrangian to obtain a desired level of performance with a single optimization. That is, there is a one-to-one mapping between the Lagrange multiplier used for the optimization and the level of compression and informativeness obtained, and we provide the exact mapping. This eliminates the need for multiple optimizations to select a suitable representation.
We introduce a particular case of the convex IB Lagrangians: the shifted exponential IB Lagrangian, which allows us to approximately obtain a specific compression level in any scenario. This way, we can approximately solve the initial constrained optimization problem from Equation (1) with a single optimization.
Furthermore, we provide some insight for explaining why there are discontinuities in the performance levels w.r.t. the values of the Lagrange multipliers. In a classification setting, we connect those discontinuities with the intrinsic clusterization of the representations when optimizing the IB bottleneck objective.
The structure of the article is the following: In Section 2 we motivate the usage of the IB in supervised learning settings. Then, in Section 3 we outline the important results used about the IB curve in deterministic scenarios. Later, in Section 4 we introduce the convex IB Lagrangian and explain some of its properties like the bijective mapping between Lagrange multipliers and the compression level and the range of such multipliers. After that, we support our (proved) claims with some empirical evidence on the MNIST [28] and TREC-6 [29] datasets in Section 5. Finally, in Section 6 we discuss our claims and empirical results. A PyTorch [30] implementation of the article can be found at https://github.com/burklight/convex-IB-Lagrangian-PyTorch.
In the Appendix A, Appendix B, Appendix C, Appendix D, Appendix E and Appendix F we provide with the proofs of the theoretical results. Then, in Appendix G we show some alternative families of Lagrangians with similar properties. Later, in Appendix H we provide with the precise experimental setup details to reproduce the results from the paper, and further experimentation with different datasets and neural network architectures. To conclude, in Appendix I we show some guidelines on how to set the convex information bottleneck Lagrangians for practical problems.
2. The IB in Supervised Learning
In this section, we will first give an overview of supervised learning in order to later motivate the usage of the information bottleneck in this setting.
2.1. Supervised Learning Overview
In supervised learning we are given a dataset of n pairs of input features and task outputs. In this case, X and Y are the random variables of the input features and the task outputs. We assume and are sampled i.i.d. from the true distribution . The usual aim of supervised learning is to use the dataset to learn a particular conditional distribution of the task outputs given the input features, parametrized by , which is a good approximation of . We use and to indicate the predicted task output random variable and its outcome. We call a supervised learning task regression when Y is continuous-valued and classification when it is discrete.
Usually, supervised learning methods employ intermediate representations of the inputs before making predictions about the outputs; e.g., hidden layers in neural networks (Chapter 5 from Bishop [31]) or transformations in a feature space through the kernel trick in kernel machines like SVMs or RVMs (Sections 7.1 and 7.2 from Bishop [31]). Let T be a possibly stochastic function of the input features X with a parametrized conditional distribution , then, T obeys the Markov condition . The mapping from the representation to the predicted task outputs is defined by the parametrized conditional distribution . Therefore, in representation-based machine learning methods, the full Markov Chain is . Hence, the overall estimation of the conditional probability is given by the marginalization of the representations; i.e., (The notation represents the probability distribution . For the rest of the text, we will use the same notation to represent conditional probability distributions where the conditioning argument is given).
In order to achieve the goal of having a good estimation of the conditional probability distribution , we usually define an instantaneous cost function . The value of this function serves as a heuristic to measure the loss of our algorithm, parametrized by , obtains when trying to predict the realization of the task output y with the input realization x.
Clearly, we can be interested in minimizing the expectation of the instantaneous cost function over all the possible input features and task outputs, which we call the cost function. However, since we only have a finite dataset we have instead to minimize the empirical cost function.
Definition 5 (Cost Function and Empirical Cost Function).
Let X and Y be the input features and task output random variables and and their realizations. Let also be the instantaneous cost function, θ the parametrization of our learning algorithm, and the given dataset. Then, we define:
(3)
(4)
The discrepancy between the normal and empirical cost functions is called the generalization gap or generalization error (see Section 1 of Xu and Raginsky [32], for instance) and intuitively, the smaller this gap is, the better our model generalizes; i.e., the better it will perform to new, unseen samples in terms of our cost function.
Definition 6 (Generalization Gap).
Let and be the cost and the empirical cost functions as defined in Definition 5. Then, the generalization gap is defined as
(5) and it represents the error incurred when the selected distribution is the one parametrized by θ when the rule is used instead of as the function to minimize.
Ideally, we would want to minimize the cost function. Hence, we usually try to minimize the empirical cost function and the generalization gap simultaneously. The modifications to our learning algorithm which intend to reduce the generalization gap but not hurt the performance on the empirical cost function are known as regularization.
2.2. Why Do We Use the IB?
Definition 7 (Representation cross-entropy cost function).
Let X and Y be two statistically dependent variables with joint distribution . Let also T be a random variable obeying the Markov condition and and be the encoding and decoding distributions of our model, parametrized by θ. Finally, let be the cross entropy between two probability distributions and . Then, the cross-entropy cost function is
(6) where is the instantaneous representation cross-entropy cost function and and .
The cross-entropy is a widely used cost function in classification tasks (e.g., Teahan [8], Krizhevsky et al. [33], Shore and Gray [34]) which has many interesting properties [35]. Moreover, it is known that minimizing the maximizes the mutual information . That is:
Proposition 1 (Minimizing the Cross Entropy Maximizes the Mutual Information).
Let be the representation cross-entropy cost function as defined in Definition 7. Let also be the mutual information between random variables T and Y in the setting from Definition 7. Then, minimizing implies maximizing .
The proof of this proposition can be found in Appendix A.
Definition 8 (Nuisance).
A nuisance is any random variable that affects the observed data X but is not informative to the task we are trying to solve. That is, Ξ is a nuisance for Y if or .
Similarly, we know that minimizing minimizes the generalization gap for restricted classes when using the cross-entropy cost function (Theorem 1 of Vera et al. [36]), and when using directly as an objective to maximize (Theorem 4 of Shamir et al. [37]). Furthermore, Achille and Soatto [38] in Proposition 3.1 upper bound the information of the input representations, T, with nuisances that affect the observed data, , with . Therefore, minimizing helps generalization by not keeping useless information of in our representations.
Thus, jointly maximizing and minimizing is a good choice both in terms of performance in the available dataset and in new, unseen data, which motivates studies on the IB.
3. The Information Bottleneck in Deterministic Scenarios
Kolchinsky et al. [21] showed that when Y is a deterministic function of X (i.e., ), the IB curve is piecewise linear. More precisely, it is shaped as stated in Proposition 2.
Proposition 2 (The IB Curve is Piecewise Linear in Deterministic Scenarios).
Let X be a random variable and be a deterministic function of X. Let also T be the bottleneck variable that solves the IB functional. Then the IB curve in the information plane is defined by the following equation:
(7)
Furthermore, they showed that the IB curve could not be explored by optimizing the IB Lagrangian for multiple because the curve was not strictly concave. That is, there was not a one-to-one relationship between and the performance level.
Theorem 1 (In Deterministic Scenarios, the IB Curve cannot be Explored Using the IB Lagrangian).
Let X be a random variable and be a deterministic function of X. Let also Δ be the set of random variables T obeying the Markov condition . Then:
- 1.
Any solution such that and solves for . That is, many different compression and performance levels can be achieved for .
- 2.
Any solution such that and solves for . That is, many compression levels can be achieved with the same performance for .
Note we use the supremum in this case since for we have that could be infinite and then the search set from Equation (1); i.e., is not compact anymore.
- 3.
Any solutionsuch thatsolvesfor all. That is, many different β achieve the same compression and performance level.
An alternative proof for this theorem can be found in Appendix B.
4. The Convex IB Lagrangian
4.1. Exploring the IB Curve
Clearly, a situation like the one depicted in Theorem 1 is not desirable, since we cannot aim for different levels of compression or performance. For this reason, we generalize the effort from Kolchinsky et al. [21] and look for families of Lagrangians which are able to explore the IB curve. Inspired by the squared IB Lagrangian, , we look at the conditions a function of requires in order to be able to explore the IB curve. In this way, we realize that any monotonically increasing and strictly convex function will be able to do so, and we call the family of Lagrangians with these characteristics the convex IB Lagrangians, due to the nature of the introduced function.
Theorem 2 (Convex IB Lagrangians).
Let Δ be the set of r.v. T obeying the Markov condition . Then, if u is a monotonically increasing and strictly convex function, the IB curve can always be recovered by the solutions of , with
(8) That is, for each point s.t. there is a unique for which maximizing achieves this solution. Furthermore, is strictly decreasing w.r.t. . We call the convex IB Lagrangian.
The proof of this theorem can be found in Appendix C. Furthermore, by exploiting the IB curve duality (Lemma 10 of Gilad-Bachrach et al. [19]) we were able to derive other families of Lagrangians which allow for the exploration of the IB curve (Appendix G).
Remark 1.
Clearly, we can see how if u is the identity function (i.e., ) then we end up with the normal IB Lagrangian. However, since the identity function is not strictly convex, it cannot ensure the exploration of the IB curve.
During the proof of this theorem we observed a relationship between the Lagrange multipliers and the solutions obtained of the normal IB Lagrangian and the convex IB Lagrangian . This relationship is formalized in the following corollary.
Corollary 1 (IB Lagrangian and IB convex Lagrangian connection).
Let be the IB Lagrangian and the convex IB Lagrangian. Then, maximizing and can obtain the same point in the IB curve if , where is the derivative of u.
This corollary allows us to better understand why the addition of u allows for the exploration of the IB curve in deterministic scenarios. If we note that for we can obtain any point in the increasing region of the curve, then we clearly see how evaluating for different values of define different values of that obtain such points. Moreover, it lets us see how if for maximizing the IB Lagrangian could obtain any point with , then the same happens for the IB convex Lagrangian.
4.2. Aiming for a Specific Compression Level
Let denote the domain of Lagrange multipliers for which we can find solutions in the IB curve with the convex IB Lagrangian. Then, the convex IB Lagrangians do not only allow us to explore the IB curve with different . They also allow us to identify the specific that obtains a given point , provided we know the IB curve in the information plane. Conversely, the convex IB Lagrangian allows finding the specific point that is obtained by a given .
Proposition 3 (Bijective Mapping between IB Curve Point and Convex IB Lagrange multiplier).
Let the IB curve in the information plane be known; i.e., is known. Then there is a bijective mapping from Lagrange multipliers from the convex IB Lagrangian to points in the IB curve . Furthermore, these mappings are:
(9) where is the derivative of u and is the inverse of .
This is especially interesting since in deterministic scenarios we know the shape of the IB curve (Theorem 2) and since the convex IB Lagrangians allow for the exploration of the IB curve (Theorem 2). A proof for Proposition 3 can be found in Appendix D.
Remark 2.
Note that the definition from Tishby et al. [1] only allows for a bijection between β and if is a strictly convex, and known function, and we have seen this is not the case in deterministic scenarios (Theorem 1).
A direct result derived from this proposition is that we know the domain of Lagrange multipliers, , which allows for the exploration of the IB curve if the shape of the IB curve is known. Furthermore, if the shape is not known we can at least bound that range.
Corollary 2 (Domain of Convex IB Lagrange Multiplier with Known IB Curve Shape).
Let the IB curve in the information plane be and let . Let also be the minimum mutual information s.t. ; i.e., . Then, the range of Lagrange multipliers that allow the exploration of the IB curve with the convex IB Lagrangian is , with
(10) where and are the derivatives of and w.r.t. evaluated at r respectively. Also, note that there are some scenarios where (see, e.g., [39]), in these scenarios .
Corollary 3 (Domain of Convex IB Lagrange Multiplier Bound).
The range of the Lagrange multipliers that allow the exploration of the IB curve is contained by which is also contained by , where
(11) where is the derivative of w.r.t. evaluated at r, is the set of possible realizations of X and and are defined as in [27] (Note in [27] they consider the dual problem (see Appendix G), so when they refer to it translates to β in this article). That is, .
Corollaries 2 and 3 allow us to reduce the range search for when we want to explore the IB curve. Practically, might be difficult to calculate so Wu et al. [27] derived an algorithm to approximate it. However, we still recommend setting the numerator to 1 for simplicity. The proofs for both corollaries are found in Appendix E and Appendix F.
5. Experimental Support
In order to showcase our claims we use the MNIST [28] and the TREC-6 [29] datasets. We modify the nonlinear-IB method [26], which is a neural network that minimizes the cross-entropy while also minimizing a differentiable kernel-based estimate of [40]. Then, we used this technique to maximize a lower bound on the convex IB Lagrangians by applying the functions u to the estimate.
The network structure is the following: first, a stochastic encoder with such that , where d is the dimension of the bottleneck variable (Note that the encoder needs to be stochastic to (i) ensure a finite and well-defined mutual information [21,41] and (ii) make gradient-based optimization methods over the IB Lagrangian useful [41]). Second, a deterministic decoder . For the MNIST dataset both the encoder and the decoder are fully-connected networks, for a fair comparison with [26]. For the TREC-6 dataset, the encoder is a set of convolutions of word embeddings followed by a fully-connected network and the decoder is also a fully-connected network. For further details about the experiment setup, additional results for different values of and and supplementary experimental results for different datasets and network architectures, please refer to Appendix H.
In Figure 1 we show our results for two particularizations of the convex IB Lagrangians:
the power IB Lagrangians: , (Note when we have the squared IB functional from Kolchinsky et al. [21]).
the exponential IB Lagrangians: , .
We can clearly see how both Lagrangians are able to explore the IB curve (first column from Figure 1) and how the theoretical performance trend of the Lagrangians matches the experimental results (second and third columns from Figure 1). There are small mismatches between the theoretical and experimental performance. This is because using the nonlinear-IB, as stated by Kolchinsky et al. [21], does not guarantee that we find optimal representations due to factors like (i) inaccurate estimation of , (ii) restrictions on the structure of T, (iii) use of an estimation of the decoder instead of the real one and (iv) the typical non-convex optimization issues that arise with gradient-based methods. The main difference comes from the discontinuities in performance for increasing , which cause is still unknown (cf. Wu et al. [27]). It has been observed, however, that the bottleneck variable performs an intrinsic clusterization in classification tasks (see, for instance, [21,26,42] or Figure 2b). We observed how this clusterization matches with the quantized performance levels observed (e.g., compare Figure 2a with the top center graph in Figure 1); with maximum performance when the number of clusters is equal to the cardinality of Y and reducing performance with a reduction of the number of clusters, which is in line with the concurrent work from Wu and Fischer [43]. We do not have a mathematical proof for the exact relationship between these two phenomena; however, we agree with Wu et al. [27] that it is an interesting matter and hope this observation serves as motivation to derive new theory.
In practice, there are different criteria for choosing the function u. For instance, the exponential IB Lagrangian could be more desirable than the power IB Lagrangian when we want to draw the IB curve since it has a finite range of . This is for the exponential IB Lagrangian vs. for the power IB Lagrangian. Furthermore, there is a trade-off between (i) how much the selected u function resembles a linear function in our region of interest; e.g., with or close to zero, since it will suffer from similar problems as the original IB Lagrangian; and (ii) how fast it grows in our region of interest; e.g., higher values of or , since it will suffer from value convergence; i.e., optimizing for separate values of will achieve similar levels of performance (Figure 3). Please, refer to Appendix I for a more thorough explanation of these two phenomena.
Particularly, the value convergence phenomenon can be exploited in order to approximately obtain a particular level of compression , both for known and unkown IB curves (see Appendix I or the example in Figure 4). For known IB curves, we also know the achieved predictability since it is the same as the level of compression . For this exploitation, we can employ the shifted version of the exponential IB Lagrangian (which is also a particular case of the convex IB Lagrangian):
- the shifted exponential IB Lagrangians:
For this Lagrangian, the optimization procedure converges to representations with approximately the desired compression level if the hyperparameter is set to a large value.
In Figure 4 we show the results of aiming for a compression level of bits in the MNIST dataset and of bits in the TREC-6 dataset, both with . We can see how for different values of we can obtain the same desired compression level, which makes this method stable to variations in the Lagrange multiplier selection.
To sum up, in order to achieve a desired level of performance with the convex IB Lagrangian as an objective one should:
In a deterministic or close to a deterministic setting (see -deterministic definition in Kolchinsky et al. [21]): Use the adequate for that performance using Proposition 3. Then if the performance is lower than desired, i.e., we are placed in the wrong performance plateau, gradually reduce the value of until reaching the previous performance plateau. Alternatively, exploit the value convergence phenomenon with, for instance, the shifted exponential IB Lagrangian.
In a stochastic setting: exploit the value convergence phenomenon with, for instance, the shifted exponential IB Lagrangian. Alternatively, draw the IB curve with multiple values of on the range defined by Corollary 3 and select the representations that best fit their interests.
6. Conclusions
The information bottleneck is a widely used and studied technique. However, it is known that the IB Lagrangian cannot be used to achieve varying levels of performance in deterministic scenarios. Moreover, in order to achieve a particular level of performance, multiple optimizations with different Lagrange multipliers must be done to draw the IB curve and select the best traded-off representation.
In this article we introduced a general family of Lagrangians which allow to (i) achieve varying levels of performance in any scenario, and (ii) pinpoint a specific Lagrange multiplier to optimize for a specific performance level in known IB curve scenarios; e.g., deterministic. Furthermore, we showed the domain when the IB curve is known and a domain bound for exploring the IB curve when it is unknown. This way we can reduce and/or avoid multiple optimizations and, hence, reduce the computational effort for finding well traded-off representations. Moreover, (iii) when the IB curve is not known, we saw how we can exploit the value convergence issue of the convex IB Lagrangian to approximately obtain a specific compression level for both known and unknown IB curve shapes. Finally, (iv) we provided some insight into the discontinuities on the performance levels w.r.t. the Lagrange multipliers by connecting those with the intrinsic clusterization of the bottleneck variable.
Acknowledgments
We want to thank the anonymous reviewers for their insighful comments.
Appendix A. Proof of Proposition 1
Proof.
We can easily prove this statement by finding is lower bounded by the where and C does not depend on T. This way maximizing such lower bound would be equivalent to minimizing and, moreover, it would imply maximizing .
We can find such an expression as follows:
(A1)
(A2)
(A3)
(A4) Here, in Equation (A1) we just used the definition of the mutual information between two random variables, and then we decoupled it using the definition of the entropy of a variable (Note we used which is usually employed for discrete variables. However, in this setting could also refer to the differential entropy of a continuous random variable since we employed the general definition using the expectation). Then, in Equation (A2) we only multiplied and divided by inside the logarithm and employed the definition of the Kullback–Leibler divergence. Finally, in Equation (A3) we first used the fact the Kullback–Leibler divergence is always positive (Theorem 2.6.3 from Cover and Thomas [20]) and then the properties of the Markov chain .
Therefore, since does not depend on T and we have a negative multiplicative term on the proposition is proved. □
Appendix B. Alternative Proof of Theorem 1
Proof.
We will proof all the enumerated statements sequentially, since the third one requires from the two first ones to be proved.
Proposition 2 states that the IB curve in the information plane follows the equation if . Then, since [1], we know in all these points. Therefore, for all points () such that are solutions of optimizing the IB Lagrangian.
Similarly, Proposition 2 states that the IB curve follows the equation if . Then, since [1], we know in all points such that . We cannot ensure it at since for .
Finally, in order to prove the last statement we will first prove that if achieves a solution, it is . Then, we will prove that if the solution exists, this can be yield by any . Hence, the solution is achieved and it is the only solution achievable.
- (a)
Since the IB curve is concave we know is non-increasing in . We also know at the points in the IB curve where and at the points in the IB curve where . Hence, if we achieve a solution with , this solution is .- (b)
We can upper bound the IB Lagrangian bywhere the first and second inequalities use the DPI (Theorem 2.8.1 from Cover and Thomas [20]).
(A5) Then, we can consider the point of the IB curve . Since the function is concave a tangent line to exists such that all other points in the curve lie below this line. Let be the slope of this curve (which we know it is from Tishby et al. [1]). Then,
(A6) As we see, by the upper bound on the IB Lagrangian from Equation (A5), if the point exists, any can be the slope of the tangent line to that ensures concavity. □
Appendix C. Proof of Theorem 2
Proof.
We start the proof by remembering the optimization problem at hand (Definition 1):
(A7) We can modify the optimization problem by
(A8) iff u is a monotonically non-decreasing function since otherwise would not hold necessarily. Now, let us assume and s.t. maximizes over all , and . Then, we can operate as follows:
(A9)
(A10)
(A11) Here, the equality from Equation (A9) comes from the fact that since , then s.t. . Then, the inequality from Equation (A10) holds since we have expanded the optimization search space. Finally, in Equation (A11) we use that maximizes and that .
Now, we can exploit that and do not depend on T and drop them in the maximization in Equation (A10). We can then realize we are maximizing over ; i.e.,
(A12)
(A13) Therefore, since satisfies both the maximization with and the constraint , maximizing obtains .
Now, we know if such exists, then the solution of the Lagrangian will be a solution for . Then, if we consider Theorem 6 from the Appendix of Courcoubetis [22] and consider the maximization problem instead of the minimization problem, we know if both and are concave functions, then a set of Lagrange multipliers exists with these conditions. We can make this consideration because f is concave if is convex and . We know is a concave function of T for (Lemma 5 of Gilad-Bachrach et al. [19]) and is convex w.r.t. T given is fixed (Theorem 2.7.4 of Cover and Thomas [20]). Thus, if we want to be concave we need u to be a convex function.
Finally, we will look at the conditions of u so that for every point in the IB curve, there exists a unique s.t. is maximized. That is, the conditions of u s.t. . For this purpose we will look at the solutions of the Lagrangian optimization:
(A14) Now, if we integrate both sides of Equation (A14) over all we obtain
(A15) where is the Lagrange multiplier from the IB Lagrangian [1] and is . Also, if we want to avoid indeterminations of we need not to be 0. Since we already imposed u to be monotonically non-decreasing, we can solve this issue by strengthening this condition. That is, we will require u to be monotonically increasing.
We would like to be continuous, this way there would be a unique for each value of . We know is a non-increasing function of (Lemma 6 of Gilad-Bachrach et al. [19]). Hence, if we want to be a strictly decreasing function of , we will require to be a strictly increasing function of . Therefore, we will require u to be a strictly convex function.
Thus, if u is a strictly convex and monotonically increasing function, for each point in the IB curve s.t. there is a unique for which maximizing achieves this solution. □
Appendix D. Proof of Proposition 3
Proof.
In Theorem 2 we showed how each point of the IB curve can be found with a unique maximizing . Therefore, since we also proved is strictly concave w.r.t. T we can find the values of that maximize the Lagrangian for fixed .
First, we look at the solutions of the Lagrangian maximization:
(A16) Then as before we can integrate at both sides for all and solve for :
(A17) Moreover, since u is a strictly convex function it’s derivative is strictly increasing. Hence, is an invertible function (since a strictly increasing function is bijective and a function is invertible iff it is bijective by definition). Now, if we consider to be known and to be the unknown we can solve for and get:
(A18) Note we require not to be 0 so the mapping is defined. □
Appendix E. Proof of Corollary 2
Proof.
We will start the proof by proving the following useful Lemma.
Lemma A1.
Let be a convex IB Lagrangian, then .
Proof.
Since , maximizing this Lagrangian is directly maximizing . We know is a concave function of T for (Theorem 2.7.4 from Cover and Thomas [20]); hence it has a supremum. We also know . Moreover, we know can be achieved if, for example, Y is a deterministic function of T (since then the Markov Chain is formed). Thus, . □
For we know maximizing we can obtain the point in the IB curve (Lemma A1). Moreover, we know that for every point such that , s.t. achieves that point (Theorem 2). Thus, s.t. is achieved. From Proposition 3 we know this is given by
(A19) Since we know is a concave non-decreasing function in (Lemma 5 of Gilad-Bachrach et al. [19]) we know it is continuous in this interval. In addition we know is strictly decreasing w.r.t. (Theorem 2). Furthermore, by definition of and knowing we know , . Therefore, we cannot ensure the exploration of the IB curve for s.t. .
Then, since u is a strictly increasing function in , is positive in that interval. Hence, taking into account is strictly decreasing we can find a maximum when approaches to 0. That is,
(A20) □
Appendix F. Proof of Corollary 3
Proof.
If we use Corollary 2, it is straightforward to see that if and for all IB curves and functions u. Therefore, we look at a domain bound dependent on the function choice. That is, if we can find and for all IB curves and all values of r, then
(A21) The region for all possible IB curves regardless of the relationship between X and Y is depicted in Figure A1. The hard limits are imposed by the DPI (Theorem 2.8.1 from Cover and Thomas [20]) and the fact that the mutual information is non-negative (Corollary with Equation 2.90 for discrete and first Corollary of Theorem 8.6.1 for continuous random variables from Cover and Thomas [20]). Hence, a minimum and maximum values of are given by the minimum and maximum values of the slope of the Pareto frontier. Which means
(A22) Note since u is monotonically increasing and, thus, will never be 0.
Then, we can tighten the bound using the results from Wu et al. [27], where, in Theorem 2, they showed the slope of the Pareto frontier could be bounded in the origin by . Finally, we know that in deterministic classification tasks , which aligns with Kolchinsky et al. [21] and what we can observe from Figure A1. Therefore,
(A23) □
Appendix G. Other Lagrangian Families
We can use the same ideas we used for the convex IB Lagrangian to formulate new families of Lagrangians that allow the exploration of the IB curve. For that, we will use the duality of the IB curve (Lemma 10 of [19]). That is:
Definition A1 (IB Dual Functional).
Let X and Y be statistically dependent variables. Let also Δ be the set of random variables T obeying the Markov condition . Then the IB dual functional is
(A24)
Theorem A1 (IB Curve Duality).
Let the IB curve be defined by the solutions of for varying . Then,
(A25) and
(A26)
From this definition, it follows that minimizing the dual IB Lagrangian, , for is equivalent to maximizing the IB Lagrangian. In fact, the original Lagrangian for solving the problem was defined this way [1]. We decided to use the maximization version because the domain of useful is bounded while it is not for .
Following the same reasoning as we did in the proof of Theorem 2, we can ensure the IB curve can be explored if:
We minimize the concave IB Lagrangian .
We maximize the dual concave IB Lagrangian .
We minimize the dual convex IB Lagrangian .
Here, u is a monotonically increasing strictly convex function, v is a monotonically increasing strictly concave function, and are the Lagrange multipliers of the families of Lagrangians defined above.
In a similar manner, one could obtain relationships between the Lagrange multipliers of the IB Lagrangian and the convex IB Lagrangian with these Lagrangian families. For instance, the convex IB Lagrangian is related with the concave IB Lagrangian as defined by Propositon A1.
Proposition A1 (Relationship between the convex and concave IB Lagrangians).
Consider the convex and concave IB Lagrangians , . Let the IB curve defined as in Definition 2 be . Then, if we fix the functions u and v we can obtain the same point in the IB curve with both Lagrangians when
(A27) or equivalently,
(A28)
Proof.
If we proceed like we did in the proof of Proposition 3 we can find the mapping between and and between and . That is,
(A29) Then, if we recall that , we can directly obtain that
(A30) Then, if we solve Equation (A30) with a fixed point for we obtain Equation (A27), and if we solve it for we obtain Equation (A28). □
Also, one could find a range of values for these Lagrangians to allow for the IB curve exploration and define a bijective mapping between their Lagrange multipliers and the IB curve. However, (i) as mentioned in Section 2.2, is particularly interesting to maximize without transformations because of its meaning. Moreover, (ii) like , the domain of useful and is not upper bounded. These two reasons make these other Lagrangians less preferable. We only include them here for completeness. Nonetheless, we encourage the curiours reader to explore these families of Lagrangians too. For example, a possible interesting research would be investigating if some particularization of the concave IB Lagrangian suffers from an issue like value convergence that can be exploited for approximately obtaining any predictability level for many values of .
Appendix H. Experimental Setup Details and Further Experiments
In order to generate empirical support for our claims, we performed several experiments on different datasets with different neural network architectures and different ways of calculating the information bottleneck.
Appendix H.1. Information Bottleneck Calculations
The information bottleneck is calculated modifying either the nonlinear-IB [26]. This method of calculating the information bottleneck is a neural network that minimizes the cross-entropy while also miniminizing an upper bound estimate of the mutual information . The nonlinear-IB relies on a kernel-based estimate of this mutual information [40]. We modify this calculation method by applying the function u to the estimate.
For the nonlinear-IB calculations, we estimated the gradients of both and the cross-entropy with the same mini-batch. Moreover, we did not learn the covariance of the mixture of Gaussians used for the kernel density estimation of and we set it to .
In both methods, and for all the experiments, we assumed a Gaussian stochastic encoder with , where d are the number of dimensions of the representations. We trained the neural networks with the Adam optimization algorithm [46] with a learning rate of and a decay rate every 10 epochs. We used a batch size of 128 samples and all the weights were initialized according to the method described by Glorot and Bengio [47] using a Gaussian distribution.
Then, we used the DBSCAN algorithm [44,45] for clustering. Particularly, we used the scikit-learn [48] implementation with and min_samples = 50.
The reader can find the PyTorch [30] implementation in the following link: https://github.com/burklight/convex-IB-Lagrangian-PyTorch.
Appendix H.2. The Experiments
We performed experiments in four different datasets:
-
A Classification Task on the MNIST Dataset [28] (Figure 1, Figure 2, Figure A2, Figure A3 and Figure A4 and top row from Figure 3). This dataset contains 60,000 training samples and 10,000 testing samples of hand-written digits. The samples are 28x28 pixels and are labeled from 0 to 9; i.e., and . The data is pre-processed so that the input has zero mean and unit variance. This is a deterministic setting, hence the experiment is designed to showcase how the convex IB Lagrangians allow us to explore the IB curve in a setting where the normal IB Lagrangian cannot and the relationship between the performance plateaus and the clusterization phenomena. Furthermore, it intends to showcase the behavior of the power and exponential Lagrangians with different parameters of and . Finally, it wants to demonstrate how the value convergence can be employed to approximately obtain a specific compression value. In this experiment, the encoder is a three fully-connected layer encoder with 800 ReLU units on the first two layers and two linear units on the last layer (), and the decoder is a fully-connected 800 ReLU unit layers followed by an output layer with 10 softmax units. The convex IB Lagrangian was calculated using the nonlinear-IB.
In Figure A2 we show how the IB curve can be explored with different values of for the power IB Lagrangian and in Figure A3 for different values of and the exponential IB Lagrangian.
Finally, in Figure A4 we show the clusterization for the same values of and as in Figure A2 and Figure A3. In this way the connection between the performance discontinuities and the clusterization is more evident. Furthermore, we can also observe how the exponential IB Lagrangian maintains better the theoretical performance than the power IB Lagrangian (see Appendix I for an explanation of why).
A Classification Task on the Fashion-MNIST Dataset [49] (Figure A5). As MNSIT, this dataset contains 60,000 training and 10,000 testing samples of 28x28 pixel images labeled from 0 to 9 and constitutes a deterministic setting. The difference is that this dataset contains fashion products instead of hand-written digits and it represents a harder classification task [49]. The data is also pre-processed so that the input has zero mean and unit variance. For this experiment, the encoder is composed of a two-layer convolutional neural network (CNN) with 32 filters on the first layer and 128 filters on the second with kernels of size 5 and stride 2. This CNN is followed by two fully-connected layers of 128 linear units (). After the first convolution and the first fully-connected layer, a ReLU activation is employed. The decoder is a fully-connected 128 ReLU unit layer followed by an output layer with 10 softmax units. The convex IB Lagrangian was calculated using the nonlinear-IB. Therefore, this experiment intends to showcase how the convex IB Lagrangian can explore the IB curve for different neural network architectures and harder datasets.
A Regression Task on the California Housing Dataset [50] (Figure A6). This dataset contains 20,640 samples of 8 real number input variables like the longitude and latitude of the house (i.e., ) and a task output real variable representing the price of the house (i.e., ). We used the log-transformed house price as the target variable and dropped the 992 samples in which the house price was equal or greater than $ so that the output distribution was closer to a Gaussian as they did in [26]. The input variables were processed so that they had zero mean and unit variance and we randomly split the samples into a 70% training and 30% test dataset. As in [40], for regression tasks we approximate with the entropy of a Gaussian with variance and with the entropy of a Gaussian with variance equal to the mean-squared error (MSE). This leads to the estimate . The encoder is a three fully-connected layer encoder with 128 ReLU units on the first two layers and 2 linear units on the last layer (), and the decoder is a fully-connected 128 ReLU unit layers followed by an output layer with 1 linear unit. The convex IB Lagrangian was calculated using the nonlinear-IB. Hence, this experiment was designed to showcase the convex IB Lagrangian can explore the IB curve in stochastic scenarios for regression tasks.
A Classification Task on the TREC-6 Dataset [29] (Figure A7 and bottom row from Figure 3). This dataset is the six-class version of the TREC [51] dataset. It contains 5452 training and 500 test samples of text questions. Each question is labeled within six different semantic categories based on what the answer is; namely: Abbreviation, description and abstract concepts, entities, human beings, locations, and numeric values. This dataset does not constitute a deterministic setting since there are examples that could belong to more than one class and there are examples which are wrongly labeled (e.g., “What is a fear of parasites?” could belong both to the description and abstract concept category, however it is labeled into the entity category), and hence . Following Ben Trevett’s tutorial on Sentiment Analysis [52] the encoder is composed by a 6 billion token pre-trained 100-dimensional Glove word embedding [53], followed by a concatenation of three convolutions with kernel sizes 2–4 respectively, and finalized with a fully-connected 128 linear unit layer (). The decoder is a single fully-connected 6 softmax unit layer. The convex IB Lagrangian was calculated using the nonlinear-IB. Thus, this experiment intends to show an example where the classification task does not convey a deterministic scenario, that the convex IB Lagrangian can recover the IB curve in complex stochastic tasks with complex neural network architectures and that the value convergence can be employed to obtain a specific compression value even in stochastic settings where the IB curve is unknown.
Appendix I. Guidelines for Selecting A Proper Function in the Convex IB Lagrangian
When choosing the right u function, it is important to find the right balance between avoiding value convergence and aiming for strong convexity. Practically, this balance is found by looking at how much faster u grows w.r.t. the identity function.
When the aim is not to draw the IB curve but to find a specific level of performance, we can exploit the value convergence phenomenon in order to design a stable performance targeted u function.
Appendix I.1. Avoiding Value Convergence
In order to explain this issue we are going to use the example of classification on MNIST [28], where , and again the power and exponential IB Lagrangians.
If we use Proposition 3 on both Lagrangians we obtain the bijective mapping between their Lagrange multipliers and a certain level of compression in the classification setting:
Power IB Lagrangian: and .
Exponential IB Lagrangian: and .
Hence, we can simply plot the curves of vs. for different hyperparameters and (see Figure A8). In this way, we can observe how increasing the growth of the function (e.g., increasing or in this case) too much provokes that many different values of converge to very similar values of . This is an issue both for drawing the curve (for obvious reasons) and for aiming for a specific performance level. Due to the nature of the estimation of the IB Lagrangian, the theoretical and practical value of that yields a specific may vary slightly (see Figure 1). Then if we select a function with too high growth, a small change in can result in a big change in the performance obtained.
Appendix I.2. Aiming for Strong Convexity
Definition A2 (-Strong Convexity).
If a function is twice continuous differentiable and its domain is confined in the real line, then it is μ-strong convex if .
Experimentally, we observed when the growth of our function is small in the domain of interest the convex IB Lagrangian does not perform well (see first row of Figure A2 and Figure A3). Later we realized that this was closely related to the strength of the convexity of our function.
In Theorem 2 we imposed the function u to be strictly convex to enforce having a unique for each value of . Hence, since in practice we are not exactly computing the Lagrangian but an estimation of it (e.g., with the nonlinear IB [26]) we require strong convexity in order to be able to explore the IB curve.
We now look at the second derivative of the power and exponential function: and respectivelly. Here we see how both functions are inherently 0-strong convex for and . However, values of and could lead to low -strong convexity in certain domains of r. Particularly, the case of is dangerous because the function approaches 0-strong convexity as r increases, so the power IB Lagrangian performs poorly when low are used to find high performances.
Appendix I.3. Exploiting Value Convergence
When the aim is not to draw or explore the IB curve, but to obtain a specific level of performance, the power of exponential IB Lagrangians aforementioned might not be the best choice due to the problems with value convergence or non-strong convexity. However, we can exploit the former in order to design a performance targeted u function.
For instance, if we look at Figure A8 we can see how a modification of the exponential IB Lagrangian could result in such a function. More precisely, a shifted exponential , with sufficiently large, converges to the compression level . We can see this more clearly if we consider the shifted exponential IB Lagrangian , since then the application of Proposition 3 results on , where is the derivative of evaluated at . We know in deterministic scenarios (Theorem 2) and that otherwise (see, e.g., [27]). Then, for large enough , regardless of the value of .
For instance, if we consider a deterministic scenario like the MNIST dataset [28] with , for and the range of the Lagrange multipliers that allow the exploration of the IB curve, according to Corollary 2, is . Furthermore, is close to 2 for many values of . For instance, for and for . This ensures a stability in the performance level obtained so that small changes in the choice of do not result in significant changes on the performance (e.g., see top row from Figure 4).
If we now consider a stochastic scenario like the TREC-6 dataset [29] with , for and the range of the Lagrange multipliers that allow the IB curve, according to Corollary 3, is , where and are defined as in [27]. Then, unless is of the order of , the range of possible betas is wide. Moreover, is close to 16 for many values of . For example, if at that point and if for ; and if at that point and if for . Hence, as in the deterministic scenario, the performance level obtained is stable with changes in the choice of (e.g., see bottom row from Figure 4).
Author Contributions
Conceptualization, B.R.G. and R.T.; formal analysis, B.R.G.; funding acquisition, M.S.; methodology, B.R.G. and R.T.; resources, M.S.; software, B.R.G.; supervision, R.T. and M.S.; visualization, B.R.G.; writing—original draft, B.R.G.; writing—review and editing, B.R.G., R.T. and M.S. All authors have read and agreed to the published version of the manuscript.
Funding
This work was supported in part by the Swedish Research Council.
Conflicts of Interest
The authors declare no conflict of interest.
References
- 1.Tishby N., Pereira F.C., Bialek W. The information bottleneck method. arXiv. 2000physics/0004057 [Google Scholar]
- 2.Alemi A.A., Fischer I., Dillon J.V., Murphy K. Deep variational information bottleneck. arXiv. 20161612.00410 [Google Scholar]
- 3.Peng X.B., Kanazawa A., Toyer S., Abbeel P., Levine S. Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and GANs by Constraining Information Flow; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]
- 4.Achille A., Soatto S. Information dropout: Learning optimal representations through noisy computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018;40:2897–2905. doi: 10.1109/TPAMI.2017.2784440. [DOI] [PubMed] [Google Scholar]
- 5.Slonim N., Tishby N. Document clustering using word clusters via the information bottleneck method; Proceedings of the 23rd annual international ACM SIGIR Conference on Research and Development in Information Retrieval; Athens, Greece. 24–28 July 2000. [Google Scholar]
- 6.Slonim N., Tishby N. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2000. Agglomerative information bottleneck. [Google Scholar]
- 7.Slonim N., Atwal G.S., Tkačik G., Bialek W. Information-based clustering. Proc. Natl. Acad. Sci. USA. 2005;102:18297–18302. doi: 10.1073/pnas.0507432102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Teahan W.J. Content-Based Multimedia Information Access. LE CENTRE DE HAUTES ETUDES INTERNATIONALES D’INFORMATIQUE DOCUMENTAIRE; Paris, France: 2000. Text classification and segmentation using minimum cross-entropy; pp. 943–961. [Google Scholar]
- 9.Strouse D., Schwab D.J. The deterministic information bottleneck. Neur. Comput. 2017;29:1611–1630. doi: 10.1162/NECO_a_00961. [DOI] [PubMed] [Google Scholar]
- 10.Nazer B., Ordentlich O., Polyanskiy Y. Information-distilling quantizers; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 96–100. [Google Scholar]
- 11.Hassanpour S., Wübben D., Dekorsy A. On the equivalence of double maxima and KL-means for information bottleneck-based source coding; Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC); Barcelona, Spain. 15–18 April 2018; pp. 1–6. [Google Scholar]
- 12.Goyal A., Islam R., Strouse D., Ahmed Z., Botvinick M., Larochelle H., Levine S., Bengio Y. Infobot: Transfer and exploration via the information bottleneck. arXiv. 20191901.10902 [Google Scholar]
- 13.Yingjun P., Xinwen H. Learning Representations in Reinforcement Learning:An Information Bottleneck Approach. arXiv. 2019cs.LG/1911.05695 [Google Scholar]
- 14.Sharma A., Gu S., Levine S., Kumar V., Hausman K. Dynamics-Aware Unsupervised Skill Discovery; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]
- 15.Schulz K., Sixt L., Tombari F., Landgraf T. Restricting the Flow: Information Bottlenecks for Attribution; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]
- 16.Li X.L., Eisner J. Specializing Word Embeddings (for Parsing) by Information Bottleneck; Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP); Hong Kong, China. 3–7 November 2019; pp. 2744–2754. [Google Scholar]
- 17.Zaslavsky N., Kemp C., Regier T., Tishby N. Efficient compression in color naming and its evolution. Proc. Natl. Acad. Sci. USA. 2018;115:7937–7942. doi: 10.1073/pnas.1800521115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chalk M., Marre O., Tkačik G. Toward a unified theory of efficient, predictive, and sparse coding. Proc. Natl. Acad. Sci. USA. 2018;115:186–191. doi: 10.1073/pnas.1711114115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Gilad-Bachrach R., Navot A., Tishby N. Learning Theory and Kernel Machines. Springer; Berlin, Germany: 2003. An information theoretic tradeoff between complexity and accuracy; pp. 595–609. [Google Scholar]
- 20.Cover T.M., Thomas J.A. Elements of Information Theory. John Wiley & Sons; Hoboken, NJ, USA: 2012. [Google Scholar]
- 21.Kolchinsky A., Tracey B.D., Van Kuyk S. Caveats for information bottleneck in deterministic scenarios; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]
- 22.Courcoubetis C. Pricing Communication Networks Economics, Technology and Modelling. Wiley Online Library; Hoboken, NJ, USA: 2003. [Google Scholar]
- 23.Tishby N., Slonim N. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2001. Data clustering by markovian relaxation and the information bottleneck method; pp. 640–646. [Google Scholar]
- 24.Slonim N., Friedman N., Tishby N. Unsupervised document classification using sequential information maximization; Proceedings of the 25th annual international ACM SIGIR Conference on Research and Development in Information Retrieval; Tampere, Finland. 11–15 August 2002. [Google Scholar]
- 25.Chalk M., Marre O., Tkacik G. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2016. Relevant sparse codes with variational information bottleneck; pp. 1957–1965. [Google Scholar]
- 26.Kolchinsky A., Tracey B.D., Wolpert D.H. Nonlinear information bottleneck. Entropy. 2019;21:1181. doi: 10.3390/e21121181. [DOI] [Google Scholar]
- 27.Wu T., Fischer I., Chuang I., Tegmark M. Learnability for the Information Bottleneck; Proceedings of the International Conference on Learning Representations (ICLR); New Orleans, LA, USA. 6–9 May 2019. [Google Scholar]
- 28.LeCun Y., Bottou L., Bengio Y., Haffner P. Gradient-based learning applied to document recognition; Proceedings of the 1998 IEEE International Frequency Control Symposium; Pasadena, CA, USA. 27–29 May 1998. [Google Scholar]
- 29.Li X., Roth D. Proceedings of the 19th international conference on Computational linguistics—Volume 1. Association for Computational Linguistics; Stroudsburg, PA, USA: 2002. Learning question classifiers; pp. 1–7. [Google Scholar]
- 30.Paszke A., Gross S., Chintala S., Chanan G., Yang E., DeVito Z., Lin Z., Desmaison A., Antiga L., Lerer A. Automatic differentiation in pytorch; Proceedings of the NIPS Autodiff Workshop; Long Beach, CA, USA. 9 December 2017. [Google Scholar]
- 31.Bishop C.M. Pattern Recognition and Machine Learning. Springer Science+ Business Media; Berlin, Germany: 2006. [Google Scholar]
- 32.Xu A., Raginsky M. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2017. Information-theoretic analysis of generalization capability of learning algorithms; pp. 2524–2533. [Google Scholar]
- 33.Krizhevsky A., Sutskever I., Hinton G.E. Advances in Neural Information Processing Systems. MIT Press; Cambridge, MA, USA: 2012. Imagenet classification with deep convolutional neural networks; pp. 1097–1105. [Google Scholar]
- 34.Shore J.E., Gray R.M. Minimum cross-entropy pattern classification and cluster analysis. IEEE Trans. Pattern Anal. Mach. Intell. 1982;1:11–17. doi: 10.1109/TPAMI.1982.4767189. [DOI] [PubMed] [Google Scholar]
- 35.Shore J., Johnson R. Properties of cross-entropy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 1981;27:472–482. doi: 10.1109/TIT.1981.1056373. [DOI] [Google Scholar]
- 36.Vera M., Piantanida P., Vega L.R. The role of the information bottleneck in representation learning; Proceedings of the 2018 IEEE International Symposium on Information Theory (ISIT); Vail, CO, USA. 17–22 June 2018; pp. 1580–1584. [Google Scholar]
- 37.Shamir O., Sabato S., Tishby N. Learning and generalization with the information bottleneck. Theor. Comput. Sci. 2010;411:2696–2711. doi: 10.1016/j.tcs.2010.04.006. [DOI] [Google Scholar]
- 38.Achille A., Soatto S. Emergence of invariance and disentanglement in deep representations. J. Mach. Learn. Res. 2018;19:1947–1980. [Google Scholar]
- 39.Du Pin Calmon F., Polyanskiy Y., Wu Y. Strong data processing inequalities for input constrained additive noise channels. IEEE Trans. Inf. Theory. 2017;64:1879–1892. doi: 10.1109/TIT.2017.2782359. [DOI] [Google Scholar]
- 40.Kolchinsky A., Tracey B. Estimating mixture entropy with pairwise distances. Entropy. 2017;19:361. doi: 10.3390/e19070361. [DOI] [Google Scholar]
- 41.Amjad R.A., Geiger B.C. Learning representations for neural network-based classification using the information bottleneck principle. IEEE Trans. Pattern Anal. Mach. Intell. 2019:1. doi: 10.1109/TPAMI.2019.2909031. [DOI] [PubMed] [Google Scholar]
- 42.Alemi A.A., Fischer I., Dillon J.V. Uncertainty in the variational information bottleneck. arXiv. 20181807.00906 [Google Scholar]
- 43.Wu T., Fischer I. Phase Transitions for the Information Bottleneck in Representation Learning; Proceedings of the International Conference on Learning Representations (ICLR); Addis Ababa, Ethiopia. 26–30 April 2020. [Google Scholar]
- 44.Ester M., Kriegel H.P., Sander J., Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise; Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; Menlo Park, CA, USA. 2–4 August 1996; pp. 226–231. [Google Scholar]
- 45.Schubert E., Sander J., Ester M., Kriegel H.P., Xu X. DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Trans. Database Syst. TODS. 2017;42:19. doi: 10.1145/3068335. [DOI] [Google Scholar]
- 46.Kingma D.P., Ba J. Adam: A method for stochastic optimization. arXiv. 20141412.6980 [Google Scholar]
- 47.Glorot X., Bengio Y. Understanding the difficulty of training deep feedforward neural networks; Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics; Sardinia, Italy. 13–15 May 2010; pp. 249–256. [Google Scholar]
- 48.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V., et al. Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 2011;12:2825–2830. [Google Scholar]
- 49.Xiao H., Rasul K., Vollgraf R. Fashion-MNIST: A Novel Image Dataset for Benchmarking Machine Learning Algorithms. arXiv. 20171708.07747 [Google Scholar]
- 50.Pace R.K., Barry R. Sparse spatial autoregressions. Stat. Probab. Lett. 1997;33:291–297. doi: 10.1016/S0167-7152(96)00140-X. [DOI] [Google Scholar]
- 51.Voorhees E.M., Tice D.M. Building a question answering test collection; Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; Athens, Greece. 24–28 July 2000. [Google Scholar]
- 52.Trevett, Ben. Tutorial on Sentiment Analysis: 5—Multi-class Sentiment Analysis. April 2019. [(accessed on 14 January 2020)]; Available online: https://github.com/bentrevett/pytorch-sentiment-analysis.
- 53.Pennington J., Socher R., Manning C. Glove: Global vectors for word representation; Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP); Doha, Qatar. 25–29 October 2014; pp. 1532–1543. [Google Scholar]