Skip to main content
Entropy logoLink to Entropy
. 2020 Jan 27;22(2):151. doi: 10.3390/e22020151

On the Information Bottleneck Problems: Models, Connections, Applications and Information Theoretic Views

Abdellatif Zaidi 1,2,*, Iñaki Estella-Aguerri 2, Shlomo Shamai (Shitz) 3
PMCID: PMC7516564  PMID: 33285926

Abstract

This tutorial paper focuses on the variants of the bottleneck problem taking an information theoretic perspective and discusses practical methods to solve it, as well as its connection to coding and learning aspects. The intimate connections of this setting to remote source-coding under logarithmic loss distortion measure, information combining, common reconstruction, the Wyner–Ahlswede–Korner problem, the efficiency of investment information, as well as, generalization, variational inference, representation learning, autoencoders, and others are highlighted. We discuss its extension to the distributed information bottleneck problem with emphasis on the Gaussian model and highlight the basic connections to the uplink Cloud Radio Access Networks (CRAN) with oblivious processing. For this model, the optimal trade-offs between relevance (i.e., information) and complexity (i.e., rates) in the discrete and vector Gaussian frameworks is determined. In the concluding outlook, some interesting problems are mentioned such as the characterization of the optimal inputs (“features”) distributions under power limitations maximizing the “relevance” for the Gaussian information bottleneck, under “complexity” constraints.

Keywords: information bottleneck, rate distortion theory, logarithmic loss, representation learning

1. Introduction

A growing body of works focuses on developing learning rules and algorithms using information theoretic approaches (e.g., see [1,2,3,4,5,6] and references therein). Most relevant to this paper is the Information Bottleneck (IB) method of Tishby et al. [1], which seeks the right balance between data fit and generalization by using the mutual information as both a cost function and a regularizer. Specifically, IB formulates the problem of extracting the relevant information that some signal XX provides about another one YY that is of interest as that of finding a representation U that is maximally informative about Y (i.e., large mutual information I(U;Y)) while being minimally informative about X (i.e., small mutual information I(U;X)). In the IB framework, I(U;Y) is referred to as the relevance of U and I(U;X) is referred to as the complexity of U, where complexity here is measured by the minimum description length (or rate) at which the observation is compressed. Accordingly, the performance of learning with the IB method and the optimal mapping of the data are found by solving the Lagrangian formulation

LβIB,*:=maxPU|XI(U;Y)βI(U;X), (1)

where PU|X is a stochastic map that assigns the observation X to a representation U from which Y is inferred and β is the Lagrange multiplier. Several methods, which we detail below, have been proposed to obtain solutions PU|X to the IB problem in Equation (4) in several scenarios, e.g., when the distribution of the sources (X,Y) is perfectly known or only samples from it are available.

The IB approach, as a method to both characterize performance limits as well as to design mapping, has found remarkable applications in supervised and unsupervised learning problems such as classification, clustering, and prediction. Perhaps key to the analysis and theoretical development of the IB method is its elegant connection with information-theoretic rate-distortion problems, as it is now well known that the IB problem is essentially a remote source coding problem [7,8,9] in which the distortion is measured under logarithmic loss. Recent works show that this connection turns out to be useful for a better understanding the fundamental limits of learning problems, including the performance of deep neural networks (DNN) [10], the emergence of invariance and disentanglement in DNN [11], the minimization of PAC-Bayesian bounds on the test error [11,12], prediction [13,14], or as a generalization of the evidence lower bound (ELBO) used to train variational auto-encoders [15,16], geometric clustering [17], or extracting the Gaussian “part” of a signal [18], among others. Other connections that are more intriguing exist also with seemingly unrelated problems such as privacy and hypothesis testing [19,20,21] or multiterminal networks with oblivious relays [22,23] and non-binary LDPC code design [24]. More connections with other coding problems such as the problems of information combining and common reconstruction, the Wyner–Ahlswede–Korner problem, and the efficiency of investment information are unveiled and discussed in this tutorial paper, together with extensions to the distributed setting.

The abstract viewpoint of IB also seems instrumental to a better understanding of the so-called representation learning [25], which is an active research area in machine learning that focuses on identifying and disentangling the underlying explanatory factors that are hidden in the observed data in an attempt to render learning algorithms less dependent on feature engineering. More specifically, one important question, which is often controversial in statistical learning theory, is the choice of a “good” loss function that measures discrepancies between the true values and their estimated fits. There is however numerical evidence that models that are trained to maximize mutual information, or equivalently minimize the error’s entropy, often outperform ones that are trained using other criteria such as mean-square error (MSE) and higher-order statistics [26,27]. On this aspect, we also mention Fisher’s dissertation [28], which contains investigation of the application of information theoretic metrics to blind source separation and subspace projection using Renyi’s entropy as well as what appears to be the first usage of the now popular Parzen windowing estimator of information densities in the context of learning. Although a complete and rigorous justification of the usage of mutual information as cost function in learning is still awaited, recently, a partial explanation appeared in [29], where the authors showed that under some natural data processing property Shannon’s mutual information uniquely quantifies the reduction of prediction risk due to side information. Along the same line of work, Painsky and Wornell [30] showed that, for binary classification problems, by minimizing the logarithmic-loss (log-loss), one actually minimizes an upper bound to any choice of loss function that is smooth, proper (i.e., unbiased and Fisher consistent), and convex. Perhaps, this justifies partially why mutual information (or, equivalently, the corresponding loss function, which is the log-loss fidelity measure) is widely used in learning theory and has already been adopted in many algorithms in practice such as the infomax criterion [31], the tree-based algorithm of Quinlan [32], or the well known Chow–Liu algorithm [33] for learning tree graphical models, with various applications in genetics [34], image processing [35], computer vision [36], etc. The logarithmic loss measure also plays a central role in the theory of prediction [37] (Ch. 09) where it is often referred to as the self-information loss function, as well as in Bayesian modeling [38] where priors are usually designed to maximize the mutual information between the parameter to be estimated and the observations. The goal of learning, however, is not merely to learn model parameters accurately for previously seen data. Rather, in essence, it is the ability to successfully apply rules that are extracted from previously seen data to characterize new unseen data. This is often captured through the notion of “generalization error”. The generalization capability of a learning algorithm hinges on how sensitive the output of the algorithm is to modifications of the input dataset, i.e., its stability [39,40]. In the context of deep learning, it can be seen as a measure of how much the algorithm overfits the model parameters to the seen data. In fact, efficient algorithms should strike a good balance between their ability to fit training dataset and that to generalize well to unseen data. In statistical learning theory [37], such a dilemma is reflected through that the minimization of the “population risk” (or “test error” in the deep learning literature) amounts to the minimization of the sum of the two terms that are generally difficult to minimize simultaneously, the “empirical risk” on the training data and the generalization error. To prevent over-fitting, regularization methods can be employed, which include parameter penalization, noise injection, and averaging over multiple models trained with distinct sample sets. Although it is not yet very well understood how to optimally control model complexity, recent works [41,42] show that the generalization error can be upper-bounded using the mutual information between the input dataset and the output of the algorithm. This result actually formalizes the intuition that the less information a learning algorithm extracts from the input dataset the less it is likely to overfit, and justifies, partly, the use of mutual information also as a regularizer term. The interested reader may refer to [43] where it is shown that regularizing with mutual information alone does not always capture all desirable properties of a latent representation. We also point out that there exists an extensive literature on building optimal estimators of information quantities (e.g., entropy, mutual information), as well as their Matlab/Python implementations, including in the high-dimensional regime (see, e.g., [44,45,46,47,48,49] and references therein).

This paper provides a review of the information bottleneck method, its classical solutions, and recent advances. In addition, in the paper, we unveil some useful connections with coding problems such as remote source-coding, information combining, common reconstruction, the Wyner–Ahlswede–Korner problem, the efficiency of investment information, CEO source coding under logarithmic-loss distortion measure, and learning problems such as inference, generalization, and representation learning. Leveraging these connections, we discuss its extension to the distributed information bottleneck problem with emphasis on its solutions and the Gaussian model and highlight the basic connections to the uplink Cloud Radio Access Networks (CRAN) with oblivious processing. For this model, the optimal trade-offs between relevance and complexity in the discrete and vector Gaussian frameworks is determined. In the concluding outlook, some interesting problems are mentioned such as the characterization of the optimal inputs distributions under power limitations maximizing the “relevance” for the Gaussian information bottleneck under “complexity” constraints.

Notation

Throughout, uppercase letters denote random variables, e.g., X; lowercase letters denote realizations of random variables, e.g., x; and calligraphic letters denote sets, e.g., X. The cardinality of a set is denoted by |X|. For a random variable X with probability mass function (pmf) PX, we use PX(x)=p(x), xX for short. Boldface uppercase letters denote vectors or matrices, e.g., X, where context should make the distinction clear. For random variables (X1,X2,) and a set of integers KN, XK denotes the set of random variables with indices in the set K, i.e., XK={Xk:kK}. If K=, XK=. For kK, we let XK/k=(X1,,Xk1,Xk+1,,XK), and assume that X0=XK+1=. In addition, for zero-mean random vectors X and Y, the quantities Σx, Σx,y and Σx|y denote, respectively, the covariance matrix of the vector X, the covariance matrix of vector (X,Y), and the conditional covariance matrix of X, conditionally on Y, i.e., Σx=E[XXH]Σx,y:=E[XYH], and Σx|y=ΣxΣx,yΣy1Σy,x. Finally, for two probability measures PX and QX on the random variable XX, the relative entropy or Kullback–Leibler divergence is denoted as DKL(PXQX). That is, if PX is absolutely continuous with respect to QX, PXQX (i.e., for every xX, if PX(x)>0, then QX(x)>0), DKL(PXQX)=EPX[log(PX(X)/QX(X))], otherwise DKL(PXQX)=.

2. The Information Bottleneck Problem

The Information Bottleneck (IB) method was introduced by Tishby et al. [1] as a method for extracting the information that some variable XX provides about another one YY that is of interest, as shown in Figure 1.

Figure 1.

Figure 1

Information bottleneck problem.

Specifically, the IB method consists of finding the stochastic mapping PU|X:XU that from an observation X outputs a representation UU that is maximally informative about Y, i.e., large mutual information I(U;Y), while being minimally informative about X, i.e., small mutual information I(U;X) (As such, the usage of Shannon’s mutual information seems to be motivated by the intuition that such a measure provides a natural quantitative approach to the questions of meaning, relevance, and common-information, rather than the solution of a well-posed information-theoretic problem—a connection with source coding under logarithmic loss measure appeared later on in [50].) The auxiliary random variable U satisfies that UXY is a Markov Chain in this order; that is, that the joint distribution of (X,U,Y) satisfies

p(x,u,y)=p(x)p(y|x)p(u|x), (2)

and the mapping PU|X is chosen such that U strikes a suitable balance between the degree of relevance of the representation as measured by the mutual information I(U;Y) and its degree of complexity as measured by the mutual information I(U;X). In particular, such U, or effectively the mapping PU|X, can be determined to maximize the IB-Lagrangian defined as

LβIB(PU|X):=I(U;Y)βI(U;X) (3)

over all mappings PU|X that satisfy UXY and the trade-off parameter β is a positive Lagrange multiplier associated with the constraint on I(U;Y).

Accordingly, for a given β and source distribution PX,Y, the optimal mapping of the data, denoted by PU|X*,β, is found by solving the IB problem, defined as

LβIB,*:=maxPU|XI(U;Y)βI(U;X)., (4)

over all mappings PU|Y that satisfy UXY. It follows from the classical application of Carathéodory’s theorem [51] that without loss of optimality, U can be restricted to satisfy |U||X|+1.

In Section 3 we discuss several methods to obtain solutions PU|X*,β to the IB problem in Equation (4) in several scenarios, e.g., when the distribution of (X,Y) is perfectly known or only samples from it are available.

2.1. The Ib Relevance–Complexity Region

The minimization of the IB-Lagrangian Lβ in Equation (4) for a given β0 and PX,Y results in an optimal mapping PU|X*,β and a relevance–complexity pair (Δβ,Rβ) where Δβ=I(Uβ,X) and Rβ=I(Uβ,Y) are, respectively, the relevance and the complexity resulting from generating Uβ with the solution PU|X*,β. By optimizing over all β0, the resulting relevance–complexity pairs (Δβ,Rβ) characterize the boundary of the region of simultaneously achievable relevance–complexity pairs for a distribution PX,Y (see Figure 2). In particular, for a fixed PX,Y, we define this region as the union of relevance–complexity pairs (Δ,R) that satisfy

ΔI(U,Y),RI(X,U) (5)

where the union is over all PU|X such that U satisfies UXY form a Markov Chain in this order. Any pair (Δ,R) outside of this region is not simultaneously achievable by any mapping PU|X.

Figure 2.

Figure 2

Information bottleneck relevance–complexity region. For a given β, the solution PU|X*,β to the minimization of the IB-Lagrangian in Equation (3) results in a pair (Δβ,Rβ) on the boundary of the IB relevance–complexity region (colored in grey).

3. Solutions to the Information Bottleneck Problem

As shown in the previous region, the IB problem provides a methodology to design mappings PU|X performing at different relevance–complexity points within the region of feasible (Δ,R) pairs, characterized by the IB relevance–complexity region, by minimizing the IB-Lagrangian in Equation (3) for different values of β. However, in general, this optimization is challenging as it requires computation of mutual information terms.

In this section, we describe how, for a fixed parameter β, the optimal solution PU|Xβ,*, or an efficient approximation of it, can be obtained under: (i) particular distributions, e.g., Gaussian and binary symmetric sources; (ii) known general discrete memoryless distributions; and (iii) unknown memory distributions and only samples are available.

3.1. Solution for Particular Distributions: Gaussian and Binary Symmetric Sources

In certain cases, when the joint distribution PX,Y is know, e.g., it is binary symmetric or Gaussian, information theoretic inequalities can be used to minimize the IB-Lagrangian in (4) in closed form.

3.1.1. Binary IB

Let X and Y be a doubly symmetric binary sources (DSBS), i.e., (X,Y)DSBS(p) for some 0p1/2. (A DSBS is a pair (X,Y) of binary random variables XBern(1/2) and YBern(1/2) and XYBern(p), where ⊕ is the sum modulo 2. That is, Y is the output of a binary symmetric channel with crossover probability p corresponding to the input X, and X is the output of the same channel with input Y.) Then, it can be shown that the optimal U in (4) is such that (X,U)DSBS(q) for some 0q1. Such a U can be obtained with the mapping PU|X such that

U=XQ,withQDSBS(q). (6)

In this case, straightforward algebra leads to that the complexity level is given by

I(U;X)=1h2(q), (7)

where, for 0x1, h2(x) is the entropy of a Bernoulli-(x) source, i.e., h2(x)=xlog2(x)(1x)log2(1x), and the relevance level is given by

I(U;Y)=1h2(pq) (8)

where pq=p(1q)+q(1p). The result extends easily to discrete symmetric mappings YX with binary X (one bit output quantization) and discrete non-binary Y.

3.1.2. Vector Gaussian IB

Let (X,Y)CNx×CNy be a pair of jointly Gaussian, zero-mean, complex-valued random vectors, of dimension Nx>0 and Ny>0, respectively. In this case, the optimal solution of the IB-Lagrangian in Equation (3) (i.e., test channel PU|X) is a noisy linear projection to a subspace whose dimensionality is determined by the tradeoff parameter β. The subspaces are spanned by basis vectors in a manner similar to the well known canonical correlation analysis [52]. For small β, only the vector associated to the dimension with more energy, i.e., corresponding to the largest eigenvalue of a particular hermitian matrix, will be considered in U. As β increases, additional dimensions are added to U through a series of critical points that are similar to structural phase transitions. This process continues until U becomes rich enough to capture all the relevant information about Y that is contained in X. In particular, the boundary of the optimal relevance–complexity region was shown in [53] to be achievable using a test channel PU|X, which is such that (U,X) is Gaussian. Without loss of generality, let

U=AX+ξ (9)

where AMNu,Nx(C) is an Nu×Nx complex valued matrix and ξCNu is a Gaussian noise that is independent of (X,Y) with zero-mean and covariance matrix INu. For a given non-negative trade-off parameter β, the matrix A has a number of rows that depends on β and is given by [54] (Theorem 3.1)

A=0T;;0T,0β<β1cα1v1T;0T;;0T,β1cβ<β2cα1v1T;α2v2T;0T;;0T,β2cβ<β3c (10)

where {v1T,v2T,,vNxT} are the left eigenvectors of ΣX|YΣX1 sorted by their corresponding ascending eigenvalues λ1,λ2,,λNx. Furthermore, for i=1,,Nx, βic=11λi are critical β-values, αi=β(1λi)1λiri with ri=viTΣXvi, 0T denotes the Nx-dimensional zero vector and semicolons separate the rows of the matrix. It is interesting to observe that the optimal projection consists of eigenvectors of ΣX|YΣX1, combined in a judicious manner: for values of β that are smaller than β1c, reducing complexity is of prime importance, yielding extreme compression U=ξ, i.e., independent noise and no information preservation at all about Y. As β increases, it undergoes a series of critical points {βic}, at each of which a new eignevector is added to the matrix A, yielding a more complex but richer representation—the rank of A increases accordingly.

For the specific case of scalar Gaussian sources, that is Nx=Ny=1, e.g., X=snrY+N where N is standard Gaussian with zero-mean and unit variance, the above result simplifies considerably. In this case, let without loss of generality the mapping PU|X be given by

X=aX+Q (11)

where Q is standard Gaussian with zero-mean and variance σq2. In this case, for I(U;X)=R, we get

I(U;Y)=12log(1+snr)12log1+snrexp(2R). (12)

3.2. Approximations for Generic Distributions

Next, we present an approach to obtain solutions to the the information bottleneck problem for generic distributions, both when this solution is known and when it is unknown. The method consists in defining a variational (lower) bound on the IB-Lagrangian, which can be optimized more easily than optimizing the IB-Lagrangian directly.

3.2.1. A Variational Bound

Recall the IB goal of finding a representation U of X that is maximally informative about Y while being concise enough (i.e., bounded I(U;X)). This corresponds to optimizing the IB-Lagrangian

LβIB(PU|X):=I(U;Y)βI(U;X) (13)

where the maximization is over all stochastic mappings PU|X such that UXY and |U||X|+1. In this section, we show that minimizing Equation (13) is equivalent to optimizing the variational cost

LβVIB(PU|X,QY|U,SU):=EPU|XlogQY|U(Y|U)βDKL(PU|X|SU), (14)

where QY|U(y|u) is an given stochastic map QY|U:U[0,1] (also referred to as the variational approximation of PY|U or decoder) and SU(u):U[0,1] is a given stochastic map (also referred to as the variational approximation of PU), and DKL(PU|X|SU) is the relative entropy between PU|X and SU.

Then, we have the following bound for a any valid PU|X, i.e., satisfying the Markov Chain in Equation (2),

LβIB(PU|X)LβVIB(PU|X,QY|U,SU), (15)

where the equality holds when QY|U=PY|U and SU=PU, i.e., the variational approximations correspond to the true value.

In the following, we derive the variational bound. Fix PU|X (an encoder) and the variational decoder approximation QY|U. The relevance I(U;Y) can be lower-bounded as

I(U;Y)=uU,yYPU,Y(u,y)logPY|U(y|u)PY(y)dydu (16)
=(a)uU,yYPU,Y(u,y)logQY|U(y|u)PY(y)dydu+DPYQY|U (17)
(b)uU,yYPU,Y(u,y)logQY|U(y|u)PY(y)dydu (18)
=H(Y)+uU,yYPU,Y(u,y)logQY|U(y|u)dydu (19)
(c)uU,yYPU,Y(u,y)logQY|U(y|u)dydu (20)
=(d)uU,xX,yYPX(x)PY|X(y|x)PU|X(u|x)logQY|U(y|u)dxdydu, (21)

where in (a) the term DPYQY|U is the conditional relative entropy between PY and QY, given PU; (b) holds by the non-negativity of relative entropy; (c) holds by the non-negativity of entropy; and (d) follows using the Markov Chain UXY.

Similarly, let SU be a given the variational approximation of PU. Then, we get

I(U;X)=uU,xXPU,X(u,x)logPU|X(u|x)PU(u)dxdu (22)
=uU,xXPU,X(u,x)logPU|X(u|x)SU(u)dxduDPUSU (23)
uU,xXPU,X(u,x)logPU|X(u|x)SU(u)dxdu (24)

where the inequality follows since the relative entropy is non-negative.

Combining Equations (21) and (24), we get

I(U;Y)βI(U;X)uU,xX,yYPX(x)PY|X(y|x)PU|X(u|x)logQY|U(y|u)dxdyduβuU,xXPU,X(u,x)logPU|X(u|x)SU(u)dxdu. (25)

The use of the variational bound in Equation (14) over the IB-Lagrangian in Equation (13) shows some advantages. First, it allows the derivation of alternating algorithms that allow to obtain a solution by optimizing over the encoders and decoders. Then, it is easier to obtain an empirical estimate of Equation (14) by sampling from: (i) the joint distribution PX,Y; (ii) the encoder PU|X; and (iii) the prior SU. Additionally, as noted in Equation (15), when evaluated for the optimal decoder QY|U and prior SU, the variational bound becomes tight. All this allows obtaining algorithms to obtain good approximate solutions to the IB problem, as shown next. Further theoretical implications of this variational bound are discussed in [55].

3.2.2. Known Distributions

Using the variational formulation in Equation (14), when the data model is discrete and the joint distribution PX,Y is known, the IB problem can be solved by using an iterative method that optimizes the variational IB cost function in Equation (14) alternating over the distributions PU|X,QY|U, and SU. In this case, the maximizing distributions PU|X,QY|U, and SU can be efficiently found by an alternating optimization procedure similar to the expectation-maximization (EM) algorithm [56] and the standard Blahut–Arimoto (BA) method [57]. In particular, a solution PU|X to the constrained optimization problem is determined by the following self-consistent equations, for all (u,x,y)U×X×Y, [1]

PU|X(u|x)=PU(u)Z(β,x)expβDKLPY|X(·|x)PY|U(·|u) (26a)
PU(u)=xXPX(x)PU|X(u|x) (26b)
PY|U(y|u)=xXPY|X(y|x)PX|U(x|u) (26c)

where PX|U(x|u)=PU|X(u|x)PX(x)/PU(u) and Z(β,x) is a normalization term. It is shown in [1] that alternating iterations of these equations converges to a solution of the problem for any initial PU|X. However, by opposition to the standard Blahut–Arimoto algorithm [57,58], which is classically used in the computation of rate-distortion functions of discrete memoryless sources for which convergence to the optimal solution is guaranteed, convergence here may be to a local optimum only. If β=0, the optimization is non-constrained and one can set U=, which yields minimal relevance and complexity levels. Increasing the value of β steers towards more accurate and more complex representations, until U=X in the limit of very large (infinite) values of β for which the relevance reaches its maximal value I(X;Y).

For discrete sources with (small) alphabets, the updating equations described by Equation (26) are relatively easy computationally. However, if the variables X and Y lie in a continuum, solving the equations described by Equation (26) is very challenging. In the case in which X and Y are joint multivariate Gaussian, the problem of finding the optimal representation U is analytically tractable in [53] (see also the related [54,59]), as discussed in Section 3.1.2. Leveraging the optimality of Gaussian mappings PU|X to restrict the optimization of PU|X to Gaussian distributions as in Equation (9), allows reducing the search of update rules to those of the associated parameters, namely covariance matrices. When Y is a deterministic function of X, the IB curve cannot be explored, and other Lagrangians have been proposed to tackle this problem [60].

3.3. Unknown Distributions

The main drawback of the solutions presented thus far for the IB principle is that, in the exception of small-sized discrete (X,Y) for which iterating Equation (26) converges to an (at least local) solution and jointly Gaussian (X,Y) for which an explicit analytic solution was found, solving Equation (3) is generally computationally costly, especially for high dimensionality. Another important barrier in solving Equation (3) directly is that IB necessitates knowledge of the joint distribution PX,Y. In this section, we describe a method to provide an approximate solution to the IB problem in the case in which the joint distribution is unknown and only a give training set of N samples {(xi,yi)}i=1N is available.

A major step ahead, which widened the range of applications of IB inference for various learning problems, appeared in [48], where the authors used neural networks to parameterize the variational inference lower bound in Equation (14) and show that its optimization can be done through the classic and widely used stochastic gradient descendent (SGD). This method, denoted by Variational IB in [48] and detailed below, has allowed handling handle high-dimensional, possibly continuous, data, even in the case in which the distributions are unknown.

3.3.1. Variational IB

The goal of the variational IB when only samples {(xi,yi)}i=1N are available is to solve the IB problem optimizing an approximation of the cost function. For instance, for a given training set {(xi,yi)}i=1N, the right hand side of Equation (14) can be approximated as

Llow1Ni=1NuUPU|X(u|xi)logQY|U(yi|u)βPU|X(u|xi)logPU|X(u|xi)SU(u)du. (27)

However, in general, the direct optimization of this cost is challenging. In the variational IB method, this optimization is done by parameterizing the encoding and decoding distributions PU|X, QY|U, and SU that are to optimize using a family of distributions whose parameters are determined by DNNs. This allows us to formulate Equation (14) in terms of the DNN parameters, i.e., its weights, and optimize it by using the reparameterization trick [15], Monte Carlo sampling, and stochastic gradient descent (SGD)-type algorithms.

Let Pθ(u|x) denote the family of encoding probability distributions PU|X over U for each element on X, parameterized by the output of a DNN fθ with parameters θ. A common example is the family of multivariate Gaussian distributions [15], which are parameterized by the mean μθ and covariance matrix Σθ, i.e., γ:=(μθ,Σθ). Given an observation X, the values of (μθ(x),Σθ(x)) are determined by the output of the DNN fθ, whose input is X, and the corresponding family member is given by Pθ(u|x)=N(u;μθ(x),Σθ(x)). For discrete distributions, a common example are concrete variables [61] (or Gumbel-Softmax [62]). Some details are given below.

Similarly, for decoder QY|U over Y for each element on U, let Qψ(y|u) denote the family of distributions parameterized by the output of the DNNs fψk. Finally, for the prior distributions SU(u) over U we define the family of distributions Sφ(u), which do not depend on a DNN.

By restricting the optimization of the variational IB cost in Equation (14) to the encoder, decoder, and prior within the families of distributions Pθ(u|x), Qψ(y|u), and Sφ(u), we get

maxPU|XmaxQY|U,SULβVIB(PU|X,QY|U,SU)maxθ,ϕ,φLβNN(θ,ϕ,φ), (28)

where θ,ϕ, and φ denote the DNN parameters, e.g., its weights, and the cost in Equation (29) is given by

LβNN(θ,ϕ,φ):=EPY,XE{Pθ(U|X)}logQϕ(Y|U)(Y|U)βDKL(Pθ(U|X)Sφ(U)). (29)

Next, using the training samples {(xi,yi)}i=1N, the DNNs are trained to maximize a Monte Carlo approximation of Equation (29) over θ,ϕ,φ using optimization methods such as SGD or ADAM [63] with backpropagation. However, in general, the direct computation of the gradients of Equation (29) is challenging due to the dependency of the averaging with respect to the encoding Pθ, which makes it hard to approximate the cost by sampling. To circumvent this problem, the reparameterization trick [15] is used to sample from Pθ(U|X). In particular, consider Pθ(U|X) to belong to a parametric family of distributions that can be sampled by first sampling a random variable Z with distribution PZ(z), zZ and then transforming the samples using some function gθ:X×ZU parameterized by θ, such that U=gθ(x,Z)Pθ(U|x). Various parametric families of distributions fall within this class for both discrete and continuous latent spaces, e.g., the Gumbel-Softmax distributions and the Gaussian distributions. Next, we detail how to sample from both examples:

  1. Sampling from Gaussian Latent Spaces: When the latent space is a continuous vector space of dimension D, e.g., U=RD, we can consider multivariate Gaussian parametric encoders of mean (μθ, and covariance Σθ), i.e., Pθ(u|x)=N(u;μθ,Σθ). To sample UN(u;μθ,Σθ), where μθ(x)=fe,θμ(x) and Σθ(x)=fe,θΣ(x) are determined as the output of a NN, sample a random variable ZN(z;0,I) i.i.d. and, given data sample xX, and generate the jth sample as
    uj=fe,θμ(x)+fe,θΣ(x)zj (30)
    where zj is a sample of ZN(0,I), which is an independent Gaussian noise, and feμ(x) and feΣ(x) are the output values of the NN with weights θ for the given input sample x.

    An example of the resulting DIB architecture to optimize with an encoder, a latent space, and a decoder parameterized by Gaussian distributions is shown in Figure 3.

  2. Sampling from a discrete latent space with the Gumbel-Softmax:

    If U is categorical random variable on the finite set U of size D with probabilities π:=(π1,,πD)), we can encode it as D-dimensional one-hot vectors lying on the corners of the (D1)-dimensional simplex, ΔD1. In general, costs functions involving sampling from categorical distributions are non-differentiable. Instead, we consider Concrete variables [62] (or Gumbel-Softmax [61]), which are continuous differentiable relaxations of categorical variables on the interior of the simplex, and are easy to sample. To sample from a Concrete random variable UΔD1 at temperature λ(0,), with probabilities π(0,1)D, sample GdGumbel(0,1) i.i.d. (The Gumbel(0,1) distribution can be sampled by drawing uUniform(0,1) and calculating g=log(log(u)).), and set for each of the components of U=(U1,,UD)
    Ud=exp((log(πd+Gd)/λ))j=1Dexp((log(πj+Gj)/λ)),d=1,,D. (31)

    We denote by Qπ,λ(u,x) the Concrete distribution with parameters (π(x),λ). When the temperature λ approaches 0, the samples from the concrete distribution become one-hot and Pr{limλ0Ud}=πd [61]. Note that, for discrete data models, standard application of Caratheodory’s theorem [64] shows that the latent variables U that appear in Equation (3) can be restricted to be with bounded alphabets size.

Figure 3.

Figure 3

Example parametrization of Variational Information Bottleneck using neural networks.

The reparametrization trick transforms the cost function in Equation (29) into one which can be to approximated by sampling M independent samples {um}m=1MPθ(u|xi) for each training sample (xi,yi), i=1,,N and allows computing estimates of the gradient using backpropagation [15]. Sampling is performed by using ui,m=gϕ(xi,zm) with {zm}m=1M i.i.d. sampled from PZ. Altogether, we have the empirical-DIB cost for the ith sample in the training dataset:

Lβ,iemp(θ,ϕ,φ):=1Mm=1MlogQϕ(yi|ui,m)βDKL(Pθ(Ui|xi)Qφ(Ui))]. (32)

Note that, for many distributions, e.g., multivariate Gaussian, the divergence DKL(Pθ(Ui|xi)Qφ(Ui)) can be evaluated in closed form. Alternatively, an empirical approximation can be considered.

Finally, we maximize the empirical-IB cost over the DNN parameters θ,ϕ,φ as,

maxθ,ϕ,φ1Ni=1NLβ,iemp(θ,ϕ,φ). (33)

By the law of large numbers, for large N,M, we have 1/Ni=1MLβ,iemp(θ,ϕ,φ)LβNN(θ,ϕ,φ) almost surely. After convergence of the DNN parameters to θ*,ϕ*,φ*, for a new observation X, the representation U can be obtained by sampling from the encoders Pθk*(Uk|Xk). In addition, note that a soft estimate of the remote source Y can be inferred by sampling from the decoder Qϕ*(Y|U). The notion of encoder and decoder in the IB-problem will come clear from its relationship with lossy source coding in Section 4.1.

4. Connections to Coding Problems

The IB problem is a one-shot coding problem, in the sense that the operations are performed letter-wise. In this section, we consider now the relationship between the IB problem and (asymptotic) coding problem in which the coding operations are performed over blocks of size n, with n assumed to be large and the joint distribution of the data PX,Y is in general assumed to be known a priori. The connections between these problems allow extending results from one setup to another, and to consider generalizations of the classical IB problem to other setups, e.g., as shown in Section 6.

4.1. Indirect Source Coding under Logarithmic Loss

Let us consider the (asymptotic) indirect source coding problem shown in Figure 4, in which Y designates a memoryless remote source and X a noisy version of it that is observed at the encoder.

Figure 4.

Figure 4

A remote source coding problem.

A sequence of n samples Xn=(X1,,Xn) is mapped by an encoder ϕ(n):Xn{1,,2nR} which outputs a message from a set {1,,2nR}, that is, the encoder uses at most R bits per sample to describe its observation and the range of the encoder map is allowed to grow with the size of the input sequence as

ϕ(n)nR. (34)

This message is mapped with a decoder ϕ(n):{1,,2nR}Y^ to generate a reconstruction of the source sequence Yn as YnY^n. As already observed in [50], the IB problem in Equation (3) is essentially equivalent to a remote point-to-point source coding problem in which distortion between Yn as YnY^n is measured under the logarithm loss (log-loss) fidelity criterion [65]. That is, rather than just assigning a deterministic value to each sample of the source, the decoder gives an assessment of the degree of confidence or reliability on each estimate. Specifically, given the output description m=ϕ(n)(xn) of the encoder, the decoder generates a soft-estimate y^n of yn in the form of a probability distribution over Yn, i.e., y^n=P^Yn|M(·). The incurred discrepancy between yn and the estimation y^n under log-loss for the observation xn is then given by the per-letter logarithmic loss distortion, which is defined as

log(y,y^):=log1y^(y). (35)

for yY and y^P(Y) designates here a probability distribution on Y and y^(y) is the value of that distribution evaluated at the outcome yY.

That is, the encoder uses at most R bits per sample to describe its observation to a decoder which is interested in reconstructing the remote source Yn to within an average distortion level D, using a per-letter distortion metric, i.e.,

E[log(n)(Yn,Y^n)]D (36)

where the incurred distortion between two sequences Yn and Y^n is measured as

log(n)(Yn,Y^n)=1ni=1nlog(yi,y^i) (37)

and the per-letter distortion is measured in terms of that given by the logarithmic loss in Equation (53).

The rate distortion region of this model is given by the union of all pairs (R,D) that satisfy [7,9]

RI(U;X) (38a)
DH(Y|U) (38b)

where the union is over all auxiliary random variables U that satisfy that UXY forms a Markov Chain in this order. Invoking the support lemma [66] (p. 310), it is easy to see that this region is not altered if one restricts U to satisfy |U||X|+1. In addition, using the substitution Δ:=H(Y)D, the region can be written equivalently as the union of all pairs (R,H(Y)Δ) that satisfy

RI(U;X) (39a)
ΔI(U;Y) (39b)

where the union is over all Us with pmf PU|X that satisfy UXY, with |U||X|+1.

The boundary of this region is equivalent to the one described by the IB principle in Equation (3) if solved for all β, and therefore the IB problem is essentially a remote source coding problem in which the distortion is measured under the logarithmic loss measure. Note that, operationally, the IB problem is equivalent to that of finding an encoder PU|X which maps the observation X to a representation U that satisfies the bit rate constraint R and such that U captures enough relevance of Y so that the posterior probability of Y given U satisfies an average distortion constraint.

4.2. Common Reconstruction

Consider the problem of source coding with side information at the decoder, i.e., the well known Wyner–Ziv setting [67], with the distortion measured under logarithmic-loss. Specifically, a memoryless source X is to be conveyed lossily to a decoder that observes a statistically correlated side information Y. The encoder uses R bits per sample to describe its observation to the decoder which wants to reconstruct an estimate of X to within an average distortion level D, where the distortion is evaluated under the log-loss distortion measure. The rate distortion region of this problem is given by the set of all pairs (R,D) that satisfy

R+DH(X|Y). (40)

The optimal coding scheme utilizes standard Wyner–Ziv compression [67] at the encoder and the decoder map ψ:U×YX^ is given by

ψ(U,Y)=Pr[X=x|U,Y] (41)

for which it is easy to see that

E[log(X,ψ(U,Y))]=H(X|U,Y). (42)

Now, assume that we constrain the coding in a manner that the encoder is be able to produce an exact copy of the compressed source constructed by the decoder. This requirement, termed common reconstruction constraint (CR), was introduced and studied by Steinberg [68] for various source coding models, including the Wyner–Ziv setup, in the context of a “general distortion measure”. For the Wyner–Ziv problem under log-loss measure that is considered in this section, such a CR constraint causes some rate loss because the reproduction rule in Equation (41) is no longer possible. In fact, it is not difficult to see that under the CR constraint the above region reduces to the set of pairs (R,D) that satisfy

RI(U;X|Y) (43a)
DH(X|U) (43b)

for some auxiliary random variable for which UXY holds. Observe that Equation (43b) is equivalent to I(U;X)H(X)D and that, for a given prescribed fidelity level D, the minimum rate is obtained for a description U that achieves the inequality in Equation (43b) with equality, i.e.,

R(D)=minPU|X:I(U;X)=H(X)DI(U;X|Y). (44)

Because UXY, we have

I(U;Y)=I(U;X)I(U;X|Y). (45)

Under the constraint I(U;X)=H(X)D, it is easy to see that minimizing I(U;X|Y) amounts to maximizing I(U;Y), an aspect which bridges the problem at hand with the IB problem.

In the above, the side information Y is used for binning but not for the estimation at the decoder. If the encoder ignores whether Y is present or not at the decoder side, the benefit of binning is reduced—see the Heegard–Berger model with common reconstruction studied in [69,70].

4.3. Information Combining

Consider again the IB problem. Assume one wishes to find the representation U that maximizes the relevance I(U;Y) for a given prescribed complexity level, e.g., I(U;X)=R. For this setup, we have

I(X;U,Y)=I(U;X)+I(Y;X)I(U;Y) (46)
=R+I(Y;X)I(U;Y) (47)

where the first equality holds since UXY is a Markov Chain. Maximizing I(U;Y) is then equivalent to minimizing I(X;U,Y). This is reminiscent of the problem of information combining [71,72], where X can be interpreted as a source information that is conveyed through two channels: the channel PY|X and the channel PU|X. The outputs of these two channels are conditionally independent given X, and they should be processed in a manner such that, when combined, they preserve as much information as possible about X.

4.4. Wyner–Ahlswede–Korner Problem

Here, the two memoryless sources X and Y are encoded separately at rates RX and RY, respectively. A decoder gets the two compressed streams and aims at recovering Y losslessly. This problem was studied and solved separately by Wyner [73] and Ahlswede and Körner [74]. For given RX=R, the minimum rate RY that is needed to recover Y losslessly is

RY(R)=minPU|X:I(U;X)RH(Y|U). (48)

Thus, we get

maxPU|X:I(U;X)RI(U;Y)=H(Y)RY(R),

and therefore, solving the IB problem is equivalent to solving the Wyner–Ahlswede–Korner Problem.

4.5. The Privacy Funnel

Consider again the setting of Figure 4, and let us assume that the pair (Y,X) models data that a user possesses and which have the following properties: the data Y are some sensitive (private) data that are not meant to be revealed at all, or else not beyond some level Δ; and the data X are non-private and are meant to be shared with another user (analyst). Because X and Y are correlated, sharing the non-private data X with the analyst possibly reveals information about Y. For this reason, there is a trade off between the amount of information that the user shares about X and the information that he keeps private about Y. The data X are passed through a randomized mapping ϕ whose purpose is to make U=ϕ(X) maximally informative about X while being minimally informative about Y.

The analyst performs an inference attack on the private data Y based on the disclosed information U. Let :Y×Y^R¯ be an arbitrary loss function with reconstruction alphabet Y^ that measures the cost of inferring Y after observing U. Given (X,Y)PX,Y and under the given loss function , it is natural to quantify the difference between the prediction losses in predicting YY prior and after observing U=ϕ(X). Let

C(,P)=infy^Y^EP[(Y,y^)]infY^(ϕ(X))EP[(Y,Y^)] (49)

where y^Y^ is deterministic and Y^(ϕ(X)) is any measurable function of U=ϕ(X). The quantity C(,P) quantifies the reduction in the prediction loss under the loss function that is due to observing U=ϕ(X), i.e., the inference cost gain. In [75] (see also [76]), it is shown that that under some mild conditions the inference cost gain C(,P) as defined by Equation (49) is upper-bounded as

C(,P)22LI(U;Y) (50)

where L is a constant. The inequality in Equation (50) holds irrespective to the choice of the loss function , and this justifies the usage of the logarithmic loss function as given by Equation (53) in the context of finding a suitable trade off between utility and privacy, since

I(U;Y)=H(Y)infY^(U)EP[log(Y,Y^)]. (51)

Under the logarithmic loss function, the design of the mapping U=ϕ(X) should strike a right balance between the utility for inferring the non-private data X as measured by the mutual information I(U;X) and the privacy metric about the private date Y as measured by the mutual information I(U;Y).

4.6. Efficiency of Investment Information

Let Y model a stock market data and X some correlated information. In [77], Erkip and Cover investigated how the description of the correlated information X improves the investment in the stock market Y. Specifically, let Δ(C) denote the maximum increase in growth rate when X is described to the investor at rate C. Erkip and Cover found a single-letter characterization of the incremental growth rate Δ(C). When specialized to the horse race market, this problem is related to the aforementioned source coding with side information of Wyner [73] and Ahlswede-Körner [74], and, thus, also to the IB problem. The work in [77] provides explicit analytic solutions for two horse race examples, jointly binary and jointly Gaussian horse races.

5. Connections to Inference and Representation Learning

In this section, we consider the connections of the IB problem with learning, inference and generalization, for which, typically, the joint distribution PX,Y of the data is not known and only a set of samples is available.

5.1. Inference Model

Let a measurable variable XX and a target variable YY with unknown joint distribution PX,Y be given. In the classic problem of statistical learning, one wishes to infer an accurate predictor of the target variable YY based on observed realizations of XX. That is, for a given class F of admissible predictors ψ:XY^ and a loss function :YY^ that measures discrepancies between true values and their estimated fits, one aims at finding the mapping ψF that minimizes the expected (population) risk

CPX,Y(ψ,)=EPX,Y[(Y,ψ(X))]. (52)

An abstract inference model is shown in Figure 5.

Figure 5.

Figure 5

An abstract inference model for learning.

The choice of a “good” loss function (·) is often controversial in statistical learning theory. There is however numerical evidence that models that are trained to minimize the error’s entropy often outperform ones that are trained using other criteria such as mean-square error (MSE) and higher-order statistics [26,27]. This corresponds to choosing the loss function given by the logarithmic loss, which is defined as

log(y,y^):=log1y^(y) (53)

for yY, where y^P(Y) designates here a probability distribution on Y and y^(y) is the value of that distribution evaluated at the outcome yY. Although a complete and rigorous justification of the usage of the logarithmic loss as distortion measure in learning is still awaited, recently a partial explanation appeared in [30] where Painsky and Wornell showed that, for binary classification problems, by minimizing the logarithmic-loss one actually minimizes an upper bound to any choice of loss function that is smooth, proper (i.e., unbiased and Fisher consistent), and convex. Along the same line of work, the authors of [29] showed that under some natural data processing property Shannon’s mutual information uniquely quantifies the reduction of prediction risk due to side information. Perhaps, this justifies partially why the logarithmic-loss fidelity measure is widely used in learning theory and has already been adopted in many algorithms in practice such as the infomax criterion [31], the tree-based algorithm of Quinlan [32], or the well known Chow–Liu algorithm [33] for learning tree graphical models, with various applications in genetics [34], image processing [35], computer vision [36], and others. The logarithmic loss measure also plays a central role in the theory of prediction [37] (Ch. 09), where it is often referred to as the self-information loss function, as well as in Bayesian modeling [38] where priors are usually designed to maximize the mutual information between the parameter to be estimated and the observations.

When the join distribution PX,Y is known, the optimal predictor and the minimum expected (population) risk can be characterized. Let, for every xX, ψ(x)=Q(·|x)P(Y). It is easy to see that

EPX,Y[log(Y,Q)]=xX,yYPX,Y(x,y)log1Q(y|x) (54a)
=xX,yYPX,Y(x,y)log1PY|X(y|x)+xX,yYPX,Y(x,y)logPY|X(y|x)Q(y|x) (54b)
=H(Y|X)+DPY|XQ (54c)
H(Y|X) (54d)

with equality iff the predictor is given by the conditional posterior ψ(x)=PY(Y|X=x). That is, the minimum expected (population) risk is given by

minψCPX,Y(ψ,log)=H(Y|X). (55)

If the joint distribution PX,Y is unknown, which is most often the case in practice, the population risk as given by Equation (56) cannot be computed directly, and, in the standard approach, one usually resorts to choosing the predictor with minimal risk on a training dataset consisting of n labeled samples {(xi,yi)}i=1n that are drawn independently from the unknown joint distribution PX,Y. In this case, one is interested in optimizing the empirical population risk, which for a set of n i.i.d. samples from PX,Y, Dn:={(xi,yi)}i=1n, is defined as

C^PX,Y(ψ,,Dn)=1ni=1n(yi,ψ(xi)). (56)

The difference between the empirical and population risks is normally measured in terms of the generalization gap, defined as

genPX,Y(ψ,,Dn):=CPX,Y(ψ,log)C^PX,Y(ψ,,Dn). (57)

5.2. Minimum Description Length

One popular approach to reducing the generalization gap is by restricting the set F of admissible predictors to a low-complexity class (or constrained complexity) to prevent over-fitting. One way to limit the model’s complexity is by restricting the range of the prediction function, as shown in Figure 6. This is the so-called minimum description length complexity measure, often used in the learning literature to limit the description length of the weights of neural networks [78]. A connection between the use of the minimum description complexity for limiting the description length of the input encoding and accuracy studied in [79] and with respect to the weight complexity and accuracy is given in [11]. Here, the stochastic mapping ϕ:XU is a compressor with

ϕR (58)

for some prescribed “input-complexity” value R, or equivalently prescribed average description-length.

Figure 6.

Figure 6

Inference problem with constrained model’s complexity.

Minimizing the constrained description length population risk is now equivalent to solving

CPX,Y,DLC(R)=minϕEPX,Y[log(Yn,ψ(Un))] (59)
s.t.ϕ(Xn)nR. (60)

It can be shown that this problem takes its minimum value with the choice of ψ(U)=PY|U and

CPX,Y,DLC(R)=minPU|XH(Y|U)s.t.RI(U;X), (61)

The solution to Equation (61) for different values of R is effectively equivalent to the IB-problem in Equation (4). Observe that the right-hand side of Equation (61) is larger for small values of R; it is clear that a good predictor ϕ should strike a right balance between reducing the model’s complexity and reducing the error’s entropy, or, equivalently, maximizing the mutual information I(U;Y) about the target variable Y.

5.3. Generalization and Performance Bounds

The IB-problem appears as a relevant problem in fundamental performance limits of learning. In particular, when PX,Y is unknown, and instead n samples i.i.d from PX,Y are available, the optimization of the empirical risk in Equation (56) leads to a mismatch between the true loss given by the population risk and the empirical risk. This gap is measured by the generalization gap in Equation (57). Interestingly, the relationship between the true loss and the empirical loss can be bounded (in high probability) in terms of the IB-problem as [80]

CPX,Y(ψ,log)C^PX,Y(ψ,,Dn)+genPX,Y(ψ,,Dn)=HP^X,Y(n)(Y|U)C^PX,Y(ψ,,Dn)+AI(P^X(n);PU|X)·lognn+BΛ(PU|X,P^Y|U,PY^|U)n+OlognnBoundongenPX,Y(ψ,,Dn)

where P^U|X and P^Y|U are the empirical encoder and decoder and PY^|U is the optimal decoder. HP^X,Y(n)(Y|U) and I(P^X(n);PU|X) are the empirical loss and the mutual information resulting from the dataset Dn and Λ(PU|X,P^Y|U,PY^|U) is a function that measures the mismatch between the optimal decoder and the empirical one.

This bound shows explicitly the trade-off between the empirical relevance and the empirical complexity. The pairs of relevance and complexity simultaneously achievable is precisely characterized by the IB-problem. Therefore, by designing estimators based on the IB problem, as described in Section 3, one can perform at different regimes of performance, complexity and generalization.

Another interesting connection between learning and the IB-method is the connection of the logarithmic-loss as metric to common performance metrics in learning:

  • The logarithmic-loss gives an upper bound on the probability of miss-classification (accuracy):
    ϵY|X(QY^|X):=1EPXY[QY^|X]1expEPX,Y[log(Y,QY^|X)]
  • The logarithmic-loss is equivalent to maximum likelihood for large n:
    1nlogPYn|Xn(yn|xn)=1ni=1nlogPY|X(yi|xi)nEX,Y[logPY|X(Y|X)]
  • The true distribution P minimizes the expected logarithmic-loss:
    PY|X=argminQY^|XEPlog1QY^|XandminQY^|XE[log(Y,QY^|X)]=H(Y|X)

Since for n the joint distribution PXY can be perfectly learned, the link between these common criteria allows the use of the IB-problem to derive asymptotic performance bounds, as well as design criteria, in most of the learning scenarios of classification, regression, and inference.

5.4. Representation Learning, Elbo and Autoencoders

The performance of machine learning algorithms depends strongly on the choice of data representation (or features) on which they are applied. For that reason, feature engineering, i.e., the set of all pre-processing operations and transformations applied to data in the aim of making them support effective machine learning, is important. However, because it is both data- and task-dependent, such feature-engineering is labor intensive and highlights one of the major weaknesses of current learning algorithms: their inability to extract discriminative information from the data themselves instead of hand-crafted transformations of them. In fact, although it may sometimes appear useful to deploy feature engineering in order to take advantage of human know-how and prior domain knowledge, it is highly desirable to make learning algorithms less dependent on feature engineering to make progress towards true artificial intelligence.

Representation learning is a sub-field of learning theory that aims at learning representations of the data that make it easier to extract useful information, possibly without recourse to any feature engineering. That is, the goal is to identify and disentangle the underlying explanatory factors that are hidden in the observed data. In the case of probabilistic models, a good representation is one that captures the posterior distribution of the underlying explanatory factors for the observed input. For related works, the reader may refer, e.g., to the proceedings of the International Conference on Learning Representations (ICLR), see https://iclr.cc/.

The use of the Shannon’s mutual information as a measure of similarity is particularly suitable for the purpose of learning a good representation of data [81]. In particular, a popular approach to representation learning are autoencoders, in which neural networks are designed for the task of representation learning. Specifically, we design a neural network architecture such that we impose a bottleneck in the network that forces a compressed knowledge representation of the original input, by optimizing the Evidence Lower Bound (ELBO), given as

LELBO(θ,ϕ,φ):=1Ni=1NlogQϕ(xi|ui)DKL(Pθ(Ui|xi)Qφ(Ui))]. (62)

over the neural network parameters θ,ϕ,φ. Note that this is precisely the variational-IB cost in Equation (32) for β=1 and Y=X, i.e., the IB variational bound when particularized to distributions whose parameters are determined by neural networks. In addition, note that the architecture shown in Figure 3 is the classical neural network architecture for autoencoders, and that is coincides with the variational IB solution resulting from the optimization of the IB-problem in Section 3.3.1. In addition, note that Equation (32) provides an operational meaning to the β-VAE cost [82], as a criterion to design estimators on the relevance–complexity plane for different β values, since the β-VAE cost is given as

LβVAE(θ,ϕ,φ):=1Ni=1NlogQϕ(xi|ui)βDKL(Pθ(Ui|xi)Qφ(Ui))], (63)

which coincides with the empirical version of the variational bound found in Equation (32).

5.5. Robustness to Adversarial Attacks

Recent advances in deep learning has allowed the design of high accuracy neural networks. However, it has been observed that the high accuracy of trained neural networks may be compromised under nearly imperceptible changes in the inputs [83,84,85]. The information bottleneck has also found applications in providing methods to improve robustness to adversarial attacks when training models. In particular, the use of the variational IB method of Alemi et al. [48] showed the advantages of the resulting neural network for classification in terms of robustness to adversarial attacks. Recently, alternatives strategies for extracting features in supervised learning are proposed in [86] to construct classifiers robust to small perturbations in the input space. Robustness is measured in terms of the (statistical)-Fisher information, given for two random variables (Y,Z) as

Φ(Z|Y)=EY,Zylogp(Z|Y)2. (64)

The method in [86] builds upon the idea of the information bottleneck by introducing an additional penalty term that encourages the Fisher information in Equation (64) of the extracted features to be small, when parametrized by the inputs. For this problem, under jointly Gaussian vector sources (X,Y), the optimal representation is also shown to be Gaussian, in line with the results in Section 6.2.1 for the IB without robustness penalty. For general source distributions, a variational method is proposed similar to the variational IB method in Section 3.3.1. The problem shows connections with the I-MMSE [87], de Brujin identity [88,89], Cramér–Rao inequality [90], and Fano’s inequality [90].

6. Extensions: Distributed Information Bottleneck

Consider now a generalization of the IB problem in which the prediction is to be performed in a distributed manner. The model is shown in Figure 7. Here, the prediction of the target variable YY is to be performed on the basis of samples of statistically correlated random variables (X1,,XK) that are observed each at a distinct predictor. Throughout, we assume that the following Markov Chain holds for all kK:={1,,K},

XkYXK/k. (65)

Figure 7.

Figure 7

A model for distributed, e.g., multi-view, learning.

The variable Y is a target variable and we seek to characterize how accurately it can be predicted from a measurable random vector (X1,,XK) when the components of this vector are processed separately, each by a distinct encoder.

6.1. The Relevance–Complexity Region

The distributed IB problem of Figure 7 is studied in [91,92] from information-theoretic grounds. For both discrete memoryless (DM) and memoryless vector Gaussian models, the authors established fundamental limits of learning in terms of optimal trade-offs between relevance and complexity, leveraging on the connection between the IB-problem and source coding. The following theorem states the result for the case of discrete memoryless sources.

Theorem 1

([91,92]). The relevance–complexity region IRDIB of the distributed learning problem is given by the union of all non-negative tuples (Δ,R1,,RK)R+K+1 that satisfy

ΔkS[RkI(Xk;Uk|Y,T)]+I(Y;USc|T),SK (66)

for some joint distribution of the form PTPYk=1KPXk|Yk=1KPUk|Xk,T.

Proof. 

The proof of Theorem 1 can be found in Section 7.1 of [92] and is reproduced in Section 8.1 for completeness. □

For a given joint data distribution PXK,Y, Theorem 1 extends the single encoder IB principle of Tishby in Equation (3) to the distributed learning model with K encoders, which we denote by Distributed Information Bottleneck (DIB) problem. The result characterizes the optimal relevance–complexity trade-off as a region of achievable tuples (Δ,R1,,RK) in terms of a distributed representation learning problem involving the optimization over K conditional pmfs PUk|Xk,T and a pmf PT . The pmfs PUk|Xk,T correspond to stochastic encodings of the observation Xk to a latent variable, or representation, Uk which captures the relevant information of Y in observation Xk. Variable T corresponds to a time-sharing among different encoding mappings (see, e.g., [51]). For such encoders, the optimal decoder is implicitly given by the conditional pmf of Y from U1,,UK, i.e., PY|UK,T.

The characterization of the relevance–complexity region can be used to derive a cost function for the D-IB similarly to the IB-Lagrangian in Equation (3). For simplicity, let us consider the problem of maximizing the relevance under a sum-complexity constraint. Let Rsum=k=1KRk and

RIDIBsum:=(Δ,Rsum)R+2:(R1,,RK)R+Ks.t.k=1KRk=Rsumand(Δ,R1,,RK)RIDIB.

We define the DIB-Lagrangian (under sum-rate) as

Ls(P):=H(Y|UK)sk=1K[H(Y|Uk)+I(Xk;Uk)]. (67)

The optimization of Equation (67) over the encoders PUk|Xk,T allows obtaining mappings that perform on the boundary of the relevance–sum complexity region RIDIBsum. To see that, note that it is easy to see that the relevance–sum complexity region RIDIBsum is composed of all the pairs (Δ,Rsum)R+2 for which ΔΔ(Rsum,PXK,Y), with

Δ(Rsum,PXK,Y)=maxPminI(Y;UK),Rsumk=1KI(Xk;Uk|Y), (68)

where the maximization is over joint distributions that factorize as PYk=1KPXk|Yk=1KPUk|Xk. The pairs (Δ,Rsum) that lie on the boundary of RIDIBsum can be characterized as given in the following proposition.

Proposition 1.

For every pair (Δ,Rsum)R+2 that lies on the boundary of the region RIDIBsum, there exists a parameter s0 such that (Δ,Rsum)=(Δs,Rs), with

Δs=1(1+s)(1+sK)H(Y)+sRs+maxPLs(P), (69)
Rs=I(Y;UK*)+k=1K[I(Xk;Uk*)I(Y;Uk*)], (70)

where P* is the set of conditional pmfs P={PU1|X1,,PUK|XK} that maximize the cost function in Equation (67).

Proof. 

The proof of Proposition 1 can be found in Section 7.3 of [92] and is reproduced here in Section 8.2 for completeness. □

The optimization of the distributed IB cost function in Equation (67) generalizes the centralized Tishby’s information bottleneck formulation in Equation (3) to the distributed learning setting. Note that for K=1 the optimization in Equation (69) reduces to the single encoder cost in Equation (3) with a multiplier s/(1+s).

6.2. Solutions to the Distributed Information Bottleneck

The methods described in Section 3 can be extended to the distributed information bottleneck case in order to find the mappings PU1|X1,T,,PUK|XK,T in different scenarios.

6.2.1. Vector Gaussian Model

In this section, we show that for the jointly vector Gaussian data model it is enough to restrict to Gaussian auxiliaries (U1,,UK) in order to exhaust the entire relevance–complexity region. In addition, we provide an explicit analytical expression of this region. Let (X1,,XK,Y) be a jointly vector Gaussian vector that satisfies the Markov Chain in Equation (83). Without loss of generality, let the target variable be a complex-valued, zero-mean multivariate Gaussian YCny with covariance matrix Σy, i.e., YCN(y;0,Σy), and XkCnk given by

Xk=HkY+Nk, (71)

where HkCnk×ny models the linear model connecting Y to the observation at encoder k and NkCnk is the noise vector at encoder k, assumed to be Gaussian with zero-mean, covariance matrix Σk, and independent from all other noises and Y.

For the vector Gaussian model Equation (71), the result of Theorem 1, which can be extended to continuous sources using standard techniques, characterizes the relevance–complexity region of this model. The following theorem characterizes the relevance–complexity region, which we denote hereafter as RIDIBG. The theorem also shows that in order to exhaust this region it is enough to restrict to no time sharing, i.e., T= and multivariate Gaussian test channels

Uk=AkXk+ZkCN(uk;AkXk,Σz,k), (72)

where AkCnk×nk projects Xk and Zk is a zero-mean Gaussian noise with covariance Σz,k.

Theorem 2.

For the vector Gaussian data model, the relevance–complexity region RIDIBG is given by the union of all tuples (Δ,R1,,RL) that satisfy

ΔkSRk+logIΣk1/2ΩkΣk1/2+logI+kScΣy1/2HkΩkHkΣy1/2,SK,

for some matrices 0ΩkΣk1.

Proof. 

The proof of Theorem 2 can be found in Section 7.5 of [92] and is reproduced here in Section 8.4 for completeness. □

Theorem 2 extends the result of [54,93] on the relevance–complexity trade-off characterization of the single-encoder IB problem for jointly Gaussian sources to K encoders. The theorem also shows that the optimal test channels PUk|Xk are multivariate Gaussian, as given by Equation (72).

Consider the following symmetric distributed scalar Gaussian setting, in which YN(0,1) and

X1=snrY+N1 (73a)
X2=snrY+N2 (73b)

where N1 and N2 are standard Gaussian with zero-mean and unit variance, both independent of Y. In this case, for I(U1;X1)=R and I(U;X2)=R, the optimal relevance is

Δ(R,snr)=12log1+2snrexp(4R)exp(4R)+snrsnr2+(1+2snr)exp(4R). (74)

An easy upper bound on the relevance can be obtained by assuming that X1 and X2 are encoded jointly at rate 2R, to get

Δub(R,snr)=12log(1+2snr)12log1+2snrexp(4R). (75)

The reader may notice that, if X1 and X2 are encoded independently, an achievable relevance level is given by

Δlb(R,snr)=12log(1+2snrsnrexp(2R))12log1+snrexp(2R). (76)

6.3. Solutions for Generic Distributions

Next, we present how the distributed information bottleneck can be solved for generic distributions. Similar to the case of single encoder IB-problem, the solutions are based on a variational bound on the DIB-Lagrangian. For simplicity, we look at the D-IB under sum-rate constraint [92].

6.4. A Variational Bound

The optimization of Equation (67) generally requires computing marginal distributions that involve the descriptions U1,,UK, which might not be possible in practice. In what follows, we derive a variational lower bound on Ls(P) on the DIB cost function in terms of families of stochastic mappings QY|U1,,UK (a decoder), {QY|Uk}k=1K and priors {QUk}k=1K. For the simplicity of the notation, we let

Q:={QY|U1,,UK,QY|U1,,QY|UK,QU1,,QUK}. (77)

The variational D-IB cost for the DIB-problem is given by

LsVB(P,Q):=E[logQY|UK(Y|UK)]av.logarithmicloss+sk=1KE[logQY|Uk(Y|Uk)]DKL(PUk|XkQUk)regularizer. (78)

Lemma 1.

For fixed P, we have

Ls(P)LsVB(P,Q),forallpmfsQ. (79)

In addition, there exists a unique Q that achieves the maximum maxQLsVB(P,Q)=Ls(P), and is given by, kK,

QUk*=PUk (80a)
QY|Uk*=PY|Uk (80b)
QY|U1,,Uk*=PY|U1,,UK, (80c)

where the marginals PUk and the conditional marginals PY|Uk and PY|U1,,UK are computed from P.

Proof. 

The proof of Lemma 1 can be found in Section 7.4 of [92] and is reproduced here in Section 8.3 for completeness. □

Then, the optimization in Equation (69) can be written in terms of the variational DIB cost function as follows,

maxPLs(P)=maxPmaxQLsVB(P,Q). (81)

The variational DIB cost in Equation (78) is a generalization to distributed learning with K-encoders of the evidence lower bound (ELBO) of the target variable Y given the representations U1,,UK [15]. If Y=(X1,,XK), the bound generalizes the ELBO used for VAEs to the setting of K2 encoders. In addition, note that Equation (78) also generalizes and provides an operational meaning to the β-VAE cost [82] with β=s/(1+s), as a criteria to design estimators on the relevance–complexity plane for different β values.

6.5. Known Memoryless Distributions

When the data model is discrete and the joint distribution PX,Y is known, the DIB problem can be solved by using an iterative method that optimizes the variational IB cost function in Equation (81) alternating over the distributions P,Q. The optimal encoders and decoders of the D-IB under sum-rate constraint satisfy the following self consistent equations,

p(uk|yk)=p(uk)Z(β,uk)expψs(uk,yk),p(x|uk)=ykYkp(yk|uk)p(x|yk)p(x|u1,,uK)=yKYKp(yK)p(uK|yK)p(x|yK)/p(uK)

where ψs(uk,yk):=DKL(PX|yk||QX|uk)+1sEUKk|yk[DKL(PX|UKk,yk||QX|UKk,uk))].

Alternating iterations of these equations converge to a solution for any initial p(uk|xk), similarly to a Blahut–Arimoto algorithm and the EM.

6.5.1. Distributed Variational IB

When the data distribution is unknown and only data samples are available, the variational DIB cost in Equation (81) can be optimized following similar steps as for the variational IB in Section 3.3.1 by parameterizing the encoding and decoding distributions P,Q using a family of distributions whose parameters are determined by DNNs. This allows us to formulate Equation (81) in terms of the DNN parameters, i.e., its weights, and optimize it by using the reparameterization trick [15], Monte Carlo sampling, and stochastic gradient descent (SGD)-type algorithms.

Considering encoders and decoders P,Q parameterized by DNN parameters θ,ϕ,φ, the DIB cost in Equation (81) can be optimized by considering the following empirical Monte Carlo approximation:

maxθ,ϕ,φ1ni=1nlogQϕK(yi|u1,i,j,,uK,i,j)+sk=1KlogQϕk(yi|uk,i,j)DKL(Pθk(Uk,i|xk,i)Qφk(Uk,i)), (82)

where uk,i,j=gϕk(xk,i,zk,j) are samples obtained from the reparametrization trick by sampling from K random variables PZk. The details of the method can be found in [92]. The resulting architecture is shown in Figure 8. This architecture generalizes that from autoencoders to the distributed setup with K encoders.

Figure 8.

Figure 8

Example parameterization of the Distributed Variational Information Bottleneck method using neural networks.

6.6. Connections to Coding Problems and Learning

Similar to the point-to-point IB-problem, the distributed IB problem also has abundant connections with (asymptotic) coding and learning problems.

6.6.1. Distributed Source Coding under Logarithmic Loss

Key element to the proof of the converse part of Theorem 3 is the connection with the Chief Executive Officer (CEO) source coding problem. For the case of K2 encoders, while the characterization of the optimal rate-distortion region of this problem for general distortion measures has eluded the information theory for now more than four decades, a characterization of the optimal region in the case of logarithmic loss distortion measure has been provided recently in [65]. A key step in [65] is that the log-loss distortion measure admits a lower bound in the form of the entropy of the source conditioned on the decoders’ input. Leveraging this result, in our converse proof of Theorem 3, we derive a single letter upper bound on the entropy of the channel inputs conditioned on the indices JK that are sent by the relays, in the absence of knowledge of the codebooks indices FL. In addition, the rate region of the vector Gaussian CEO problem under logarithmic loss distortion measure has been found recently in [94,95].

6.6.2. Cloud RAN

Consider the discrete memoryless (DM) CRAN model shown in Figure 9. In this model, L users communicate with a common destination or central processor (CP) through K relay nodes, where L1 and K1. Relay node k, 1kK, is connected to the CP via an error-free finite-rate fronthaul link of capacity Ck. In what follows, we let L:=[1:L] and K:=[1:K] indicate the set of users and relays, respectively. Similar to Simeone et al. [96], the relay nodes are constrained to operate without knowledge of the users’ codebooks and only know a time-sharing sequence Qn, i.e., a set of time instants at which users switch among different codebooks. The obliviousness of the relay nodes to the actual codebooks of the users is modeled via the notion of randomized encoding [97,98]. That is, users or transmitters select their codebooks at random and the relay nodes are not informed about the currently selected codebooks, while the CP is given such information.

Figure 9.

Figure 9

CRAN model with oblivious relaying and time-sharing.

Consider the following class of DM CRANs in which the channel outputs at the relay nodes are independent conditionally on the users’ inputs. That is, for all kK and all i[1:n],

Yk,iXL,iYK/k,i (83)

forms a Markov Chain in this order.

The following theorem provides a characterization of the capacity region of this class of DM CRAN problem under oblivious relaying.

Theorem 3

([22,23]). For the class of DM CRANs with oblivious relay processing and enabled time-sharing for which Equation (83) holds, the capacity region C(CK) is given by the union of all rate tuples (R1,,RL) which satisfy

tTRtsS[CsI(Ys;Us|XL,Q)]+I(XT;USc|XTc,Q),

for all non-empty subsets TL and all SK, for some joint measure of the form

p(q)l=1Lp(xl|q)k=1Kp(yk|xL)k=1Kp(uk|yk,q). (84)

The direct part of Theorem 3 can be obtained by a coding scheme in which each relay node compresses its channel output by using Wyner–Ziv binning to exploit the correlation with the channel outputs at the other relays, and forwards the bin index to the CP over its rate-limited link. The CP jointly decodes the compression indices (within the corresponding bins) and the transmitted messages, i.e., Cover-El Gamal compress-and-forward [99] (Theorem 3) with joint decompression and decoding (CF-JD). Alternatively, the rate region of Theorem 3 can also be obtained by a direct application of the noisy network coding (NNC) scheme of [64] (Theorem 1).

The connection between this problem, source coding and the distributed information bottleneck is discussed in [22,23], particularly in the derivation of the converse part of the theorem. Note also the similarity between the resulting capacity region in Theorem 3 and the relevance complexity region of the distributed information bottleneck in Theorem 1, despite the significant differences of the setups.

6.6.3. Distributed Inference, ELBO and Multi-View Learning

In many data analytics problems, data are collected from various sources of information or feature extractors and are intrinsically heterogeneous. For example, an image can be identified by its color or texture features and a document may contain text and images. Conventional machine learning approaches concatenate all available data into one big row vector (or matrix) on which a suitable algorithm is then applied. Treating different observations as a single source might cause overfitting and is not physically meaningful because each group of data may have different statistical properties. Alternatively, one may partition the data into groups according to samples homogeneity, and each group of data be regarded as a separate view. This paradigm, termed multi-view learning [100], has received growing interest, and various algorithms exist, sometimes under references such as co-training [101,102,103,104], multiple kernel learning [104], and subspace learning [105]. By using distinct encoder mappings to represent distinct groups of data, and jointly optimizing over all mappings to remove redundancy, multi-view learning offers a degree of flexibility that is not only desirable in practice but is also likely to result in better learning capability. Actually, as shown in [106], local learning algorithms produce fewer errors than global ones. Viewing the problem as that of function approximation, the intuition is that it is usually not easy to find a unique function that holds good predictability properties in the entire data space.

Besides, the distributed learning of Figure 7 clearly finds application in all those scenarios in which learning is performed collaboratively but distinct learners either only access subsets of the entire dataset (e.g., due to physical constraints) or access independent noisy versions of the entire dataset.

In addition, similar to the single encoder case, the distributed IB also finds applications in fundamental performance limits and formulation of cost functions from an operational point of view. One of such examples is the generalization of the commonly used ELBO and given in Equation (62) to the setup with K views or observations, as formulated in Equation (78). Similarly, from the formulation of the DIB problem, a natural generalization of the classical autoencoders emerge, as given in Figure 8.

7. Outlook

A variant of the bottleneck problem in which the encoder’s output is constrained in terms of its entropy, rather than its mutual information with the encoder’s input as done originally in [1], was considered in [107]. The solution of this problem turns out to be a deterministic encoder map as opposed to the stochastic encoder map that is optimal under the IB framework of Tishby et al. [1], which results in a reduction of the algorithm’s complexity. This idea was then used and extended to the case of available resource (or time) sharing in [108].

In the context of privacy against inference attacks [109], the authors of [75,76] considered a dual of the information bottleneck problem in which XX represents some private data that are correlated with the non-private data YY. A legitimate receiver (analyst) wishes to infer as much information as possible about the non-private data Y but does not need to infer any information about the private data X. Because X and Y are correlated, sharing the non-private data X with the analyst possibly reveals information about Y. For this reason, there is a trade-off between the amount of information that the user shares about X as measured by the mutual information I(U;X) and the information that he keeps private about Y as measured by the mutual information I(U;Y), where U=ϕ(X).

Among interesting problems that are left unaddressed in this paper is that of characterizing optimal input distributions under rate-constrained compression at the relays where, e.g., discrete signaling is already known to sometimes outperform Gaussian signaling for single-user Gaussian CRAN [97]. It is conjectured that the optimal input distribution is discrete. Other issues might relate to extensions to continuous time filtered Gaussian channels, in parallel to the regular bottleneck problem [108], or extensions to settings in which fronthauls may be not available at some radio-units, and that is unknown to the systems. That is, the more radio units are connected to the central unit, the higher is the rate that could be conveyed over the CRAN uplink [110]. Alternatively, one may consider finding the worst-case noise under given input distributions, e.g., Gaussian, and rate-constrained compression at the relays. Furthermore, there are interesting aspects that address processing constraints of continuous waveforms, e.g., sampling at a given rate [111,112] with focus on remote logarithmic distortion [65], which in turn boils down to the distributed bottleneck problem [91,92]. We also mention finite-sample size analysis (i.e., finite block length n, which relates to the literature on finite block length coding in information theory). Finally, it is interesting to observe that the bottleneck problem relates to interesting problem when R is not necessarily scaled with the block length n.

8. Proofs

8.1. Proof of Theorem 1

The proof relies on the equivalence of the studied distributed learning problem with the Chief-Executive Officer (CEO) problem under logarithmic-loss distortion measure, which was studied in [65] (Theorem 10). For the K-encoder CEO problem, let us consider K encoding functions ϕk:XkMk(n) satisfying nRklog|ϕk(Xkn)| and a decoding function ψ˜:M1(n)××MK(n)Y^n, which produces a probabilistic estimate of Y from the outputs of the encoders, i.e., Y^n is the set of distributions on Y. The quality of the estimation is measured in terms of the average log-loss.

Definition 1.

A tuple (D,R1,,RK) is said to be achievable in the K-encoder CEO problem for PXK,Y for which the Markov Chain in Equation (83) holds, if there exists a length n, encoders ϕk for kK, and a decoder ψ˜, such that

DE1nlog1P^Yn|JK(Yn|ϕ1(X1n),,ϕK(XKn)), (85)
Rk1nlog|ϕk(Xkn)|forallkK. (86)

The rate-distortion region RDCEO is given by the closure of all achievable tuples (D,R1,,RK).

The following lemma shows that the minimum average logarithmic loss is the conditional entropy of Y given the descriptions. The result is essentially equivalent to [65] (Lemma 1) and it is provided for completeness.

Lemma 2.

Let us consider PXK,Y and the encoders Jk=ϕk(Xkn), kK and the decoder Y^n=ψ˜(JK). Then,

E[log(Yn,Y^n)]H(Yn|JK), (87)

with equality if and only if ψ˜(JK)={PYn|JK(yn|JK)}ynYn.

Proof. 

Let Z:=(J1,,JK) be the argument of ψ˜ and P^(yn|z) be a distribution on Yn. We have for Z=z:

E[log(Yn,Y^n)|Z=z]=ynYnP(yn|z)log1P^(yn|z) (88)
=ynYnP(yn|z)logP(yn|z)P^(yn|z)+H(Yn|Z=z) (89)
=DKL(P(yn|z)P^(yn|z))+H(Yn|Z=z) (90)
H(Yn|Z=z), (91)

where Equation (91) is due to the non-negativity of the KL divergence and the equality holds if and only if for P^(yn|z)=P(yn|z) where P(yn|z)=Pr{Yn=yn|Z=z} for all z and ynYn. Averaging over Z completes the proof. □

Essentially, Lemma 2 states that minimizing the average log-loss is equivalent to maximizing relevance as given by the mutual information IYn;ψϕ1(X1n),,ϕK(XKn). Formally, the connection between the distributed learning problem under study and the K-encoder CEO problem studied in [65] can be formulated as stated next.

Proposition 2.

A tuple (Δ,R1,,RK)RIDIB if and only if (H(Y)Δ,R1,,RK)RDCEO.

Proof. 

Let the tuple (Δ,R1,,RK)RIDIB be achievable for some encoders ϕk. It follows by Lemma 2 that, by letting the decoding function ψ˜(JK)={PYn|JK(yn|JK)}, we have E[log(Yn,Y^n)|JK]=H(Yn|JK), and hence (H(Y)Δ,R1,,RK)RDCEO.

Conversely, assume the tuple (D,R1,,RK)RDCEO is achievable. It follows by Lemma 2 that H(Y)DH(Yn)H(Yn|JK)=I(Yn;JK), which implies (Δ,R1,,RK)RIDIB with Δ=H(Y)D. □

The characterization of rate-distortion region RCEO has been established recently in [65] (Theorem 10). The proof of the theorem is completed by noting that Proposition 2 implies that the result in [65] (Theorem 10) can be applied to characterize the region RIDIB, as given in Theorem 1.

8.2. Proof of Proposition 1

Let P* be the maximizing in Equation (69). Then,

(1+s)Δs=(1+sK)H(Y)+sRs+Ls(P*) (92)
=(1+sK)H(Y)+sRs+H(Y|UK*)sk=1K[H(Y|Uk*)+I(Xk;Uk*)] (93)
=(1+sK)H(Y)+sRs+(H(Y|UK*)s(RsI(Y;UK*)+KH(Y))) (94)
=(1+s)I(Y;UK*) (95)
(1+s)Δ(Rs,PXK,Y), (96)

where Equation (94) is due to the definition of Ls(P) in Equation (67); Equation (95) holds since k=1K[I(Xk;Uk*)+H(Y|Uk*)]=RsI(Y;UK*)+KH(Y) using Equation (70); and Equation (96) follows by using Equation (68).

Conversely, if P* is the solution to the maximization in the function Δ(Rsum,PXK,Y) in Equation (68) such that Δ(Rsum,PXK,Y)=Δs, then ΔsI(Y;UK*) and ΔsRk=1KI(Xk;Uk*|Y) and we have, for any s0, that

Δ(Rsum,PXK,Y)=Δs (97)
Δs(ΔsI(Y;UK*))sΔsRsum+k=1KI(Xk;Uk*|Y) (98)
=I(Y;UK*)sΔs+sRsumsk=1KI(Xk;Uk*|Y) (99)
=H(Y)sΔs+sRsumH(Y|UK*)sk=1K[I(Xk;Uk*)+H(Y|Uk*)]+sKH(Y) (100)
H(Y)sΔs+sRsum+Ls*+sKH(Y) (101)
=H(Y)sΔs+sRsum+sKH(Y)((1+sK)H(Y)+sRs(1+s)Δs) (102)
=Δs+s(RsumRs), (103)

where in Equation (100) we use that k=1KI(Xk;Uk|Y)=KH(Y)+k=1KI(Xk;Uk)+H(Y|Uk). which follows by using the Markov Chain UkXkY(XKk,UKk); Equation (101) follows since Ls* is the maximum over all possible distributions P (possibly distinct from the P* that maximizes Δ(Rsum,PXK,Y)); and Equation (102) is due to Equation (69). Finally, Equation (103) is valid for any Rsum0 and s0. Given s, and hence (Δs,Rs), letting R=Rs yields Δ(Rs,PXK,Y)Δs. Together with Equation (96), this completes the proof of Proposition 1.

8.3. Proof of Lemma 1

Let, for a given random variable Z and zZ, a stochastic mapping QY|Z(·|z) be given. It is easy to see that

H(Y|Z)=E[logQY|Z(Y|Z)]DKL(PY|ZQY|Z). (104)

In addition, we have

I(Xk;Uk)=H(Uk)H(Uk|Xk) (105)
=DKL(PUk|XkQUk)DKL(PUkQUk). (106)

Substituting it into Equation (67), we get

Ls(P)=LsVB(P,Q)+DKL(PY|UK||QY|UK)+sk=1K(DKL(PY|Uk||QY|Uk)+DKL(PUk||QUk)) (107)
LsVB(P,Q), (108)

where Equation (108) follows by the non-negativity of relative entropy. In addition, note that the inequality in Equation (108) holds with equality iff Q* is given by Equation (80).

8.4. Proof of Theorem 2

The proof of Theorem 2 relies on deriving an outer bound on the relevance–complexity region, as given by Equation (66), and showing that it is achievable with Gaussian pmfs and without time-sharing. In doing so, we use the technique of [89] (Theorem 8), which relies on the de Bruijn identity and the properties of Fisher information and MMSE.

Lemma 3

([88,89]). Let (X,Y) be a pair of random vectors with pmf p(x,y). We have

log|(πe)J1(X|Y)|h(X|Y)log|(πe)mmse(X|Y)|, (109)

where the conditional Fischer information matrix is defined as

J(X|Y):=E[logp(X|Y)logp(X|Y)] (110)

and the minimum mean square error (MMSE) matrix is

mmse(X|Y):=E[(XE[X|Y])(XE[X|Y])]. (111)

For tT and fixed k=1Kp(uk|xk,t), choose Ωk,t, k=1,,K satisfying 0Ωk,tΣk1 such that

mmse(Yk|X,Uk,t,t)=ΣkΣkΩk,tΣk. (112)

Note that such Ωk,t exists since 0mmse(Xk|Y,Uk,t,t)Σk1, for all tT, and kK.

Using Equation (66), we get

I(Xk;Uk|Y,t)log|Σk|log|mmse(Xk|Y,Uk,t,t)|=log|IΣk1/2Ωk,tΣk1/2|, (113)

where the inequality is due to Lemma 3, and Equation (113) is due to Equation (112).

In addition, we have

I(Y;USc,t|t)log|Σy|log|J1(Y|USc,t,t)| (114)
=logkScΣy1/2HkΩk,tHkΣy1/2+I, (115)

where Equation (114) is due to Lemma 3 and Equation (115) is due to to the following equality, which relates the MMSE matrix in Equation (112) and the Fisher information, the proof of which follows,

J(Y|USc,t,t)=kScHkΩk,tHk+Σy1. (116)

To show Equation (116), we use de Brujin identity to relate the Fisher information with the MMSE as given in the following lemma, the proof of which can be found in [89].

Lemma 4.

Let (V1,V2) be a random vector with finite second moments and NCN(0,ΣN) independent of (V1,V2). Then,

mmse(V2|V1,V2+N)=ΣNΣNJ(V2+N|V1)ΣN. (117)

From the MMSE of Gaussian random vectors [51],

Y=E[Y|XSc]+ZSc=kScGkXk+ZSc, (118)

where Gk=Σy|xScHkΣk1 and ZScCN(0,Σy|xSc), and

Σy|xSc1=Σy1+kScHkΣk1Hk. (119)

Note that ZSc is independent of YSc due to the orthogonality principle of the MMSE and its Gaussian distribution. Hence, it is also independent of USc,q.

Thus, we have

mmsekScGkXk|Y,USc,t,t=kScGkmmseXk|Y,USc,t,tGk (120)
=Σy|xSckScHkΣk1ΩkHkΣy|xSc, (121)

where Equation (120) follows since the cross terms are zero due to the Markov Chain (Uk,t,Xk)Y(UK/k,t,XK/k) (see Appendix V of [89]); and Equation (121) follows due to Equation (112) and Gk.

Finally, we have

J(Y|USc,t,t)=Σy|xSc1Σy|xSc1mmsekScGkXk|Y,USc,t,tΣy|xSc1 (122)
=Σy|xSc1kScHkΣk1Ωk,tHk (123)
=Σy1+kScHkΩk,tHk, (124)

where Equation (122) is due to Lemma 4; Equation (123) is due to Equation (121); and Equation (124) follows due to Equation (119).

Then, averaging over the time sharing random variable T and letting Ω¯k:=tTp(t)Ωk,t, we get, using Equation (113),

I(Xk;Uk|Y,T)tTp(t)log|IΣk1/2Ωk,tΣk1/2|log|IΣk1/2Ω¯kΣk1/2|, (125)

where Equation (125) follows from the concavity of the log-det function and Jensen’s inequality.

Similarly, using Equation (115) and Jensen’s Inequality, we have

I(Y;USc|T)logkScΣy1/2HkΩ¯kHkΣy1/2+I. (126)

The outer bound on RIDIB is obtained by substituting into Equation (66), using Equations (125) and (126), noting that Ωk=tTp(t)Ωk,tΣk1 since 0Ωk,tΣk1, and taking the union over Ωk satisfying 0ΩkΣk1.

Finally, the proof is completed by noting that the outer bound is achieved with T= and multivariate Gaussian distributions p*(uk|xk,t)=CN(xk,Σk1/2(ΩkI)Σk1/2).

Acknowledgments

The authors would like to thank the anonymous reviewers for the constructive comments and suggestions, which helped us improve this manuscript.

Author Contributions

A.Z., I.E.-A. and S.S.(S.) equally contributed to the published work. All authors have read and agreed to the published version of the manuscript.

Funding

The work of S. Shamai was supported by the European Union’s Horizon 2020 Research And Innovation Programme, grant agreement No. 694630, and by the WIN consortium via the Israel minister of economy and science.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Tishby N., Pereira F., Bialek W. The information bottleneck method; Proceedings of the Thirty-Seventh Annual Allerton Conference on Communication, Control, and Computing, Allerton House; Monticello, IL, USA. 22–24 September 1999; pp. 368–377. [Google Scholar]
  • 2.Pratt W.K. Digital Image Processing. John Willey & Sons Inc.; New York, NY, USA: 1991. [Google Scholar]
  • 3.Yu S., Principe J.C. Understanding Autoencoders with Information Theoretic Concepts. arXiv. 2018 doi: 10.1016/j.neunet.2019.05.003.1804.00057 [DOI] [PubMed] [Google Scholar]
  • 4.Yu S., Jenssen R., Principe J.C. Understanding Convolutional Neural Network Training with Information Theory. arXiv. 2018 doi: 10.1109/TNNLS.2020.2968509.1804.06537 [DOI] [PubMed] [Google Scholar]
  • 5.Kong Y., Schoenebeck G. Water from Two Rocks: Maximizing the Mutual Information. arXiv. 20181802.08887 [Google Scholar]
  • 6.Ugur Y., Aguerri I.E., Zaidi A. A generalization of Blahut-Arimoto algorithm to computing rate-distortion regions of multiterminal source coding under logarithmic loss; Proceedings of the IEEE Information Theory Workshop, ITW; Kaohsiung, Taiwan. 6–10 November 2017. [Google Scholar]
  • 7.Dobrushin R.L., Tsybakov B.S. Information transmission with additional noise. IRE Trans. Inf. Theory. 1962;85:293–304. doi: 10.1109/TIT.1962.1057738. [DOI] [Google Scholar]
  • 8.Witsenhausen H.S., Wyner A.D. A conditional Entropy Bound for a Pair of Discrete Random Variables. IEEE Trans. Inf. Theory. 1975;IT-21:493–501. doi: 10.1109/TIT.1975.1055437. [DOI] [Google Scholar]
  • 9.Witsenhausen H.S. Indirect Rate Distortion Problems. IEEE Trans. Inf. Theory. 1980;IT-26:518–521. doi: 10.1109/TIT.1980.1056251. [DOI] [Google Scholar]
  • 10.Shwartz-Ziv R., Tishby N. Opening the Black Box of Deep Neural Networks via Information. arXiv. 20171703.00810 [Google Scholar]
  • 11.Achille A., Soatto S. Emergence of Invariance and Disentangling in Deep Representations. arXiv. 20171706.01350 [Google Scholar]
  • 12.McAllester D.A. A PAC-Bayesian Tutorial with a Dropout Bound. arXiv. 20131307.2118 [Google Scholar]
  • 13.Alemi A.A. Variational Predictive Information Bottleneck. arXiv. 20191910.10831 [Google Scholar]
  • 14.Mukherjee S. Machine Learning using the Variational Predictive Information Bottleneck with a Validation Set. arXiv. 20191911.02210 [Google Scholar]
  • 15.Kingma D.P., Welling M. Auto-Encoding Variational Bayes. arXiv. 20131312.6114 [Google Scholar]
  • 16.Mukherjee S. General Information Bottleneck Objectives and their Applications to Machine Learning. arXiv. 20191912.06248 [Google Scholar]
  • 17.Strouse D., Schwab D.J. The information bottleneck and geometric clustering. Neural Comput. 2019;31:596–612. doi: 10.1162/neco_a_01136. [DOI] [PubMed] [Google Scholar]
  • 18.Painsky A., Tishby N. Gaussian Lower Bound for the Information Bottleneck Limit. J. Mach. Learn. Res. (JMLR) 2018;18:7908–7936. [Google Scholar]
  • 19.Kittichokechai K., Caire G. Privacy-constrained remote source coding; Proceedings of the 2016 IEEE International Symposium on Information Theory (ISIT); Barcelona, Spain. 10–15 July 2016; pp. 1078–1082. [Google Scholar]
  • 20.Tian C., Chen J. Successive Refinement for Hypothesis Testing and Lossless One-Helper Problem. IEEE Trans. Inf. Theory. 2008;54:4666–4681. doi: 10.1109/TIT.2008.928951. [DOI] [Google Scholar]
  • 21.Sreekumar S., Gündüz D., Cohen A. Distributed Hypothesis Testing Under Privacy Constraints. arXiv. 20181807.02764 [Google Scholar]
  • 22.Aguerri I.E., Zaidi A., Caire G., Shamai (Shitz) S. On the Capacity of Cloud Radio Access Networks with Oblivious Relaying; Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT); Aachen, Germany. 25–30 June 2017; pp. 2068–2072. [Google Scholar]
  • 23.Aguerri I.E., Zaidi A., Caire G., Shamai (Shitz) S. On the capacity of uplink cloud radio access networks with oblivious relaying. IEEE Trans. Inf. Theory. 2019;65:4575–4596. doi: 10.1109/TIT.2019.2897564. [DOI] [Google Scholar]
  • 24.Stark M., Bauch G., Lewandowsky J., Saha S. Decoding of Non-Binary LDPC Codes Using the Information Bottleneck Method; Proceedings of the ICC 2019-2019 IEEE International Conference on Communications (ICC); Shanghai, China. 20–24 May 2019; pp. 1–6. [Google Scholar]
  • 25.Bengio Y., Courville A., Vincent P. Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 2013;35:1798–1828. doi: 10.1109/TPAMI.2013.50. [DOI] [PubMed] [Google Scholar]
  • 26.Erdogmus D. Ph.D. Thesis. University of Florida Gainesville; Florida, FL, USA: 2002. Information Theoretic Learning: Renyi’s Entropy and Its Applications to Adaptive System Training. [Google Scholar]
  • 27.Principe J.C., Euliano N.R., Lefebvre W.C. Neural and Adaptive Systems: Fundamentals Through Simulations. Volume 672 Wiley; New York, NY, USA: 2000. [Google Scholar]
  • 28.Fisher J.W. Nonlinear Extensions to the Minumum Average Correlation Energy Filter. University of Florida; Gainesville, FL, USA: 1997. [Google Scholar]
  • 29.Jiao J., Courtade T.A., Venkat K., Weissman T. Justification of logarithmic loss via the benefit of side information. IEEE Trans. Inf. Theory. 2015;61:5357–5365. doi: 10.1109/TIT.2015.2462848. [DOI] [Google Scholar]
  • 30.Painsky A., Wornell G.W. On the Universality of the Logistic Loss Function. arXiv. 20181805.03804 [Google Scholar]
  • 31.Linsker R. Self-organization in a perceptual network. Computer. 1988;21:105–117. doi: 10.1109/2.36. [DOI] [Google Scholar]
  • 32.Quinlan J.R. C4. 5: Programs for Machine Learning. Elsevier; Amsterdam, The Netherlands: 2014. [Google Scholar]
  • 33.Chow C., Liu C. Approximating discrete probability distributions with dependence trees. IEEE Trans. Inf. Theory. 1968;14:462–467. doi: 10.1109/TIT.1968.1054142. [DOI] [Google Scholar]
  • 34.Olsen C., Meyer P.E., Bontempi G. On the impact of entropy estimation on transcriptional regulatory network inference based on mutual information. EURASIP J. Bioinf. Syst. Biol. 2008;2009:308959. doi: 10.1155/2009/308959. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Pluim J.P., Maintz J.A., Viergever M.A. Mutual-information-based registration of medical images: A survey. IEEE Trans. Med. Imaging. 2003;22:986–1004. doi: 10.1109/TMI.2003.815867. [DOI] [PubMed] [Google Scholar]
  • 36.Viola P., Wells W.M., III Alignment by maximization of mutual information. Int. J. Comput. Vis. 1997;24:137–154. doi: 10.1023/A:1007958904918. [DOI] [Google Scholar]
  • 37.Cesa-Bianchi N., Lugosi G. Prediction, Learning and Games. Cambridge University Press; New York, NY, USA: 2006. [Google Scholar]
  • 38.Lehmann E.L., Casella G. Theory of Point Estimation. Springer Science & Business Media; Berlin, Germany: 2006. [Google Scholar]
  • 39.Bousquet O., Elisseeff A. Stability and generalization. J. Mach. Learn. Res. 2002;2:499–526. [Google Scholar]
  • 40.Shalev-Shwartz S., Shamir O., Srebro N., Sridharan K. Learnability, stability and uniform convergence. J. Mach. Learn. Res. 2010;11:2635–2670. [Google Scholar]
  • 41.Xu A., Raginsky M. Information-theoretic analysis of generalization capability of learning algorithms; Proceedings of the 31st International Conference on Neural Information Processing Systems; Long Beach, CA, USA. 4–9 December 2017; pp. 2521–2530. [Google Scholar]
  • 42.Russo D., Zou J. How much does your data exploration overfit? Controlling bias via information usage. arXiv. 2015 doi: 10.1109/TIT.2019.2945779.1511.05219 [DOI] [Google Scholar]
  • 43.Amjad R.A., Geiger B.C. Learning Representations for Neural Network-Based Classification Using the Information Bottleneck Principle. IEEE Trans. Pattern Anal. Mach. Intell. 2019 doi: 10.1109/TPAMI.2019.2909031. [DOI] [PubMed] [Google Scholar]
  • 44.Paninski L. Estimation of entropy and mutual information. Neural Comput. 2003;15:1191–1253. doi: 10.1162/089976603321780272. [DOI] [Google Scholar]
  • 45.Jiao J., Venkat K., Han Y., Weissman T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory. 2015;61:2835–2885. doi: 10.1109/TIT.2015.2412945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Valiant P., Valiant G. Estimating the unseen: improved estimators for entropy and other properties; Proceedings of the Advances in Neural Information Processing Systems 26 (NIPS 2013); Lake Tahoe, NV, USA. 5–10 December 2013; pp. 2157–2165. [Google Scholar]
  • 47.Chalk M., Marre O., Tkacik G. Relevant sparse codes with variational information bottleneck. arXiv. 20161605.07332 [Google Scholar]
  • 48.Alemi A., Fischer I., Dillon J., Murphy K. Deep Variational Information Bottleneck; Proceedings of the International Conference on Learning Representations, ICLR 2017; Toulon, France. 24–26 April 2017. [Google Scholar]
  • 49.Achille A., Soatto S. Information Dropout: Learning Optimal Representations Through Noisy Computation. IEEE Trans. Pattern Anal. Mach. Intell. 2018;40:2897–2905. doi: 10.1109/TPAMI.2017.2784440. [DOI] [PubMed] [Google Scholar]
  • 50.Harremoes P., Tishby N. The information bottleneck revisited or how to choose a good distortion measure; Proceedings of the 2007 IEEE International Symposium on Information Theory; Nice, France. 24–29 June 2007; pp. 566–570. [Google Scholar]
  • 51.Gamal A.E., Kim Y.H. Network Information Theory. Cambridge University Press; Cambridge, UK: 2011. [Google Scholar]
  • 52.Hotelling H. The most predictable criterion. J. Educ. Psycol. 1935;26:139–142. doi: 10.1037/h0058165. [DOI] [Google Scholar]
  • 53.Globerson A., Tishby N. On the Optimality of the Gaussian Information Bottleneck Curve. Hebrew University; Jerusalem, Israel: 2004. Technical Report. [Google Scholar]
  • 54.Chechik G., Globerson A., Tishby N., Weiss Y. Information Bottleneck for Gaussian Variables. J. Mach. Learn. Res. 2005;6:165–188. [Google Scholar]
  • 55.Wieczorek A., Roth V. On the Difference Between the Information Bottleneck and the Deep Information Bottleneck. arXiv. 2019 doi: 10.3390/e22020131.1912.13480 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Dempster A.P., Laird N.M., Rubin D.B. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B. 1977;39:1–38. [Google Scholar]
  • 57.Blahut R. Computation of channel capacity and rate-distortion functions. IEEE Trans. Inf. Theory. 1972;18:460–473. doi: 10.1109/TIT.1972.1054855. [DOI] [Google Scholar]
  • 58.Arimoto S. An algorithm for computing the capacity of arbitrary discrete memoryless channels. IEEE Trans. Inf. Theory. 1972;IT-18:12–20. doi: 10.1109/TIT.1972.1054753. [DOI] [Google Scholar]
  • 59.Winkelbauer A., Matz G. Rate-information-optimal gaussian channel output compression; Proceedings of the 48th Annual Conference on Information Sciences and Systems (CISS); Princeton, NJ, USA. 19–21 March 2014; pp. 1–5. [Google Scholar]
  • 60.Gálvez B.R., Thobaben R., Skoglund M. The Convex Information Bottleneck Lagrangian. Entropy. 2020;20:98. doi: 10.3390/e22010098. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Jang E., Gu S., Poole B. Categorical Reparameterization with Gumbel-Softmax. arXiv. 20171611.01144 [Google Scholar]
  • 62.Maddison C.J., Mnih A., Teh Y.W. The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. arXiv. 20161611.00712 [Google Scholar]
  • 63.Kingma D.P., Ba J. Adam: A Method for Stochastic Optimization. arXiv. 20141412.6980 [Google Scholar]
  • 64.Lim S.H., Kim Y.H., Gamal A.E., Chung S.Y. Noisy Network Coding. IEEE Trans. Inf. Theory. 2011;57:3132–3152. doi: 10.1109/TIT.2011.2119930. [DOI] [Google Scholar]
  • 65.Courtade T.A., Weissman T. Multiterminal source coding under logarithmic loss. IEEE Trans. Inf. Theory. 2014;60:740–761. doi: 10.1109/TIT.2013.2288257. [DOI] [Google Scholar]
  • 66.Csiszár I., Körner J. Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic Press; London, UK: 1981. [Google Scholar]
  • 67.Wyner A.D., Ziv J. The rate-distortion function for source coding with side information at the decoder. IEEE Trans. Inf. Theory. 1976;22:1–10. doi: 10.1109/TIT.1976.1055508. [DOI] [Google Scholar]
  • 68.Steinberg Y. Coding and Common Reconstruction. IEEE Trans. Inf. Theory. 2009;IT-11:4995–5010. doi: 10.1109/TIT.2009.2030487. [DOI] [Google Scholar]
  • 69.Benammar M., Zaidi A. Rate-Distortion of a Heegard-Berger Problem with Common Reconstruction Constraint; Proceedings of the International Zurich Seminar on Information and Communication; Cambridge, MA, USA. 1–6 July 2016. [Google Scholar]
  • 70.Benammar M., Zaidi A. Rate-distortion function for a heegard-berger problem with two sources and degraded reconstruction sets. IEEE Trans. Inf. Theory. 2016;62:5080–5092. doi: 10.1109/TIT.2016.2586919. [DOI] [Google Scholar]
  • 71.Sutskover I., Shamai S., Ziv J. Extremes of Information Combining. IEEE Trans. Inf. Theory. 2005;51:1313–1325. doi: 10.1109/TIT.2005.844077. [DOI] [Google Scholar]
  • 72.Land I., Huber J. Information Combining. Found. Trends Commun. Inf. Theory. 2006;3:227–230. doi: 10.1561/0100000013. [DOI] [Google Scholar]
  • 73.Wyner A.D. On source coding with side information at the decoder. IEEE Trans. Inf. Theory. 1975;21:294–300. doi: 10.1109/TIT.1975.1055374. [DOI] [Google Scholar]
  • 74.Ahlswede R., Korner J. Source coding with side information and a converse for degraded broadcast channels. IEEE Trans. Inf. Theory. 1975;21:629–637. doi: 10.1109/TIT.1975.1055469. [DOI] [Google Scholar]
  • 75.Makhdoumi A., Salamatian S., Fawaz N., Médard M. From the information bottleneck to the privacy funnel; Proceedings of the IEEE Information Theory Workshop, ITW; Hobart, Tasamania, Australia. 2–5 November 2014; pp. 501–505. [Google Scholar]
  • 76.Asoodeh S., Diaz M., Alajaji F., Linder T. Information Extraction Under Privacy Constraints. IEEE Trans. Inf. Theory. 2019;65:1512–1534. doi: 10.1109/TIT.2018.2865558. [DOI] [Google Scholar]
  • 77.Erkip E., Cover T.M. The efficiency of investment information. IEEE Trans. Inf. Theory. 1998;44:1026–1040. doi: 10.1109/18.669153. [DOI] [Google Scholar]
  • 78.Hinton G.E., van Camp D. Proceedings of the Sixth Annual Conference on Computational Learning Theory. ACM; New York, NY, USA: 1993. Keeping the Neural Networks Simple by Minimizing the Description Length of the Weights; pp. 5–13. [Google Scholar]
  • 79.Gilad-Bachrach R., Navot A., Tishby N. Learning Theory and Kernel Machines. Springer; Berlin/Heidelberg, Germany: 2003. An Information Theoretic Tradeoff between Complexity and Accuracy; pp. 595–609. Lecture Notes in Computer Science. [Google Scholar]
  • 80.Vera M., Piantanida P., Vega L.R. The Role of Information Complexity and Randomization in Representation Learning. arXiv. 20181802.05355 [Google Scholar]
  • 81.Huang S.L., Makur A., Wornell G.W., Zheng L. On Universal Features for High-Dimensional Learning and Inference. arXiv. 20191911.09105 [Google Scholar]
  • 82.Higgins I., Matthey L., Pal A., Burgess C., Glorot X., Botvinick M., Mohamed S., Lerchner A. β-VAE: Learning basic visual concepts with a constrained variational framework; Proceedings of the International Conference on Learning Representations; Toulon, France. 24–26 April 2017. [Google Scholar]
  • 83.Kurakin A., Goodfellow I., Bengio S. Adversarial Machine Learning at Scale. arXiv. 20161611.01236 [Google Scholar]
  • 84.Madry A., Makelov A., Schmidt L., Tsipras D., Vladu A. Towards Deep Learning Models Resistant to Adversarial Attacks. arXiv. 20171706.06083 [Google Scholar]
  • 85.Engstrom L., Tran B., Tsipras D., Schmidt L., Madry A. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv. 20171712.02779 [Google Scholar]
  • 86.Pensia A., Jog V., Loh P.L. Extracting robust and accurate features via a robust information bottleneck. arXiv. 20191910.06893 [Google Scholar]
  • 87.Guo D., Shamai S., Verdu S. Mutual Information and Minimum Mean-Square Error in Gaussian Channels. IEEE Trans. Inf. Theory. 2005;51:1261–1282. doi: 10.1109/TIT.2005.844072. [DOI] [Google Scholar]
  • 88.Dembo A., Cover T.M., Thomas J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory. 1991;37:1501–1518. doi: 10.1109/18.104312. [DOI] [Google Scholar]
  • 89.Ekrem E., Ulukus S. An Outer Bound for the Vector Gaussian CEO Problem. IEEE Trans. Inf. Theory. 2014;60:6870–6887. doi: 10.1109/TIT.2014.2358692. [DOI] [Google Scholar]
  • 90.Cover T.M., Thomas J.A. Elements of Information Theory. John Willey & Sons Inc.; New York, NY, USA: 1991. [Google Scholar]
  • 91.Aguerri I.E., Zaidi A. Distributed information bottleneck method for discrete and Gaussian sources; Proceedings of the International Zurich Seminar on Information and Communication, IZS; Zurich, Switzerland. 21–23 February 2018. [Google Scholar]
  • 92.Aguerri I.E., Zaidi A. Distributed Variational Representation Learning. arXiv. 2018 doi: 10.1109/TPAMI.2019.2928806.1807.04193 [DOI] [PubMed] [Google Scholar]
  • 93.Winkelbauer A., Farthofer S., Matz G. The rate-information trade-off for Gaussian vector channels; Proceedings of the The 2014 IEEE International Symposium on Information Theory; Honolulu, HI, USA. 29 June–4 July 2014; pp. 2849–2853. [Google Scholar]
  • 94.Ugur Y., Aguerri I.E., Zaidi A. Rate region of the vector Gaussian CEO problem under logarithmic loss; Proceedings of the 2018 IEEE Information Theory Workshop (ITW); Guangzhou, China. 25–29 November 2018. [Google Scholar]
  • 95.Ugur Y., Aguerri I.E., Zaidi A. Vector Gaussian CEO Problem under Logarithmic Loss. arXiv. 20181811.03933 [Google Scholar]
  • 96.Simeone O., Erkip E., Shamai S. On Codebook Information for Interference Relay Channels With Out-of-Band Relaying. IEEE Trans. Inf. Theory. 2011;57:2880–2888. doi: 10.1109/TIT.2011.2120030. [DOI] [Google Scholar]
  • 97.Sanderovich A., Shamai S., Steinberg Y., Kramer G. Communication Via Decentralized Processing. IEEE Tran. Inf. Theory. 2008;54:3008–3023. doi: 10.1109/TIT.2008.924659. [DOI] [Google Scholar]
  • 98.Lapidoth A., Narayan P. Reliable communication under channel uncertainty. IEEE Trans. Inf. Theory. 1998;44:2148–2177. doi: 10.1109/18.720535. [DOI] [Google Scholar]
  • 99.Cover T.M., El Gamal A. Capacity Theorems for the Relay Channel. IEEE Trans. Inf. Theory. 1979;25:572–584. doi: 10.1109/TIT.1979.1056084. [DOI] [Google Scholar]
  • 100.Xu C., Tao D., Xu C. A survey on multi-view learning. arXiv. 20131304.5634 [Google Scholar]
  • 101.Blum A., Mitchell T. Combining labeled and unlabeled data with co-training; Proceedings of the Eleventh Annual Conference on Computational Learning Theory; Madison, WI, USA. 24–26 July 1998; pp. 92–100. [Google Scholar]
  • 102.Dhillon P., Foster D.P., Ungar L.H. Multi-view learning of word embeddings via CCA; Proceedings of the 2011 Advances in Neural Information Processing Systems; Granada, Spain. 12–17 December 2011; pp. 199–207. [Google Scholar]
  • 103.Kumar A., Daumé H. A co-training approach for multi-view spectral clustering; Proceedings of the 28th International Conference on Machine Learning (ICML-11); Bellevue, WA, USA. 28 June–2 July 2011; pp. 393–400. [Google Scholar]
  • 104.Gönen M., Alpaydın E. Multiple kernel learning algorithms. J. Mach. Learn. Res. 2011;12:2211–2268. [Google Scholar]
  • 105.Jia Y., Salzmann M., Darrell T. Factorized latent spaces with structured sparsity; Proceedings of the Advances in Neural Information Processing Systems; Brno, Czech. 24–25 June 2010; pp. 982–990. [Google Scholar]
  • 106.Vapnik V. The Nature of Statistical Learning Theory. Springer Science & Business Media; Berlin, Germany: 2013. [Google Scholar]
  • 107.Strouse D.J., Schwab D.J. The deterministic information bottleneck. Mass. Inst. Tech. Neural Comput. 2017;26:1611–1630. doi: 10.1162/NECO_a_00961. [DOI] [PubMed] [Google Scholar]
  • 108.Homri A., Peleg M., Shitz S.S. Oblivious Fronthaul-Constrained Relay for a Gaussian Channel. IEEE Trans. Commun. 2018;66:5112–5123. doi: 10.1109/TCOMM.2018.2846248. [DOI] [Google Scholar]
  • 109.du Pin Calmon F., Fawaz N. Privacy against statistical inference; Proceedings of the 2012 50th Annual Allerton Conference on Communication, Control, and Computing (Allerton); Monticello, IL, USA. 1–5 October 2012. [Google Scholar]
  • 110.Karasik R., Simeone O., Shamai S. Robust Uplink Communications over Fading Channels with Variable Backhaul Connectivity. IEEE Trans. Commun. 2013;12:5788–5799. [Google Scholar]
  • 111.Chen Y., Goldsmith A.J., Eldar Y.C. Channel capacity under sub-Nyquist nonuniform sampling. IEEE Trans. Inf. Theory. 2014;60:4739–4756. doi: 10.1109/TIT.2014.2323406. [DOI] [Google Scholar]
  • 112.Kipnis A., Eldar Y.C., Goldsmith A.J. Analog-to-Digital Compression: A New Paradigm for Converting Signals to Bits. IEEE Signal Process. Mag. 2018;35:16–39. doi: 10.1109/MSP.2017.2774249. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES