Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Jan 2.
Published in final edited form as: Adv Neural Inf Process Syst. 2021 Dec;34:21008–21018.

Perturbation Theory for the Information Bottleneck

Vudtiwat Ngampruetikorn 1,*, David J Schwab 1
PMCID: PMC9806839  NIHMSID: NIHMS1853884  PMID: 36597463

Abstract

Extracting relevant information from data is crucial for all forms of learning. The information bottleneck (IB) method formalizes this, offering a mathematically precise and conceptually appealing framework for understanding learning phenomena. However the nonlinearity of the IB problem makes it computationally expensive and analytically intractable in general. Here we derive a perturbation theory for the IB method and report the first complete characterization of the learning onset—the limit of maximum relevant information per bit extracted from data. We test our results on synthetic probability distributions, finding good agreement with the exact numerical solution near the onset of learning. We explore the difference and subtleties in our derivation and previous attempts at deriving a perturbation theory for the learning onset and attribute the discrepancy to a flawed assumption. Our work also provides a fresh perspective on the intimate relationship between the IB method and the strong data processing inequality.

1. Information Bottleneck

Extracting relevant information from data is crucial for all forms of learning. Animals are very adept at isolating biologically useful information from complicated real-world sensory stimuli: for example, we instinctively ignore pixel-level noise when looking for a face in a photo. A failure to disregard irrelevant bits could lead to suboptimal generalization performance especially when the data contains spurious correlations. For instance, an image classifier that relies on background texture to identify objects is likely to fail when presented with a new image showing an object in an ‘unusual’ background (see, e.g., Refs [7, 30]). Understanding the principles behind the identification and extraction of relevant bits is therefore of fundamental and practical importance.

Formalizing this aspect of learning, the information bottleneck (IB) method provides a precise notion of relevance with respect to a prediction target: the relevant information in a source (X) is the bits that carry information about the target (Y) [26]. The relevant bits in X are summarized in a representation (Z) via a stochastic map defined by an encoder q(zx), obeying the Markov constraint ZXY.1 In general a trade-off exists between the amount of discarded information (compression) and the remaining relevant information in Z (prediction), thus motivating the IB cost function,2

L[q(zx)]=I(Z;X)βI(Z;Y), (1)

where β>0 denotes the trade-off parameter and I(A; B) the mutual information. The first term favors succinct representations whereas the second encourages predictive ones. The IB loss is minimized by the representations that are most predictive of Y at fixed compression, parametrized by the Lagrange multiplier β (see, Fig 1a).

Figure 1: Information bottleneck & Learning onset.

Figure 1:

a. The IB frontier (solid) is parametrized by the trade-off parameter β whose inverse is the slope of this curve. The relevant information is bounded from above by the data processing inequality (DPI) [dotted line] and its tight version, the strong data processing inequality (SDPI) [Eq (3), dashed line] which touches the IB curve at the origin. The slope at the origin is equal to the inverse critical trade-off parameter βc1 which marks the learning onset (circles in (b-d)). b-d. Our controlled expansions (dashed) vs the exact solution (solid) for the joint distribution PX,Y shown in (e). The red curves in (e) depict the the perturbative IB encoder defined in Eq (14). We obtain the SDPI from Eqs (16) & (17) and the perturbative expansions in (b-d) from Eqs (26) & (27), see Appendix for relevant algorithms. Information is in bits.

The IB method offers a highly versatile framework with wide-ranging applications, including neural coding [16], evolutionary population dynamics [22], statistical physics [9], clustering [25], deep learning [13] and reinforcement learning [10]. However the nonlinearity of the IB problem makes it computationally expensive and difficult to analyze, barring a few special cases [5]. This necessitates an investigation of tractable methods for solving the IB problem. The use of variational approximations to reduce the computational cost has paved the way for a massive scale-up of the IB method [3]. Complementing this approach, we report a new analytical result for the IB problem in the tractable limiting case of learning onset.

2. Learning Onset

Although the IB loss in Eq (1) favors a representation that encodes every relevant bit in X when β,3 the optimal representation needs not contain any relevant information at finite β. To see this, we note that the loss vanishes for any uninformative representation I(Z;X)=I(Z;Y)=0, and thus an informative representation yields a lower loss only when the relevant information in Z is adequately large: a negative IB loss requires I(Z;Y)>β1I(Z;X). But the relevant information is also bounded from above by the data processing inequality (DPI), I(Z;Y)I(Z;X), resulting from the the Markov constraint ZXY [6] (see, Fig 1a). Combining these inequalities yields

β1I(Z;X)<I(Z;Y)I(Z;X), (2)

which cannot be met when β1>1. Hence the existence of an informative IB minimizer requires β11. Indeed for any PX,Y with I(X;Y)>0, there exists a critical trade-off parameter βc(XY)1 that marks the learning onset, separating two qualitatively distinct regimes: uninformative regime at β<βc and informative regime at β>βc. The learning onset is the first in a series of transitions that emerges from the hierarchy of relevant information in the data [26].

Galvanized in part by the recent applications of the IB principle in deep learning, several works have attempted to characterized the IB transitions [8, 17, 28, 29]. However the IB problem remains intractable even in limiting cases and a complete characterization of the IB transitions remains elusive. In fact the only exception is the special case of Gaussian variables for which an exact solution exists [5]. In this work we derive a perturbation theory for the IB problem and offer the first complete description of the learning onset. We elaborate on the subtle differences between our theory and the previous works in Sec 6.

The learning onset is not only a special limit in the IB problem but also physically and practically relevant. It corresponds to the region where the relevant information per encoded bit is greatest and thus places a tight bound on the thermodynamic efficiency of predictive systems [23, 24]. An analysis the IB learning onset has recently found applications in statistical physics [9]. The (inverse) critical trade-off parameter is also a useful measure of correlation between two random variables [13]; indeed its square root satisfies all but the symmetry property of Rényi’s axioms for statistical dependence measures [21]. Finally estimating the upper bound of βC might help weed out non-viable values of hyperparameters in deep learning techniques such as the variational information bottleneck [28, 29].

2.1. Strong data processing inequality

We can improve the bound on βC with the tight version of the DPI, the strong data processing inequality (SDPI) [4, 18, 20] (see, Fig 1a)

I(Z;Y)ηKL(XY)I(Z;X) (3)

where ηKL(XY) denotes the contraction coefficient for the Kullback-Leibler divergence, defined via

ηKL(XY)supRXPXDKL(RYPY)DKL(RXPX). (4)

Here PX and PY denote the probability distributions of X and Y. The supremum is over all allowed distributions given the space of X, and RY is related to RX via the channel PYX. Replacing the DPI with the SDPI in Eq (2), we obtain

βc(XY)ηKL(XY)1. (5)

In the following section we show that the equality holds, as expected (since the SDPI is tight). Note that ηKL(XY) and βc(XY) are generally asymmetric under XY.

3. Perturbation Theory

We investigate the learning onset through the lens of perturbation theory. This method constructs the solution for a problem as a power series in a small parameter ε, when the solution for the limiting case ε=0, the unperturbed solution, is accessible. For small ε, the higher order terms in this series represent ever smaller corrections to the unperturbed solution. To obtain these corrections, we insert the series solution into the initial problem and expand the resulting expressions as power series in ε, truncated at appropriate order. For example, the first-order theory drops all quadratic and higher terms (those proportional to ε2,ε3,), resulting in a consistency condition for the linear correction (i.e., the term proportional to ε). Requiring consistency up to εn leads to the nth-order perturbation theory. In practice the first few corrections suffice for a characterization of the problem in the vicinity of ε = 0.

Our theory is based on a controlled expansion around the critical trade-off parameter βc and some uninformative encoder q0(zx)=q0(z),

q(zx)=q0(zx)+εq1(zx)+ε2q2(zx)+ (6)
I(Z;X)=εIZ;X(1)[q1]+ε2IZ;X(2)[q1,q2]+, (7)

where εββc0+ and zqn(zx)=0 for n ≥ 1 to ensure normalization. Note that IZ:X(0) vanishes for uninformative q0. The first and second-order informations capture the first and second-order growths of information as β rises above βc and are given by (see Appendix for derivation)

IZ;X(1)[q1]=xp(x)zZ1q1(zx)lnq1(zx)q1(z) (8)
IZ;X(2)[q1,q2]=xp(x)(zZ0q1(zx)2q1(z)22q0(z)+zZ1q2(zx)lnq1(zx)q1(z)+zZ2q2(zx)lnq2(zx)q2(z)), (9)

where Z0=supp(q0) and Zn=supp(qn)i=0n1Zi (i.e., Zn contains representation classes or space that first appear in the support of the nth-order encoder).4 The expansions for q(z) and q(zy) take the same form as Eq (6), and the expressions for I(Z;Y) are identical to Eqs (7)-(9) but with Y replacing X everywhere. Finally we write down the loss function as a power series in ε,

L[q(zx)]=εL(1)[q1]+ε2L(2)[q1,q2]+, (10)

where

L(1)[q1]=IZ;X(1)[q1]βcIZ;Y(1)[q1] (11)
L(2)[q1,q2]=IZ;X(2)[q1,q2]βcIZ;Y(2)[q1,q2]IZ;Y(1)[q1]. (12)

3.1. First-order theory

Minimizing the first-order loss yields5

minL(1)=L(1)[q1]=0withq1(zx)q1(z)=exp(βcyp(yx)lnq1(zy)q1(z))forzZ1. (13)

As the ratio q1(zx)/q1(z) does not depend on z, we eliminate the superfluous dependence on z by defining

r(x)q1(zx)p(x)q1(z)forzZ1,andr(y)xp(yx)r(x). (14)

Note that both r(x) and r(y) are non-negative and normalized: xr(x)=yr(y)=1. Substituting Eqs (14) in (8) & (13), we obtain

IZ;X(1)=DKL[r(x)p(x)]zZ1q1(z)IZ;Y(1)=DKL[r(y)p(y)]zZ1q1(z), (15)

where

r(x)=p(x)eβc(DKL[p(yx)r(y)]DKL[p(yx)p(y)]). (16)

Since the first-order loss vanishes [see, Eq (13)], we have IZ;X(1)[q1]βcIZ;Y(1)[q1]=0 and thus

βc=IZ;X(1)[q1]IZ;Y(1)[q1]=DKL[r(x)p(x)]DKL[r(y)p(y)]. (17)

Note that an uninformative solution r(x)=p(x) always satisfies Eq (16) and we must seek a nontrivial solution r(x)p(x).

We now show that the critical trade-off parameter is equivalent to the inverse contraction coefficient. First we note that r(x) in Eq (16) is a solution to a different optimization, described by a loss function L[f]=DKL[f(x)p(x)]βcDKL[f(y)p(y)].. That is, δL/δf|fr=0 and min L=L[r]=0. It follows immediately that δ(DKL[f(y)p(y)]DKL[f(x)p(x)])/δf|fr=0for DKL[r(x)p(x)]>0, therefore

βc1=DKL[r(y)p(y)]DKL[r(x)p(x)]=supfpDKL[f(y)p(y)]DKL[f(x)p(x)]=ηKL(XY), (18)

where the first and last equalities come from Eqs (17) & (4), respectively. The above analysis provides an alternative derivation of the equivalence between the contraction coefficients of mutual information and KL divergence [4, 18].

While our first-order theory provides a method for identifying the critical trade-off parameter by solving Eqs (16) & (17), it is incomplete. The optimal encoder in Eq (13) is determined only up to a multiplicative factor. Consequently the informations in Eq (15) still depend on q1(z) which can take any positive value (for zZ1). This unphysical scale invariance is broken in the second-order theory.

3.2. Second-order theory

From Eqs (9) & (12), we write down the second-order loss

L(2)[q1,q2]=zZ0x,xq1(zx)K(x,x)q1(zx)2q0(z)IZ;Y(1)[q1] (19a)
+xp(x)zZ1q2(zx)(lnq1(zx)q1(z)βcyp(yx)lnq1(zy)q1(z)) (19b)
+xp(x)zZ2q2(zx)(lnq2(zx)q2(z)βcyp(yx)lnq2(zy)q2(z)), (19c)

where we define

K(x,x)δ(x,x)p(x)+(βc1)p(x)p(x)βcyp(y)p(xy)p(xy). (20)

Optimizing L(2) with respect to q2 (for Z1 and Z2 separately) results in stationary conditions, which equate the terms in the parentheses of Eqs (19b) & (19c) to zero.6 Eliminating IZ;Y(1) in Eq (19a) with Eq (15), we have

L(2)[q1]=DKL[r(y)p(y)]zZ1q1(z)+zZ0x,xq1(zx)K(x,x)q1(zx)2q0(z). (21)

Minimizing this loss function with respect to q1 and subject to the normalization Σzq1(zx)=0 gives

xK(x,x)q1(zx)q0(z)=(zZ1q1(z))xK(x,x)r(x)p(x)forzZ0. (22)

Substituting the above in Eq (21) leads to

L(2)[q1]=DKL[r(y)p(y)]zZ1q1(z)+κ2(zZ1q1(z))2, (23)

where we define

κx,xr(x)K(x,x)r(x)p(x)p(x). (24)

Assuming κ>0,7 the final minimization with respect to zZ1q1(z) yields

zZ1q1(z)=1κDKL[r(y)p(y)], (25)
L(2)[q]=12κDKL[r(y)p(y)]2. (26)

Finally we eliminate the remaining dependence on q1 in Eq (15) and write down the first-order information

IZ;X(1)=1κDKL[r(x)p(x)]DKL[r(y)p(y)]IZ;Y(1)=1κDKL[r(y)p(y)]2. (27)

We see that the second-order perturbation theory fixes the scales of the leading corrections to mutual information, thus completing our analysis of the learning onset. Furthermore these leading corrections are related via IZ;X(1)=βcIZ;Y(1) and L(2)=IZ;Y(1)/2.

4. Numerical Results

We now turn to comparing our theory to numerical results. In Fig 1, we compare the results from our perturbation theory [Eqs (16), (17), (26) & (27)] to the numerically exact solution of the IB problem for a synthetic joint distribution (shown in Fig 1e). Our theory correctly identifies the critical trade-off parameter and captures the leading corrections to the mutual information and IB loss in the vicinity of the learning onset (see, Fig 1b-d). The inverse critical trade-off parameter βc1 coincides with the slope of the strong data processing inequality (SDPI) which provides a tight upper bound for the IB frontier (Fig 1a). Note that the SDPI is tight at the origin [I(Z;X)=I(Z;Y)=0] and is therefore fully characterized by our analysis of the learning onset.

Binary classification

In Fig 2, we consider the onset of learning for binary classification in which the target Y is a binary random variable with equal probability for each class and the source variable X is drawn from a distribution that depends on the realization of Y. In other words, provided with some data x, we ask whether it was drawn from blue or red distributions in the top panel of Fig 2. In all cases we see a general trend that the inverse critical trade-off parameter βc1 and the relevant information response IZ;Y(1) increase with available information I (X;Y). Indeed for the Gaussian case (Fig 2a), the information response diverges in the high information limit (equivalent to a large difference in the means of the Gaussian distributions) which is also the limit where binary classification becomes deterministic, I(X;Y)1 bit.

Figure 2: Learning onset in binary classification.

Figure 2:

We illustrate the results of our theory for the case of a binary target variable with equal probability assigned to each class, i.e., Y{y1,y2} and p(Y=y1)=p(Y=y2)=1/2, for three different sets of conditional distributions p(xy) (a-c, top row). a. The source data X are drawn from a Gaussian distribution whose mean and variance depend on Y (top panel). We set the mean to zero and variance to one for Y = y1 and solve the IB learning onset for various values of mean μ and variance σ for Y = y2. The middle panel depict the critical trade-off parameter, predicted by our theory in Sec 3 (filled circles) and the methods from previous works described in Sec 6 (empty circles). The bottom panel shows the information response to a small perturbation in trade-off parameter [for definition see, Eq (7)]. The theory predictions are plotted against the data mutual information, parametrized by the mean μ of p(xy2) for four different values of standard deviations (see legend). The dotted lines display the power dependence and serves only as a guide to the eye to aid comparisons. b. Same as (a) but for exponential distributions and the curves are parametrized by the rate parameter λ of the exponential distributions (see, top panel). c. Same as (a) but for Poisson distributions and the curves are parametrized by λ2 [mean of p(xy2) for four values of λ1 [mean of p(xy1) (see legend). Information is in bits.

Noise dependence

In Fig 3, we depict the critical trade-off parameter and information response for joint distributions generated from XUnif(1,1) and Yf(X)+N(0,σ2) for various functional associations (Panel a). We see that the critical trade-off tends to one in the low noise limit, as expected for a deterministic functional relationship [14]. At higher noise level, βc increases with σ as it becomes harder to extract relevant bits from the data. This fact is also reflected in the first-order information IZ;Y(1) which measures the change in relevant information as the trade-off parameter β exceeds the critical value. For all functions considered, IZ;Y(1) decreases with increasing noise standard deviation. Interestingly we see that the information response diverges in the deterministic limit similar to the binary classification example shown in Fig 2a. Note that IZ;X(1)=βcIZ;Y(1) and L(2)=IZ;Y(1)/2 [see Eqs (26) & (27)].

Figure 3: Learning onset for noisy functional relationships.

Figure 3:

a. Functions used in data generation: XUnif(1,1) and Yf(X)+N(0,σ2). b. The inverse critical trade-off parameter βc1 vs noise level parametrized by the noise standard deviation σ (left) and by available information I(X;Y) (right). c. The first-order growth of information vs noise level parametrized by σ (left) and by I(X;Y) (right). Both the maximum relevant information per extract bit βc1 and the first-order relevant information IZ;Y(1) decrease with noise level as it becomes increasingly difficult to extract relevant information. The dashed lines display the power dependence and serves only as a guide to the eye to aid comparisons. Information is in bits.

5. Learning Onset for Gaussian Variables

At first sight it seems that our theory, which is agnostic about the discrete or continuous nature of the representation, is at odd with the exact solution for Gaussian variables which is based on a continuous representation [5]. In this section we show that our theory captures the learning onset for joint Gaussian variables. Importantly we demonstrate that a discrete representation of continuous variables can describe the learning onset just as well as continuous ones.

Consider joint Gaussian variables

[XY]N([00],[ΣXΣXYΣYXΣY]). (28)

A convenient ansatz for r(x) and r(y) [for definitions, see, Eq (14)] is a Gaussian distribution,

RX=N(vX,ΛX)andRY=N(vY,ΛY). (29)

where (vX,ΛX) denotes the mean vector and covariance matrix for RX and (vY,ΛY) for RY. Using this ansatz, we write down the KL divergences in the exponent of Eq (16),

DKL[p(yx)r(y)]=12((μYxvY)ΛY1(μYxvY)+tr[ΛY1ΣYX]dY+ln|ΛY||ΣYX|) (30)
DKL[p(yx)p(y)]=12(μYxΣY1μYx+tr[ΣY1ΣYX]dY+ln|ΣY||ΣYX|), (31)

where ΣYX=ΣYΣYXΣX1ΣXY, dY denotes the dimensionality of Y and we define μYxΣYXΣX1x. The ratio between r(x) and p(x) is given by

lnr(x)p(x)=12((xvX)ΛX1(xvX)+xΣX1x+ln|ΣX||ΛX|). (32)

Since Eqs (30)-(32) are related via Eq (16) which holds for all values of x, we take the logarithm of Eq (16) and equate the terms quadratic in x, linear in x and constants separately, yielding

ΛX1ΣX1=βcΣX1ΣXY(ΛY1ΣY1)ΣYXΣX1 (33)
ΛX1vX=βcΣX1ΣXYΛY1vY (34)
vXΛX1vX=ln|ΣX||ΛX|+βc(vYΛY1vY+tr[(ΛY1ΣY1)ΣYX]ln|ΣY||ΛY|). (35)

We can find a solution to this set of equations by letting ΛX=ΣX (which also leads to ΛY=ΣY). For this choice of covariance matrix, both sides of Eq (33) vanish and Eqs (34) & (35) reduce to8

(1βc(1ΣXYΣX1))vX=0. (36)

Solving the above eigenproblem for the smallest possible critical trade-off parameter, we find βc=(1λmin)1 and vXϕmin where λmin denotes the smallest eigenvalue of ΣXYΣX1 and ϕmin the corresponding eigenvector. While both [5] and our work identify the same critical trade-off parameter and reveal the importance of the spectrum of ΣXYΣX1, the analyses are distinct in that the representation is continuous in [5] but can be discrete in our theory.9

6. Comparisons to Previous Works

The recent applications of the IB principle in machine learning [13, 7] have sparked much interest in characterizing the structure of the IB problem [28, 29]. Several works underscore the learning onset and IB transitions as important limiting cases, not least because they are a direct manifestation of the hierarchical structure of the relevant information in the data [5, 8, 17, 28, 29]. However the attempts to derive a perturbation theory for the learning onset are plagued by a flawed assumption that the representation space does not expand beyond the support of the unperturbed, uninformative encoder [8, 28, 29]. Equivalent to setting Z1=Z2= in Eqs (8) & (9) in our theory, this assumption significantly simplifies the analysis but the resulting theory generally fails to identify the critical trade-off parameter.10 This raises serious questions about the insights gleaned from such expansions around a seemingly arbitrary point. In the following we explore the differences between our full treatment and the perturbation theory derived in previous works. In particular we argue that the theory in previous work describes the learning onset of a non-standard IB problem, defined with χ2–information (instead of Shannon information).

Setting Z1=Z2=, the leading correction to the IB loss is of second order and is given by the first term of Eq (19a),11

L(2)[q1]=zZ0x,xq1(zx)K(x,x)q1(zx)2q0(z), (37)

where the dependence on βc is implicit [see, Eq (20), for the definition of K(x,x). We see that K(x,x) is the Hessian of the loss function and its eigenvalues determine the curvatures of the loss landscape in the vicinity of the unperturbed encoder. In this theory the learning onset corresponds to the emergence of a direction along which the loss decreases quadratically—i.e., when the smallest eigenvalue first becomes negative. Note that K(x,x) always has a vanishing eigenvalue, resulting from the fact that all uninformative perturbations q1(zx)=q1(z) lead to the same loss.12 In practice we may identify the learning onset with the point where the second smallest eigenvalue becomes zero but a more efficient method exists, see below. Similarly to our first-order theory (Sec 3.1), this eigenvalue problem yields only the direction of the first-order encoder and a higher order theory is required to fix the scale.

It is worth pointing out that if we define the IB problem [Eq (1)] with χ2–information instead of the standard Shannon information,13 Eq (37) is identical (up to a multiplicative factor) to the first-order loss in our full treatment (i.e., with Z1). Indeed the resulting learning onset coincides with the SDPI for χ2–information. The contraction coefficient for χ2–information, ηχ2, is exactly the squared maximal correlation (for a review, see, e.g., Ref [15]) and is therefore symmetric under XY and equal to the square of the second largest singular value of the divergence transition matrix [12, 21],

B(x,y)p(x,y)p(x)p(y)forp(x)p(y)>0,andB(x,y)0otherwise. (38)

Finally we note that ηχ2ηKL [19, 20], hence the perturbation theory based on fixed representation space gives an upper bound to the critical trade-off parameter of the standard IB problem.

Figure 2 demonstrates that even for simple binary classification, the theory with fixed representation space, which predicts β^c=ηχ21 (empty circles), does not correctly identify the learning onset (filled circles). For the set of examples shown, we see that the discrepancy between βc and β^c is greatest for the Gaussian case and at lower available information. Note that in the deterministic limit [I(X;Y)=1 bit for binary classification] all contraction coefficients tend to one and we do not expect any discrepancy there.

7. Discussion & Outlook

We derive a perturbation theory for the IB problem and offer a glimpse of the intimate connections between the learning onset and the strong data processing inequality. In future works we aim to build on our results to develop an algorithm for estimating the contraction coefficient from samples and explore novel methods for solving the IB problem in this limit. It would be interesting to further leverage the wealth of rigorous results from the literature on hypercontractivity and strong data processing inequalities to better understand the learning onset in the IB problem. In addition, various numerical techniques developed for the IB problem could significantly extend the range of applicability of contraction coefficients.

In Sec 5, we show that a discrete representation can also capture the learning onset for Gaussian variables. Our approach contrasts with the exact solution of Ref [5] which uses continuous representation. This highlights the degeneracy of the global minimum in the IB problem and implies that discrete representations of continuous variables needs not be suboptimal.

While the IB problem formulated with Shannon information is somewhat unique [11], our work reveals that the analyses of the learning onset would be much simplified if one were to define the IB loss with χ2–information instead of Shannon information. The IB principle based on other f–information could provide a more tractable formulation for certain problems and offer an insight not readily available otherwise.

Supplementary Material

Supplementary Information

Acknowledgments and Disclosure of Funding

We thank Shervin Parsi and Sarang Gopalakrishnan for useful discussions during the early stages of the project. This work was supported in part by the National Institutes of Health BRAIN initiative (R01EB026943), the National Science Foundation, through the Center for the Physics of Biological Function (PHY-1734030), the Simons Foundation and the Sloan Foundation.

Footnotes

1

This Markov chain implies PYX,Z=PYX and PZX,Y=PZX (see, e.g., Ref [6]).

2

The optimization problem involving Eq (1) first appeared in a different context (see, e.g., Ref [27]).

3

The compression term, while infinitesimally small in this limit, still penalizes irrelevant information and prefers a representation Z that is the minimal sufficient statistics of X for Y.

4

Our theory generalizes the expansions in Refs [28, 29] which considered the case Z1=Z2=.

5

Unlike in the original IB problem, here the optimization is unconstrained since the normalization zq1(zx)=0 sums over both Z0 and Z1, and only the latter enters our first-order theory.

6

This optimization is unconstrained since the second-order loss does not depend on q2 with zZ0 (see, footnote 5). The resulting stationary conditions are identical to Eq (13) for q1 with zZ1 and q2 with zZ2.

7

For κ0, the loss function in Eq (23) is unbounded from below and a higher order perturbation theory is required to fix the scale of q1.

8

Equation (35) becomes the same as Eq (34) but with vXΣX1 multiplied from the left.

9

We can always choose the unperturbed encoder to be an all-to-one map (q0(z0x)=1) and let the linear correction have access to one additional alphabet (q1(z1x)>0).

10

Our set-up differs slightly from Refs [28, 29] in that we ask how optimal encoders respond to a small change in β as opposed to how the loss function changes in response to a small perturbation to an encoder. However this difference is not the reason why our theory produces a tight bound on the learning onset. Allowing the representation to take values outside the support of the unperturbed encoder is key to capturing the learning onset regardless of how a perturbation theory is constructed.

11

Note that this loss depends only on the first-order encoder q1. The second-order encoder q2 appears in higher order theories.

12

It is easy to verify that xK(x,x)=0.

13

The χ2–information is defined as follows, Iχ2(X;Y)x,yp(x)p(y)(p(x,y)p(x)p(y)1)2

References

  • [1].Achille A. and Soatto S. Information dropout: Learning optimal representations through noisy computation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(12):2897–2905, 2018. doi: 10.1109/TPAMI.2017.2784440. [DOI] [PubMed] [Google Scholar]
  • [2].Achille A. and Soatto S. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19(50):1–34, 2018. http://jmlr.org/papers/v19/17-646.html. [Google Scholar]
  • [3].Alemi AA, Fischer I, Dillon JV, and Murphy K. Deep variational information bottleneck. In International Conference on Learning Representations, 2017. https://openreview.net/forum?id=HyxQzBceg. [Google Scholar]
  • [4].Anantharam V, Gohari AA, Kamath S, and Nair C. On maximal correlation, hypercontractivity, and the data processing inequality studied by Erkip and Cover, 2013. https://arxiv.org/abs/1304.6133.
  • [5].Chechik G, Globerson A, Tishby N, and Weiss Y. Information bottleneck for Gaussian variables. Journal of Machine Learning Research, 6:165–188, 2005. https://www.jmlr.org/papers/v6/chechik05a.html. [Google Scholar]
  • [6].Cover TM and Thomas JA Elements of Information Theory. Wiley-Interscience, 2 ed., 2006. [Google Scholar]
  • [7].Dubois Y, Kiela D, Schwab DJ, and Vedantam R. Learning optimal representations with the decodable information bottleneck. In Larochelle H, Ranzato M, Hadsell R, Balcan MF, and Lin H, eds., Advances in Neural Information Processing Systems, vol. 33, pp. 18674–18690. Curran Associates, Inc., 2020. https://proceedings.neurips.cc/paper/2020/file/d8ea5f53c1b1eb087ac2e356253395d8-Paper.pdf. [Google Scholar]
  • [8].Gedeon T, Parker AE, and Dimitrov AG The mathematical structure of information bottleneck methods. Entropy, 14(3):456–479, 2012. doi: 10.3390/e14030456. [DOI] [Google Scholar]
  • [9].Gordon A, Banerjee A, Koch-Janusz M, and Ringel Z. Relevance in the renormalization group and in information theory, 2020. https://arxiv.org/abs/2012.01447. [DOI] [PubMed]
  • [10].Goyal A, Islam R, Strouse DJ, Ahmed Z, Larochelle H, Botvinick M, Levine S, and Bengio Y. Transfer and exploration via the information bottleneck. In International Conference on Learning Representations, 2019. https://openreview.net/forum?id=rJg8yhAqKm. [Google Scholar]
  • [11].Harremoes P. and Tishby N. The information bottleneck revisited or how to choose a good distortion measure. In 2007 IEEE International Symposium on Information Theory, pp. 566–570, 2007. doi: 10.1109/ISIT.2007.4557285. [DOI] [Google Scholar]
  • [12].Hirschfeld HO A connection between correlation and contingency. Mathematical Proceedings of the Cambridge Philosophical Society, 31(4):520–524, 1935. doi: 10.1017/S0305004100013517. [DOI] [Google Scholar]
  • [13].Kim H, Gao W, Kannan S, Oh S, and Viswanath P. Discovering potential correlations via hypercontractivity. In Guyon I, Luxburg UV, Bengio S, Wallach H, Fergus R, Vishwanathan S, and Garnett R, eds., Advances in Neural Information Processing Systems 30 (NIPS 2017), pp. 4577–4587. Curran Associates, Inc., 2017. http://papers.nips.cc/paper/7044-discovering-potential-correlations-via-hypercontractivity.pdf. [Google Scholar]
  • [14].Kolchinsky A, Tracey BD, and Kuyk SV Caveats for information bottleneck in deterministic scenarios. In International Conference on Learning Representations, 2019. https://openreview.net/forum?id=rke4HiAcY7. [Google Scholar]
  • [15].Makur A. Information contraction and decomposition. PhD thesis, Massachusetts Institute of Technology, 2019. [Google Scholar]
  • [16].Palmer SE, Marre O, Berry II MJ, and Bialek W. Predictive information in a sensory population. Proceedings of the National Academy of Sciences, 112(22):6908–6913, 2015. doi: 10.1073/pnas.1506855112. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Parker A, Gedeon T, and Dimitrov A. Annealing and the rate distortion problem. In Becker S, Thrun S, and Obermayer K, eds., Advances in Neural Information Processing Systems 15 (NIPS 2002), vol. 15, pp. 993–976, 2003. https://proceedings.neurips.cc/paper/2002/file/ccbd8ca962b80445df1f7f38c57759f0-Paper.pdf. [Google Scholar]
  • [18].Polyanskiy Y. and Wu Y. Dissipation of information in channels with input constraints. IEEE Transactions on Information Theory, 62(1):35–55, 2016. doi: 10.1109/TIT.2015.2482978. [DOI] [Google Scholar]
  • [19].Polyanskiy Y. and Wu Y. Strong data-processing inequalities for channels and Bayesian networks. In Carlen E, Madiman M, and Werner EM, eds., Convexity and Concentration, pp. 211–249, New York, NY, 2017. Springer New York. [Google Scholar]
  • [20].Raginsky M. Strong data processing inequalities and Φ-Sobolev inequalities for discrete channels. IEEE Transactions on Information Theory, 62(6):3355–3389, 2016. doi: 10.1109/TIT.2016.2549542. [DOI] [Google Scholar]
  • [21].Rényi A. On measures of dependence. Acta Mathematica Academiae Scientiarum Hungarica, 10(3):441–451, 1959. doi: 10.1007/BF02024507. [DOI] [Google Scholar]
  • [22].Sachdeva V, Mora T, Walczak AM, and Palmer SE Optimal prediction with resource constraints using the information bottleneck. PLOS Computational Biology, 17(3):e1008743, 2021. doi: 10.1371/journal.pcbi.1008743. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [23].Still S. Thermodynamic cost and benefit of memory. Physical Review Letters, 124:050601, Feb 2020. doi: 10.1103/PhysRevLett.124.050601. [DOI] [PubMed] [Google Scholar]
  • [24].Still S, Sivak DA, Bell AJ, and Crooks GE Thermodynamics of prediction. Physical Review Letters, 109:120604, 2012. doi: 10.1103/PhysRevLett.109.120604. [DOI] [PubMed] [Google Scholar]
  • [25].Strouse DJ and Schwab DJ The information bottleneck and geometric clustering. Neural Computation, 31(3):596–612, 2019. doi: 10.1162/neco_a_01136. [DOI] [PubMed] [Google Scholar]
  • [26].Tishby N, Pereira FCN, and Bialek W. The information bottleneck method. In Hajek B. and Sreenivas RS, eds., 37th Allerton Conference on Communication, Control and Computing, pp. 368–377. University of Illinois, 1999. http://arxiv.org/abs/physics/0004057. [Google Scholar]
  • [27].Witsenhausen H. and Wyner A. A conditional entropy bound for a pair of discrete random variables. IEEE Transactions on Information Theory, 21(5):493–501, 1975. doi: 10.1109/TIT.1975.1055437. [DOI] [Google Scholar]
  • [28].Wu T. and Fischer I. Phase transitions for the information bottleneck in representation learning. In International Conference on Learning Representations, 2020. https://openreview.net/forum?id=HJloElBYvB. [Google Scholar]
  • [29].Wu T, Fischer I, Chuang IL, and Tegmark M. Learnability for the information bottleneck. Entropy, 21(10):924, 2019. doi: 10.3390/e21100924. [DOI] [Google Scholar]
  • [30].Xiao K, Engstrom L, Ilyas A, and Madry A. Noise or signal: The role of image backgrounds in object recognition, 2020. https://arxiv.org/abs/2006.09994.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

RESOURCES