Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2019 Jan 23.
Published in final edited form as: Neural Comput. 2018 Jan 17;30(4):885–944. doi: 10.1162/neco_a_01056

Information-Theoretic Bounds and Approximations in Neural Population Coding

Wentao Huang 1, Kechen Zhang 2
PMCID: PMC6343676  NIHMSID: NIHMS1003919  PMID: 29342399

Abstract

While Shannon’s mutual information has widespread applications in many disciplines, for practical applications it is often difficult to calculate its value accurately for high-dimensional variables because of the curse of dimensionality. This article focuses on effective approximation methods for evaluating mutual information in the context of neural population coding. For large but finite neural populations, we derive several information-theoretic asymptotic bounds and approximation formulas that remain valid in high-dimensional spaces. We prove that optimizing the population density distribution based on these approximation formulas is a convex optimization problem that allows efficient numerical solutions. Numerical simulation results confirmed that our asymptotic formulas were highly accurate for approximating mutual information for large neural populations. In special cases, the approximation formulas are exactly equal to the true mutual information. We also discuss techniques of variable transformation and dimensionality reduction to facilitate computation of the approximations.

1. Introduction

Shannon’s mutual information (MI) provides a quantitative characterization of the association between two random variables by measuring how much knowing one of the variables reduces uncertainty about the other (Shannon, 1948). Information theory has become a useful tool for neuroscience research (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Borst & Theunissen, 1999; Pouget, Dayan, & Zemel, 2000; Laughlin & Sejnowski, 2003; Brown, Kass, & Mitra, 2004; Quiroga & Panzeri, 2009), with applications to various problems such as sensory coding problems in the visual systems (Eckhorn & Pöpel, 1975; Optican & Richmond, 1987; Atick & Redlich, 1990; McClurkin, Gawne, Optican, & Richmond, 1991; Atick, Li, & Redlich, 1992; Becker & Hinton, 1992; Van Hateren, 1992; Gawne & Richmond, 1993; Tovee, Rolls, Treves, & Bellis, 1993; Bell & Sejnowski, 1997; Lewis & Zhaoping, 2006) and the auditory systems (Chechik et al., 2006; Gourévitch and Eggermont, 2007; Chase & Young, 2005).

One major problem encountered in practical applications of information theory is that the exact value of mutual information is often hard to compute in high-dimensional spaces. For example, suppose we want to calculate the mutual information between a random stimulus variable that requires many parameters to specify and the elicited noisy responses of a large population of neurons. In order to accurately evaluate the mutual information between the stimuli and the responses, one has to average over all possible stimulus patterns and over all possible response patterns of the whole population. This averaging quickly leads to a combinatorial explosion as either the stimulus dimension or the population size increases. This problem occurs not only when one computes MI numerically for a given theoretical model but also when one estimates MI empirically from experimental data.

Even when the input and output dimensions are not that high, an MI estimate from experimental data tends to have a positive bias due to limited sample size (Miller, 1955; Treves & Panzeri, 1995). For example, a perfectly flat joint probability distribution implies zero MI, but an empirical joint distribution with fluctuations due to finite data size appears to suggest a positive MI. The error may get much worse as the input and output dimensions increase because a reliable estimate of MI may require exponentially more data points to fill the space of the joint distribution. Various asymptotic expansion methods have been proposed to reduce the bias in an MI estimate (Miller, 1955; Carlton, 1969; Treves & Panzeri, 1995; Victor, 2000; Paninski, 2003). Other estimators of MI have also been studied, such as those based on k-nearest neighbor (Kraskov, Stögbauer, & Grassberger, 2004) and minimal spanning trees (Khan et al., 2007). However, it is not easy for these methods to handle the general situation with high-dimensional inputs and high-dimensional outputs.

For numerical computation of MI for a given theoretical model, one useful approach is Monte Carlo sampling, a convergent method that may potentially reach arbitrary accuracy (Yarrow, Challis, & Series, 2012). However, its stochastic and inefficient computational scheme makes it unsuitable for many applications. For instance, to optimize the distribution of a neural population for a given set of stimuli, one may want to slightly alter the population parameters and see how the perturbation affects the MI, but a tiny change of MI can be easily drowned out by the inherent noise in the Monte Carlo method.

An alternative approach is to use information-theoretic bounds and approximations to simplify calculations. For example, the Cramér-Rao lower bound (Rao, 1945) tells us that the inverse of Fisher information (FI) is a lower bound to the mean square decoding error of any unbiased decoder. Fisher information is useful for many applications partly because it is often much easier to calculate than MI (see e.g., Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Zhang & Sejnowski, 1999; Abbott & Dayan, 1999; Bethge, Rotermund, & Pawelzik, 2002; Harper & McAlpine, 2004; Toyoizumi, Aihara, & Amari, 2006).

A link between MI and FI has been studied by several researchers (Clarke & Barron, 1990; Rissanen, 1996; Brunel & Nadal, 1998; Sompolinsky, Yoon, Kang, & Shamir, 2001). Clarke and Barron (1990) first derived an asymptotic formula between the relative entropy and FI for parameter estimation from independent and identically distributed (i.i.d.) observations with suitable smoothness conditions. Rissanen (1996) generalized it in the framework of stochastic complexity for model selection. Brunel and Nadal (1998) presented an asymptotic relationship between the MI and FI in the limit of a large number of neurons. The method was extended to discrete inputs by Kang and Sompolinsky (2001). More general discussions about this also appeared in other papers (e.g., Ganguli & Simoncelli, 2014; Wei & Stocker, 2015). However, for finite population size, the asymptotic formula may lead to large errors, especially for high-dimensional inputs, as detailed in sections 2.2 and 4.1.

In this article, our main goal is to improve FI approximations to MI for finite neural populations especially for high-dimensional inputs. Another goal is to discuss how to use these approximations to optimize neural population coding. We will present several information-theoretic bounds and approximation formulas and discuss the conditions under which they are established in section 2, with detailed proofs given in the appendix. We also discuss how our approximation formulas are related to other statistical estimators and information-theoretic bounds, such as Cramér-Rao bound and van Trees’ Bayesian Cramér-Rao bound (see section 3). In order to better apply the approximation formulas in high-dimensional input space, we propose some useful techniques in section 4, including variable transformation and dimensionality reduction, which may greatly reduce the computational complexity for practical applications. Finally, in section 5, we discuss how to use the approximation formulas for optimizing information transfer for neural population coding.

2. Bounds and Approximations for Mutual Information in Neural Population Coding

2.1. Mutual Information and Notations.

Suppose the input x is a K-dimensional vector, x = (x1, x2, …, xK)T, and the outputs of N neurons are denoted by a vector, r = (r1, r2, …, rN)T. In this article, we denote random variables by uppercase letters (e.g., random variables X and R) in contrast to their vector values x and r. The MI I(X; R) (denoted as I below) between X and R is defined by Cover and Thomas (2006):

I=XRp(rx)p(x)lnp(rx)p(r)drdx, (2.1)

where xXRK, rRRN, dx=k=1Kdxk, dr=n=1Ndrn, and the integration symbol is for the continuous variables and can be replaced by the summation symbol ∑ for discrete variables. The probability density function (p.d.f.) of r, p(r), satisfies

p(r)=Xp(rx)p(x)dx. (2.2)

The MI I in equation 2.1 may also be expressed equivalently as

I=H(X)lnp(r)p(rx)p(x)r,x=H(X)H(XR), (2.3)

where H(X) is the entropy of random variable X,

H(X)=lnp(x)x,H(XR)=lnp(xr)r,x, (2.4)

and ⟨·⟩ denotes expectation:

x=Xp(x)()dx, (2.5)
rx=Rp(rx)()dr, (2.6)
r,x=XRp(r,x)()drdx. (2.7)

Next, we introduce the following notations,

l(rx)=lnp(rx), (2.8)
L(rx)=ln(p(rx)p(x)), (2.9)
q(x)=lnp(x), (2.10)

and

IF=12ln(det(J(x)2πe))x+H(X), (2.11)
IG=12ln(det(G(x)2πe))x+H(X), (2.12)

where det (·) denotes the matrix determinant, and

J(x)=l(rx)l(rx)Trx, (2.13)
G(x)=J(x)+P(x), (2.14)
P(x)=q(x). (2.15)

Here J(x) is the FI matrix, which is symmetric and positive-semidefinite, and ′ and ″ denote the first and second derivative for x, respectively; that is, l′(rx) = ∂l (rx) /x and q″(rx) = 2 ln p (x)/∂xxT. If p(rx) is twice differentiable for x, then

J(x)=l(rx)l(rx)Trx=l(rx)rx (2.16)

We denote the Kullback-Leibler (KL) divergence as

D(xx^)=Rp(rx)lnp(rx)p(rx^)dr, (2.17)

and define

Xω(x)={x˘RK:(x˘x)TG(x)(x˘x)<Nω2}, (2.18)

as the ω neighborhoods of x and its complementary set as

Xω(x)=XXω(x), (2.19)

where ω is a positive number.

2.2. Information-Theoretic Asymptotic Bounds and Approximations.

In a large N limit, Brunei and Nadal (1998) proposed an asymptotic relationship I ~ IF between MI and FI and gave a proof in the case of one-dimensional input. Another proof is given by Sompolinsky et al. (2001), although there appears to be an error in their proof when a replica trick is used (see equation B1 in their paper; their equation B5 does not follow directly from the replica trick). For large but finite N, IIF is usually a good approximation as long as the inputs are low dimensional. For the high-dimensional inputs, the approximation may no longer be valid. For example, suppose p(rx) is a normal distribution with mean ATx and covariance matrix IN and p(x) is a normal distribution with mean μ and covariance matrix ,

p(rx)=N(ATx,IN),p(x)=N(μ,Σ), (2.20)

where A = [a1, a2, …, aN] is a deterministic K × N matrix and IN is the N × N identity matrix. The MI I is given by (see Verdu, 1986; Guo, Shamai, & Verdu, 2005, for details)

I=12ln(det(Σ12AATΣ12+IK)). (2.21)

If rank (J(x)) < K, then IF = −∞. Notice that here, J(x) = AAT. When a = a1 = … = aN and = Ik, then by equation 2.21 and a matrix determinant lemma, we have

I=12ln(det(NaaT+IK))=12ln(NaTa+1)0, (2.22)

and by equation 2.11,

IF=12ln(det(NaaT))=, (2.23)

which is obviously incorrect as an approximation to I. For high-dimensional inputs, the determinant det (J(x)) may become close to zero in practical applications. When the FI matrix J(x) becomes degenerate, the regularity condition ensuring the Cramér-Rao paradigm of statistics is violated (Amari & Nakahara, 2005), in which case using IF as a proxy for I incurs large errors.

In the following, we will show that IG is a better approximation of I for high-dimensional inputs. For instance, for the above example, we can verify that

IG=12ln(det(12πe(AAT+Σ1)))+12ln(det(2πeΣ))=12ln(det(Σ12AATΣ12+IK))=I, (2.24)

which is exactly equal to the MI I given in equation 2.21.

2.2.1. Regularity Conditions.

First, we consider the following regularity conditions for p(x) and p(rx):

C1: p(x) and p(rx) are twice continuously differentiable for almost every xX, where X is a convex set; G(x) is positive definite, and ∥ G−1 (x)∥ = O(N−1), where ∥·∥ denotes the Frobenius norm of a matrix. The following conditions hold:

q(x)<, (2.25a)
q(x)<, (2.25b)
(N1l(rx)Tl(rx))2rx=O(1), (2.25c)
N1(l(rx)l(rx)rx)2rx=O(N1), (2.25d)

and there exists an ω = ω (x) > 0 for x˘Xω(x) such that

N1l(rx˘)l(rx)=O(1), (2.25e)

where O indicates the big-O notation.

C2: The following condition is satisfied,

N1(l(rx)l(rx)rx)2(m+1)rx=O(N1), (2.26a)

for mN, and there exists η > 1 such that

Prx{det(G(x))12Xω^(x)p(x^r)dx^>ϵp(xr)}=O(Nη) (2.26b)

for all ϵ ∈ (0,1/2), ω^(0,ω) and xX with p(x) > 0, where Prx{} denotes the probability of r given x.

The regularity conditions C1 and C2 are needed to prove theorems in later sections. They are expressed in mathematical forms that are convenient for our proofs, although their meanings may seem opaque at first glance. In the following, we will examine these conditions more closely. We will use specific examples to make interpretations of these conditions more transparent.

Remark 1. In this article, we assume that the probability distributions p(x) and p(rx) are piecewise twice continuously differentiable. This is because we need to use Fisher information to approximate mutual information, and Fisher information requires derivatives that make sense only for continuous variables. Therefore, the methods developed in this article apply only to continuous input variables or stimulus variables. For discrete input variables, we need alternative methods for approximating MI, which we will address in a separate publication.

Conditions 2.25a and 2.25b state that the first and the second derivatives of q(x) = ln p(x) have finite values for any given xϵX. These two conditions are easily satisfied by commonly encountered probability distributions because they only require finite derivatives within X, the set of allowable inputs, and derivatives do not need to be finitely bounded.

Remark 2. Conditions 2.25c to 2.26a constrain how the first and the second derivatives of l(rx) = ln p(rx) scale with N, the number of neurons. These conditions are easily met when p(rx) is conditionally independent or when the noises of different neurons are independent, that is, p(rx)=n=1Np(rnx).

We emphasize that it is possible to satisfy these conditions even when p(rx) is not independent or when the noises are correlated, as we show later. Here we first examine these conditions closely, assuming independence. For simplicity, our demonstration that follows is based on a one-dimensional input variable (K = 1). The conclusions are readily generalizable to higher-dimensional inputs (K > 1) because K is fixed and does not affect the scaling with N.

Assuming independence, we have l(rx)=n=1Nl(rnx) with l(rnx) = ln p(rnx), and the left-hand side of equation 2.25c becomes

N2l(rx)4rx=N2n1,,n4=1Nl(rn1x)l(rn2x)l(rn3x)l(rn4x)rn1,rn2,rn3,rn4x=N2(nml(rnx)2rnxl(rmx)2rmx+n=1Nl(rnx)4rnx), (2.27)

where the final result contains only two terms with even numbers of duplicated indices, while all other terms in the expansion vanish because any unmatched or lone index k (from n1, n2, n3, n4) should yield a vanishing average:

l(rkx)rkx=Rp(rkx)l(rkx)drk=x(Rp(rkx)drk)=0. (2.28)

Thus, condition 2.25c is satisfied as long as ⟨l′(rnx)2rnx and ⟨l′(rnx)4rnx are bounded by some finite numbers, say, a and b, respectively, because now equation 2.27 should scale as N−2 (aN(N – 1) + bN) = O(1). For instance, a gaussian distribution always meets this requirement because the averages of the second and fourth powers are proportional to the second and fourth moments, which are both finite. Note that the argument above works even if ⟨l′(rnx)4rnx is not finitely bounded but scales as O(N).

Similarly, under the assumption of independence, the left-hand side of equation 2.25d becomes

N2(l(rx)l(rx)rx)2rx=N2n,m=1N(l(rnx)l(rnx)rnx)(l(rmx)l(rmx)rmx))rn,rmx=N2n=1N(l(rnx)l(rnx)rnx)2rnx=N2n=1N(l(rnx)2rnxl(rnx)rnx2), (2.29)

where, in the second step, the only remaining terms are the squares, while all other terms in the expansion with nm have vanished because ⟨l″(rnx) – ⟨l″(rnx)⟩rnxrnx = 0. Thus, condition 2.25d is satisfied as long as ⟨l″(rnx)⟩rnx and ⟨l″(rnx)2rnx are bounded so that equation 2.29 scales as N−2N = N−1.

Condition 2.25e is easily satisfied under the assumption of independence. It is easy to show that this condition holds when l″(rnx) is bounded.

Condition 2.26a can be examined with similar arguments used for equations 2.27 and 2.29. Assuming independence, we rewrite the left-hand side of equation 2.26a as

Nz(l(rx)l(rx)rx)zrx=Nzn1,,nz=1N(l(rn1x)l(rn1x)rn1x)(l(r1x)l(rnzx)rnzx)rnzx=Nzn1,,nm+1=1Ni=1m+1(l(rnix)l(rnix)rnix)2rnix+ (2.30)

where z = 2(m + 1) ≥ 4 is an even number. Any term in the expansion with an unmatched index nk should vanish, as in the cases of equations 2.27 and 2.29. When ⟨l″ (rnx)⟩rnx and ⟨l″(rnx)2rnx are bounded, the leading term with respect to scaling with N is the product of squares, as shown at the end of equation 2.30, because all the other nonvanishing terms increase more slowly with N. Thus equation 2.30 should scale as NzNm+1 = Nm−1, which trivially satisfies condition 2.26a.

In summary, conditions 2.25c to 2.26a are easy to meet when p(rx) is independent. It is sufficient to satisfy these conditions when the averages of the first and second derivatives of l(rx) = ln p(rx), as well as the averages of their powers, are bounded by finite numbers for all the neurons.

Remark 3. For neurons with correlated noises, if there exists an invertible transformation that maps r to r~ such that p(r~x) becomes conditionally independent, then conditions C1 and C2 are easily met in the space of the new variables by the discussion in remark 2. This situation is best illustrated by the familiar example of a population of neurons with correlated noises that obey a multivariate gaussian distribution:

p(rx)=1det(2πΣ)exp(12(rg)TΣ1(rg)), (2.31)

where is an N × N invertible covariance matrix, and g = (g1(x; θ1),…, gN(x; θN)) describes the mean resp onses with θn being the parameter vector. Using the following transformation,

r~=Σ12r=(r~1,r~2,,r~N)T, (2.32)
g~=Σ12g=(g~1,g~2,,g~N)T, (2.33)

we obtain the independent distribution:

p(r~x)=n=1N12πexp(12(r~ng~n)2). (2.34)

In the special case when the correlation coefficient between any pair of neurons is a constant c, −1 < c < 1, the noise covariance can be written as

Σ=a((1c)IN+cuuT), (2.35)

where a > 0 is a constant, IN is the N × N identity matrix, and u=(1,1,,1)TRN×1. The desired transformation in equations 2.32 and 2.33 is given explicitly by

Σ12=b0(INb1uuT), (2.36)

where

b0=1a(1c),b1=1N(1±1c(N1)c+1). (2.37)

The new response variables defined in equations 2.32 and 2.33 now read:

r~n=b0(rnb1m=1Nrm), (2.38)
g~n=b0(gnb1m=1Ngm). (2.39)

Now we have the derivatives:

l(r~nx)=(r~ng~n)g~nx, (2.40)
l(r~nx)=l(r~nx)rnx=(r~ng~n)2g~nx2, (2.41)

where g~nx and 2g~nx2 are finite as long as ∂gn/∂x and 2gn/∂x2 are finite. Conditions C1 and C2 are satisfied when the derivatives and their powers are finitely bounded as shown before.

The example above shows explicitly that it is possible to meet conditions C1 and C2 even when the noises of different neurons are correlated. More generally, if a nonlinear transformation exists that maps correlated random variables into independent variables, then by similar argument, conditions C1 and C2 are satisfied when the derivatives of the log likelihood functions and their powers in the new variables are finitely bounded. Even when the desired transformation does not exist or is unknown, it does not necessarily imply that conditions C1 and C2 must be violated.

While the exact mathematical conditions for the existence of the desired transformation are unclear, let us consider a specific example. If a joint probability density function can be morphed smoothly and reversibly into a flat or constant density in a cube (hypercube), which is a special case of an independent distribution, then this morphing is the desired transformation. Here we may replace the flat distribution by any known independent distribution and the argument above should still work. So the desired transformation may exist under rather general conditions.

For correlated random variables, one may use algorithms such as independent component analysis to find an invertible linear mapping that makes the new random variables as independent as possible (Bell & Sejnowski, 1997) or use neural networks to find related nonlinear mappings (Huang & Zhang, 2017). These methods do not directly apply to the problem of testing conditions C1 and C2 because they work for a given network size N and further development is needed to address the scaling behavior in the large network limit N → ∞.

Finally, we note that the value of the MI of the transformed independent variables is the same as the MI of the original correlated variables because of the invariance of MI under invertible transformation of marginal variables. A related discussion is in theorem 3, which involves a transformation of the input variables rather than a transformation of the output variables as needed here.

Remark 4. Condition 2.26b is satisfied if a positive number δ and a positive integer m exist such that

det(G(x))12Xωˆ(x)Bm,δ(x)p(rx^)p(x^)drdx^=O(Nη) (2.42)

for all x^Xω^(x), where

Bm,δ(x)={rR:δNη12mG(x)<l(rx)l(rx)rx<δNη12mG(x)}, (2.43)

and A < B means that the matrix AB is negative definite. A proof is as follows.

First note that in equation 2.43, if η → 1 or m → ∞, then Nη12m1. Following Markov’s inequality, condition C2 and equation A.19 in the appendix, for the complementary set of Bm,δ(x), Bm,δ(x), we have

Prx{Bm,δ(x)}Prx{B02δ2Nη1m}δ2mN(η1)B02mrx=O(Nη), (2.44)

where

B0=G12(x)(l(rx)l(rx)rx)G12(x). (2.45)

Define the set

Aω^(x)={rR:Xω^(x)p(x^r)p(xr)dx^>det(G(x))12ϵ}. (2.46)

Then it follows from Markov’s inequality and equation 2.42 that

Prx{Aω^(x)Bm,δ(x)}ϵ1det(G(x))12Bm,δ(x)Xω^(x)p(rx^)p(x^)p(x)dx^dr=O(Nη). (2.47)

Hence, we get

Prx{Aω^(x)}Prx{Aω^(x)Bm,δ(x)}+Prx{Bm,δ(x)}=O(Nη),

which yields condition 2.26b.

Condition 2.42 is satisfied if there exists a positive number ς such that

lnp(rx)p(rx^)Nς (2.48)

for all x^Xω^(x) and rBm,δ(x). This is because

det(G(x))12Xω^(x)Bm,δ(x)p(rx^)p(x^)drdx^=det(G(x))12Xω^(x)p(x^)Bm,δ(x)p(rx)exp(lnp(rx)p(rx^))drdx^det(G(x))12exp(Nς)=O(NK2eNς). (2.49)

Here notice that det (G (x))1/2 = O (NK/2) (see equation A.23).

Inequality 2.48 holds if p(rx) is conditionally independent, namely, p(rx)=n=1Np(rnx), with

lnp(rnx)p(rnx^)ς,n=1,2,,N, (2.50)

for all x^Xω^(x) and rBm,δ(x). Consider the inequality lnp(rnx)p(rnx^)rnx0 where the equality holds when x=x^. If there is only one extreme point at x^=x for x^Xω(x), then generally it is easy to find a set Bm,δ(x) that satisfies equation 2.50, so that equation 2.26b holds.

2.2.2. Asymptotic Bounds and Approximations for Mutual Information.

Let

ξ=N1(l(rx)l(rx)rx)G1(x)l(rx)2rx, (2.51)

and it follows from conditions C1 and C2 that

ξNG1(x)2N1(l(rx)l(rx)rx)4rx12×(N1l(rx)Tl(rx))2rx12=O(N12). (2.52)

Moreover, if p(rx) is conditionally independent, then by an argument similar to the discussion in remark 2, we can verify that the condition ξ = O (N−1) is easily met.

In the following we state several conclusions about the MI; their proofs are given in the appendix.

Lemma 1. If condition C1 holds, then the MI I has an asymptotic upper bound for integer N,

IIG+O(N1). (2.53)

Moreover, if equations 2.25c and 2.25d are replaced by

N1l(rx)Tl(rx)1+τrx=O(1), (2.54a)
N1(l(rx)l(rx)rx)2rx=o(1), (2.54b)

for some τ ∈ (0,1), where o indicates the Little-O notation, then the MI has the following asymptotic upper bound for integer N:

IIG+o(1). (2.55)

Lemma 2. If conditions C1 and C2 hold, ξ = O (N−1), then the MI has an asymptotic lower bound for integer N,

IIG+O(N1). (2.56)

Moreover, if condition C1 holds but equations 2.25c and 2.25d are replaced by 2.54a and 2.54b, and inequality 2.26b in C2 also holds for η > 0, then the MI has the following asymptotic lower bound for integer N:

IIG+o(1). (2.57)

Theorem 1. If conditions C1 and C2 hold, ξ = O (N−1), then the MI has the following asymptotic equality for integer N:

I=IG+O(N1). (2.58)

For more relaxed conditions, suppose condition C1 holds but equations 2.25c and 2.25d are replaced by 2.54a and 2.54b, and inequality 2.26b in C2 also holds for η > 0, then the MI has an asymptotic equality for integer N:

I=IG+o(1). (2.59)

Theorem 2. Suppose J(x) and G(x) are symmetric and positive-definite. Let

ς=Tr(Ψ(x))x, (2.60)
Ψ(x)=J12(x)P(x)J12(x). (2.61)

Then

IGIF+ς2, (2.62)

where Tr (·) indicating matrix trace; moreover, if P(x) is positive-semidefinite, then

0IGIFς2. (2.63)

But if

ς1=Ψ(x)x=O(Nβ) (2.64)

for some β > 0, then

IG=IF+O(Nβ). (2.65)

Remark 5. In general, we need only to assume that p(x) and p(rx) are piecewise twice continuously differentiable for xϵX. In this case, lemmas 1 and 2 and theorem 1 can still be established. For more general cases, such as discrete or continuous inputs, we have also derived a general approximation formula for MI from which we can easily derive formula for IG (this will be discussed in separate paper).

2.3. Approximations of Mutual Information in Neural Populations with Finite Size.

In the preceding section, we provided several bounds, including both lower and upper bounds, and asymptotic relationships for the true MI in the large N (network size) limit. Now, we discuss effective approximations to the true MI in the case of finite N. Here we consider only the case of continuous inputs (we will discuss the case of discrete inputs in another paper).

Theorem 1 tells us that under suitable conditions, we can use IG to approximate I for a large but finite N (e.g., NK), that is,

IIG. (2.66)

Moreover, by theorem 2, we know that if ς ≈ 0 with positive-semidefinite P(x) or ς1 ≈ 0 holds (see equations 2.60 and 2.64), then by equations 2.63, 2.65, and 2.66, we have

IIGIF. (2.67)

Define

G~(x)=J(x)+P(x)+Q(x), (2.68)
I~G=12ln(det(G~(x)2πe))x+H(X), (2.69)

where G~(x) is positive-definite and Q (x) is a symmetric matrix depending on x and ∥Q (x) ∥ = O(1). Suppose G~1(x)=O(N1). If we replace IG by I~G in theorem 1, then we can prove equations 2.58 and 2.59 in a manner similar to the proof of that theorem. Considering a special case where ∥P(x)∥ → 0, det (J(x)) = O(1) (e.g., rank (J(x)) < K) and ∥G−1 (x) ∥ ≠ O(N−1), then we can no longer use the asymptotic formulas in theorem 1. However, if we substitute G~(x) for G(x) by choosing an appropriate Q (x) such that G~(x) is positive-definite and G~1(x)=O(N1), then we can use equation 2.58 and 2.59 as the asymptotic formula.

If we assume G(x) and G~(x) are positive-definite and

ζ=Q(x)G~1(x)x=O(Nβ),β>0, (2.70)

then similar to the proof of theorem 2, we have

ln(det(G(x)))x=ln(det(G~(x)))x+ln(det(IKQ(x)G~1(x)))x=ln(det(G~(x)))x+O(Nβ) (2.71)

and

I~G=IG+O(Nβ).

For large N, we usually have I~GIG.

It is more convenient to redefine the following quantities:

Q(x)=P+P(x), (2.72)
P+=lnp(x)xlnp(x)xTx, (2.73)
G+(x)=G~(x)=J(x)+P+, (2.74)

and

IG+=I~G=12ln(det(G+(x)2πe))x+H(X). (2.75)

Notice that if p(x) is twice differentiable for x and

X2p(x)xxTdx=0, (2.76)

then

P+=P(x)x=1p(x)2p(x)xxTx2lnp(x)xxTx. (2.77)

For example, if p(x) is a normal distribution, p(x)=N(μ,Σ), then

P(x)=P+=Σ1. (2.78)

Similar to the proof of theorem 2, we can prove that

0IG+IFς+2, (2.79)

where

ς+=Tr(P+J1(x))x. (2.80)

We find that IG is often a good approximation of MI I even for relatively small N. However, we cannot guarantee that P(x) is always positive-semidefinite in equation 2.14, and as a consequence, it may happen that det (G(x)) is very small for small N, G(x) is not positive-definite, and ln (det (G(x))) is not a real number. In this case, IG is not a good approximation to I but IG+ is still a good approximation. Generally, if P(x) is always positive-semidefinite, then IG or IG+ is a better approximation than IF, especially when p(x) is close to a normal distribution.

In the following, we give an example of 1D inputs. High-dimensional inputs are discussed in section 4.1.

2.3.1. A Numerical Comparison for 1D Stimuli.

Considering the Poisson neuron model (see equation 5.7 in section 5.1 for details), the tuning curve of the nth neuron, f (x; θn), takes the form of circular normal or von Mises distribution

f(x;θn)=Aexp((T2πσf)2(1cos(2πT(xθn)))), (2.81)

where x ∈ [−T/2, T/2), θn ∈ [−Tθ/2, Tθ/2], n ϵ {1,2, …, N}, with T = π, Tθ = 1, σf = 0.5, and A = 20, and the centers θ1, θ2, …, θN of the N neurons are uniformly distributed on interval [−Tθ/2, Tθ/2], that is, θn = (n – 1) dθTθ/2, with dθ = Tθ/(N – 1) and N ≥ 2. Suppose the distribution of 1D continuous input x (K = 1) p(x) has the form

p(x)=Z1exp((T2πσp)2(1cos(2πTx))), (2.82)

where σp is a constant set to π/4 and Z is the normalization constant. Figure 1A shows graphs of the input distribution p(x) and the tuning curves f (x; θ) with different centers θ = −π/4, 0, π/4.

Figure 1:

Figure 1:

A comparison of approximations IMC, IG, IG+, and IF for one-dimensional input stimuli. All of them were almost equally good, even for small population size N. (A) The stimulus distribution p(x) and tuning curves f (x; θ) with different centers θ = −π/4, 0, π/4. (B) The values of IMC, IG, IG+, and IF all increase with neuron number N. (C) The relative errors DIG, DIG+, and DIF for the results in panel B. (D) The absolute values of the relative errors ∣DIG∣, ∣DIG+∣, and ∣DIF∣, with error bars showing standard deviations of repeated trials.

To evaluate the precision of the approximation formulas, we use Monte Carlo (MC) simulation to approximate MI I. For MC simulation, we first sample an input xj by the distribution p(x), then generate the neural response rj by the conditional distribution p(rjxj), where j = 1, 2, …, jmax. The value of MI by MC simulation is calculated by

IMC=1jmaxj=1jmaxln(p(rjxj)p(rj)), (2.83)

where p(rj) is given by

p(rj)=m=1Mp(rjxm)p(xm) (2.84)

and xm = (m – 1) T/MT/2 for m ϵ {1,2, …, M}.

To evaluate the accuracy of MC simulation, we compute the standard deviation,

Istd=1imaxi=1imax(IMCiIMC)2, (2.85)

where

IMCi=1jmaxj=1jmaxln(p(rΓj,ixΓj,i)p(rΓj,i)), (2.86)
IMC=1imaxi=1imaxIMCi, (2.87)

and Γj,i ϵ {1,2, …, jmax} is the (j, i)th entry of the matrix ΓNjmax×imax with samples taken randomly from the integer set {1, 2, …, jmax} by a uniform distribution. Here we set jmax = 5 × 105, imax = 100 and M = 103.

For different N ∈ {2, 3, 4, 6, 10, 14, 20, 30, 50, 100, 200,400, 700, 1000}, we compare IMC with IG, IG+, and IF, which are illustrated in Figures 1B to 1D. Here we define the relative error of approximation, for example, for IG, as

DIG=IGIMCIMC, (2.88)

and the relative standard deviation

DIstd=IstdIMC. (2.89)

Figure 1B shows how the values of IMC, IG, IG+, and IF change with neuron number N, and Figures 1C and 1D show their relative errors and the absolute values of the relative errors with respect to IMC. From Figures 1B to 1D, we can see that the values of IG, IG+, and IF are all very close to one another and the absolute values of their relative errors are all very small. The absolute values are less than 1% when N ≥ 10 and less than 0.1% when N ≥ 100. However, for the high-dimensional inputs, there will be a big difference between IG, IG+, and IF in many cases (see section 4.1 for more details).

3. Statistical Estimators and Neural Population Decoding

Given the neural response r elicited by the input x, we may infer or estimate the input x from the response. This procedure is sometimes referred to as decoding from the response. We need to choose an efficient estimator or a function x^=x^(r) that maps the response r to an estimate x^ of the true stimulus x. The maximum likelihood (ML) estimator defined by

x^(r)=argmaxxp(rx)=argmaxxl(rx) (3.1)

is known to be efficient in large N limit. According to the Cramér-Rao lower bound (Rao, 1945), we have the following relationship between the covariance matrix of any unbiased estimator Σx^ and the FI matrix J (x),

Σx^=(x^(r)x)(x^(r)x)TrxJ1(x), (3.2)

where x^(r) is an unbiased estimation of x from the response r, and AB means that matrix AB is positive-semidefinite. Thus,

IF=12ln(det(J(x)2πe))x+H(X)12ln(det(Σx^12πe))x+H(X)=Ivar. (3.3)

The MI between X and X^ is given by

I^=H(X^)H(X^X)x^,x, (3.4)

where H(X^) is the entropy of random variable X^ and H(X^X) is its conditional entropy of random variable X^ given X. Since the maximum entropy probability distribution is gaussian, H(X^X) satisfies

H(X^X)12ln(det(2πeΣx^)). (3.5)

Therefore, from equations 3.4 and 3.5, we get

I^12ln(det(Σx^12πe))x+H(X^)=I^var. (3.6)

The data processing inequality (Cover & Thomas, 2006) states that postprocessing cannot increase information, so that we have

II^I^var. (3.7)

Here we can not directly obtain IIF as in Brunel and Nadal (1998) when H(X^)=H(X) and Ivar=I^var. The simulation results in Figure 1 also show that IF is not a lower bound of I.

For biased estimators, the van Trees’ Bayesian Cramér-Rao bound (Van Trees & Bell, 2007) provides a lower bound:

Σx^x=(x^(r)x)(x^(r)x)Trxx(J(x)x+P+)1=G+(x)x1. (3.8)

It follows from equations 2.75, 3.6, and 3.8 that

IG+12ln(det(G+(x)x2πe))+H(X)=IVT, (3.9)
IVT12ln(det(Σx^x12πe))+H(X)=I~var, (3.10)
IvarI~var. (3.11)

We may also regard decoding as Bayesian inference. By Bayes’ rule,

p(xr)=p(rx)p(x)p(r). (3.12)

According to the Bayesian decision theory, if we know the response r, from the prior p(x) and the likelihood p(rx), we can infer an estimation of the true stimulus x, x^(r)—for example,

x^(r)=argmaxxp(xr)=argmaxxL(rx), (3.13)

which is also called maximum a posteriori (MAP) estimation.

Consider a loss function φ(x^(r)x) for estimation,

φ(x^(r)x)=lnp(xr), (3.14)

which is minimized when p(xr) reaches its maximum. Now the conditional risk is

R(x^(r)r)=φ(x^(r)x)xr, (3.15)

and the overall risk is

Ro=R(x^(r)r)r=φ(x^(r)x)xrr=lnp(xr)x,r. (3.16)

Then it follows from equations 2.3 and 3.16 that

I=lnp(xr)r,x+H(X)=Ro+H(X). (3.17)

Comparing equations 2.12, 2.66, and 3.17, we find

Ro12ln(det(G(x)2πe))x. (3.18)

Hence, maximizing MI I (or IG) means minimizing the overall risk Ro for a determinate H(X). Therefore, we can get the optimal Bayesian inference via optimizing MI I (or IG).

By the Cramér-Rao lower bound, we know that the inverse of FI matrix J−1(x) reflects the accuracy of decoding (see equation 3.2). P(x) provides some knowledge about the prior distribution p(x); for example, P−1 (x) is the covariance matrix of input x when p(x) is a normal distribution. ∥P(x)∥ is small for a flat prior (poor prior) and large for a sharp prior (good prior). Hence, if the prior p(x) is flat or poor and the knowledge about model is rich, then the MI I is governed by the knowledge of model, which results in a small ς1 (see equation 2.64) and IIGIF. Otherwise, the prior knowledge has a great influence on MI I, which results in a large ς1 and IIGIF.

4. Variable Transformation and Dimensionality Reduction in Neural Population Coding

For low-dimensional input x and large N, both IG are IF are good approximations of MI I, but for high-dimensional input x, a large value of ς1 may lead to a large error of IF, in which case IG (or IG+ ) is a better approximation. It is difficult to directly apply the approximation formula IIG when we do not have an explicit expression of p (x) or P (x). For many applications, we do not need to know the exact value of IG and care only about the value of ⟨ln (det (G(x)))⟩x (see section 5). From equations 2.12, 2.22, and 2.78, we know that if p (x) is close to a normal distribution, we can easily approximate P (x) and H(X) to obtain ⟨ln (det (G(x)))⟩x and IG. When p (x) is not a normal distribution, we can employ a technique of variable transformation to make it closer to a normal distribution, as discussed below.

4.1. Variable Transformation.

Suppose T:XX~ is an invertible and differentiable mapping:

x~=T(x)=(T1(x),T2(x),,TK(x))T, (4.1)

x=T1(x~), and x~X~RK. Let p(x~) denote the p.d.f. of random variable X~ and

p(rx~)=p(rx)x=T1(x~). (4.2)

Then we have the following conclusions, the proofs of which are given in the appendix.

Theorem 3. The MI is equivariant under the invertible transformations. More specifically, for the above invertible transformation T, the MI I(X; R) in equation 2.1 is equal to

I(X~;R)=lnp(rx~)p(r)r,x~. (4.3)

Furthermore, suppose p(x~) and p(rx~) fulfill the conditions C1, C2 and ξ = O (N−1). Then we have

I(X~;R)=I~G+O(N1), (4.4)
I~G=12ln(det(G(x~)2πe))x~+H(X~)=12ln(det(G(x)2πe))x+H(X)=IG, (4.5)

where H(X~) is the entropy of random variable X~ and satisfies

H(X~)=lnp(x~)x~=H(X)+lndet(DT(x))x, (4.6)

and DT (x) denotes the Jacobian matrix of T (x),

(DT(x))i,j=Ti(x)xj,i,j=1,2,,K. (4.7)

Corollary 1. Suppose p(rx) is a normal distribution,

p(rx)=N(ATy,IN), (4.8)

where y = f (BT x) = (y1, y2, …, yK)T, yk=fk(bkTx) for k = 1, 2, …, K, A is a deterministic K × N matrix, B = [b1, b2, …, bK] is a deterministic invertible matrix, and fk is an invertible and differentiable function. If Y has also a normal distribution, p(y)=N(μf,Σf), then

IG=IG+=I(X;R)=I(Υ;R)=12ln(det(12πe(AAT+Σf1)))+H(Υ)=12ln(det(12πe(J(x)+P(x))))x+H(X), (4.9)

where

H(Υ)=12ln(det(2πeΣf))=H(X)+lndet(D(x))x, (4.10)
D(x)=(f1(b1Tx)b1,f2(b2Tx)b2,,fK(bKTx)bK)T, (4.11)
fk(bkTx)=fk(yk)ykyk=bkTx,k=1,2,,K. (4.12)

Remark 6. From corollary 1 and equation 2.78, we know that the approximation accuracy for IGI(X; R) is improved when we employ an invertible transformation on the input random variable X to make the new random variable Y closer to a normal distribution (see section 4.3).

Consider the eigendecompositions of AAT and f as given by

AAT=UAΣ^UAT, (4.13)
Σf=UfΣ~UfT, (4.14)

where UA and Uf are K × K orthogonal matrices; Σ^=diag(σ^12,σ^22,,σ^K2) and Σ~=diag(σ~12,σ~22,,σ~K2) are K × K eigenvalue matrices; and σ^1σ^2σ^K>0 and σ~1σ~2σ~K>0. Then by equations 2.11 and 4.9, we have

IG=IG+=I(X;R)=I(Υ;R)=12ln(det(12πe(UAΣ^UAT+UfΣ~1UfT)))+H(Υ), (4.15)
IF=12ln(det(Σ^2πe))+H(Υ), (4.16)

and

IFIG=12ln(det(IK+Σ^12UATUfΣ~1UfTUAΣ^12)). (4.17)

Now consider two special cases. If Σ~=IK, then by equation 4.17, we get

IFIG=12k=1Kln(1+σ^k2). (4.18)

If UA = Uf, then

IFIG=12k=1Kln(1+σ^k2σ~k2). (4.19)

Here J(x)=UAΣ^UAT, P1(x)=UfΣ~UfT. The FI matrices J(x) and P−1(x) become degenerate when σ^K20 and σ~K20.

From equations 4.18 and 4.19, we see that if either J(x)or P−1 (x) becomes degenerate, then (IFIG) → −∞. This may happen for high-dimensional stimuli. For a specific example, consider a random matrix A defined as follows. Here we first generate K × N elements Ak,n, (k = 1, 2, …, K; n = 1, 2, …, N) from a normal distribution N (0,1). Then each column of matrix A is normalized by Ak,nAk,nk=1KAk,n2. We randomly sample M (set to 2 × 104) image patches with size ω × ω from Olshausen’s nature image data set (Olshausen & Field, 1996) as the inputs. Each input image patch was centered by subtracting its mean: xmxm1Kk=1Kxk,m. Then let xmxm1Mm=1Mxm for ∀m ϵ {1, 2, …, M}. Define matrix X = [x1, x2, …, xM] and compute eigendecomposition

1MXXT=UxΣˇUxT, (4.20)

where Ux is a K × K orthogonal matrix and Σˇ=diag(σˇ12,σˇ22,,σˇK2) is a K × K eigenvalue matrix with σˇ1σˇ2σˇK>0. Define

y=UxTx. (4.21)

Then

1Mm=1MymymT=Σˇ. (4.22)

The distribution of random variable Y can be approximated by a normal distribution (see section 4.3 for more details). When p(y)=N(μˇ,Σˇ), we have

IG=IG+=I(X;R)=I(Υ;R), (4.23)
IG=12ln(det(12πe(AAT+Σˇ1)))+H(Υ)=12ln(det(12πe(Σˇ12AATΣˇ12+IK))), (4.24)
IF=12ln(det(AAT2πe))+H(Υ). (4.25)

The error of approximation IF is given by

dIF=IFI(X;R)=IFIG=12ln(det(IK+(AAT)1Σˇ1)), (4.26)

and the relative error for IF is

DIF=dIFIG. (4.27)

Figure 2A shows how the values of IG and IF vary with the input dimension K = ω × ω and the number of neurons N (with ω = 2, 4, 6, …, 30 and N = 104, 2 × 104, 5 × 104, 105). The relative error DIF is shown in Figure 2B. The absolute value of the relative error tends to decrease with N but may grow quite large as K increases. In Figure 2B, the largest absolute value of relative error ∣DIF∣ is greater than 5000%, which occurs when K = 900 and N = 104. Even the smallest ∣DIF∣ is still greater than 80%, which occurs when K = 100 and N = 105. In this example, IF is a bad approximation of MI I, whereas IG and IG+ are strictly equal to the true MI I across all parameters.

Figure 2:

Figure 2:

A comparison of approximations IG and IF for different input dimensions. Here IG is always equal to the true MI with IG = IG+ = I(X; R), whereas IF always has nonzero errors. (A) The values IG and IF vary with input dimension K = ω2 with ω = 2, 4, 6, …, 30, and the number of neurons N = Ni with N1 = 104, N2 = 2 × 104, N3 = 5 × 104, N4 = 105. (B) The relative error DIF changes with input dimension K for different N.

4.2. Dimensionality Reduction for Asymptotic Approximations.

Suppose x = (x1, … xK)T is partitioned into two sets of components, x=(x1T,x2T)T with

x1=(x1,x2,,xK1)T, (4.28)
x2=(xK1+1,xK1+2,,xK)T, (4.29)

where x1X1RK1, x2X2RK2, K1 + K2 = K, K ≥ 2, K1 ≥ 1 and K2 ≥ 1.

Then by Fubini’s theorem, the MI I in equation 2.1 can be written as

I=X2X1Rp(rx1,x2)p(x1,x2)lnp(rx1,x2)p(r)drdx1dx2, (4.30)

where p(x1, x2) = p(x) and p(rx1, x2) = p(rx).

First define

G(x)=(G1,1(x)G1,2(X)G2,1(x)G2,2(x)), (4.31a)
Gi,j(x)=Ji,j(x)+Pi,j(x), (4.31b)

where i, j ϵ {1, 2}, and

Ji,j(x)=lnp(rx)xilnp(rx)xjTrx, (4.32a)
Pi,j(x)=2lnp(x)xixjT. (4.32b)

Then we have the following results, their proofs are given in the appendix.

Theorem 4. Suppose matrices G (x), G1, 1 (x), and G2,2 (x) are positive-definite. If the matrix AxRK×K satisfies

Tr(Axx)1 (4.33)

with

Ax=G2,212(x)G2,1(x)G1,11(x)G1,2(x)G2,212, (4.34)

then we have

IGIG1 (4.35)

with strict equality if and only if

G2,1(x)G1,11(x)G1,2(x)=0, (4.36)

where

IG1=12ln(det(G1,1(x)2πe))x+12ln(det(G2,2(x)2πe))x+H(X). (4.37)

Theorem 5. Suppose matrices G (x), G1,1 (x) and P2,2 (x) are positive-definite. If the matrix BxRK2×K2 is positive-semidefinite and satisfies

0Tr(Bxx)1 (4.38)

with

Bx=P2,212(x)CxP2,212(x), (4.39)
Cx=J2,2(x)G2,1(x)G1,11(x)G1,2(x), (4.40)

then we have

IGIG2, (4.41)

with strict equality if and only if

Cx=0, (4.42)

where

IG2=12ln(det(G1,1(x)2πe))x+12ln(det(P2,2(x)2πe))x+H(X). (4.43)

Corollary 2. If the random variables X1 and X2 are independent so that p(x)=p(x1)p(x2),p(x2)=N(μ2,Σx2) is a normal distribution, and G (x), G1,1 (x), P1,1 (x) and P2,2 (x) are all positive-definite and satisfy equation 4.38, then we have

IGIG1, (4.44)
IG1=12ln(det(G1,1(x)2πe))x+H(X1), (4.45)

with strict equality if and only if

Cx=J2,2(x)J2,1(x)G1,11(x)J1,2(x)=0, (4.46)

where

H(X1)=lnp(x1)x1, (4.47a)
G1,1(x)=J1,1(x)+P1,1(x), (4.47b)
P1,1(x)=2lnp(x1)x1x1T. (4.47c)

Remark 7. Sometimes we are concerned only with calculating the determinant of matrix G(x) with a given p(x). Theorems 3 and 4 provide a dimensionality reduction method for computing G (x) or det (G (x)), by which we need only to compute G1,1 (x) and G2,2 (x) separately. To apply the approximation 4.35, we do not need to strictly require ∣Tr (⟨Axx)∣ ⪡ 1. Instead we need to require only

Tr(Axx)ln(det(G1,1(x))det(G2,2(x)))x. (4.48)

Similarly, the inequality ∣Tr (⟨Bxx)∣ ⪡ 1 can be substituted by

Tr(Bxx)ln(det(G1,1(x))det(P2,2(x)))x. (4.49)

By equation 4.44 and the second mean value theorem for integrals, we get

IG1=12ln(det(G1,1(x1,x¨2)2πe))x1+H(X1) (4.50)

for some fixed x¨2X2. When ∥x2∥ is small, x¨2 should be close to the mean: x¨2μ2. It follows from theorem 1 and corollary 2 that the approximate relationship IIG1 holds. However, equation 4.50 implies that IG1 is determined only by the first component x1. Hence, there is little impact on information transfer by the minor component (i.e., x2) for the high-dimensional input x. In other words, the information transfer is mainly determined by the first component x1, and we can omit the minor component x2.

4.3. Further Discussion.

Suppose x is a zero-mean vector; if it is not, then let xx – ⟨xx. The covariance matrix of x is given by

Σx=xxTx=UΣUT, (4.51)

where U is a K × K orthogonal matrix whose kth column is the eigenvector uk of x and is a diagonal matrix whose diagonal elements are the corresponding eigenvalues—Σ=diag(σ12,σ22,,σK2) with σ1σ2 ≥ … ≥ σK > 0. With the whitening transformation,

x~=Σ12UTx, (4.52)

the covariance matrix of x~ becomes an identity matrix:

Σx~=x~x~Tx~=Σ12UTxxTxUΣ12=IK. (4.53)

By the central limit theorem, the distribution of random variable X~ should be closer to a normal distribution than the distribution of the original random variable X; that is, p(x~)N(0,IK). Using Laplace’s method asymptotic expansion (MacKay, 2003), we get

P(x~)=2lnp(x~)x~x~TΣx~1=IK, (4.54)
P+=P(x~)x~Σx~1=IK. (4.55)

In principal component analysis (PCA), the data set is modeled by a multivariate gaussian. By a PCA-like whitening transformation equation 4.52, we can use the approximation 4.55 with Laplace’s method, which requires only that the peak be close to the mean and the random variable X~ does not need to be an exact gaussian distribution.

By theorem 3, we have

I(X~;R)IG=12ln(det(G(x~)2πe))x~+H(X~), (4.56)

where

G(x~)=J(x~)+IK, (4.57)
J(x~)=lnp(rx~)x~lnp(rx~)x~Trx~ (4.58)
=Σ12UTlnp(rx)xlnp(rx)xTrxUΣ12 (4.59)
=Σ12UTJ(x)UΣ12, (4.60)
H(X~)=lnp(x~)x~=H(X)12ln(det(Σ)). (4.61)

Given a K × K orthogonal matrix BRK×K, we define

y=BTx~. (4.62)

Then it follows from equations 4.56 to 4.62 that

I(Υ;R)IG=12ln(det(G(y)2πe))y+H(Υ), (4.63)

where

G(y)=J(y)+IK, (4.64)
J(y)=BTJ(x~)B, (4.65)
H(Υ)=H(X~). (4.66)

Suppose y is partitioned into two sets of components, y=(y1T,y2T)T and

y1=(y1,y2,,yK1)T, (4.67)
y2=(yK1+1,yK1+2,,yK)T, (4.68)

where K1 + K2 = K, K ≥ 2, K1 ≥ 1 and K2 ≥ 1. Let

G(y)=(J1,1(y)+IK1J1,2(y)J2,1(y)J2,2(y)+IK2), (4.69)

where

Ji,j(y)=lnp(ry)yilnp(ry)yjTry,i,j=1,2. (4.70)

When K ⪢ 1, suppose we can find an orthogonal matrix B and K1 that satisfy condition 4.38 in theorem 5 or condition 4.49—that is,

0Tr(By)yγ, (4.71)
By=J2,2(y)J2,1(y)(J1,1(y)+IK1)1J1,2(y), (4.72)
γ=ln(det(J1,1(y)+IK1))y. (4.73)

Here matrix By is positive-semidefinite because

J2,2(y)J2,1(y)(J1,1(y)+IK1)1J1,2(y)=ρ(ry)ρ(ry)Try, (4.74)

where

ρ(ry)=lnp(ry)y2J2,1(y)(J1,1(y)+IK1)1(lnp(ry)y1+a(r)) (4.75)

and a (r) is a K1-dimensional random vector that satisfies

lnp(ry)y2a(r)Try=lnp(ry)y2rya(r)Try=0, (4.76)
a(r)a(r)Try=IK1. (4.77)

Assuming that J1,1 (y) is positive-definite, J1,11(y)=O(N1) and ∥J1,2(y)∥ = ∥J2,1(y)∥ = O (N), we have

(J1,1(y)+IK1)1=J1,11(y)J1,12(y)+O(J1,13(y)) (4.78)

and

Tr(Cx)=Tr(J2,2(y)J2,1(y)J1,11(y)J1,2(y))+Tr(J2,1(y)J1,12(y)J1,2(y))+O(N1). (4.79)

Hence, if

Tr(J2,2(y)J2,1(y)J1,11(y)J1,2(y))γ, (4.80)
Tr(J2,1(y)J1,12(y)J1,2(y))γ, (4.81)

then equation 4.71 holds. Notice that the matrix (J2,2(y)J2,1(y)J1,11(y)J1,2(y)) is positive-semidefinite, which is similar to equation 4.74 and 0Tr(J2,1(y)J1,11(y)J1,2(y))Tr(J2,2(y)). Hence, if

Tr(J2,2(y))γ, (4.82)

then equations 4.80 and 4.81 hold so does equation 4.71.

5. Optimization of Information Transfer in Neural Population Coding –

5.1. Population Density Distribution of Parameters in Neural Populations.

If p(rx) is conditional independent, we can write

p(rx)=n=1Np(rnx;θn), (5.1)

where θnRK~ denotes a K~-dimensional vector for parameters of the nth neuron, and p(rnx; θn) is the conditional p.d.f. of the output rn given x. With the definition in equation 2.13, we have following proposition.

Proposition 1. If p(rx) is conditional independent as in equation 5.1, we have

J(x)=NΘp(θ)S(x;θ)dθ, (5.2)

where

S(x;θ)=p(rx;θ)lnp(rx;θ)xlnp(rx;θ)xTdr, (5.3)

rR, θΘRK~, and p(θ) is the population density function of parameter vector θ:

p(θ)=1Nn=1Nδ(θθn), (5.4)

with δ (·) being the Dirac delta function.

Proof.

J(x)=Rp(rx)lnp(rx)xlnp(rx)xTdr=n=1Np(rnx;θn)lnp(rnx;θn)xlnp(rnx;θn)xTdrn=Θn=1Nδ(θθn)(p(rx;θ)lnp(rx;θ)xlnp(rx;θ)xTdr)dθ=NΘp(θ)S(x;θ)dθ. (5.5)

Remark 8. Proposition 1 shows that J(x) can be regarded as a function of the population density of parameters, p(θ). If the p.d.f. of the input p(x) is given, we can find an appropriate p(θ) to maximize MI I.

For neuron model with Poisson spikes, we have

p(rx)=n=1Np(rnx;θn), (5.6)
p(rnx;θn)=f(x;θn)rnrn!exp(f(x;θn)), (5.7)

where f (x; θn) is the tuning curve of the nth neuron, n = 1, 2, …, N. Now we have

S(x;θ)=p(rx;θ)lnp(rx;θ)xlnp(rx;θ)xTdr=1f(x;θ)f(x;θ)xf(x;θ)xT=g(x;θ)xg(x;θ)xT, (5.8)
g(x;θ)=2f(x;θ). (5.9)

Similarly, for a neuron response model with gaussian noise, we have

p(rx)=n=1Np(rnx;θn), (5.10)
p(rnx;θn)=1σ2πexp((rnf(x;θn))22σ2), (5.11)

where σ is a constant standard deviation. Now we get

S(x;θ)=1σ2f(x;θ)xf(x;θ)xT. (5.12)

5.2. Optimal Population Distribution for Neural Population Coding.

Suppose p(x) and p(rx) fulfill conditions C1 and C2 and equation 5.1. Following the discussion in section 2.2, we define the following objective for maximizing MI I,

maximizeIG[p(θ)]=12ln(det(G(x)2πe))x+H(X), (5.13)

or, equivalently,

minimizeQG[p(θ)]=12ln(det(G(x)))x, (5.14)

where

G(x)=J(x)+P(x), (5.15)
J(x)=NΘp(θ)S(x;θ)dθ, (5.16)
S(x;θ)=lnp(rx;θ)xlnp(rx;θ)xTrx;θ. (5.17)

Here P (x) is given in equation 2.15, and it generally can be substituted by P+ (see equation 2.78).

When ς1 ≈ 0 (see equation 2.64), the object function, equation 5.13, can be reduced to

maximizeIF[p(θ)]=12ln(det(J(x)2πe))x+H(X), (5.18)

or, equivalently,

minimizeQF[p(θ)]=12ln(det(J(x)))x. (5.19)

The constraint condition for p(θ) is given by

subject toΘp(θ)dθ=1,p(θ)0. (5.20)

However, without further constraints on the neural populations, especially a limit on the peak firing rate, the capacity of the system may grow indefinitely: I(X; R) → ∞. The most common limitation on neural populations is the energy or power constraint. For neuron models with Poisson noise or gaussian noise, a useful constraint is a limitation on the peak power,

f(x;θn)Emax,xXandn=1,2,,N, (5.21)

where Emax > 0 is the peak power. Under this constraint, maximizing IG[p(θ)] or IF[p(θ)] for independent neurons will result in maxxf (x; θn)∣ = Emax for ∀n = 1,2, …, N.

Another constraint is a limitation on average power. For Poisson neurons given in equation 5.7,

1Nn=1Nrnp(rnx;θn)rnxxEavg, (5.22)

which can also be written as

f(x;θ)xθEavg, (5.23)

and for gaussian noise neurons given in equation 5.11,

f(x;θ)2xθEavg, (5.24)

where Eavg > 0 is the maximum average energy cost.

In equation 5.15, we can approximate the continuous integral by a discrete summation for numerical computation,

J(x)=Nk=1K1αkS(x;θk), (5.25)

where the positive integer K1N denotes the number of subclasses in the neural population and

k=1K1αk=1,αk>0,k=1,2,,K1. (5.26)

If we do not know the specific form of p(x) but have M samples, x1, x2, …, xM, which are i.i.d. samples drawn from the distribution p(x), then we can approximate the integral in equation 5.13 by the sample average:

ln(det(G(x)))x1Mm=1Mln(det(G(xm))). (5.27)

Optimizing the objective 5.13 or 5.18 is a convex optimization problem (see the appendix for a proof).

Proposition 2. The functions IG[p(θ)] and IF [p(θ)] are concave about p(θ).

Remark 9. For a low-dimensional input x, we may use equation 5.18 or 5.19 as the objective. Since IG[p(θ)] and IF[p(θ)] are concave functions of p(θ), we can directly use efficient numerical methods to get the optimal solution for small K. However, for high-dimensional input x, we need to use other methods (e.g., Huang & Zhang, 2017).

5.3. Necessary and Sufficient Conditions for Optimal Population Distribution.

Applying the method of Lagrange multipliers for the optimization problems 5.13 and 5.20 yields

L[p(θ)]=IG[p(θ)]λ1(Θp(θ)dθ1)+Θλ2(θ)p(θ)dθ, (5.28)

where λ1 is a constant and λ2(θ) is a function of θ. According to Karush-Kuhn-Tucker (KKT) conditions (Boyd & Vandenberghe, 2004), we have

λ2(θ)p(θ)=0,λ2(θ)0, (5.29)

and the necessary condition for optimal population density,

L[p(θ)]p(θ)=12Tr(NG(x)1S(x;θ))xλ1+λ2(θ)=0. (5.30)

It follows from equations 5.29 and 5.30 that

12Tr(NG(x)1S(x;θ))x=λ1,p(θ)0, (5.31)
12Tr(NG(x)1S(x;θ))x=λ1λ2(θ),p(θ)=0. (5.32)

Since IG[p(θ)] is a concave function of p(θ), equations 5.31 and 5.32 are the necessary and sufficient conditions for the optimization problems 5.13 and 5.20.

5.4. Channel Capacity for Neural Population Coding.

If p(x) is unknown, then by Jensen’s inequality, we have

IIG[p(x)]=Xp(x)ln(p(x)1det(G(x)2πe)12)dxlnXdet(G(x)2πe)12dx, (5.33)

and the equality holds if and only if p(x)−1 det (G(x))1/2 is a constant. Thus,

IG[p(x)]=maxp(x)(IG[p(x)])=lnXdet(G(x)2πe)12dx, (5.34)
p(x)=det(G(x))12Xdet(G(x^))12dx^, (5.35)

assuming Xdet(G(x^))12dx^<.

Let us consider a specific example. Suppose J(x) = J0 is a constant matrix; then it follows from equation 2.12 that

IG=12ln(det(J0+P(x)2πe))x+H(X). (5.36)

According to the maximum entropy probability distribution, we know that maximizing H(X) results in a uniformly distributed p(x). Hence we have G(x) = J0, and p*(x) coincides with the uniform distribution (see equation 5.35). In this case, the maximum IG[p*(x)] can be regarded as the channel capacity for this neural population.

If we consider a constraint on random variables X and assume that the covariance matrix of X is 0 and satisfies

Σ01=P(x), (5.37)

then it follows from the maximum entropy probability distribution that

H(X)12(det(2πeΣ0)), (5.38)

and the equality holds if and only if the p.d.f. of the input is a normal distribution: p(x)=N(μ,Σ0). Hence,

IG=12ln(det(J0+Σ012πe))+H(X)12ln(det(Σ0J0+IK))=IG[p(x)], (5.39)

where IG[p*(x)] is the channel capacity of neural population. Here the equality holds if and only if p(x)=N(μ,Σ0), which is consistent with equation 5.37.

Furthermore, if ς1 ≈ 0 (see equation 2.64), we have

IIG[p(x)]IF[p(x)]=Xp(x)ln(p(x)1det(J(x)2πe)12)dx. (5.40)

Similarly, we also get

IF[p(x)]=maxp(x)(IF[p(x)])=lnXdet(J(x)2πe)12dx, (5.41)
p(x)=det(J(x))12Xdet(J(x^))12dx^, (5.42)

assuming Xdet(J(x^))12dx^<. Here IF[p*(x)] is the channel capacity of the neural population. The distribution p*(x) coincides with the Jeffrey’s prior in Bayesian probability (Jeffreys, 1961). In this case, if we suppose the covariance matrix of X is 0, then similar to equations 5.38 and 5.39, we can get the channel capacity

IF[p(x)]=12ln(det(Σ0J0)) (5.43)

with p(x)=N(μ,Σ0).

For another example, consider the Poisson neuron model given in equation 5.7 and suppose the input x is one dimension, K = 1. It follows from equations 5.8 and 5.42 that

p(x)=(Θp(θ)(g(x;θ)x)2dθ)12X(Θp(θ)(g(x^;θ)x^)2dθ)12dx^. (5.44)

If p(θ) = δ(θθ0), equation 5.44 becomes

p(x)=g(x;θ0)xXg(x^;θ0)x^dx^. (5.45)

Atick and Redlich (1990) presented a redundancy measure to approximate Barlow’s optimality principle:

R=1I(X;R)C(R), (5.46)

where C(R) is the channel capacity. Here for neural population coding, we have C(R) ≃ IG[p*(x)] and I(X; R) ≃ IG (or C(R) – IF[p*(x)] and I(X; R) ≃ IF). Hence, we can minimize R by choosing an appropriate J(x) to maximize IG (or IF) and simultaneously satisfy equation 5.35 (or 5.42) (see Huang & Zhang, 2017, for further details).

6. Discussion

In this article, we have derived several information-theoretic bounds and approximations for effective approximation of MI in the context of neural population coding for large but finite population size. We have found some regularity conditions under which the asymptotic bounds and approximations hold. Generally these regularity conditions are easy to meet. Special examples that satisfy these conditions include the cases when the likelihood function p(rx) for the neural population responses is conditionally independent or has correlated noises with a multivariate gaussian distribution. Under the general regularity conditions, we have derived several asymptotic bounds and approximations of MI for a neural population and found some relationships among different approximations.

How to choose among these different asymptotic approximations of MI in a neural population with finite size N? For a flat prior distribution p(x), we have IGIF; that is, the two approximations IG and IF are about equally valid. For a sharply peaked prior distribution p(x), IG is generally a better approximation to MI I than IF. Under suitable conditions (e.g., cases C1 and C2) for low-dimensional inputs, IG and IF are good approximations of MI I not only for large N but also for small N. For high-dimensional inputs, the FI matrix J(x) (see equation 2.11) or matrix P−1(x) (see equation 2.15) often becomes degenerate, which causes a large error between IF and MI I. Hence, in this situation, IG is a better approximation to MI I than IF. For more convenient computation of the approximation, we have also introduced the approximation formula IG+ which may substitute for IG as a proxy of MI I. For some special cases (see corollary 1), IG and IG+ are strictly equal to the true MI I. Our simulation results for the one-dimensional case show that the approximations IG, IG+, and IF are all highly precise compared with the true MI I, even for small N (see Figure 1).

These approximation formulas satisfy additional constraints. By the Cramér-Rao lower bound, we know that IF is related to the covariance matrix of an unbiased estimator (see equation 3.3). By van Trees’ Bayesian Cramér-Rao bound, we get a link between IG+ and the covariance matrix of a biased estimator (see equation 3.9). From the point of view of neural population decoding and Bayesian inference, there is a connection between MI (or IG) and MAP (see equation 3.17).

For more efficient calculation of the approximation IG (or IG+ ) for high-dimensional inputs, we propose to apply an invertible transformation on the input variable so as to make the new variable closer to a normal distribution (see section 4.1). Another useful technique is dimensionality reduction, which effectively approximates MI by further reducing the computational complexity for high-dimensional inputs. We found that IF could lead to huge errors as a proxy of the true MI I for high-dimensional inputs even when IG and IG+ are strictly equal to the true MI I.

These approximation formulas are potentially useful for optimization problems of information transfer in neural population coding. We have proven that optimizing the population density distribution of parameters p(θ) is a convex optimization problem and have found a set of necessary and sufficient conditions. The approximation formulas are also useful for discussion of the channel capacity of neural population coding (see section 5.4).

Information theory is a powerful tool for neuroscience and other disciplines, including diverse fields such as physics, information and communication technology, machine learning, computer vision, and bioinformatics. Finding effective approximation methods for computing MI is a key for many practical applications of information theory. Generally the FI matrix is easier to evaluate or approximate than MI. This is because calculation of MI involves averaging over both the input variable x and the output variable r (see equation 2.1), and typically p(r) also needs to be calculated from p(rx) by another average over x (see equation 2.2). By contrast, the FI matrix J(x) involves averaging over r only (see equation 2.13). Furthermore, it is often easier to find analytical forms of FI for specific models such as a population of tuning curves with Poisson spike statistics. Taking into account the computational efficiency, for practical applications we suggest using IG or IG+ as a proxy of the true MI I for most cases. These approximations could be very useful even when we do not need to know the exact value of MI. For example, for some optimization and learning problems, we only need to know how MI is affected by the conditional p.d.f. or likelihood function p(rx). In such situations, we may easily solve for the optimal parameters using the approximation formulas (Huang & Zhang, 2017; Huang, Huang, & Zhang, 2017). Further discussions of the applications will be given in separate publications.

Acknowledgments

This work was supported by an NIH grant R01 DC013698.

Appendix: The Proofs

We consider a Taylor expanding of L(rx^) around x. If L(rx^) is twice differentiable for x^Xω(x), then by condition C1 we get

L(rx^)L(rx)=(x^x)TL(rx)+12(x^x)TL(rx˘)(x^x)=yTv~12yTy+12yTBy, (A.1)

where

y=G12(x)(x^x), (A.2)
v~=v+v1,v=G12(x)l(rx),v1=G12(x)q(x), (A.3)
x˘=x+t(x^x)Xω(x),t(0,1), (A.4)
{B=G12(x)CG12(x)=B0+B1+B2,C=C0+C1+C2,} (A.5)

and

{B0=G12(x)C0G12(x),B1=G12(x)C1G12(x),B2=G12(x)C2G12(x),C0=l(rx)l(rx)rx,C1=l(rx˘)l(rx),C2=q(x˘)q(x).} (A.6)

By condition C1, we know that the matrix B1 + B2 is continuous and symmetric for xˇXω and ∥B1 + B2∥ = O(1). By the definition of continuous functions, we can prove the following: for any ϵ ∈ (0, 1), there is an ε ∈ (0, ω) such that for all YYε,

ϵIKB1+B2ϵIK, (A.7)

where

Yε={yRK:y<εN}. (A.8)

Hence,

yT(B1+B2)y<ϵy2. (A.9)

Here xˇ=x+tG12(x)y, ε is a function of r, ε = ε (r) = O (1), and

YεYω={yRK:y<ωN}. (A.10)

We define the sets

{Yε={yRK:yεN},Zε^={zRK:zk<ε^NK,k=1,2,,K},Zε^={zRK:zkε^NK,k=1,2,,K},Z~ε={zRK:z+v~1Rε^<εN},} (A.11)

where

ε^=ε2, (A.12)

1(·) denotes an indicator random variable,

1Rε^={1,rRε^(x)0,rRε^(x)},1Rε^={1,rRε^(x)0,rRε^(x)}, (A.13)

and

{Rε^(x)={rR:v~<ε^N},Rε^(x)={rR:v~ε^N}.} (A.14)

For all zZε^, we have z+v~1Rε^2z2+v1Rε^2<εN;; then

Zε^Z~ε. (A.15)

It follows from equations A.3 and A.6 that

vrx=0,B0rx=0 (A.16)

and

v~Tv~rxx=L(rx)TG1(x)L(rx)rxx=Tr(L(rx)L(rx)TrxG1(x))x=K+ζ=K+O(N1), (A.17)

and it follows from condition C1 that

ζ=Tr(1p(x)2p(x)xxTG1(x))x=Tr((q(x)Tq(x)+q(x))G1(x))x=N1(q(x)Tq(x)+q(x))NG1(x)x=O(N1). (A.18)

Combining conditions C1 and C2 and equations A.3, A.4, and A.6, we find

{B02mrxN1C02mNG1(x)2mrxx=O(N1),B02m+1rxNG1(x)2m+1N1C02rx12N1C04mrx12x=O(N1),v2m0rxN1l(rx)Tl(rx)m0rxNG1(x)m0=O(1),v12m0N1q(x)Tq(x)m0NG1(x)m0=O(Nm0),} (A.19)

together with the power mean inequality,

(v~Tv~)m0rx(v+v1)2m0rx22m01v2m0+v12m0rx=O(1), (A.20)

where mN, m0 ∈ {1, 2}. Notice that ∣G−1 (x)∥ = O (N−1). Here we note that for all conformable matrices A and B,

{Tr(AB)AB,ABAB.} (A.21)

By equation 2.25c, we have

Tr(N1J(x))2=N1l(rx)Tl(rx)rx2(N1l(rx)Tl(rx))2rx=O(1). (A.22)

Then it follows from equations 2.25b and A.22 that

det(G(x))=O(NK). (A.23)

A.1 Proof of Lemma 1. It follows from equation A.1 that

Γω=lnXω(x)exp(L(rx^)L(rx))dx^rxx=12ln(det(G(x)))x+ln(Yωexp(yTv~12yTy+12yTBy)dy)rxxΓ^ω. (A.24)

For yYε, according to the definitions in equations A.13 and A.14, we have

yTv~1Rε^yv~1Rε^(Nε2)12v~1Rε^2v~Tv~1Rε^. (A.25)

Then by condition C1, we get

v~Tv~1Rε^rxv~4(ε^N)2rxN1(ε^0)2v~4rx=O(N1), (A.26)

where ε^0 is a positive constant and ε^0[minε^(r),maxε^(r)]. By equations A.9, A.17, and A.24, we get

Γ^ωln(Yεexp(yTv~12(1+ϵ)yTy+12yTB0y)dy)rxxln(Zε^exp(12(z+v~1Rε^1+ϵ)TB0(z+v~1Rε^1+ϵ))ϕε^(z)dz)rxx+ln(Ψε^)+v~Tv~2(1+ϵ)25v~Tv~1Rε^2(1+ϵ)2rxx12(Zε^(z+v~1Rε^1+ϵ)TB0(z+v~1Rε^1+ϵ)ϕε^(z)dz)rxx+ln(Ψε^)rxx+K+ζ2(1+ϵ)2+O(N1), (A.27)

where z=yv~1Rε^(x), the last step in equation A.27 follows from Jensen’s inequality, and

{ϕε^(z)=Ψε^1exp(1+ϵ2zTz),Ψε^=Zε^exp(1+ϵ2zTz)dz.} (A.28)

Integrating by parts yields

1Zε^zZε^(1+ϵ2π)K2exp(1+ϵ2zTz)dz=O(NK2eNδ) (A.29)

and

(2π1+ϵ)K2Ψε^(2π1+ϵ)K2(1O(NK2eNδ)) (A.30)

for some δ > 0.

Then from equation A.27, we get

(Zε^(z+v~1Rε^1+ϵ)TB0(z+v~1Rε^1+ϵ)ϕε^(z)dz)rxx=(2π1+ϵ)K2Ψε^1zTB0z1Zε^z+v~TB02v~1Zε^1Rε^(1+ϵ)2rxx(2π1+ϵ)K2Ψε^1zTB0z1Zε^zrxxO(N1), (A.31)

where

{z=RK()ϕ0(z)dz,ϕ0(z)=(1+ϵ2π)K2exp(1+ϵ2zTz).} (A.32)

Here, notice that

(2π1+ϵ)K2Ψε^1=1+O(NK2eNα) (A.33)

and

zTB0z1Zε^zrxx=zTB0z1Zε^zrxxB02rx12z41Zε^zrx12x=O(N1). (A.34)

Hence, from the consideration above, we find

Γ^ωK2ln(2π1+ϵ)+K2(1+ϵ)2+O(N1). (A.35)

Since ϵ is arbitrary, let it go to zero. Thus, combining equations A.24 and A.35 yields

Γω=12ln(det(G(x)2πe))x+O(N1). (A.36)

Considering

lnp(r)p(rx)p(x)rxxΓω, (A.37)

and combining equations 2.3 and A.36, we immediately get equation 2.53.

On the other hand, by conditions 2.54a and 2.54b, we have

{v~Tv~1Rε^rxv~22+2τ(ε^N)2τNτ(ε^0)2τv~2+2τrx=o(1),zTB0z1Zε^zrxxB02rx12z41Zε^zrx12x=o(1).} (A.38)

Similarly we can get equation 2.55. This completes the proof of lemma 1. □

A.2 Proof of Lemma 2. Define the sets

Ωϵ(x)={rR:yTB0y<ϵy2,yRK} (A.39)

and

Θϵ(x)={rR:Xε(x)p(rx^)p(x^)p(rx)p(x)dx<ϵdet(G(x))12}, (A.40)

where Xε(x)=XXε(x), assuming ϵ ∈ (0, 1/2) and p(x) > 0.

Then by Markov’s inequality, we have

1ΩϵrxPrx{B02ϵ2}ϵ2B02rx=O(N1), (A.41)

and by equation 2.26b,

1Θϵrx=Prx{Xε(x)p(rx^)p(x^)p(rx)p(x)dx^ϵdet(G(x))12}=Prx{det(G(x))12Xω^(x)p(x^r)dx^>ϵp(rx)}=O(Nη). (A.42)

Consider the following equality:

lnp(r)p(rx)p(x)rx=1Θϵlnp(r)p(rx)p(x)rx+1Θϵlnp(r)p(rx)p(x)rx. (A.43)

For the last term in equation A.43, Jensen’s inequality implies that

1Θϵlnp(r)p(rx)p(x)rxx1Θϵrxxln11Θϵrxx=o(N1). (A.44)

For the first term in equation A.43, it follows from equations A.40 and A.9 that

1Θϵlnp(rp(rx)p(xrx1Θϵln(Xε(x)exp(L(rx^)L(rx))dx^+ϵdet(G(x))12)rxK2ln(det(G(x)))+1Θϵln(Yεexp(yTv~12(1+ϵ)yTy+12yTB0y)dy+ϵ)rx. (A.45)

The last term, equation A.45, is upper-bounded by

1ΘϵΩϵln(RKexp(yTv~12(12ϵ)yTy)dy+ϵ)rx (A.46)
+1ΘϵΩϵln(RKexp(yTv~12(1ϵ)yTy+12yTB0y)dy+ϵ)rx. (A.47)

Equation A.46 is equal to

1ΘϵΩϵln((2π12ϵ)K2exp(v~Tv~2(12ϵ))+ϵ)rx1ΘϵΩϵ(v~Tv~2(12ϵ)+ln((2π12ϵ)K2+ϵ))rx. (A.48)

Equation A.47 is equal to

1ΘϵΩϵln((2π1ϵ)K2exp(12(z+v~1ϵ)T))((B0(z+v~1ϵ)+v~Tv~2(1ϵ))z+ϵ)rx1ΘϵΩϵ(K2ln(2π1ϵ)+v~Tv~2(1ϵ)+v~TB02v~2(1ϵ)2+v~TB02v~(1ϵ)3)rx (A.49a)
+1ΘϵΩϵln(exp(12zTB0z+zTB0v~1ϵv~TB02v~(1ϵ)3)z)(+ϵ(1ϵ2π)K2)rx, (A.49b)

where

{z=RK()ϕ1(z)dzϕ1(z)=(1ϵ2π)K2exp(1ϵ2zTz)}. (A.50)

Notice that

1ΘϵΩϵrx1Ωϵrx=O(N1) (A.51)

and

1ΘϵΩϵrx=11ΘϵΩϵrx=1+O(N1). (A.52)

Then by equation A.19, we get

1ΘϵΩϵ(exp(zTB0z)z121)rx1ΘϵΩϵm=01m!(zTB0z)mzrx121ΘϵΩϵrx=O(N1), (A.53)
v~Tv~1Θϵrxv~4rx121Θϵrx12=O(N1), (A.54)

and by equation 2.51,

0v~TB02v~1ΘϵΩΩrxvTB02vrx+O(N1)ξNG1(x)+O(N1)=O(N1). (A.55)

Hence, we have

1Θϵ(K2ln(2π1ϵ)+v~Tv~2(1ϵ))rx=(K2ln(2π1ϵ)+K+ζ2(1ϵ))+O(N1), (A.56)

and by Cauchy-Schwarz inequality and equation A.53, the term A.49b is upper-bounded by

1ΘϵΩϵln(exp(zTB0z)z12exp(2zTB0v~1ϵ2v~TB02v~(1ϵ)3)z12)(+ϵ(1ϵ2π)K2)rx=1ΘϵΩϵln(exp(zTB0z)z12+ϵ(1ϵ2π)K2)rx1ΘϵΩϵ(exp(zTB0z)z12+ϵ(1ϵ2π)K21)rx=O(N1). (A.57)

Since ϵ is arbitrary, we can let it go to zero. Then, taking everything together, we get

lnp(r)p(rx)p(r)rxx12ln(det(G(x)2πe))x+O(N1). (A.58)

Putting equation A.58 into 2.3 yields 2.56.

On the other hand, we have

lnp(r)p(rx)p(x)rxx=1ΘϵΩϵlnp(r)p(rx)p(x)rxx (A.59)
+1ΘϵΩϵlnp(r)p(rx)p(x)rxx+1Θϵlnp(r)p(rx)p(x)rxx. (A.60)

For equation A.60, it fo11ows from Jensen’s inequa1ity that

1Θϵlnp(r)p(rx)p(x)rxx1Θϵrxxln11Θϵrxx=o(1) (A.61)

and

1ΘϵΩϵlnp(r)p(rx)p(x)rxx1ΘϵΩϵrxxln11ΘϵΩϵrxx=o(1), (A.62)

where

{1ΩϵrxP(B02ϵ2)ϵ2B02rx=o(1),1ΘϵΩϵrx1Ωϵrx=o(1).} (A.63)

Similarly we can get equation 2.57. □

A.3 Proof of Theorem 1. By lemmas 1 and 2, we immediately get equation 2.58. The proof of equation 2.59 is similar. □

A.4 Proof of Theorem 2. First, we have

G(x)=J12(x)(IK+Ψ(x))J12(x). (A.64)

Since J(x) and G(x) are symmetric and positive-definite, IK + Ψ(x) is also symmetric and positive-definite. The eigendecompositon of Ψ(x) is given by

Ψ(x)=UxΛxUxT, (A.65)

where UxRK×K is an orthogonal matrix and the matrix ΛxRK×K is a K × K diagonal matrix with K nonnegative real numbers on the diagonal, λ1 ≥ λ2 ≥, …, ≥ λK > −1. Then we have

Tr(Λx)x=Tr(Ψ(x))x=Tr(P(x)J1(x))x=ς (A.66)

and

ln(det(IK+Ψ(x)))x=Tr(ln(IK+Λx))xTr(Λx)x=ς. (A.67)

Notice that ln(1 + x) ≤ x for ∀x ∈ (−1, ∞). It follows from equations A.64 and A.67 that

ln(det(G(x)))x=ln(det(J(x)))x=ln(det(IK+Ψ(x)))xς. (A.68)

From equations 2.12, 2.11, and A.68, we obtain equation 2.62.

If P(x) is positive-semidefinite, then λ1 ≥ λ2 ≥, …, ≥ λK ≥ 0, ς ≥ 0 and (ln (det ⟨IK + Ψ(x)))⟩x ≥ 0. Hence we can get equation 2.63.

On the other hand, it follows from equations 2.64 and A.67 and the power mean inequality that

ςk=1KλkxK(k=1Kλk2)12x=KΨ(x)x=Kς1=O(Nβ). (A.69)

Let λk=min(0,λk) for ∀k ∈ {1, 2, …, K}. Then

k=1Kln(1+λk)xln(det(IK+Ψ(x)))x. (A.70)

Notice that 1<λk0. Then by equation A.69, we have

k=1Kln(1+λk)x=m=11mk=1K(λk)mx=O(Nβ). (A.71)

From equations 2.12, 2.11, A.68, A.70, and A.71, we immediately get equation 2.65. □

A.5 Proof of Theorem 3. Considering the change of variables theorem, for any real-valued function f and invertible transformation T, we have

X~f(x~)dx~=Xf(T(x))det(DT(x))dx, (A.72)

and for p(x) and p(x~),

p(x~)x~=T(x)=det(DT(x))1p(x). (A.73)

Then it follows from equations 4.2, A.72, and A.73 that

{p(r)=Xp(rx)p(x)dx=X~p(rx~)p(x~)dx~,H(X~)=X~p(x~)lnp(x~)dx=Xp(x)ln(p(x)det(DT(x))1)dx=H(X)+Xp(x)lndet(DT(x))dx,G(x)=DT(x)TG(x~)DT(x).} (A.74)

Substituting equations A.73 and A.74 into 2.1, we can directly obtain equation 4.3. Moreover, if p(x~) and p(rx~) fulfill conditions C1, C2 and ξ = O (N−1), then by theorem 1, we immediately obtain equation 4.4. □

A.6 Proof of Corollary 1. It follows from equation 2.21 and theorem 3 that

IG=IG+=I(X;R)=I(Y;R)=12ln(det(12πe(AAT+Σf1)))+H(Y) (A.75)

and

H(Y)=12ln(det(2πeΣf))=H(X)+lndet(D(x))x. (A.76)

Here notice that

J(x)=lnp(rx)xlnp(rx)xTrx=yTxlnp(ry)ylnp(ry)yTyxTry=D(x)TAATD(x) (A.77)

and

P(x)=2lnp(x)xxT=yTx2lnp(y)yyTyxT=D(x)TΣf1D(x). (A.78)

Hence, combining equations A.75 to A.78, we can immediately obtain equation 4.9. □

A.7 Proof of Theorem 4. First, we have

ln(det(G(x)2πe))x=ln(det(G1,1(x)2πe))(det(12πe(G2,2(x)G2,1(x)G1,11(x)G1,2(x))))x=ln(det(G1,1(x)2πe))+ln(det(G2,2(x)2πe))+ln(det(IK2Ax))x. (A.79)

Then by the eigendecompositon of Ax, we have

Ax=UxΛxUxT, (A.80)

where Ux and Λx are K2 × K2 eigenvector matrix and eigenvalue matrix, respectively. Since G (x), G1,1 (x), and G2,2 (x) are positive-definite, then IKAx is also positive-definite and Ax is positive-semidefinite, with 0 ≤ (Λx)k,k = λk < 1 for ∀k ∈ {1, 2, …, K2}. Moreover, it follows from equation 4.33 that

{0Tr(Λx)x=Tr(Ax)x1,0Tr(Λxm)x=k=1K2λkmxTr(Λx)x1.} (A.81)

Then by equation A.81, we have

ln(det(IK2Ax))x=Tr(ln(IK2Λx))x=m=11mTr(Λxm)x0. (A.82)

Substituting equation A.82 into A.79 and then combining with equation 2.12, we get equation 4.35.

If equation 4.36 holds, then Ax = 0 and IG = IG1. Conversely, if IG = IG1, then

0=ln(det(IK2Ax))xTr(Ax)x0, (A.83)

Ax = 0, and equation 4.36 holds. □

A.8 Proof of Theorem 5. Similar to equation A.79, we have

ln(det(G(x)2πe))x=ln(det(G1,1(x)2πe))+ln(det(P2,2(x)2πe))+ln(det(IK2Bx))x. (A.84)

Similar to equation A.65, the eigendecompositon of Bx is given by

Bx=UxΛxUxT, (A.85)

where Ux and Λx are K2 × K2 eigenvector matrix and eigenvalue matrix, respectively. If the matrix Bx is positive-semidefinite and satisfies equation 4.38, then (Λx)k,k = λk ≥ 0 for ∀k ∈ {1, 2, …, K2} and

0ln(det(IK2+Bx))x=k=1K2ln(1+λk)xTr(Λx)x=Tr(Bxx)1. (A.86)

Substituting equation A.86 into A.84, we immediately get equation 4.41. If Cx = 0, then ln (det(lK2 + Bx)) = 0 and IG = IG2. And if IG = IG2, then ln (det(IK2 + Bx)) = 0, Bx = 0 and Cx = 0. □

A.9 Proof of Corollary 2. Notice that

{H(X)=H(X1)+H(X2),H(X2)=12ln(det(2πeΣx2)),P2,1(x)=P1,2(x)=0,P2,2(x)=Σx21,} (A.87)

and the matrices

Cx=J2,2(x)J2,1(x)G1,11(x)J1,2(x), (A.88)
Bx=P2,212(x)CxP2,212(x) (A.89)

are positive-semidefinite, and the proof is similar to equation 4.74. Then by theorem 5, we immediately get equation 4.41. Substituting equation A.87 into 4.41 yields equation 4.44 with strict equality if and only if Cx = 0. □

A.10 Proof of Proposition 2. By writing p(θ) as a sum of two density functions p1(θ) and p2(θ),

p(θ)=αp1(θ)+(1α)p2(θ), (A.90)

we have

G(x)=NΘp(θ)S(x;θ)dθ+P(x)=αG1(x)+(1α)G2(x), (A.91)

where 0 ≤ α ≤ 1 and

G1(x)=NΘp1(θ)S(x;θ)dθ+P(x), (A.92)
G2(x)=NΘp2(θ)S(x;θ)dθ+P(x). (A.93)

Using the Minkowski determinant inequality and the inequality of weighted arithmetic and geometric means, we find

det(G(x))1K=det(αG1(x)+(1α)G2(x))1Kαdet(G1(x))1K+(1α)det(G2(x))1K(det(G1(x))αdet(G2(x))(1α))1K. (A.94)

It follows from equations A.91 and A.94 that

ln(det(αG1(x)+(1α)G2(x)))αln(det(G1(x)))+(1α)ln(det(G2(x))), (A.95)

where the equality holds if and only if G1(x) = G2(x). Thus ln (det (G(x))) is concave about p(θ). Therefore IG[p(θ)] is a concave function about p(θ). Similarly, we can prove that IF[p(θ)] is also a concave function about p(θ). □

Contributor Information

Wentao Huang, Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, U.S.A., and Cognitive and Intelligent Lab and Information Science Academy of China Electronics Technology Group Corporation, Beijing 100846, China.

Kechen Zhang, Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, U.S.A..

References

  1. Abbott LF, & Dayan P (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1), 91–101. [DOI] [PubMed] [Google Scholar]
  2. Amari S, & Nakahara H (2005). Difficulty of singularity in population coding. Neural Comput., 17(4), 839–858. [DOI] [PubMed] [Google Scholar]
  3. Atick JJ, Li ZP, & Redlich AN (1992). Understanding retinal color coding from first principles. Neural Comput., 4(4), 559–572. [Google Scholar]
  4. Atick JJ, & Redlich AN (1990). Towards a theory of early visual processing. Neural Comput., 2(3), 308–320. [Google Scholar]
  5. Becker S, & Hinton GE (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356), 161–163. [DOI] [PubMed] [Google Scholar]
  6. Bell AJ, & Sejnowski TJ (1997). The “independent components” of natural scenes are edge filters. Vision Res., 37(23), 3327–3338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Bethge M, Rotermund D, & Pawelzik K (2002). Optimal short-term population coding: When Fisher information fails. Neural Comput., 14(10), 2317–2351. [DOI] [PubMed] [Google Scholar]
  8. Borst A, & Theunissen FE (1999). Information theory and neural coding. Nat. Neurosci, 2(11), 947–257. [DOI] [PubMed] [Google Scholar]
  9. Boyd S, & Vandenberghe L (2004). Convex optimization. Cambridge: Cambridge University Press. [Google Scholar]
  10. Brown EN, Kass RE, & Mitra PP (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci, 7(5), 456–461. [DOI] [PubMed] [Google Scholar]
  11. Brunel N, & Nadal JP (1998). Mutual information, Fisher information, and population coding. Neural Comput., 10(7), 1731–1757. [DOI] [PubMed] [Google Scholar]
  12. Carlton A (1969). On the bias of information estimates. Psychological Bulletin, 71(2), 108. [Google Scholar]
  13. Chase SM, & Young ED (2005). Limited segregation of different types of sound localization information among classes of units in the inferior colliculus. Journal of Neuroscience, 25(33), 7575–7585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Chechik G, Anderson MJ, Bar-Yosef O, Young ED, Tishby N, & Nelken I (2006). Reduction of information redundancy in the ascending auditory pathway. Neuron, 51(3), 359–368. [DOI] [PubMed] [Google Scholar]
  15. Clarke BS, & Barron AR (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory, 36(3), 453–471. [Google Scholar]
  16. Cover TM, & Thomas JA (2006). Elements of information (2nd ed.) New York: Wiley-Interscience. [Google Scholar]
  17. Eckhorn R, & Pöpel B (1975). Rigorous and extended application of information theory to the afferent visual system of the cat. II. Experimental results. Biological Cybernetics, 17(1), 7–17. [DOI] [PubMed] [Google Scholar]
  18. Ganguli D, & Simoncelli EP (2014). Efficient sensory encoding and Bayesian inference with heterogeneous neural populations. Neural Comput., 26(10), 2103–2134. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Gawne TJ, & Richmond BJ (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13(7), 2758–2771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Gourévitch B, & Eggermont JJ (2007). Evaluating information transfer between auditory cortical neurons. Journal of Neurophysiology, 97(3), 2533–2543. [DOI] [PubMed] [Google Scholar]
  21. Guo DN, Shamai S, & Verdu S (2005). Mutual information and minimum mean-square error in gaussian channels. IEEE Trans. Inform. Theory, 51(4), 1261–1282. [Google Scholar]
  22. Harper NS, & McAlpine D (2004). Optimal neural population coding of an auditory spatial cue. Nature, 430(7000), 682–686. [DOI] [PubMed] [Google Scholar]
  23. Huang W, Huang X, & Zhang K (2017). Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the 51st Annual Conference on Information Sciences and Systems (pp. 1–4). Piscataway, NJ: IEEE. [Google Scholar]
  24. Huang W, & Zhang K (2017). An information-theoretic framework for fast and robust unsupervised learning via neural population infomax. In Proceedings of the 5th International Conference on Learning Representations (ICLR). arXiv:1611.01886. [Google Scholar]
  25. Jeffreys H (1961). Theory of probability (3rd ed.). New York: Oxford University Press. [Google Scholar]
  26. Kang K, & Sompolinsky H (2001). Mutual information of population codes and distance measures in probability space. Phys. Rev. Lett, 86(21), 4958–4961. [DOI] [PubMed] [Google Scholar]
  27. Khan S, Bandyopadhyay S, Ganguly AR, Saigal S, Erickson DJ III, Protopopescu V, & Ostrouchov G (2007). Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E, 76(2), 026209. [DOI] [PubMed] [Google Scholar]
  28. Kraskov A, Stögbauer H, & Grassberger P (2004). Estimating mutual information. Physical Review E, 69(6), 066138. [DOI] [PubMed] [Google Scholar]
  29. Laughlin SB, & Sejnowski TJ (2003). Communication in neuronal networks. Science, 301(5641), 1870–1874. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Lewis A, & Zhaoping L (2006). Are cone sensitivities determined by natural color statistics? J. Vis, 6(3), 285–302. [DOI] [PubMed] [Google Scholar]
  31. MacKay DJC (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. [Google Scholar]
  32. McClurkin JW, Gawne TJ, Optican LM, & Richmond BJ (1991). Lateral geniculate neurons in behaving primates. II. Encoding of visual information in the temporal shape of the response. Journal of Neurophysiology, 66(3), 794–808. [DOI] [PubMed] [Google Scholar]
  33. Miller GA (1955). Note on the bias of information estimates In Quastler H (Ed.), Information theory in psychology: Problems and methods II-B (pp. 95–100). Glencoe, IL: Free Press. [Google Scholar]
  34. Olshausen BA, & Field DJ (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. [DOI] [PubMed] [Google Scholar]
  35. Optican LM, & Richmond BJ (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. III. Information theoretic analysis. Journal of Neurophysiology, 57(1), 162–178. [DOI] [PubMed] [Google Scholar]
  36. Paninski L (2003). Estimation of entropy and mutual information. Neural Comput., 15(6), 1191–1253. [Google Scholar]
  37. Pouget A, Dayan P, & Zemel R (2000). Information processing with population codes. Nat. Rev. Neurosci, 1(2), 125–132. [DOI] [PubMed] [Google Scholar]
  38. Quiroga R, & Panzeri S (2009). Extracting information from neuronal populations: Information theory and decoding approaches. Nat. Rev. Neurosci, 10(3), 173–185. [DOI] [PubMed] [Google Scholar]
  39. Rao CR (1945). Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37(3), 81–91. [Google Scholar]
  40. Rieke F, Warland D, de Ruyter van Steveninck R, & Bialek W (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. [Google Scholar]
  41. Rissanen JJ (1996). Fisher information and stochastic complexity. IEEE Trans. Inform. Theory, 42(1), 40–47. [Google Scholar]
  42. Shannon C (1948). A mathematical theory of communications. Bell System Technical Journal, 27, 379–423, 623–656. [Google Scholar]
  43. Sompolinsky H, Yoon H, Kang KJ, & Shamir M (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64(5), 051904. [DOI] [PubMed] [Google Scholar]
  44. Tovee MJ, Rolls ET, Treves A, & Bellis RP (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. [DOI] [PubMed] [Google Scholar]
  45. Toyoizumi T, Aihara K, & Amari S (2006). Fisher information for spike-based population decoding. Phys. Rev. Lett, 97(9), 098102. [DOI] [PubMed] [Google Scholar]
  46. Treves A, & Panzeri S (1995). The upward bias in measures of information derived from limited data samples. Neural Comput., 7(2), 399–407. [Google Scholar]
  47. Van Hateren JH (1992). Real and optimal neural images in early vision. Nature, 360(6399), 68–70. [DOI] [PubMed] [Google Scholar]
  48. Van Trees HL, & Bell KL (2007). Bayesian bounds for parameter estimation and nonlinear filtering/tracking. Hoboken, NJ: Wiley. [Google Scholar]
  49. Verdu S (1986). Capacity region of gaussian CDMA channels: The symbol synchronous case. In Proc. 24th Allerton Conf. Communication, Control and Computing, (pp. 1025–1034). Piscataway, NJ: IEEE. [Google Scholar]
  50. Victor JD (2000). Asymptotic bias in information estimates and the exponential (bell) polynomials. Neural Comput., 12(12), 2797–2804. [DOI] [PubMed] [Google Scholar]
  51. Wei X-X, & Stocker AA (2015). Mutual information, Fisher information, and efficient coding. Neural Comput., 28, 305–326. [DOI] [PubMed] [Google Scholar]
  52. Yarrow S, Challis E, & Series P (2012). Fisher and Shannon information in finite neural populations. Neural Comput., 24(7), 1740–1780. [DOI] [PubMed] [Google Scholar]
  53. Zhang K, Ginzburg I, McNaughton BL, & Sejnowski TJ (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophysiol, 79(2), 1017–1044. [DOI] [PubMed] [Google Scholar]
  54. Zhang K, & Sejnowski TJ (1999). Neuronal tuning: To sharpen or broaden? Neural Comput., 11(1), 75–84. [DOI] [PubMed] [Google Scholar]

RESOURCES