Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Oct 15.
Published in final edited form as: Neural Comput. 2019 Apr 12;31(6):1015–1047. doi: 10.1162/neco_a_01193

Quantifying information conveyed by large neuronal populations

John A Berkowitz 1, Tatyana O Sharpee 2,3
PMCID: PMC8519183  NIHMSID: NIHMS1740426  PMID: 30979352

Abstract

Quantifying mutual information between inputs and outputs of a large neural circuit is an important open problem in both machine learning and neuroscience. However, evaluation of the mutual information is known to be generally intractable for large systems due to the exponential growth in the number of terms that need to be evaluated. Here we show how information contained in the responses of large neural populations can be effectively computed provided the input-output functions of individual neurons can be measured and approximated by a logistic function applied to a potentially nonlinear function of the stimulus. Neural responses in this model can remain sensitive to multiple stimulus components. We show that the mutual information in this model can be effectively approximated as a sum of lower-dimensional conditional mutual information terms. The approximations become exact in the limit of large neural populations and for certain conditions on the distribution of receptive fields across the neural population. We empirically find that these approximations continue to work well even when the conditions on the receptive field distributions are not fulfilled. The computing cost for the proposed methods grows linearly in the dimension of the input, and compares favorably with other approximations.

1. Introduction

Information theory has the potential of answering many important questions about how neurons communicate within the brain. In particular, it can help determine whether neural responses provide sufficient amounts of information about certain stimulus features, and in this way determine whether these features could possibly affect the animal’s behavior (Rieke et al., 1997; Bialek, 2012). In addition, a number of previous studies have shown that one can understand many aspects of the neural circuit organization as those that provide maximal amounts of information under metabolic constraints (Laughlin et al., 1998; Bialek, 2012). Key to all of these analyses is the ability to compute the Shannon mutual information (Cover and Thomas, 2012). When estimating the information transmitted by neural populations from experimental recordings, all empirical methods produce biased estimates (Paninski, 2003). There are several approaches to trying to reduce or account for this bias (Nemenman et al., 2004; Strong et al., 1998; Brenner et al., 2000; Treves and Panzeri, 1995), but these approaches do not have finite-sample guarantees and are generally ineffective when the population response is high dimensional. In order to make progress on this problem, we consider the case where the response functions of individual neurons can be measured and where the stimulus-conditional (“noise”) correlations between neural responses can be described by pairwise statistics (Schneidman et al., 2006). Historically, even with these assumptions the mutual information is notoriously difficult to compute in part due to the large number of possible responses that a set of neurons can jointly produce (Nemenman et al., 2004; Strong et al., 1998). The number of patterns grows exponentially with both the number of time points (Strong et al., 1998; Dettner et al., 2016) and the number of neurons.

In this paper we will describe a set of approaches for computing information conveyed by responses of large neural populations. These methods build on recent advances for computing information based on linear combinations of neural responses across time (Dettner et al., 2016; Yu et al., 2010) and/or neurons (Berkowitz and Sharpee, 2018). We will show that when each individual neuron’s firing probability depends monotonically on a (potentially nonlinear) function of the stimulus, the information contained in the full population response can be completely preserved by a linear transformation of the population output. This calculation still involves computing information between high dimensional vector variables. Therefore, we further show how the full information can be effectively approximated using a sum of conditional mutual information values between pairs of low-dimensional variables. The resulting approach makes it possible to avoid the “curse of dimensionality” with respect to the number of neurons when computing the mutual information from large neural populations.

2. Framework setup

Our analysis will target neural responses considered over sufficiently small time windows such that no more than one spike can be produced by any given neuron. We model the neural population as a set of binary neurons with sigmoidal tuning curves with response probability described by:

Prn=1s=11+e2fn(s), (1)

where sD is the input, rn ∈ {−1, 1} is the activity of the nth neuron, and fn(s) is a scalar function of s representing the activation function of the nth neuron. The population consists of N such neurons, and the population response is denoted as r=r1,,rN. For clarity of the derivation, we will initially assume that neural responses are independent conditioned on s:

P(rs)=nPrns, (2)

and later discuss under what conditions our results generalize to the case where neural responses are correlated for a given stimulus s. A few lines of algebra suffice to show that Eq. (2) can be expressed in the following form:

P(rs)=expnrnfn(s)An(s),
An(s)=log2 coshfn(s). (3)

This formulation will assist all of the approaches described below for computing the mutual information.

3. An unbiased estimator of information for large neural populations

In order to test the approaches described in subsequent sections, we first developed a Monte-Carlo method for computing the “ground-truth” mutual information that works for large neural populations. The approach relies on the knowledge of neural response parameters fn(s) to produce unbiased estimates of mutual information between R and S for different choices of fn(s) or P(s). Here and in what follows, upper case letters (e.g. S) represent random variables, while lower case letters (e.g. s) represent specific values of the associated random variables. The input distribution P(s) is defined by drawing Nstim samples; we denote this set of samples as sμ. Because of this approximation however, I(R,S) will be bounded above by log(Nstim) (as will be any unbiased estimator of mutual information).

Although there are several formulations of the mutual information in terms of the entropies of R and S it serves to examine just one:

I(R,S)=H(R)H(RS). (4)

Here, H(R) is the Shannon entropy of the marginal distribution of R and H(RS) is the conditional entropy of R given S. Because we intend to use this estimator as a way to test the quality of other approximations, we will only consider here the case of conditionally independent neural responses R. In this case, the noise entropy H(RS=S) decomposes into a sum over neurons:

H(RS=s)=nr¯n(s)fn(s)An(s), (5)

where r¯n(s) is the expected value of Rn given s:

r¯n(s)=tanhfn(s). (6)

We denote H^(RS) as the finite sample approximation to H(RS):

H^(RS)=1Nstim μnr¯nsμfnsμAnsμ. (7)

The conditional entropy H^(RS) can be evaluated in O(N * Nstim) time, not including the cost of evaluating fn(s). However, the marginal distribution of r will in general not factor. Thus evaluating H(R) requires computing the marginal P(r) for all r1,1N. This computation grows like O(N * Nstim *2N). Thus, evaluation of Eq. (4) is known to become intractable for realistic population sizes. To derive our estimator, we begin by rewriting H(R):

H(r)=rP(r) log(P(r))=F(r)r,
F(r)=log P(rs)s. (8)

We approximate the log-marginal F(r) with an empirical average:

F^(r)=log 1NstimμPrsμ=log 1Nstim μexp nrnfnsμAnsμ (9)

In terms of numerical implementation, F^(r) can be efficiently and stably evaluated in O(Nstim) time using the logsumexp function that is implemented in many numerical libraries. To approximate the averaging with respect to P(r) we draw B samples of r for every sμ, which is easily done with (1) and (2), and denote these samples as rν. We can thus produce an unbiased estimate of F^(r)r:

H^(R)=1BNstimνlog 1Nstimμexp nrnνfnsμAnsμ (10)

Importantly, the response entropy H^(R) requires ON*B*Nstim2 operations, a substantial improvement over exact evaluation of H(r) when B * Nstin ≪ 2N. We note that even though we are able to produce unbiased estimates of H^(R), this estimator systematically underestimates the “infinite sample” entropy computed with respect to P(s) explicitly, i.e. not defined by input samples (see Appendix A). Our Monte-Carlo estimator of I(R,S) is the straightforward combination of H^(R) and H^(RS):

I^(R,S)=H^(R)H^(RS) (11)

Although I^(R,S) is an unbiased estimator of the mutual information (after accounting for the approximation of P(s) by samples sμ) the variance of H^(r) and thus of I^(R,S) can be difficult to quantify. However, F(r) is a bounded function because r has finite support (or, more generally, F(r) can be treated as a continuous function on the compact set [−1, 1]N). Thus, standard concentration bounds show that H^(R) is a consistent estimator of H(R).

In order to test our derivation that I^(R,S) is an unbiased estimator of I(R,S), we analyzed the statistics of I^(R,S) on a tractable neural population where I(R,S) can be computed exactly. We let N = 10 and fn(s)=ϕns with ϕn uniformly distributed along the unit circle. P(s) is a spherical, two-dimensional Gaussian distribution and Nstim = 8, 000. We evaluate I(R,S) exactly, and get that I(R,S)=1.3384 nats. This is well below the upper bound of log(8, 000) ≈ 8.987 nats. We computed I^(R,S) 100 times for B = 1 and B = 3, with {sμ} fixed. For each repetition we record the residual I^(R,S)I(R,S). Distribution plots of the residuals are shown in Figure 1. For both distributions the sample mean is not significantly different from zero with P = 0.848 (B = 1) and P = 0.851 (B = 3) in a two-sided t-test. The simulation results therefore support the derivation of zero-bias in the proposed model-based Monte-Carlo estimator.

Figure 1:

Figure 1:

Between exact calculation and the Monte Carlo results for the test neural population described in Section 3. Dashed black line indicates zero, while red marker and error bar are the sample mean and standard deviation.

4. Simplifying the mutual information with sufficient statistics

4.1. A vector-valued sufficient statistic

The method introduced in Section 3 can be applied for very general formulations and parametrizations of the activation functions. However, when we constrain the activation functions to be affine we can show that P(rs) has especially useful properties. Specifically, we assume the following parametrization of fn(s):

fn(s)=wnsαn,n (12)

While Eq. (12) implies a strong restriction on how stimuli drive the neural responses, some results of this section can be generalized to other activation functions. The reason for this is that even the general formulation of P(rs) given in Eq. (3) can be viewed as an exponential family, with sufficient statistic r and natural parameter f(s). In particular, the framework can be extended to quadratic activation functions, which are an important model for describing neurons that are sensitive to multiple stimulus features. See Appendix E for further discussion.

If Eq. (12) holds, then Eq. (2) can be rewritten as follows:

P(rs)=h(r) exp(st(r)A(s)), (13)

where,

t(r)=Wr,Ww1T,,wNT (14)
h(r)=enrnαn, (15)
A(s)=nlog 2 coshwnsαn. (16)

Equation (13) is an exponential family with sufficient statistic tD, natural parameter s, base measure h(r), and log-partition function A(s) (Wainwright and Jordan, 2008).

The stimulus-conditional probability distribution P(ts) can be defined by marginalizing over all r that map to the same t:

P(ts)=rδ(t,t(r))h(r) exp(st(r)A(s))=exp(stA(s))rδ(t,t(r))h(r)=exp(stA(s))h(t). (17)

Note that ht=0 if there does not exist an r such that t=t(r). An important property of sufficient statistics is the conservation of information (Cover and Thomas, 2012):

I(S,R)=Ivector(S,T) (18)

with T defined by Eq. (14). Although T does not lose information relative to r, it is worth making a few comments on T and Eq. (18). Because R is a discrete variable (with cardinality of at most 2N) and T is a deterministic function of R, then T is also a discrete variable with finite cardinality. Indeed, outside of cases of degeneracy between the columns of W, there will generally be a one-to-one mapping between values of R and T. Thus, even in cases where DN, computing H(T) can be just as difficult as computing H(R). Furthermore, unlike R, the components of T will generally not be conditionally independent. H(Ts) will thus be similarly intractable. While it may seem that we have not gained any computational advantage by transforming from R to T we will now show that Eq. (18) can be expressed in a convenient form that facilitates several useful approximations.

4.2. Decomposition of mutual information based on sufficient statistics

We start by noting that the ordering of the components of S and T is arbitrary, because applying any matching permutation to the components of S and T does not affect I(S,R). We will use the following notations for components of vectors s=s1,,sD:s¬d=s1,,sd1,sd+1,sD, s<d = (s1, …, sd−1), and similarly for s>d, sd, and sd. Note that S¬d = (S<d,S>d). Additionally, we will at times consider information theoretic quantities involving variables that are the concatenation of two other variables, such as X and Y. Such compound variables will be denoted as {X, Y}. Using these notations and applying the chain rule for mutual information to Eq. (18) yields: (Cover and Thomas, 2012):

Ivector(S,T)=d=1DISd,TS<d. (19)

In Eq. (19), ISd,TS<d is the mutual information between T and Sd conditioned on S<d.

ISd,TS<d=HSd,S<d+HT,S<dHSd,T,S<dHS<d=TCS<d,Sd,TIS<d,SdIS<d,T=IT,S<d,SdIT,S<d=IT,SdIT,S<d=I(Sd,Ts<d)s<d, (20)

where TCS<d,Sd,T is the total correlation between S<d, Sd, and T. All four formulations of  ISd,TS<d are equivalent provided they all exist. A notable situation is when there is functional dependence between Sd and S<d, such as when the support of Sd lies on a manifold of intrinsic dimension < d. In this case I(S<d,Sd) diverges and the second line of Eq. (20) is ill-defined. However, it easy to show that ISd,TS<d=0 if such a functional dependency exists using the fourth line of Eq. (20). Formally, let sdg(s<d) where gs<d:Rd1R is a function defined explicitly or implicitly. Then we note that the mutual information between two variables is zero if at least one variable is constant:

ISd,Ts<d=Igs<d,Ts<d=0. (21)

Thus, ISd,TS<d will also be zero:

I(Sd,TS<d)=I(Sd,Ts<d)s<d=I(g(s<d),Ts<d)s<d=0. (22)

Computing just ISd,Ts<d remains challenging for the same reasons as computing Ivector(S,T). However, we can achieve a further reduction by taking advantage of the fact that P(ts) is an exponential family. To see this we first express two important marginalized forms of (14):

P(tsd,s<d)=h(t) exp(s<dt<d+sdtd)exp(s>dt>dA(s))s>dsd. (23)
P(ts<d)=h(t) exp(s<dt<d)exp(sdtdA(s))sds<d. (24)

The notation f(s)s>dsd denotes the expectation of f(s) with respect to P(s>d|sd), with analogous meanings for f(s)sds<d and so forth. Marginalization over conditioned variables is expressed implicitly: Pts<d=P(ts)sd. The important consequence of (23) and (24) is that the log-likelihood ratio of Ptsd,s<d and Pts<d is independent of t<d. From this we can show that ISd,Ts<d=ISd,Tds<d:

I(Sd,Ts<d)=tP(tsd,s<d) log(P(tsd,s<d)P(ts<d))sd=tdP(tdsd,s<d) log(P(tdsd,s<d)P(tds<d))sd.=I(Sd,Tds<d) (25)

This leads to our next reduction:

Ivector(S,T)=d=1DISd,TdS<d. (26)

We note that the dth term in (19) has D+d+1 degrees of freedom, whereas the corresponding term in (26) has D+1 degrees of freedom. This effective dimension reduction has important algorithmic implications for the nonparametric estimators we use to compute the individual terms of (26) (c.f. Section 5). In section 5.1 and 5.2, we use Eq. (26) to evaluate Ivector(S,T).

4.3. Lower bounds for the mutual information

While Eq. (26) represents a significant improvement in complexity over naive evaluation of I(S,T), individual terms of Ivector(S,T) may be still too high dimensional to reliably evaluate. In this section, we will present a series of lower bounds on Ivector(S,T) that are more easily estimated. In particular we consider bounds that arise by replacing Td in the dth term of Eq (26) by a lower dimensional, deterministic transformation of Td denoted Zd. Applying the Data Processing Inequality (DPI) to each term in Eq (26), will yield a lower bound for the mutual information. There are many possible lower bounds to Ivector(S,T) of this form. We focus on a variable Zd = {Td, |T>d|} where |T>d| is the L2-norm of T>d. This leads to the following lower bound approximation to Ivector(S,T)

Iiso(S,T)=dISd,Td,T>dS<d, (27)

which we term isotropic. In Appendix B, we show that this approximation becomes exact in the asymptotic limit of large neural populations, meaning that Iiso(S,T)=Ivector(S,T), when the stimulus distribution is isotropic and the distribution of receptive fields (RF) w across the population is such that A(s)=A(s). Notably, this is achieved when RFs uniform cover the stimulus space, meaning that P(w) is described by an uncorrelated Gaussian distribution. For finite number of neurons, A(s) will never be perfectly isotropic. However, for large populations (N ≫ 1) where the receptive fields w are drawn from an isotropic distribution and the distribution of α is independent of w, A(s) will become isotropic asymptotically as N → ∞, cf. appendix B.1 for further details. The analogue of approximation Eq. (27) for the case where RFs and s are described by a matching correlated Gaussian distribution is described in appendix B.2.

The next reduction we consider is to drop |T>d| from each term of Eq. (27):

Icomp-cond(S,T)=d=1DISd,TdS<d. (28)

By the data-processing inequality, it again follows that Iiso(S,T)Icomp-cond(S,T). Overall, one obtains a series of bounds:

Ivector(S,T)Iiso(S,T)Icomp-cond(S,T). (29)

Our final, simplest approximation is to drop the conditioning on S<d in each term of Eq. (28).

Icomp-ind (S,T)=d=1DISd,Td. (30)

We show in Appendix C, that this last approximation becomes exact in the case where neural populations split into independent sub-populations with orthogonal RFs between sub-populations. Mathematically, this corresponds to the case where both the stimulus distribution P(s) and the function A(s) factor in the same basis:

P(s)=k=1DPsk,A(s)=k=1DAsk. (31)

In general, Icomp-ind (S,T) may be greater or less than Icomp-cond(S,T) (or Ivector(S,T)) (Renner and Maurer, 2002). However when P(s)=dPsd, the following additional inequality holds:

Icomp-cond(S,T)Icomp-ind (S,T). (32)

To derive (32) we first note that we can decompose I(Sd, {Td, S<d}) (for d > 1) in two different ways:

ISd,Td,S<d=ISd,TdS<d+ISd,S<d=ISd,S<dTd+ISd,Td (33)

Equating the first and second lines of (33) we can rewrite the residual I(Sd,Td|S<d) − I(Sd,Td):

ISd,TdS<dISd,Td=ISd,S<dTdISd,S<d (34)

Though either side of (34) may be positive or negative in general, when we make the assumption that P(s) factors across dimension, then I(Sd, S<d) = 0. Thus (34) is nonnegative and I(Sd, Td|S<d) ≥ I(Sd, Td), implying (32).

In the opposite extreme case where the value of Sd is a deterministic function of S<d, Eq. (22) can be generalized to show that I(Sd,Td|S<d) = 0. Thus, in this case I(Sd,Td) ≥ I(Sd,Td|S<d), which in turn indicates that Icomp-ind (S,T)Icomp-cond(S,T). For example, when the support of P(s) lies on a one-dimensional curve, e.g. S represents position along a one-dimensional nonlinear track, Sd is fully determined from the values of other variables S<dd regardless of component ordering. In this case, Icomp-ind (S,T)Icomp-cond(S,T).

In the intermediate cases with some statistical dependencies between stimulus components, Icomp-ind (S,T) is not generally guaranteed to be a lower bound to either Icomp-cond(S,T) or Ivector(S,T). Nevertheless, we observed that even for some correlated P(s)Icomp-ind (S,T)<Icomp-cond(S,T), c.f. section 5.2.

4.4. Alternative Approximations of I(R,S)

Previous authors have proposed other approximations to the mutual information. There exists a non-parametric upper bound to the mutual information computed in terms of pairwise relative entropies between P(rs) and Prs (Haussler et al., 1997; Kolchinsky et al., 2017):

Ikw(R,S)=dsP(s) logdsPs expDKLP(rs)Prs (35)

The model we consider for P(rs) is an exponential family and thus has a tractable relative entropy (Banerjee et al., 2005):

DKLP(rs)||Prs=ntanhfn(s)fn(s)An(s)tanhfn(s)fnsAns (36)

In Eq. (35) we have used the generalized definitions of Section 3. The evaluation of the upper bound (35) is quadratic in the sample size Nstim as opposed to O(Nstim log2 Nstim) for the estimator in section 5. In the limit where ND, another popular approximation exist based on Fisher information (Brunel and Nadal, 1998); it can be computed with O(Nstim) operations. Recent work has shown that this approximation is valid only for certain classes of input distributions (Huang and Zhang, 2018). In appendix D we discuss the relationship between this approximation and IFisher (R,S). We include numerical comparisons between Ikw(R,S) and the methods proposed in this paper in sections 5.1 and 5.2. However, we found that the Fisher Information approximation drastically overestimated the true mutual information. Therefore to avoid obscuring differences between other results, the approximation based on Fisher Information is not included in Figures 23. Full plots including this approximation can be found in Appendix D.

Figure 2:

Figure 2:

Information curves for neural populations with uncorrelated RF and stimulus distributions. Lines and error bars are mean and standard deviation over ten repeats of the estimator. Insets show RF distribution for several population sizes.

Figure 3:

Figure 3:

Information curves for neural populations with correlated RF distributions, cf. section 5.2. Lines and errorbars are mean and standard deviation over ten repeats of the estimator. In all panels, circles represent information computed in the original basis, while squares and triangles are computations performed in decorrelated basis. Ivector (A) and Iiso (B) recover full information. These two computations do not benefit from working in the decorrelated stimulus basis. Stimulus decorrelation improves the performanceof Icomp-cond (C) and Icomp-ind (D)). In (D), computations are invariant to ordering of components.

We note that there are other variational approximations to mutual information (Belghazi et al., 2018; Barber and Agakov, 2003). However, because comparing the information for different choices of the parameters of P(rs) and P(s) requires training a different variational approximation each time, direct comparison requires substantial computational resources and we leave them for future work.

5. Numerical Simulations

We now test the performance of the above described bounds under several representative situations that include correlated and uncorrelated stimulus distributions, and isotropic and anisotropic receptive field distributions, including experimentally recorded receptive fields from the primary visual cortex, as well as the case where intrinsic “noise” correlations are present.

To empirically estimate the bounds on mutual information [I(Sd, Td), I(Sd, Td|S<d), I(Sd, {Td, |T>d|}|S<d), and I(Sd, Td|S<d)], we use the KSG estimator (Kraskov et al., 2004), a non-parametric method based on distributions of K nearest-neighbor distances. We chose to use the KSG estimator because even though we have reduced the mutual information between two high-dimensional variables into a sum over pairs of scalars, computing even just I(Sd, Td) can still be a daunting task, even more so for terms involving conditional informations. Td may still have exponentially large cardinality, and complicated interdependencies between components of T present difficulties in forming explicit expressions for P(td|sd), so exact evaluation of H(Td) and H(Td|Sd) is not feasible at present. The KSG estimator requires only that we can draw Nstim samples of S and T from P(s,t), discarding the unused components. Sampling from P(s,t) is easily done given samples from P(s). Given a sample s, we draw r from P(rs), which is easily done because of Eq. (2), and transform r into t using (14). This estimator has complexity O(Nstim log2 Nstim) when implemented with KD-Trees. For the case of two scalar variables the 2 error of the estimate decreases like 1/Nstim  (Gao et al., 2018), though if the true value of the mutual information is very high then the error may still be large (Gao et al., 2015). In order to partially alleviate this error, we use the PCA based Local Nonuniformity Correction of (Gao et al., 2015) (KSG-LNC). We extend this estimator to compute the conditional mutual information terms using a decomposition analogous to the second line of Eq. (20), and set the nonuniformity threshold hyperparameter according to the heuristics suggested in (Gao et al., 2015). Additionally, we assume that the distribution of S (and thus Sd, S<d, and Sdd) is non-atomic. Thus, because T is discrete but real valued, the KSG estimator is applicable as neither T nor S is a mixed continuous-discrete variable (Gao et al., 2017).

5.1. Large populations responding to uncorrelated stimuli

We evaluated the performance of the bounds on information developed in section 4.3 for large populations ranging from N ≈ 100 to 1, 000. Specifically, to test the performance of Iiso(S,T) we chose a highly isotropic population and stimulus distribution. We set D = 3, M = 8, 000, and let P(s) be a zero mean gaussian with unit covariance matrix. For each value of N, the wn were placed uniformly on the surface of the unit sphere, using the regular placement algorithm of (Deserno, 2004). Because N is too large for exact evaluation of H(r) ground truth values were estimated using the Monte Carlo estimator I^(R,S) of section 3 with B = 3. Results are plotted in Figure 2. We find that for large N, Iiso(S,T) tightly approximates Ivector(S,T) and both are accurate approximations to I^(R,S), strongly outperforming Ikw(R,S). We note that for this case the upper bound of log(Nstim) = log(8, 000) ≈ 9 (nats) is well above all of the curves other than Ikw(R,S), which is already known to be an upper bound to I(R,S), demonstrating that we are in the well-sampled regime. Once again we see that inequalities (29) and (32) hold.

5.2. Correlated stimulus distributions

We now consider the case of correlated Gaussian stimuli. and model P(s) as a zero-mean Gaussian with a full-rank non-diagonal covariance matrix C. To better understand the effects of stimulus correlations we also perform computations in stimulus bases where components are independent. For this, we decompose C as C = VΛVT where V is an orthogonal matrix whose columns are the eigenvectors of C.

S^=VTS,T^=VTT. (37)

Note that we have t^s^=ts. It is easy to see that P(t^s^) is also an exponential family. Additionally because the mappings from s to s^ and t to t^ are diffeomorphisms the information is preserved:

Ivector(S,T)=Ivector(S^,T^). (38)

We note that while Eq. (38) holds in principle, in practice we may see variation as the KSG family of estimators is not invariant to under diffeomorphisms. Importantly we also note that Eq. (25) holds for (S^, T^). Given samples from P(s,t), we automatically have samples from P(s^,t^). We can straightforwardly generalize Ivector(S,T), Icomp-cond(S,T), Icomp-ind (S,T), and Iiso(S,T), to I(S^,T^), Icomp-cond(S^,T^), Icomp-ind(S^,T^), and Iiso(S^,T^) respectively. Eq. (38) does not generalize to Iiso, Icomp-cond, or Icomp-ind as they are not expressible as mutual information quantities between two variables. We note that I(R,S)=I(R,S^) so we do not modify I^(R,S).

Simulations in Figure 3 were done using the following stimulus covariance matrix

C=1.747160931.31037070.873580461.31037071.747160931.31037070.873580461.31037071.74716093. (39)

For this choice of C ρ1,2 = ρ2,3 = 0.75, ρ1,3 = 0.5, and |C| = 1. The covariance matrix of S^ is diagonal:

C^=0.2830000.8680004.032. (40)

The receptive field configurations are the same as in Section 5.1. We note that in the s coordinates, all components have the same variance, whereas this symmetry is broken in the decorrelated components s^. We compared sorting the components of s^ in increasing and decreasing order of variance (triangles and squares respectively) in Figure 3AC. Component order does not matter for Icomp-ind (S^,T^), Figure 3D. We find that for both Ivector and Iiso it is optimal to perform computation in the original basis, with both quantities accurately matching I^(R,S). For Icomp-cond and Icomp-ind accuracy is increased by using decorrelated components and, for Icomp-cond, sorting components by decreasing variance.

5.3. Highly asymmetric receptive field distributions

Next, we consider a small population (N = 9) with a highly asymmetric distribution of redundant receptive fields in two stimulus dimensions. In particular we are interested in a population where many different configurations of R map to the same configuration of T, demonstrating the utility of using T as a non-trivial sufficient statistic of R. With this in mind, we chose a heavily redundant configuration of wn:wn=(0,1)(n=1,2,3), wn=(1,0)(n=4,5,6), wn=(1,1)(n=7,8,9). The cardinalities of R, T, T1 and T2 are 512, 37, 7, and 7 respectively. Because N is small, ground truth values of I(R,S) were computed by exactly evaluating P(rs)r1,1N, for every sample of s. Given P(rs) we average across s to get P(r) explicitly, and calculate H(R), H(RS), and I(R,S) from these distributions. P(s) is gaussianly distributed with diagonal covariance matrix, in accordance with (32). We set Nstim = 10, 000. To test how the relative values of I(S1,T1) and I(S2,T2) impact the optimal ordering of components in Icomp-cond(S,T) we fixed σ2 = 1 and varied σ1 between 0.5 and 2.5. Results are plotted in Figure 4. As predicted, the hierarchy of bounds (29) and (32) holds for all σ1. It is also notable that in this case, just like in the case of large neural populations, for computing Icomp-cond(S,T), it always seems best to start the information computation with the stimulus component that has the largest variance. As expected, the ordering or components does not strongly impact Ivector(S,T).

Figure 4:

Figure 4:

Information curves for the example population with highly redundant RFs from Sec. 5.3. Lines and errorbars are mean and standard deviation over ten repeats of the estimator. Although neither the component-conditional nor the component information are guaranteed to tightly approximate the information, both provide good approximation to the full information (as estimated via unbiased Monte Carlo method), reaching values within ≥ 80% of the maximum. For the vector-sufficient statistic, both component orderings accurately reproduced the full information. The next best approximation to the full information is provided by the component-conditional computation with components added in the order from largest to smallest variance. This approximation reaches accuracy within 95% of the full value over the range of neural population sizes.

5.4. Experimental stimuli and receptive fields

In the previous three experimental sections we considered synthetic distributions of low-dimensional stimuli and artificial configurations of receptive fields. In this section we analyze a population of model neurons with receptive fields and α values that were fit to responses of primary visual cortex neurons (V1) elicited by natural stimuli (Sharpee et al., 2006). We use 147 pairs of wn,αn values that were fit using the Maximally Informative Dimension (MID) algorithm as in (Sharpee et al., 2006). Stimuli are 10 pixel by 10 pixel patches extracted from the same set of images used to fit the model parameters. Receptive fields are normalized and centered on the patch, and we chose a 10 × 10 sub-patch of the original 32 × 32 shaped receptive fields so that all dimensions are well sampled by receptive fields. That is, for all pixels of the 10 × 10 patch, at least 115 of the 147 neurons have a nonzero value in the corresponding component of their receptive field. Additionally, the stimuli are z-scored by subtracting the mean and dividing by the standard deviation, with both quantities computed across all samples and pixels collectively.

Because of high stimulus dimensionality we could only compute the Icomp-ind (S,T) bound (via the KSG estimator) and the ground truth information (via the Monte Carlo method). Because the pixels of natural image patches are clearly not independent, we also computed Icomp-ind in two additional coordinate systems. The first coordinate system is simply the linearly decorrelated components S^ and T defined in Eq. (37). The second coordinate systems uses independent components derived using independent component analysis on S

S˜=US,T˜=U1TT (41)

Here, U is an unmixing matrix computed using Infomax ICA (Bell and Sejnowski, 1997) on the samples of S. As is done in (Bell and Sejnowski, 1997), U includes a linear whitening matrix. We note that in Eq. (41), T is multiplied by (U−1)T and not U because the ICA unmixing matrix is generally not orthogonal and we require that ts=t˜s˜. As before Ivector(S,T)=Ivector(S^,T^)=Ivector(S˜,T˜). However, the same cannot be said for Icomp-ind expressed in different coordinate systems.

To evaluate the effect of using different coordinate systems to evaluate Icomp-ind for different population sizes we first ranked the 147 neurons in descending order by the information each neuron carried about the stimulus. We computed IRn,Sn1,,147, which is easily done exactly since Rn is a binary variable, and then sorted neurons so that IRn,S>=IRn+1,Sn. We considered populations of size N = 60, 70, …, 140, where each population contained the first N neurons under the aforementioned ordering. For each value of N we computed I^(R,S)  (B=3), Icomp-ind (S,T), Icomp-ind (S^,T^), and Icomp-ind (S˜,T˜). We note that log(Nstim) = log(49, 152) ≈ 10.8 nats. Results are plotted in Figure 5.

Figure 5:

Figure 5:

Information curves for populations based on experimentally recorded RFs and probed with D = 100 natural visual stimuli. The full information (solid black line) is computed via the Monte Carlo method and is compared to Icomp-ind approximations computed in two different bases: the PCA basis (blue dashed line) and the ICA basis (blue dotted line). We do not show the calculation in the original pixel basis, because it Icomp-ind(S,T) is omitted as it was yielded values ~ 25 nats across the range of population sizes and obscured the other curves. Because of non-Gaussian statistics of natural stimuli, PCA components remain correlated, and as a result the approximation is no longer guaranteed to be a lower bound of the true information. Computation in the ICA basis respects the lower bound requirements, and achieves ≥ 75% of the full information across the range of population sizes.

We observe that both Icomp-ind (S,T) and Icomp-ind (S^,T^) overestimate the true information, especially Icomp-ind (S,T). This overestimation occurs because stimulus components are not independent. By comparison, computation performed in the ICA basis, Icomp-ind (S˜,T˜), lower bounds the mutual information for all N, achieving ≥ 75% of the full information across the range of population sizes.

5.5. Handling intrinsically correlated neurons

In order to simplify derivations, we assumed that the neural responses were independent after conditioning on s. However, all of the analytic results in Section 4.2 can be extended to specific forms of intrinsic interneuronal correlation to allow for the presence of correlations in neural responses for a given stimulus s. Formally, we modify the base measure h(r) to include a pairwise coupling term:

h(r,J)=emnJmnrmrnnrnαn. (42)

In Eq. 42, J is a symmetric N × N matrix where Jmn describes the intrinsic coupling between the mth and nth neurons. In this case P(rs,J) can still be written as an exponential family in a canonical form:

P(rs)=h(r,J)exp(st(r)A(s,J)),
A(s,J)=log rh(r,J) exp(st(r)). (43)

The form of the sufficient statistic remains unchanged, though A(s,J) generally lacks a closed form. Nevertheless, all of the decompositions, equalities, and inequalities in Section 4.2 require only for the exponential family to be in canonical form and remain valid.

We tested the accuracy of our approximations on a small (N = 10) population of intrinsically correlated neurons. Receptive fields are uniformly distributed on the unit circle, αn = 0 ∀n, and P(s) is a standard two-dimensional Gaussian (Nstim = 20, 000). Intrinsic coupling is set proportional to the overlap between receptive fields with a coupling strength J0, the sign of which determines whether the intrinsic couplings perform stimulus decorrelation or error-correction (Tkačik et al., 2010).:

Jmn=J0wmwn. (44)

The algorithms of Sections 5 and 3 all depend on being able to sample easily from P(rs). For large N and general J this is usually difficult, particularly for configurations of J that exhibit glassy dynamics. Additionally, evaluating Eq. (9) requires explicit knowledge of A(s), though methods such as mean-field theory or the TAP approximation may be used to approximate A(s) (Opper et al., 2001). Since this population is small, we evaluate the ground truth information exactly as in Section 5.3. Likewise, we sample r from P(rs) exactly by computing all 2N (1, 024) probabilities for every sample of s. Analyzing the error introduced in using approximate sampling strategies such as Markov Chain Monte Carlo is left to future investigation. Results are plotted in Figure 6. As predicted, Ivector(S,T) matches the ground truth values of the information. Similarly the hierarchy of bounds (29) and (32) is preserved, though for strongly negative couplings Icomp-cond(S,T)Icomp-ind (S,T). In sum, the presence of noise correlations does not invalidate the approximations and bounds that are derived above. However, numerical computation can become more difficult in the presence of noise correlations.

Figure 6:

Figure 6:

Information curves for populations with intrinsic correlations. Lines and errorbars are mean and standard deviation over ten repeats of the estimator.

6. Conclusions and Future Work

We have presented three approximations that can be used to estimate the information transmitted by large neural populations. Each of these three approximations represents different trade-offs between accuracy and computational ease and feasibility. The best performance in terms of accuracy was provided by the isotropic approximation Iiso. This approximation worked well even in cases where is not guaranteed to become asymptotically exact with increasing population size. For example, the isotropic approximation was derived assuming a matching covariance matrix for both the stimulus and RF distributions, cf. Appendix B. Yet, this approximation matched the full ground-truth information values even for correlated stimuli with RFs remaining uncorrelated across the neural population (Fig. 3.) This approximation provided the best overall performance among the bounds tested, consistent with the theoretically expected inequalities between these bounds, cf. Eq. (29).

The component-conditional information Icomp-cond offered the second-best performance. This approximation performed especially well when computed in the stimulus basis where stimulus components were not correlated. This approximation is less computationally difficult compared to Iiso, because each conditional information is evaluated between just two quantities Sd and Td compared to Sd and a conjunction of Td and |T>d| as in Iiso. For this reason, the finite-sample bias of Icomp-ind can also be less than Iiso, because bias in the evaluation of the mutual information is usually larger for higher-dimensional calculations.

The last approximation Icomp-ind is the least accurate of the three approximation but is computationally the easiest. It is the only approximation among the three we considered here that we were able to implement in conjunction with high dimensional stimuli. This approximation becomes most accurate in the stimulus bases where stimulus components are independent. There is strong evidence that neural receptive fields are organized along the ICA components of natural stimuli (Bell and Sejnowski, 1997; Olshausen and Field, 2004; Smith and Lewicki, 2006). This raises the possibility that the approaches proposed here will fair well when applied to recorded neural responses. Indeed, we found that Icomp-ind ≥ 75% of the full information value for large neural populations constructed using experimentally recorded RFs and probed with natural stimuli.

At present, the main limitation for computing the conditional approximations Icomp-cond and Icomp-iso is not the number of neurons but rather the stimulus dimensionality. For stimulus distributions where P(sd|s<d) can be easily sampled from, such as Gaussian distributions, we can take advantage of the fourth line of Eq. (20) to compute unbiased estimates of Icomp-cond and Iiso, albeit with possibly high variance. Developing methods that can efficiently approximate these conditional computations represents an important opportunity for future research.

7. Acknowledgements

This research was supported by NSF grant IIS-1724421.

A. Bias of H^(R)

In this section we give a self-contained proof that H^(R) systematically underestimates the ”true” entropy H(R). We first assume that Psμ=μPsμ: The sμ are drawn independently from P(s), whether P(s) is a smooth density on RD or some larger set of samples. We define a an empirical version of the marginal distribution on P(r):

P^rsμ=1MμPrsμ (45)

The true marginal distribution is the expected value of P^(r{sμ}):P^(r;{sμ})P({sμ})=P(r). The Shannon entropy is a concave function of P^r;sμ, which can be considered a random vector in the 2N dimensional probability simplex. Thus, by Jensen’s inequality we have the following:

H^(R)PsμH(R:P=P^)H(R) (46)

This bias holds even in the case of evaluating H^(R) through exact enumeration. We note that we are able to produce unbiased estimates of H^(R) because we have full access to Prsμ: We can evaluate Prsμ explicitly and deterministically, and thus F^(r) as well (up to factors of numerical precision). If we were always constrained to drawing samples from Prsμ, then we would once again be limited to making biased estimates of H^(R) (Paninski, 2003).

B. On the asymptotic tightness Iiso(S,T)

Consider a large population (N ≫ 1) where the distribution of w and α is such that A(s)=A(s) (in some sense to be made more precise later). Consider the likelihood ratio in the definition of (25):

P(tdsd,s<d)P(tds<d)=exp(sdtd+s>dt>dA(s))s>dsdexp(sdtd+s>dt>dA(s))sds<d=exp(sdtd+s>dt>dA(|s|))s>dsdexp(sdtd+s>dt>dA(|s|))sds<d (47)

Additionally we consider a stimulus distribution that is similarly isotropic so that the conditional distribution P(s>d|sd) can be written in a convenient factored form:

P(s)=P(s)=Ps<d2+sd2+s>d2
Ps>dsd=P(s)Psd=Ps<d2+sd2+s>d2Psd

We will show that in this situation, we can replace Td in (25) with the variable that is the concatenation of Td and |T>d| without loss of information:

ISd,Tds<dISd,Td,T>ds<d (48)

To show this using the Fisher-Neyman factorization theorem, it suffices to show that the numerator and denominator in (47) can be factored as follows:

exp(sdtd+s>dt>dA(|s|))s>dsd=g1(s<d,sd,td,|t>d|)g2(t>d)exp(sdtd+s>dt>dA(|s|))sds<d=f1(s<d,td,|t>d|)f2(t>d) (49)

With the requirement that f2(t>d) = g2(t>d), so that dependence on t>d cancels out in (47). We note that the first line of (49) implies the second so we examine that term in more detail.

exp(sdtd+s>dt>dA(|s|))s>dsd=exp(sdtd)exp (s>dt>dA(s<d2+sd2+s>d2))P(s>dsd)ds>d=exp(sdtd)P(sd)exp (s>dt>dA(s<d2+sd2+s>d2))P(s<d2+sd2+s>d2)ds>d. (50)

We note that s>d is a K = Dd dimensional vector. We assume that d < D − 1, so that K > 1, otherwise no further reduction of (50) is possible. We convert the integral over RK in (50) into spherical coordinates and break it into three parts: Integration over ρ ∈ [0, ∞) where |s>d| = ρ; integration over θ ∈ [0, π] where s>d ·t>d = ρ|t>d|cos(θ), and φ ∈ ΩK−1 is the set of all directions in RK with constant θ. The integrand of (50) doesn’t depend on φ so we can integrate over it automatically, yielding a constant BK that is a function only of K. We can now restate (50) in these coordinates:

=BK expsdtdPsd00πdρdθρK1 sinK2(θ)Ps<d2+sd2+ρ2 exp ρt>d cos(θ)As<d2+sd2+ρ2=BK expsdtdPsd0dρρK1Ps<d2+sd2+ρ2 exp As<d2+sd2+ρ20πdθ sinK2(θ) exp ρt>d cos(θ) (51)

We next evaluate the integral over θ in (51).

0πdθ sinK2(θ) expρt>d cos(θ)=πΓK212ΓK20F1K2,ρ2t>d24FKρt>d (52)

Where Γ(x) is the Gamma function, and 0F1(a, z) is the confluent hypergeometric limit function. We have our final expression for the first term in (49):

=BK expsdtdP(sd)0dρρK1Ps<d2+sd2+ρ2 exp As<d2+sd2+ρ2FKρt>d=BK expsdtdP(sd)gs<d,sd,t>d (53)

By setting g1 in (49) equal (53), and letting g2 = f2 = 1, we have established (48).

B.1. Approximating A(s) for Gaussian P(w)

In the previous section we assumed that the distribution P(w) and P(α) are such that A(s)=A(s). In the special case when N ≫ 1, P(α) = δ(α) and w are Gaussianly distributed we can approximate A(s) in a semi-closed form. Let P(w) be a zero-mean Gaussian with with positive-definite covariance matrix C:

A(s)=Ndwexp 12wTC1wdet(2πC)log(2 cosh(ws))=Ndxexp x22σx22πσx2log(2 cosh(x)) (54)
σx=sTCs (55)

Where we have taken advantage of the fact that ws is a scalar gaussian variable with standard deviation that depends on s and C. We next take an infinite series expansion of log(2cosh(x)).

log(2 cosh(x))=x+log(1+exp(2x))=x+m=1(1)m+1mexp(2mx) (56)

As an aside, the first equality in (56) is a useful and numerically stable expression for A(x). The ”softplus” function l(y) = log(1 + exp(y)) is implemented in many scientific computing packages, and using this alternate form for A(x) sidesteps computing the hyperbolic cosine. We next take the appropriate Gaussian average of each term in (56):

12πσx2dx expx22σx2x=2πσx (57)
12πσx2dx expx22σx22mx=erfcx2mσx (58)

Where erfcx(y) is the scaled complementary error function. Thus we have our final form for A(s):

A(s)=N2πσx+Nm=1(1)m+1merfcx2mσx (59)

We note that erfcx(y) monotonically decreases to zero, so for large values of sTCs, A(s) is well approximated by the first term in (59). Regardless we see that A(s) depends on s only through sTCs:

A(s)=AsTCs=A(|Us|) (60)

Where U=C12 is the cholesky decomposition of C.

B.2. Generalizing Iiso(S,T) for matched anisotropy

In this section we will show that a generalized form of Iiso(S,T) will also asymptotically converge to Ivector(S,T) when both P(s) and A(s) obey a certain form of ”matched” anisotropy. Specifically we assume that P(s) and A(s) depend on on s through a quadratic function of s with positive-definite kernel C.

P(s)=PsTCs=P(|Us|) (61)
A(s)=AsTCs=A(|Us|) (62)

Where, as in Section B.1, U=C12 is the cholesky decomposition of C. Let us transformed versions of S and T:

S˜=US (63)
T˜=UTT (64)

Where UT is the transpose of the inverse of U, which is well defined since C was positive definite. We note that t˜s˜=ts and P(t˜s˜) can once again be written as an exponential family in canonical form:

P(t˜s˜)=exp(s˜t˜A(s˜))h(t˜) (65)

As the mappings from S to S˜ and T to T˜ are diffeomorphisms, we have that Ivector(S,T)=Ivector(S˜,T˜). Furthermore, equation (65) implies that an analogous form of equation (26) holds for Ivector(S˜,T˜):

IS˜d,T˜S˜<d=IS˜d,T˜dS˜<d (66)

Additionally, A(s˜)=A(|s˜|) and P(s˜)=P(|s˜|). Thus, we may reuse the derivation of section 48 to derive the following analogy of equation (48):

IS˜d,T˜ds˜<dIS˜d,T˜d,T˜>ds˜<d (67)

Therefore Iiso(S˜,T˜) is asymptotically equal to Ivector(S,T):

Iiso(S˜,T˜)=dIS˜d,T˜d,T˜>dS˜<dIvector(S,T) (68)

We note that an example of such a matched isotropy situation would be where both the stimuli and receptive fields (for a large population) are distributed according to a Gaussian distribution with covariance matrix C (c.f. section B.1).

C. Independent Sub-populations

In this section we present an example where one of our proposed approximations, Icomp-ind (S,T) in this case, is equal to Ivector(S,T). Let e^1,,e^D be an orthonormal basis for RD. Suppose that the distribution of s and w are such that both P(s) and A(s) factor when expressed in this basis sk=se^k:

Similarly defining tk=te^k we have that ts=ts. Because mutual information is invariant under bijective transformations of the variables (e.g. a change of basis) (Cover and Thomas, 2012) we have that Ivector(S,T)=IS,T. It is easy to show that Pts can be written as follows:

Pts=htkexpsktkAsk (69)

Eq (69) implies that the log-likelihood ratio of Pts to Pt decomposes across sk:

log PtsPt=klog PtkskPtk (70)

Thus we have the following reduction of Ivector(S,T):

Ivector(S,T)=IvectorS,T=kISk,Tk=Icomp-ind S,T (71)

We note that (31) includes the case where for some k, wne^k=0n. In such a case Ask=A(0)=log(2), tk=0 with probability one, and ISk,Tk=0. Thus, in the case of independent subpopulations, Ivector(S,T) can be reduced to computing Icomp-ind (S,T) following a change of basis.

D. Relationship between Ikw(R,S) and IFisher(R,S)

In this appendix we relate Ikw(R,S) to the Fisher Information based approximation of (Brunel and Nadal, 1998):

IFisher (R,S)=H(S)+12dsP(s) log |J(s)|(2πe)D (72)

Where J(s) is the Fisher Information Matrix of P(rs):

Jab(s)=2ablog P(rs)P(rs) (73)

We begin by considering the inner expectation over s in Ikw(R,S):

L(s)=dsPs exp DKLP(rs)Prs (74)

We next assume that the activation function fn(s) is affine (e.g. (13)), and thus in canonical form. We also assume that, P(rs) is identifiable:

DKLP(rs)Prs=0s=s (75)

For (13) a necessary and sufficient condition for identifiability is that the matrix W has full rank, a reasonable assumption when ND. We utilize the following properties of exponential families in canonical form:

  1. 2abDKLP(rs)Prs=2abAs=Js.

  2. As is convex and Js is positive semi-definite. When P(rs) is identifiable replace As becomes strictly convex, Js positive definite, and DKLP(rs)Prs has a global minimum with respect to s of 0 at s=s.

In the limit ND we approximate L(s) using Laplace’s Method (Bender and Orszag, 1999), expanding around s=s:

L(s)P(s)(2π)D|J(s)| (76)

Plugging (76) into the definition of Ikw(R,S) we have the following asymptotic expression for Ikw(R,S):

Ikw(R,S)H(s)+12dsP(s) log |J(s)|(2π)D=IFisher(R,S)D2 (77)

For stimulus distributions where the entropy H(S) is known a priori, such as the Gaussian distributions in sections 5.1 and 5.2, IFisher(R,S) can be computed in O(M) time. If not, then H(S) must be estimated, a challenging task in high dimensions. In figure 7, we replot the results of sections 5.1 with the inclusion of IFisher (R,S). We see that IFisher (R,S) is a very loose upper bound of I(R,S) and of Ikw(R,S), indicating that the convergence of Laplace’s Method may be very slow in this situation.

Figure 7:

Figure 7:

Information curves for the population of section 5.1 compared to Fisher approximation. Lines and errorbars are mean and standard deviation over ten repeats of the estimator. I^(R,S) (solid black line), Ikw(R,S) (dashed black line), Ivector(S,T) (solid yellow line), IFisher (R,S) (dotted black line), Icomp-cond(S,T) (solid red line), Icomp-ind (S,T) (solid blue line), Iiso(S,T) (solid green line)

E. Extension to polynomial activation functions

In section 4.1 we assumed that the activation functions were affine functions of the stimulus vector s. In this appendix we will show how to generalize some of the results of section 4.1 to polynomial activation functions. For clarity of exposition we will demonstrate this generalization for quadratic functions as the procedure for higher order polynomials follows quickly. To begin, we add a quadratic term to Eq (12):

fn(s)=sTγns+wnsαn (78)

γnRD×D is a symmetric D × D matrix representing the quadratic kernel of the nth neuron’s activation function. We note that sTγns=ssTγn where AB is the hadamard product between equally shaped matrices A and B and ssTRD×D is the outer product of s with itself. We define a vector embedding of s into RD+D2,ψ(s):

ψ(s)d=sd if dDsa*sb if d>D

Where a = d mod D and b = ⎿d/D⏌ are index mappings that map a D × D matrix into a vector of length D2. We define a similar vector embedding of wn and γn, τnwn,γn:

τnwn,γnd=wn,d if dDγn,ab if d>D

For the sake of brevity we henceforth omit the dependence of τn on wn and γn. By construction we have the following equivalence:

sTγns+wns=τnψ(s) (79)

If all neurons have activation functions of the form in (78) then P(rs) may once again be written as an exponential family

P(rs)=h(r) exptquad(r)ψ(s)A(s)
A(s)=nlog2 coshτnψ(s)αn
tquad(r)=nτnrn (80)

As tquad(r)RD+D2 is the sufficient statistic for this family, and ψ(s) is the natural parameter. As before, I(R,S)=I(Tquad ,S). However, we note that P(rS) can be written entirely in terms of ψ. Additionally, we note that the support of ψ lies on a D-dimensional manifold in RD+D2 and s maps injectively into this manifold. Thus I(Tquad,S)=I(Tquad,Ψ).

We note several properties of I(Tquad ,Ψ). First, we can in principle expand I(Tquad ,Ψ) like Eq. (19):

ITquad,Ψ=d=1d=D+D2ITquad,ΨdΨ<d (81)

Secondly, the same reduction as Eq. (26) holds for ITquad ,ΨdΨ<d:

ITquad,ΨdΨ<d=ITdquad,ΨdΨ<d (82)

Most notably however, is that ITquad ,ΨdΨ<d=ITdquad,ΨdΨ<d=0 for d > D. This holds because ψd = g(ψD) for d > D, where g(ψD) is just the product of the two relevant components of ψD. Because of this functional dependence we can just apply the generalization of Eq. (22). Therefore the expansion of I(Tquad ,Ψ) can be truncated after D terms.

ITquad,Ψ=d=1d=DITd2,ΨdΨ<d=d=1d=DITdquad,SdS<d (83)

In fact, we can make an even stronger reduction by noting that conditioning on components of S=ψD effectively also conditions on elements of ψ>D. For clarity of exposition we break down Tquad into vector and matrix valued components:

TquadTD,TDD
TDquadTDRD
T>DquadTDDRD×D

We note that conditioning on Sd conditions on the components of ψ corresponding to Td and Tdd. Additionally conditioning on S<d conditions on the components of T<d and on the components of Td1d2 for all indices d1 and d2 such that 1 ≤ d1,d2 < d. Thus Eq. (83) can be further generalized:

ITquad,S=d=1d=DITd,Tdd,SdS<d (84)

The dth term in Eq (83) has D2+D+1 degrees of freedom while the dth term in Eq (84) has D2+D+1−(d−1)2 degrees of freedom. The above procedure can be generalized to polynomial activation functions of arbitrarily high but finite order, though the dimensionality of the sufficient statistic and natural parameter grow exponentially with the order. However, Eq. (84) holds for any order of polynomial, so that one one needs only compute the first D terms of the expansion of the mutual information between the sufficient statistic and natural parameter.

Contributor Information

John A. Berkowitz, Department of Physics, University of California San Diego, San Diego, CA 92093

Tatyana O. Sharpee, Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037 Department of Physics, University of California San Diego, San Diego, CA 92093.

References

  1. Banerjee A, Merugu S, Dhillon IS, and Ghosh J (2005). Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749. [Google Scholar]
  2. Barber D and Agakov F (2003). The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 201–208. MIT Press. [Google Scholar]
  3. Belghazi I, Rajeswar S, Baratin A, Hjelm RD, and Courville A (2018). Mine: Mutual information neural estimation. arXiv preprint arXiv:1801.04062. [Google Scholar]
  4. Bell AJ and Sejnowski TJ (1997). The “independent components” of natural scenes are edge filters. Vision Res, 23:3327–3338. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Bender C and Orszag S (1999). Advanced Mathematical Methods for Scientists and Engineers I: Asymptotic Methods and Perturbation Theory. Advanced Mathematical Methods for Scientists and Engineers. Springer. [Google Scholar]
  6. Berkowitz J and Sharpee T (2018). Decoding neural responses with minimal information loss. bioRxiv. [Google Scholar]
  7. Bialek W (2012). Biophysics: Searching for principles. Princeton University Press, Princeton and Oxford. [Google Scholar]
  8. Brenner N, Strong SP, Koberle R, Bialek W, and de Ruyter van Steveninck RR (2000). Synergy in a neural code. Neural Comput, 12:1531–1552. [DOI] [PubMed] [Google Scholar]
  9. Brunel N and Nadal JP (1998). Mutual information, fisher information, and population coding. Neural Comput, 10(7):1731–1757. [DOI] [PubMed] [Google Scholar]
  10. Cover TM and Thomas JA (2012). Elements of information theory. John Wiley and Sons. [Google Scholar]
  11. Deserno M (2004). How to generate equidistributed points on the surface of a sphere. P.-If Polymerforshung (Ed.), page 99. [Google Scholar]
  12. Dettner A, Münzberg S, and Tchumatchenko T (2016). Temporal pairwise spike correlations fully capture single-neuron information. Nature communications, 7:13805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Gao S, Ver Steeg G, and Galstyan A (2015). Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286. [Google Scholar]
  14. Gao W, Kannan S, Oh S, and Viswanath P (2017). Estimating mutual information for discrete-continuous mixtures. In Advances in Neural Information Processing Systems, pages 5986–5997. [Google Scholar]
  15. Gao W, Oh S, and Viswanath P (2018). Demystifying fixed k-nearest neighbor information estimators. IEEE Transactions on Information Theory. [Google Scholar]
  16. Haussler D, Opper M, et al. (1997). Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics, 25(6):2451–2492. [Google Scholar]
  17. Huang W and Zhang K (2018). Information-theoretic bounds and approximations in neural population coding. Neural computation, (Early Access):1–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Kolchinsky A, Tracey BD, and Wolpert DH (2017). Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436. [Google Scholar]
  19. Kraskov A, Stögbauer H, and Grassberger P (2004). Estimating mutual information. Physical review E, 69(6):066138. [DOI] [PubMed] [Google Scholar]
  20. Laughlin SB, de Ruyet van Steveninck RR, and Anderson JC (1998). The metabolic cost of neural computation. Nat. Neurosci, 41:36–41. [DOI] [PubMed] [Google Scholar]
  21. Nemenman I, Bialek W, and van Steveninck R. d. R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):056111. [DOI] [PubMed] [Google Scholar]
  22. Olshausen BA and Field DJ (2004). Sparse coding of sensory inputs. Curr Opin Neurobiol, 14(4):481–487. [DOI] [PubMed] [Google Scholar]
  23. Opper M, Winther O, et al. (2001). From naive mean field theory to the tap equations. Advanced mean field methods: theory and practice, pages 7–20. [Google Scholar]
  24. Paninski L (2003). Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253. [Google Scholar]
  25. Renner R and Maurer U (2002). About the mutual (conditional) information. In Proc. IEEE ISIT
  26. Rieke F, Warland D, de Ruyter van Steveninck RR, and Bialek W (1997). Spikes: Exploring the neural code. MIT Press, Cambridge. [Google Scholar]
  27. Schneidman E, Berry II MJ, Segev R, and Bialek W (2006). Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440:1007=1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Sharpee TO, Sugihara H, Kurgansky AV, Rebrik SP, Stryker MP, and Miller KD (2006). Adaptive filtering enhances information transmission in visual cortex. Nature, 439(7079):936–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Smith E and Lewicki MS (2006). Efficient auditory coding. Nature, 439:978–82. [DOI] [PubMed] [Google Scholar]
  30. Strong SP, Koberle R, van Steveninck R. R. d. R., and Bialek W (1998). Entropy and information in neural spike trains. Physical review letters, 80(1):197. [Google Scholar]
  31. Tkačik G, Prentice JS, Balasubramanian V, and Schneidman E (2010). Optimal population coding by noisy spiking neurons. Proceedings of the National Academy of Sciences, 107(32):14419–14424. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Treves A and Panzeri S (1995). The upward bias in measures of information derived from limited data samples. Neural Comp, 7:399–407. [Google Scholar]
  33. Wainwright MJ and Jordan MI (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305. [Google Scholar]
  34. Yu Y, Crumiller M, Knight B, and Kaplan E (2010). Estimating the amount of information carried by a neuronal population. Frontiers in computational neuroscience, 4:10. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES