Abstract
Quantifying mutual information between inputs and outputs of a large neural circuit is an important open problem in both machine learning and neuroscience. However, evaluation of the mutual information is known to be generally intractable for large systems due to the exponential growth in the number of terms that need to be evaluated. Here we show how information contained in the responses of large neural populations can be effectively computed provided the input-output functions of individual neurons can be measured and approximated by a logistic function applied to a potentially nonlinear function of the stimulus. Neural responses in this model can remain sensitive to multiple stimulus components. We show that the mutual information in this model can be effectively approximated as a sum of lower-dimensional conditional mutual information terms. The approximations become exact in the limit of large neural populations and for certain conditions on the distribution of receptive fields across the neural population. We empirically find that these approximations continue to work well even when the conditions on the receptive field distributions are not fulfilled. The computing cost for the proposed methods grows linearly in the dimension of the input, and compares favorably with other approximations.
1. Introduction
Information theory has the potential of answering many important questions about how neurons communicate within the brain. In particular, it can help determine whether neural responses provide sufficient amounts of information about certain stimulus features, and in this way determine whether these features could possibly affect the animal’s behavior (Rieke et al., 1997; Bialek, 2012). In addition, a number of previous studies have shown that one can understand many aspects of the neural circuit organization as those that provide maximal amounts of information under metabolic constraints (Laughlin et al., 1998; Bialek, 2012). Key to all of these analyses is the ability to compute the Shannon mutual information (Cover and Thomas, 2012). When estimating the information transmitted by neural populations from experimental recordings, all empirical methods produce biased estimates (Paninski, 2003). There are several approaches to trying to reduce or account for this bias (Nemenman et al., 2004; Strong et al., 1998; Brenner et al., 2000; Treves and Panzeri, 1995), but these approaches do not have finite-sample guarantees and are generally ineffective when the population response is high dimensional. In order to make progress on this problem, we consider the case where the response functions of individual neurons can be measured and where the stimulus-conditional (“noise”) correlations between neural responses can be described by pairwise statistics (Schneidman et al., 2006). Historically, even with these assumptions the mutual information is notoriously difficult to compute in part due to the large number of possible responses that a set of neurons can jointly produce (Nemenman et al., 2004; Strong et al., 1998). The number of patterns grows exponentially with both the number of time points (Strong et al., 1998; Dettner et al., 2016) and the number of neurons.
In this paper we will describe a set of approaches for computing information conveyed by responses of large neural populations. These methods build on recent advances for computing information based on linear combinations of neural responses across time (Dettner et al., 2016; Yu et al., 2010) and/or neurons (Berkowitz and Sharpee, 2018). We will show that when each individual neuron’s firing probability depends monotonically on a (potentially nonlinear) function of the stimulus, the information contained in the full population response can be completely preserved by a linear transformation of the population output. This calculation still involves computing information between high dimensional vector variables. Therefore, we further show how the full information can be effectively approximated using a sum of conditional mutual information values between pairs of low-dimensional variables. The resulting approach makes it possible to avoid the “curse of dimensionality” with respect to the number of neurons when computing the mutual information from large neural populations.
2. Framework setup
Our analysis will target neural responses considered over sufficiently small time windows such that no more than one spike can be produced by any given neuron. We model the neural population as a set of binary neurons with sigmoidal tuning curves with response probability described by:
| (1) |
where is the input, rn ∈ {−1, 1} is the activity of the nth neuron, and is a scalar function of representing the activation function of the nth neuron. The population consists of N such neurons, and the population response is denoted as . For clarity of the derivation, we will initially assume that neural responses are independent conditioned on :
| (2) |
and later discuss under what conditions our results generalize to the case where neural responses are correlated for a given stimulus . A few lines of algebra suffice to show that Eq. (2) can be expressed in the following form:
| (3) |
This formulation will assist all of the approaches described below for computing the mutual information.
3. An unbiased estimator of information for large neural populations
In order to test the approaches described in subsequent sections, we first developed a Monte-Carlo method for computing the “ground-truth” mutual information that works for large neural populations. The approach relies on the knowledge of neural response parameters to produce unbiased estimates of mutual information between and for different choices of or . Here and in what follows, upper case letters (e.g. ) represent random variables, while lower case letters (e.g. ) represent specific values of the associated random variables. The input distribution is defined by drawing Nstim samples; we denote this set of samples as . Because of this approximation however, will be bounded above by log(Nstim) (as will be any unbiased estimator of mutual information).
Although there are several formulations of the mutual information in terms of the entropies of and it serves to examine just one:
| (4) |
Here, is the Shannon entropy of the marginal distribution of and is the conditional entropy of given . Because we intend to use this estimator as a way to test the quality of other approximations, we will only consider here the case of conditionally independent neural responses . In this case, the noise entropy decomposes into a sum over neurons:
| (5) |
where is the expected value of Rn given :
| (6) |
We denote as the finite sample approximation to :
| (7) |
The conditional entropy can be evaluated in O(N * Nstim) time, not including the cost of evaluating . However, the marginal distribution of will in general not factor. Thus evaluating requires computing the marginal for all . This computation grows like O(N * Nstim *2N). Thus, evaluation of Eq. (4) is known to become intractable for realistic population sizes. To derive our estimator, we begin by rewriting :
| (8) |
We approximate the log-marginal with an empirical average:
| (9) |
In terms of numerical implementation, can be efficiently and stably evaluated in O(Nstim) time using the logsumexp function that is implemented in many numerical libraries. To approximate the averaging with respect to we draw B samples of for every , which is easily done with (1) and (2), and denote these samples as . We can thus produce an unbiased estimate of :
| (10) |
Importantly, the response entropy requires operations, a substantial improvement over exact evaluation of when B * Nstin ≪ 2N. We note that even though we are able to produce unbiased estimates of , this estimator systematically underestimates the “infinite sample” entropy computed with respect to explicitly, i.e. not defined by input samples (see Appendix A). Our Monte-Carlo estimator of is the straightforward combination of and :
| (11) |
Although is an unbiased estimator of the mutual information (after accounting for the approximation of by samples ) the variance of and thus of can be difficult to quantify. However, is a bounded function because has finite support (or, more generally, can be treated as a continuous function on the compact set [−1, 1]N). Thus, standard concentration bounds show that is a consistent estimator of .
In order to test our derivation that is an unbiased estimator of , we analyzed the statistics of on a tractable neural population where can be computed exactly. We let N = 10 and with uniformly distributed along the unit circle. is a spherical, two-dimensional Gaussian distribution and Nstim = 8, 000. We evaluate exactly, and get that nats. This is well below the upper bound of log(8, 000) ≈ 8.987 nats. We computed 100 times for B = 1 and B = 3, with {sμ} fixed. For each repetition we record the residual . Distribution plots of the residuals are shown in Figure 1. For both distributions the sample mean is not significantly different from zero with P = 0.848 (B = 1) and P = 0.851 (B = 3) in a two-sided t-test. The simulation results therefore support the derivation of zero-bias in the proposed model-based Monte-Carlo estimator.
Figure 1:

Between exact calculation and the Monte Carlo results for the test neural population described in Section 3. Dashed black line indicates zero, while red marker and error bar are the sample mean and standard deviation.
4. Simplifying the mutual information with sufficient statistics
4.1. A vector-valued sufficient statistic
The method introduced in Section 3 can be applied for very general formulations and parametrizations of the activation functions. However, when we constrain the activation functions to be affine we can show that has especially useful properties. Specifically, we assume the following parametrization of :
| (12) |
While Eq. (12) implies a strong restriction on how stimuli drive the neural responses, some results of this section can be generalized to other activation functions. The reason for this is that even the general formulation of given in Eq. (3) can be viewed as an exponential family, with sufficient statistic and natural parameter . In particular, the framework can be extended to quadratic activation functions, which are an important model for describing neurons that are sensitive to multiple stimulus features. See Appendix E for further discussion.
If Eq. (12) holds, then Eq. (2) can be rewritten as follows:
| (13) |
where,
| (14) |
| (15) |
| (16) |
Equation (13) is an exponential family with sufficient statistic , natural parameter , base measure , and log-partition function (Wainwright and Jordan, 2008).
The stimulus-conditional probability distribution can be defined by marginalizing over all that map to the same :
| (17) |
Note that if there does not exist an such that . An important property of sufficient statistics is the conservation of information (Cover and Thomas, 2012):
| (18) |
with defined by Eq. (14). Although does not lose information relative to , it is worth making a few comments on and Eq. (18). Because is a discrete variable (with cardinality of at most 2N) and is a deterministic function of , then is also a discrete variable with finite cardinality. Indeed, outside of cases of degeneracy between the columns of W, there will generally be a one-to-one mapping between values of and . Thus, even in cases where , computing can be just as difficult as computing . Furthermore, unlike , the components of will generally not be conditionally independent. will thus be similarly intractable. While it may seem that we have not gained any computational advantage by transforming from to we will now show that Eq. (18) can be expressed in a convenient form that facilitates several useful approximations.
4.2. Decomposition of mutual information based on sufficient statistics
We start by noting that the ordering of the components of and is arbitrary, because applying any matching permutation to the components of and does not affect . We will use the following notations for components of vectors , s<d = (s1, …, sd−1), and similarly for s>d, s≥d, and s≤d. Note that S¬d = (S<d,S>d). Additionally, we will at times consider information theoretic quantities involving variables that are the concatenation of two other variables, such as X and Y. Such compound variables will be denoted as {X, Y}. Using these notations and applying the chain rule for mutual information to Eq. (18) yields: (Cover and Thomas, 2012):
| (19) |
In Eq. (19), is the mutual information between and Sd conditioned on S<d.
| (20) |
where is the total correlation between S<d, Sd, and . All four formulations of are equivalent provided they all exist. A notable situation is when there is functional dependence between Sd and S<d, such as when the support of S≤d lies on a manifold of intrinsic dimension < d. In this case I(S<d,Sd) diverges and the second line of Eq. (20) is ill-defined. However, it easy to show that if such a functional dependency exists using the fourth line of Eq. (20). Formally, let sd ≡ g(s<d) where is a function defined explicitly or implicitly. Then we note that the mutual information between two variables is zero if at least one variable is constant:
| (21) |
Thus, will also be zero:
| (22) |
Computing just remains challenging for the same reasons as computing . However, we can achieve a further reduction by taking advantage of the fact that is an exponential family. To see this we first express two important marginalized forms of (14):
| (23) |
| (24) |
The notation denotes the expectation of with respect to P(s>d|s≤d), with analogous meanings for and so forth. Marginalization over conditioned variables is expressed implicitly: . The important consequence of (23) and (24) is that the log-likelihood ratio of and is independent of t<d. From this we can show that :
| (25) |
This leads to our next reduction:
| (26) |
We note that the dth term in (19) has D+d+1 degrees of freedom, whereas the corresponding term in (26) has D+1 degrees of freedom. This effective dimension reduction has important algorithmic implications for the nonparametric estimators we use to compute the individual terms of (26) (c.f. Section 5). In section 5.1 and 5.2, we use Eq. (26) to evaluate .
4.3. Lower bounds for the mutual information
While Eq. (26) represents a significant improvement in complexity over naive evaluation of , individual terms of may be still too high dimensional to reliably evaluate. In this section, we will present a series of lower bounds on that are more easily estimated. In particular we consider bounds that arise by replacing T≥d in the dth term of Eq (26) by a lower dimensional, deterministic transformation of T≥d denoted Zd. Applying the Data Processing Inequality (DPI) to each term in Eq (26), will yield a lower bound for the mutual information. There are many possible lower bounds to of this form. We focus on a variable Zd = {Td, |T>d|} where |T>d| is the L2-norm of T>d. This leads to the following lower bound approximation to
| (27) |
which we term isotropic. In Appendix B, we show that this approximation becomes exact in the asymptotic limit of large neural populations, meaning that , when the stimulus distribution is isotropic and the distribution of receptive fields (RF) across the population is such that . Notably, this is achieved when RFs uniform cover the stimulus space, meaning that is described by an uncorrelated Gaussian distribution. For finite number of neurons, will never be perfectly isotropic. However, for large populations (N ≫ 1) where the receptive fields are drawn from an isotropic distribution and the distribution of α is independent of , will become isotropic asymptotically as N → ∞, cf. appendix B.1 for further details. The analogue of approximation Eq. (27) for the case where RFs and are described by a matching correlated Gaussian distribution is described in appendix B.2.
The next reduction we consider is to drop |T>d| from each term of Eq. (27):
| (28) |
By the data-processing inequality, it again follows that . Overall, one obtains a series of bounds:
| (29) |
Our final, simplest approximation is to drop the conditioning on S<d in each term of Eq. (28).
| (30) |
We show in Appendix C, that this last approximation becomes exact in the case where neural populations split into independent sub-populations with orthogonal RFs between sub-populations. Mathematically, this corresponds to the case where both the stimulus distribution and the function factor in the same basis:
| (31) |
In general, may be greater or less than (or ) (Renner and Maurer, 2002). However when , the following additional inequality holds:
| (32) |
To derive (32) we first note that we can decompose I(Sd, {Td, S<d}) (for d > 1) in two different ways:
| (33) |
Equating the first and second lines of (33) we can rewrite the residual I(Sd,Td|S<d) − I(Sd,Td):
| (34) |
Though either side of (34) may be positive or negative in general, when we make the assumption that factors across dimension, then I(Sd, S<d) = 0. Thus (34) is nonnegative and I(Sd, Td|S<d) ≥ I(Sd, Td), implying (32).
In the opposite extreme case where the value of Sd is a deterministic function of S<d, Eq. (22) can be generalized to show that I(Sd,Td|S<d) = 0. Thus, in this case I(Sd,Td) ≥ I(Sd,Td|S<d), which in turn indicates that . For example, when the support of lies on a one-dimensional curve, e.g. represents position along a one-dimensional nonlinear track, Sd is fully determined from the values of other variables S<d ∀d regardless of component ordering. In this case, .
In the intermediate cases with some statistical dependencies between stimulus components, is not generally guaranteed to be a lower bound to either or . Nevertheless, we observed that even for some correlated , c.f. section 5.2.
4.4. Alternative Approximations of
Previous authors have proposed other approximations to the mutual information. There exists a non-parametric upper bound to the mutual information computed in terms of pairwise relative entropies between and (Haussler et al., 1997; Kolchinsky et al., 2017):
| (35) |
The model we consider for is an exponential family and thus has a tractable relative entropy (Banerjee et al., 2005):
| (36) |
In Eq. (35) we have used the generalized definitions of Section 3. The evaluation of the upper bound (35) is quadratic in the sample size Nstim as opposed to O(Nstim log2 Nstim) for the estimator in section 5. In the limit where N ≫ D, another popular approximation exist based on Fisher information (Brunel and Nadal, 1998); it can be computed with O(Nstim) operations. Recent work has shown that this approximation is valid only for certain classes of input distributions (Huang and Zhang, 2018). In appendix D we discuss the relationship between this approximation and . We include numerical comparisons between and the methods proposed in this paper in sections 5.1 and 5.2. However, we found that the Fisher Information approximation drastically overestimated the true mutual information. Therefore to avoid obscuring differences between other results, the approximation based on Fisher Information is not included in Figures 2–3. Full plots including this approximation can be found in Appendix D.
Figure 2:

Information curves for neural populations with uncorrelated RF and stimulus distributions. Lines and error bars are mean and standard deviation over ten repeats of the estimator. Insets show RF distribution for several population sizes.
Figure 3:

Information curves for neural populations with correlated RF distributions, cf. section 5.2. Lines and errorbars are mean and standard deviation over ten repeats of the estimator. In all panels, circles represent information computed in the original basis, while squares and triangles are computations performed in decorrelated basis. Ivector (A) and Iiso (B) recover full information. These two computations do not benefit from working in the decorrelated stimulus basis. Stimulus decorrelation improves the performanceof Icomp-cond (C) and Icomp-ind (D)). In (D), computations are invariant to ordering of components.
We note that there are other variational approximations to mutual information (Belghazi et al., 2018; Barber and Agakov, 2003). However, because comparing the information for different choices of the parameters of and requires training a different variational approximation each time, direct comparison requires substantial computational resources and we leave them for future work.
5. Numerical Simulations
We now test the performance of the above described bounds under several representative situations that include correlated and uncorrelated stimulus distributions, and isotropic and anisotropic receptive field distributions, including experimentally recorded receptive fields from the primary visual cortex, as well as the case where intrinsic “noise” correlations are present.
To empirically estimate the bounds on mutual information [I(Sd, Td), I(Sd, Td|S<d), I(Sd, {Td, |T>d|}|S<d), and I(Sd, T≥d|S<d)], we use the KSG estimator (Kraskov et al., 2004), a non-parametric method based on distributions of K nearest-neighbor distances. We chose to use the KSG estimator because even though we have reduced the mutual information between two high-dimensional variables into a sum over pairs of scalars, computing even just I(Sd, Td) can still be a daunting task, even more so for terms involving conditional informations. Td may still have exponentially large cardinality, and complicated interdependencies between components of present difficulties in forming explicit expressions for P(td|sd), so exact evaluation of H(Td) and H(Td|Sd) is not feasible at present. The KSG estimator requires only that we can draw Nstim samples of and from , discarding the unused components. Sampling from is easily done given samples from . Given a sample , we draw from , which is easily done because of Eq. (2), and transform into using (14). This estimator has complexity O(Nstim log2 Nstim) when implemented with KD-Trees. For the case of two scalar variables the ℓ2 error of the estimate decreases like (Gao et al., 2018), though if the true value of the mutual information is very high then the error may still be large (Gao et al., 2015). In order to partially alleviate this error, we use the PCA based Local Nonuniformity Correction of (Gao et al., 2015) (KSG-LNC). We extend this estimator to compute the conditional mutual information terms using a decomposition analogous to the second line of Eq. (20), and set the nonuniformity threshold hyperparameter according to the heuristics suggested in (Gao et al., 2015). Additionally, we assume that the distribution of (and thus S≤d, S<d, and Sd ∀d) is non-atomic. Thus, because is discrete but real valued, the KSG estimator is applicable as neither nor is a mixed continuous-discrete variable (Gao et al., 2017).
5.1. Large populations responding to uncorrelated stimuli
We evaluated the performance of the bounds on information developed in section 4.3 for large populations ranging from N ≈ 100 to 1, 000. Specifically, to test the performance of we chose a highly isotropic population and stimulus distribution. We set D = 3, M = 8, 000, and let be a zero mean gaussian with unit covariance matrix. For each value of N, the were placed uniformly on the surface of the unit sphere, using the regular placement algorithm of (Deserno, 2004). Because N is too large for exact evaluation of ground truth values were estimated using the Monte Carlo estimator of section 3 with B = 3. Results are plotted in Figure 2. We find that for large N, tightly approximates and both are accurate approximations to , strongly outperforming . We note that for this case the upper bound of log(Nstim) = log(8, 000) ≈ 9 (nats) is well above all of the curves other than , which is already known to be an upper bound to , demonstrating that we are in the well-sampled regime. Once again we see that inequalities (29) and (32) hold.
5.2. Correlated stimulus distributions
We now consider the case of correlated Gaussian stimuli. and model as a zero-mean Gaussian with a full-rank non-diagonal covariance matrix C. To better understand the effects of stimulus correlations we also perform computations in stimulus bases where components are independent. For this, we decompose C as C = VΛVT where V is an orthogonal matrix whose columns are the eigenvectors of C.
| (37) |
Note that we have . It is easy to see that is also an exponential family. Additionally because the mappings from to and to are diffeomorphisms the information is preserved:
| (38) |
We note that while Eq. (38) holds in principle, in practice we may see variation as the KSG family of estimators is not invariant to under diffeomorphisms. Importantly we also note that Eq. (25) holds for (, ). Given samples from , we automatically have samples from . We can straightforwardly generalize , , , and , to , , , and respectively. Eq. (38) does not generalize to Iiso, Icomp-cond, or Icomp-ind as they are not expressible as mutual information quantities between two variables. We note that so we do not modify .
Simulations in Figure 3 were done using the following stimulus covariance matrix
| (39) |
For this choice of C ρ1,2 = ρ2,3 = 0.75, ρ1,3 = 0.5, and |C| = 1. The covariance matrix of is diagonal:
| (40) |
The receptive field configurations are the same as in Section 5.1. We note that in the coordinates, all components have the same variance, whereas this symmetry is broken in the decorrelated components . We compared sorting the components of in increasing and decreasing order of variance (triangles and squares respectively) in Figure 3A–C. Component order does not matter for , Figure 3D. We find that for both Ivector and Iiso it is optimal to perform computation in the original basis, with both quantities accurately matching . For Icomp-cond and Icomp-ind accuracy is increased by using decorrelated components and, for Icomp-cond, sorting components by decreasing variance.
5.3. Highly asymmetric receptive field distributions
Next, we consider a small population (N = 9) with a highly asymmetric distribution of redundant receptive fields in two stimulus dimensions. In particular we are interested in a population where many different configurations of map to the same configuration of , demonstrating the utility of using as a non-trivial sufficient statistic of . With this in mind, we chose a heavily redundant configuration of , , . The cardinalities of , , T1 and T2 are 512, 37, 7, and 7 respectively. Because N is small, ground truth values of were computed by exactly evaluating , for every sample of . Given we average across to get explicitly, and calculate , , and from these distributions. is gaussianly distributed with diagonal covariance matrix, in accordance with (32). We set Nstim = 10, 000. To test how the relative values of I(S1,T1) and I(S2,T2) impact the optimal ordering of components in we fixed σ2 = 1 and varied σ1 between 0.5 and 2.5. Results are plotted in Figure 4. As predicted, the hierarchy of bounds (29) and (32) holds for all σ1. It is also notable that in this case, just like in the case of large neural populations, for computing , it always seems best to start the information computation with the stimulus component that has the largest variance. As expected, the ordering or components does not strongly impact .
Figure 4:

Information curves for the example population with highly redundant RFs from Sec. 5.3. Lines and errorbars are mean and standard deviation over ten repeats of the estimator. Although neither the component-conditional nor the component information are guaranteed to tightly approximate the information, both provide good approximation to the full information (as estimated via unbiased Monte Carlo method), reaching values within ≥ 80% of the maximum. For the vector-sufficient statistic, both component orderings accurately reproduced the full information. The next best approximation to the full information is provided by the component-conditional computation with components added in the order from largest to smallest variance. This approximation reaches accuracy within 95% of the full value over the range of neural population sizes.
5.4. Experimental stimuli and receptive fields
In the previous three experimental sections we considered synthetic distributions of low-dimensional stimuli and artificial configurations of receptive fields. In this section we analyze a population of model neurons with receptive fields and α values that were fit to responses of primary visual cortex neurons (V1) elicited by natural stimuli (Sharpee et al., 2006). We use 147 pairs of values that were fit using the Maximally Informative Dimension (MID) algorithm as in (Sharpee et al., 2006). Stimuli are 10 pixel by 10 pixel patches extracted from the same set of images used to fit the model parameters. Receptive fields are normalized and centered on the patch, and we chose a 10 × 10 sub-patch of the original 32 × 32 shaped receptive fields so that all dimensions are well sampled by receptive fields. That is, for all pixels of the 10 × 10 patch, at least 115 of the 147 neurons have a nonzero value in the corresponding component of their receptive field. Additionally, the stimuli are z-scored by subtracting the mean and dividing by the standard deviation, with both quantities computed across all samples and pixels collectively.
Because of high stimulus dimensionality we could only compute the bound (via the KSG estimator) and the ground truth information (via the Monte Carlo method). Because the pixels of natural image patches are clearly not independent, we also computed Icomp-ind in two additional coordinate systems. The first coordinate system is simply the linearly decorrelated components and defined in Eq. (37). The second coordinate systems uses independent components derived using independent component analysis on
| (41) |
Here, U is an unmixing matrix computed using Infomax ICA (Bell and Sejnowski, 1997) on the samples of . As is done in (Bell and Sejnowski, 1997), U includes a linear whitening matrix. We note that in Eq. (41), is multiplied by (U−1)T and not U because the ICA unmixing matrix is generally not orthogonal and we require that . As before . However, the same cannot be said for Icomp-ind expressed in different coordinate systems.
To evaluate the effect of using different coordinate systems to evaluate Icomp-ind for different population sizes we first ranked the 147 neurons in descending order by the information each neuron carried about the stimulus. We computed , which is easily done exactly since Rn is a binary variable, and then sorted neurons so that . We considered populations of size N = 60, 70, …, 140, where each population contained the first N neurons under the aforementioned ordering. For each value of N we computed , , , and . We note that log(Nstim) = log(49, 152) ≈ 10.8 nats. Results are plotted in Figure 5.
Figure 5:

Information curves for populations based on experimentally recorded RFs and probed with D = 100 natural visual stimuli. The full information (solid black line) is computed via the Monte Carlo method and is compared to Icomp-ind approximations computed in two different bases: the PCA basis (blue dashed line) and the ICA basis (blue dotted line). We do not show the calculation in the original pixel basis, because it is omitted as it was yielded values ~ 25 nats across the range of population sizes and obscured the other curves. Because of non-Gaussian statistics of natural stimuli, PCA components remain correlated, and as a result the approximation is no longer guaranteed to be a lower bound of the true information. Computation in the ICA basis respects the lower bound requirements, and achieves ≥ 75% of the full information across the range of population sizes.
We observe that both and overestimate the true information, especially . This overestimation occurs because stimulus components are not independent. By comparison, computation performed in the ICA basis, , lower bounds the mutual information for all N, achieving ≥ 75% of the full information across the range of population sizes.
5.5. Handling intrinsically correlated neurons
In order to simplify derivations, we assumed that the neural responses were independent after conditioning on . However, all of the analytic results in Section 4.2 can be extended to specific forms of intrinsic interneuronal correlation to allow for the presence of correlations in neural responses for a given stimulus . Formally, we modify the base measure to include a pairwise coupling term:
| (42) |
In Eq. 42, J is a symmetric N × N matrix where Jmn describes the intrinsic coupling between the mth and nth neurons. In this case can still be written as an exponential family in a canonical form:
| (43) |
The form of the sufficient statistic remains unchanged, though generally lacks a closed form. Nevertheless, all of the decompositions, equalities, and inequalities in Section 4.2 require only for the exponential family to be in canonical form and remain valid.
We tested the accuracy of our approximations on a small (N = 10) population of intrinsically correlated neurons. Receptive fields are uniformly distributed on the unit circle, αn = 0 ∀n, and is a standard two-dimensional Gaussian (Nstim = 20, 000). Intrinsic coupling is set proportional to the overlap between receptive fields with a coupling strength J0, the sign of which determines whether the intrinsic couplings perform stimulus decorrelation or error-correction (Tkačik et al., 2010).:
| (44) |
The algorithms of Sections 5 and 3 all depend on being able to sample easily from . For large N and general J this is usually difficult, particularly for configurations of J that exhibit glassy dynamics. Additionally, evaluating Eq. (9) requires explicit knowledge of , though methods such as mean-field theory or the TAP approximation may be used to approximate (Opper et al., 2001). Since this population is small, we evaluate the ground truth information exactly as in Section 5.3. Likewise, we sample from exactly by computing all 2N (1, 024) probabilities for every sample of . Analyzing the error introduced in using approximate sampling strategies such as Markov Chain Monte Carlo is left to future investigation. Results are plotted in Figure 6. As predicted, matches the ground truth values of the information. Similarly the hierarchy of bounds (29) and (32) is preserved, though for strongly negative couplings . In sum, the presence of noise correlations does not invalidate the approximations and bounds that are derived above. However, numerical computation can become more difficult in the presence of noise correlations.
Figure 6:

Information curves for populations with intrinsic correlations. Lines and errorbars are mean and standard deviation over ten repeats of the estimator.
6. Conclusions and Future Work
We have presented three approximations that can be used to estimate the information transmitted by large neural populations. Each of these three approximations represents different trade-offs between accuracy and computational ease and feasibility. The best performance in terms of accuracy was provided by the isotropic approximation Iiso. This approximation worked well even in cases where is not guaranteed to become asymptotically exact with increasing population size. For example, the isotropic approximation was derived assuming a matching covariance matrix for both the stimulus and RF distributions, cf. Appendix B. Yet, this approximation matched the full ground-truth information values even for correlated stimuli with RFs remaining uncorrelated across the neural population (Fig. 3.) This approximation provided the best overall performance among the bounds tested, consistent with the theoretically expected inequalities between these bounds, cf. Eq. (29).
The component-conditional information Icomp-cond offered the second-best performance. This approximation performed especially well when computed in the stimulus basis where stimulus components were not correlated. This approximation is less computationally difficult compared to Iiso, because each conditional information is evaluated between just two quantities Sd and Td compared to Sd and a conjunction of Td and |T>d| as in Iiso. For this reason, the finite-sample bias of Icomp-ind can also be less than Iiso, because bias in the evaluation of the mutual information is usually larger for higher-dimensional calculations.
The last approximation Icomp-ind is the least accurate of the three approximation but is computationally the easiest. It is the only approximation among the three we considered here that we were able to implement in conjunction with high dimensional stimuli. This approximation becomes most accurate in the stimulus bases where stimulus components are independent. There is strong evidence that neural receptive fields are organized along the ICA components of natural stimuli (Bell and Sejnowski, 1997; Olshausen and Field, 2004; Smith and Lewicki, 2006). This raises the possibility that the approaches proposed here will fair well when applied to recorded neural responses. Indeed, we found that Icomp-ind ≥ 75% of the full information value for large neural populations constructed using experimentally recorded RFs and probed with natural stimuli.
At present, the main limitation for computing the conditional approximations Icomp-cond and Icomp-iso is not the number of neurons but rather the stimulus dimensionality. For stimulus distributions where P(s≥d|s<d) can be easily sampled from, such as Gaussian distributions, we can take advantage of the fourth line of Eq. (20) to compute unbiased estimates of Icomp-cond and Iiso, albeit with possibly high variance. Developing methods that can efficiently approximate these conditional computations represents an important opportunity for future research.
7. Acknowledgements
This research was supported by NSF grant IIS-1724421.
A. Bias of
In this section we give a self-contained proof that systematically underestimates the ”true” entropy . We first assume that : The are drawn independently from , whether is a smooth density on or some larger set of samples. We define a an empirical version of the marginal distribution on :
| (45) |
The true marginal distribution is the expected value of . The Shannon entropy is a concave function of , which can be considered a random vector in the 2N dimensional probability simplex. Thus, by Jensen’s inequality we have the following:
| (46) |
This bias holds even in the case of evaluating through exact enumeration. We note that we are able to produce unbiased estimates of because we have full access to : We can evaluate explicitly and deterministically, and thus as well (up to factors of numerical precision). If we were always constrained to drawing samples from , then we would once again be limited to making biased estimates of (Paninski, 2003).
B. On the asymptotic tightness
Consider a large population (N ≫ 1) where the distribution of and α is such that (in some sense to be made more precise later). Consider the likelihood ratio in the definition of (25):
| (47) |
Additionally we consider a stimulus distribution that is similarly isotropic so that the conditional distribution P(s>d|s≤d) can be written in a convenient factored form:
We will show that in this situation, we can replace T≥d in (25) with the variable that is the concatenation of Td and |T>d| without loss of information:
| (48) |
To show this using the Fisher-Neyman factorization theorem, it suffices to show that the numerator and denominator in (47) can be factored as follows:
| (49) |
With the requirement that f2(t>d) = g2(t>d), so that dependence on t>d cancels out in (47). We note that the first line of (49) implies the second so we examine that term in more detail.
| (50) |
We note that s>d is a K = D − d dimensional vector. We assume that d < D − 1, so that K > 1, otherwise no further reduction of (50) is possible. We convert the integral over in (50) into spherical coordinates and break it into three parts: Integration over ρ ∈ [0, ∞) where |s>d| = ρ; integration over θ ∈ [0, π] where s>d ·t>d = ρ|t>d|cos(θ), and φ ∈ ΩK−1 is the set of all directions in with constant θ. The integrand of (50) doesn’t depend on φ so we can integrate over it automatically, yielding a constant BK that is a function only of K. We can now restate (50) in these coordinates:
| (51) |
We next evaluate the integral over θ in (51).
| (52) |
Where Γ(x) is the Gamma function, and 0F1(a, z) is the confluent hypergeometric limit function. We have our final expression for the first term in (49):
| (53) |
By setting g1 in (49) equal (53), and letting g2 = f2 = 1, we have established (48).
B.1. Approximating for Gaussian
In the previous section we assumed that the distribution and P(α) are such that . In the special case when N ≫ 1, P(α) = δ(α) and are Gaussianly distributed we can approximate in a semi-closed form. Let be a zero-mean Gaussian with with positive-definite covariance matrix C:
| (54) |
| (55) |
Where we have taken advantage of the fact that is a scalar gaussian variable with standard deviation that depends on and C. We next take an infinite series expansion of log(2cosh(x)).
| (56) |
As an aside, the first equality in (56) is a useful and numerically stable expression for A(x). The ”softplus” function l(y) = log(1 + exp(y)) is implemented in many scientific computing packages, and using this alternate form for A(x) sidesteps computing the hyperbolic cosine. We next take the appropriate Gaussian average of each term in (56):
| (57) |
| (58) |
Where erfcx(y) is the scaled complementary error function. Thus we have our final form for :
| (59) |
We note that erfcx(y) monotonically decreases to zero, so for large values of , is well approximated by the first term in (59). Regardless we see that depends on only through :
| (60) |
Where is the cholesky decomposition of C.
B.2. Generalizing for matched anisotropy
In this section we will show that a generalized form of will also asymptotically converge to when both and obey a certain form of ”matched” anisotropy. Specifically we assume that and depend on on through a quadratic function of with positive-definite kernel C.
| (61) |
| (62) |
Where, as in Section B.1, is the cholesky decomposition of C. Let us transformed versions of and :
| (63) |
| (64) |
Where U−T is the transpose of the inverse of U, which is well defined since C was positive definite. We note that and can once again be written as an exponential family in canonical form:
| (65) |
As the mappings from to and to are diffeomorphisms, we have that . Furthermore, equation (65) implies that an analogous form of equation (26) holds for :
| (66) |
Additionally, and . Thus, we may reuse the derivation of section 48 to derive the following analogy of equation (48):
| (67) |
Therefore is asymptotically equal to :
| (68) |
We note that an example of such a matched isotropy situation would be where both the stimuli and receptive fields (for a large population) are distributed according to a Gaussian distribution with covariance matrix C (c.f. section B.1).
C. Independent Sub-populations
In this section we present an example where one of our proposed approximations, in this case, is equal to . Let be an orthonormal basis for . Suppose that the distribution of and are such that both and factor when expressed in this basis :
Similarly defining we have that . Because mutual information is invariant under bijective transformations of the variables (e.g. a change of basis) (Cover and Thomas, 2012) we have that . It is easy to show that can be written as follows:
| (69) |
Eq (69) implies that the log-likelihood ratio of to decomposes across :
| (70) |
Thus we have the following reduction of :
| (71) |
We note that (31) includes the case where for some k, . In such a case , with probability one, and . Thus, in the case of independent subpopulations, can be reduced to computing following a change of basis.
D. Relationship between and
In this appendix we relate to the Fisher Information based approximation of (Brunel and Nadal, 1998):
| (72) |
Where is the Fisher Information Matrix of :
| (73) |
We begin by considering the inner expectation over in :
| (74) |
We next assume that the activation function is affine (e.g. (13)), and thus in canonical form. We also assume that, is identifiable:
| (75) |
For (13) a necessary and sufficient condition for identifiability is that the matrix W has full rank, a reasonable assumption when N ≫ D. We utilize the following properties of exponential families in canonical form:
.
is convex and is positive semi-definite. When is identifiable replace becomes strictly convex, positive definite, and has a global minimum with respect to of 0 at .
In the limit N ≫ D we approximate using Laplace’s Method (Bender and Orszag, 1999), expanding around :
| (76) |
Plugging (76) into the definition of we have the following asymptotic expression for :
| (77) |
For stimulus distributions where the entropy is known a priori, such as the Gaussian distributions in sections 5.1 and 5.2, can be computed in O(M) time. If not, then must be estimated, a challenging task in high dimensions. In figure 7, we replot the results of sections 5.1 with the inclusion of . We see that is a very loose upper bound of and of , indicating that the convergence of Laplace’s Method may be very slow in this situation.
Figure 7:

Information curves for the population of section 5.1 compared to Fisher approximation. Lines and errorbars are mean and standard deviation over ten repeats of the estimator. (solid black line), (dashed black line), (solid yellow line), (dotted black line), (solid red line), (solid blue line), (solid green line)
E. Extension to polynomial activation functions
In section 4.1 we assumed that the activation functions were affine functions of the stimulus vector . In this appendix we will show how to generalize some of the results of section 4.1 to polynomial activation functions. For clarity of exposition we will demonstrate this generalization for quadratic functions as the procedure for higher order polynomials follows quickly. To begin, we add a quadratic term to Eq (12):
| (78) |
is a symmetric D × D matrix representing the quadratic kernel of the nth neuron’s activation function. We note that where A ◦ B is the hadamard product between equally shaped matrices A and B and is the outer product of with itself. We define a vector embedding of into :
Where a = d mod D and b = ⎿d/D⏌ are index mappings that map a D × D matrix into a vector of length D2. We define a similar vector embedding of and γn, :
For the sake of brevity we henceforth omit the dependence of on and γn. By construction we have the following equivalence:
| (79) |
If all neurons have activation functions of the form in (78) then may once again be written as an exponential family
| (80) |
As is the sufficient statistic for this family, and is the natural parameter. As before, . However, we note that can be written entirely in terms of . Additionally, we note that the support of lies on a D-dimensional manifold in and maps injectively into this manifold. Thus .
We note several properties of . First, we can in principle expand like Eq. (19):
| (81) |
Secondly, the same reduction as Eq. (26) holds for :
| (82) |
Most notably however, is that for d > D. This holds because ψd = g(ψ≤D) for d > D, where g(ψ≤D) is just the product of the two relevant components of ψ≤D. Because of this functional dependence we can just apply the generalization of Eq. (22). Therefore the expansion of can be truncated after D terms.
| (83) |
In fact, we can make an even stronger reduction by noting that conditioning on components of effectively also conditions on elements of ψ>D. For clarity of exposition we break down into vector and matrix valued components:
We note that conditioning on Sd conditions on the components of corresponding to Td and . Additionally conditioning on S<d conditions on the components of T<d and on the components of for all indices d1 and d2 such that 1 ≤ d1,d2 < d. Thus Eq. (83) can be further generalized:
| (84) |
The dth term in Eq (83) has D2+D+1 degrees of freedom while the dth term in Eq (84) has D2+D+1−(d−1)2 degrees of freedom. The above procedure can be generalized to polynomial activation functions of arbitrarily high but finite order, though the dimensionality of the sufficient statistic and natural parameter grow exponentially with the order. However, Eq. (84) holds for any order of polynomial, so that one one needs only compute the first D terms of the expansion of the mutual information between the sufficient statistic and natural parameter.
Contributor Information
John A. Berkowitz, Department of Physics, University of California San Diego, San Diego, CA 92093
Tatyana O. Sharpee, Computational Neurobiology Laboratory, Salk Institute for Biological Studies, La Jolla, CA 92037 Department of Physics, University of California San Diego, San Diego, CA 92093.
References
- Banerjee A, Merugu S, Dhillon IS, and Ghosh J (2005). Clustering with bregman divergences. Journal of machine learning research, 6(Oct):1705–1749. [Google Scholar]
- Barber D and Agakov F (2003). The im algorithm: a variational approach to information maximization. In Proceedings of the 16th International Conference on Neural Information Processing Systems, pages 201–208. MIT Press. [Google Scholar]
- Belghazi I, Rajeswar S, Baratin A, Hjelm RD, and Courville A (2018). Mine: Mutual information neural estimation. arXiv preprint arXiv:1801.04062. [Google Scholar]
- Bell AJ and Sejnowski TJ (1997). The “independent components” of natural scenes are edge filters. Vision Res, 23:3327–3338. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bender C and Orszag S (1999). Advanced Mathematical Methods for Scientists and Engineers I: Asymptotic Methods and Perturbation Theory. Advanced Mathematical Methods for Scientists and Engineers. Springer. [Google Scholar]
- Berkowitz J and Sharpee T (2018). Decoding neural responses with minimal information loss. bioRxiv. [Google Scholar]
- Bialek W (2012). Biophysics: Searching for principles. Princeton University Press, Princeton and Oxford. [Google Scholar]
- Brenner N, Strong SP, Koberle R, Bialek W, and de Ruyter van Steveninck RR (2000). Synergy in a neural code. Neural Comput, 12:1531–1552. [DOI] [PubMed] [Google Scholar]
- Brunel N and Nadal JP (1998). Mutual information, fisher information, and population coding. Neural Comput, 10(7):1731–1757. [DOI] [PubMed] [Google Scholar]
- Cover TM and Thomas JA (2012). Elements of information theory. John Wiley and Sons. [Google Scholar]
- Deserno M (2004). How to generate equidistributed points on the surface of a sphere. P.-If Polymerforshung (Ed.), page 99. [Google Scholar]
- Dettner A, Münzberg S, and Tchumatchenko T (2016). Temporal pairwise spike correlations fully capture single-neuron information. Nature communications, 7:13805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao S, Ver Steeg G, and Galstyan A (2015). Efficient estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages 277–286. [Google Scholar]
- Gao W, Kannan S, Oh S, and Viswanath P (2017). Estimating mutual information for discrete-continuous mixtures. In Advances in Neural Information Processing Systems, pages 5986–5997. [Google Scholar]
- Gao W, Oh S, and Viswanath P (2018). Demystifying fixed k-nearest neighbor information estimators. IEEE Transactions on Information Theory. [Google Scholar]
- Haussler D, Opper M, et al. (1997). Mutual information, metric entropy and cumulative relative entropy risk. The Annals of Statistics, 25(6):2451–2492. [Google Scholar]
- Huang W and Zhang K (2018). Information-theoretic bounds and approximations in neural population coding. Neural computation, (Early Access):1–60. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kolchinsky A, Tracey BD, and Wolpert DH (2017). Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436. [Google Scholar]
- Kraskov A, Stögbauer H, and Grassberger P (2004). Estimating mutual information. Physical review E, 69(6):066138. [DOI] [PubMed] [Google Scholar]
- Laughlin SB, de Ruyet van Steveninck RR, and Anderson JC (1998). The metabolic cost of neural computation. Nat. Neurosci, 41:36–41. [DOI] [PubMed] [Google Scholar]
- Nemenman I, Bialek W, and van Steveninck R. d. R. (2004). Entropy and information in neural spike trains: Progress on the sampling problem. Physical Review E, 69(5):056111. [DOI] [PubMed] [Google Scholar]
- Olshausen BA and Field DJ (2004). Sparse coding of sensory inputs. Curr Opin Neurobiol, 14(4):481–487. [DOI] [PubMed] [Google Scholar]
- Opper M, Winther O, et al. (2001). From naive mean field theory to the tap equations. Advanced mean field methods: theory and practice, pages 7–20. [Google Scholar]
- Paninski L (2003). Estimation of entropy and mutual information. Neural computation, 15(6):1191–1253. [Google Scholar]
- Renner R and Maurer U (2002). About the mutual (conditional) information. In Proc. IEEE ISIT
- Rieke F, Warland D, de Ruyter van Steveninck RR, and Bialek W (1997). Spikes: Exploring the neural code. MIT Press, Cambridge. [Google Scholar]
- Schneidman E, Berry II MJ, Segev R, and Bialek W (2006). Weak pairwise correlations imply strongly correlated network states in a neural population. Nature, 440:1007=1012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sharpee TO, Sugihara H, Kurgansky AV, Rebrik SP, Stryker MP, and Miller KD (2006). Adaptive filtering enhances information transmission in visual cortex. Nature, 439(7079):936–942. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Smith E and Lewicki MS (2006). Efficient auditory coding. Nature, 439:978–82. [DOI] [PubMed] [Google Scholar]
- Strong SP, Koberle R, van Steveninck R. R. d. R., and Bialek W (1998). Entropy and information in neural spike trains. Physical review letters, 80(1):197. [Google Scholar]
- Tkačik G, Prentice JS, Balasubramanian V, and Schneidman E (2010). Optimal population coding by noisy spiking neurons. Proceedings of the National Academy of Sciences, 107(32):14419–14424. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Treves A and Panzeri S (1995). The upward bias in measures of information derived from limited data samples. Neural Comp, 7:399–407. [Google Scholar]
- Wainwright MJ and Jordan MI (2008). Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305. [Google Scholar]
- Yu Y, Crumiller M, Knight B, and Kaplan E (2010). Estimating the amount of information carried by a neuronal population. Frontiers in computational neuroscience, 4:10. [DOI] [PMC free article] [PubMed] [Google Scholar]
