Information-Theoretic Bounds and Approximations in Neural Population Coding

Wentao Huang; Kechen Zhang

doi:10.1162/neco_a_01056

. Author manuscript; available in PMC: 2019 Jan 23.

Published in final edited form as: Neural Comput. 2018 Jan 17;30(4):885–944. doi: 10.1162/neco_a_01056

Information-Theoretic Bounds and Approximations in Neural Population Coding

Wentao Huang ¹, Kechen Zhang ²

PMCID: PMC6343676 NIHMSID: NIHMS1003919 PMID: 29342399

Abstract

While Shannon’s mutual information has widespread applications in many disciplines, for practical applications it is often difficult to calculate its value accurately for high-dimensional variables because of the curse of dimensionality. This article focuses on effective approximation methods for evaluating mutual information in the context of neural population coding. For large but finite neural populations, we derive several information-theoretic asymptotic bounds and approximation formulas that remain valid in high-dimensional spaces. We prove that optimizing the population density distribution based on these approximation formulas is a convex optimization problem that allows efficient numerical solutions. Numerical simulation results confirmed that our asymptotic formulas were highly accurate for approximating mutual information for large neural populations. In special cases, the approximation formulas are exactly equal to the true mutual information. We also discuss techniques of variable transformation and dimensionality reduction to facilitate computation of the approximations.

1. Introduction

Shannon’s mutual information (MI) provides a quantitative characterization of the association between two random variables by measuring how much knowing one of the variables reduces uncertainty about the other (Shannon, 1948). Information theory has become a useful tool for neuroscience research (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Borst & Theunissen, 1999; Pouget, Dayan, & Zemel, 2000; Laughlin & Sejnowski, 2003; Brown, Kass, & Mitra, 2004; Quiroga & Panzeri, 2009), with applications to various problems such as sensory coding problems in the visual systems (Eckhorn & Pöpel, 1975; Optican & Richmond, 1987; Atick & Redlich, 1990; McClurkin, Gawne, Optican, & Richmond, 1991; Atick, Li, & Redlich, 1992; Becker & Hinton, 1992; Van Hateren, 1992; Gawne & Richmond, 1993; Tovee, Rolls, Treves, & Bellis, 1993; Bell & Sejnowski, 1997; Lewis & Zhaoping, 2006) and the auditory systems (Chechik et al., 2006; Gourévitch and Eggermont, 2007; Chase & Young, 2005).

One major problem encountered in practical applications of information theory is that the exact value of mutual information is often hard to compute in high-dimensional spaces. For example, suppose we want to calculate the mutual information between a random stimulus variable that requires many parameters to specify and the elicited noisy responses of a large population of neurons. In order to accurately evaluate the mutual information between the stimuli and the responses, one has to average over all possible stimulus patterns and over all possible response patterns of the whole population. This averaging quickly leads to a combinatorial explosion as either the stimulus dimension or the population size increases. This problem occurs not only when one computes MI numerically for a given theoretical model but also when one estimates MI empirically from experimental data.

Even when the input and output dimensions are not that high, an MI estimate from experimental data tends to have a positive bias due to limited sample size (Miller, 1955; Treves & Panzeri, 1995). For example, a perfectly flat joint probability distribution implies zero MI, but an empirical joint distribution with fluctuations due to finite data size appears to suggest a positive MI. The error may get much worse as the input and output dimensions increase because a reliable estimate of MI may require exponentially more data points to fill the space of the joint distribution. Various asymptotic expansion methods have been proposed to reduce the bias in an MI estimate (Miller, 1955; Carlton, 1969; Treves & Panzeri, 1995; Victor, 2000; Paninski, 2003). Other estimators of MI have also been studied, such as those based on k-nearest neighbor (Kraskov, Stögbauer, & Grassberger, 2004) and minimal spanning trees (Khan et al., 2007). However, it is not easy for these methods to handle the general situation with high-dimensional inputs and high-dimensional outputs.

For numerical computation of MI for a given theoretical model, one useful approach is Monte Carlo sampling, a convergent method that may potentially reach arbitrary accuracy (Yarrow, Challis, & Series, 2012). However, its stochastic and inefficient computational scheme makes it unsuitable for many applications. For instance, to optimize the distribution of a neural population for a given set of stimuli, one may want to slightly alter the population parameters and see how the perturbation affects the MI, but a tiny change of MI can be easily drowned out by the inherent noise in the Monte Carlo method.

An alternative approach is to use information-theoretic bounds and approximations to simplify calculations. For example, the Cramér-Rao lower bound (Rao, 1945) tells us that the inverse of Fisher information (FI) is a lower bound to the mean square decoding error of any unbiased decoder. Fisher information is useful for many applications partly because it is often much easier to calculate than MI (see e.g., Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Zhang & Sejnowski, 1999; Abbott & Dayan, 1999; Bethge, Rotermund, & Pawelzik, 2002; Harper & McAlpine, 2004; Toyoizumi, Aihara, & Amari, 2006).

A link between MI and FI has been studied by several researchers (Clarke & Barron, 1990; Rissanen, 1996; Brunel & Nadal, 1998; Sompolinsky, Yoon, Kang, & Shamir, 2001). Clarke and Barron (1990) first derived an asymptotic formula between the relative entropy and FI for parameter estimation from independent and identically distributed (i.i.d.) observations with suitable smoothness conditions. Rissanen (1996) generalized it in the framework of stochastic complexity for model selection. Brunel and Nadal (1998) presented an asymptotic relationship between the MI and FI in the limit of a large number of neurons. The method was extended to discrete inputs by Kang and Sompolinsky (2001). More general discussions about this also appeared in other papers (e.g., Ganguli & Simoncelli, 2014; Wei & Stocker, 2015). However, for finite population size, the asymptotic formula may lead to large errors, especially for high-dimensional inputs, as detailed in sections 2.2 and 4.1.

In this article, our main goal is to improve FI approximations to MI for finite neural populations especially for high-dimensional inputs. Another goal is to discuss how to use these approximations to optimize neural population coding. We will present several information-theoretic bounds and approximation formulas and discuss the conditions under which they are established in section 2, with detailed proofs given in the appendix. We also discuss how our approximation formulas are related to other statistical estimators and information-theoretic bounds, such as Cramér-Rao bound and van Trees’ Bayesian Cramér-Rao bound (see section 3). In order to better apply the approximation formulas in high-dimensional input space, we propose some useful techniques in section 4, including variable transformation and dimensionality reduction, which may greatly reduce the computational complexity for practical applications. Finally, in section 5, we discuss how to use the approximation formulas for optimizing information transfer for neural population coding.

2. Bounds and Approximations for Mutual Information in Neural Population Coding

2.1. Mutual Information and Notations.

Suppose the input x is a K-dimensional vector, x = (x₁, x₂, …, x_K)^T, and the outputs of N neurons are denoted by a vector, r = (r₁, r₂, …, r_N)^T. In this article, we denote random variables by uppercase letters (e.g., random variables X and R) in contrast to their vector values x and r. The MI I(X; R) (denoted as I below) between X and R is defined by Cover and Thomas (2006):

I = \int_{X} \int_{R} p (r ∣ x) p (x) ln \frac{p (r ∣ x)}{p (r)} d r d x,

(2.1)

where $x \in X \subseteq R^{K}$ , $r \in R \subseteq R^{N}$ , $d x = \prod_{k = 1}^{K} d x_{k}$ , $d r = \prod_{n = 1}^{N} d r_{n}$ , and the integration symbol ∫ is for the continuous variables and can be replaced by the summation symbol ∑ for discrete variables. The probability density function (p.d.f.) of r, p(r), satisfies

p (r) = \int_{X} p (r ∣ x) p (x) d x .

(2.2)

The MI I in equation 2.1 may also be expressed equivalently as

I = H (X) - {〈 ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r, x} = H (X) - H (X ∣ R),

(2.3)

where H(X) is the entropy of random variable X,

H (X) = - 〈 ln p (x) 〉_{x}, H (X ∣ R) = - 〈 ln p (x ∣ r) 〉_{r, x},

(2.4)

and ⟨·⟩ denotes expectation:

〈 \cdot 〉_{x} = \int_{X} p (x) (\cdot) d x,

(2.5)

〈 \cdot 〉_{r ∣ x} = \int_{R} p (r ∣ x) (\cdot) d r,

(2.6)

〈 \cdot 〉_{r, x} = \int_{X} \int_{R} p (r, x) (\cdot) d r d x .

(2.7)

Next, we introduce the following notations,

l (r ∣ x) = ln p (r ∣ x),

(2.8)

L (r ∣ x) = ln (p (r ∣ x) p (x)),

(2.9)

q (x) = ln p (x),

(2.10)

and

I_{F} = \frac{1}{2} {〈 ln (det (\frac{J (x)}{2 π e})) 〉}_{x} + H (X),

(2.11)

I_{G} = \frac{1}{2} {〈 ln (det (\frac{G (x)}{2 π e})) 〉}_{x} + H (X),

(2.12)

where det (·) denotes the matrix determinant, and

J (x) = 〈 l^{'} (r ∣ x) l^{'} (r ∣ x)^{T} 〉_{r ∣ x},

(2.13)

G (x) = J (x) + P (x),

(2.14)

P (x) = - q^{″} (x) .

(2.15)

Here J(x) is the FI matrix, which is symmetric and positive-semidefinite, and ′ and ″ denote the first and second derivative for x, respectively; that is, l′(r∣x) = ∂l (r∣x) /∂x and q″(r∣x) = ∂² ln p (x)/∂x∂x^T. If p(r∣x) is twice differentiable for x, then

J (x) = 〈 l^{'} (r ∣ x) l^{'} (r ∣ x)^{T} 〉_{r ∣ x} = - 〈 l^{″} (r ∣ x) 〉_{r ∣ x} \cdot

(2.16)

We denote the Kullback-Leibler (KL) divergence as

D (x ‖ \hat{x}) = \int_{R} p (r ∣ x) ln \frac{p (r ∣ x)}{p (r ∣ \hat{x})} d r,

(2.17)

and define

X_{ω} (x) = {\overset{˘}{x} \in R^{K} : (\overset{˘}{x} - x)^{T} G (x) (\overset{˘}{x} - x) < N ω^{2}},

(2.18)

as the ω neighborhoods of x and its complementary set as

{\overline{X}}_{ω} (x) = X - X_{ω} (x),

(2.19)

where ω is a positive number.

2.2. Information-Theoretic Asymptotic Bounds and Approximations.

In a large N limit, Brunei and Nadal (1998) proposed an asymptotic relationship I ~ I_F between MI and FI and gave a proof in the case of one-dimensional input. Another proof is given by Sompolinsky et al. (2001), although there appears to be an error in their proof when a replica trick is used (see equation B1 in their paper; their equation B5 does not follow directly from the replica trick). For large but finite N, I ≃ I_F is usually a good approximation as long as the inputs are low dimensional. For the high-dimensional inputs, the approximation may no longer be valid. For example, suppose p(r∣x) is a normal distribution with mean A^Tx and covariance matrix I_N and p(x) is a normal distribution with mean μ and covariance matrix ∑,

p (r ∣ x) = N (A^{T} x, I_{N}), p (x) = N (μ, Σ),

(2.20)

where A = [a₁, a₂, …, a_N] is a deterministic K × N matrix and I_N is the N × N identity matrix. The MI I is given by (see Verdu, 1986; Guo, Shamai, & Verdu, 2005, for details)

I = \frac{1}{2} ln (det (Σ^{1 ∕ 2} A A^{T} Σ^{1 ∕ 2} + I_{K})) .

(2.21)

If rank (J(x)) < K, then I_F = −∞. Notice that here, J(x) = AA^T. When a = a₁ = … = a_N and ∑ = I_k, then by equation 2.21 and a matrix determinant lemma, we have

I = \frac{1}{2} ln (det (N a a^{T} + I_{K})) = \frac{1}{2} ln (N a^{T} a + 1) \geq 0,

(2.22)

and by equation 2.11,

I_{F} = \frac{1}{2} ln (det (N a a^{T})) = - \infty,

(2.23)

which is obviously incorrect as an approximation to I. For high-dimensional inputs, the determinant det (J(x)) may become close to zero in practical applications. When the FI matrix J(x) becomes degenerate, the regularity condition ensuring the Cramér-Rao paradigm of statistics is violated (Amari & Nakahara, 2005), in which case using I_F as a proxy for I incurs large errors.

In the following, we will show that I_G is a better approximation of I for high-dimensional inputs. For instance, for the above example, we can verify that

\begin{matrix} I_{G} & = \frac{1}{2} ln (det (\frac{1}{2 π e} (A A^{T} + Σ^{- 1}))) + \frac{1}{2} ln (det (2 π e Σ)) \\ = \frac{1}{2} ln (det (Σ^{1 ∕ 2} A A^{T} Σ^{1 ∕ 2} + I_{K})) = I, \end{matrix}

(2.24)

which is exactly equal to the MI I given in equation 2.21.

2.2.1. Regularity Conditions.

First, we consider the following regularity conditions for p(x) and p(r∣x):

C1: p(x) and p(r∣x) are twice continuously differentiable for almost every $x \in X$ , where $X$ is a convex set; G(x) is positive definite, and ∥ G⁻¹ (x)∥ = O(N⁻¹), where ∥·∥ denotes the Frobenius norm of a matrix. The following conditions hold:

‖ q^{'} (x) ‖ < \infty,

(2.25a)

‖ q^{″} (x) ‖ < \infty,

(2.25b)

{〈 {(N^{- 1} l^{'} (r ∣ x)^{T} l^{'} (r ∣ x))}^{2} 〉}_{r ∣ x} = O (1),

(2.25c)

{〈 {‖ N^{- 1} (l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x}) ‖}^{2} 〉}_{r ∣ x} = O (N^{- 1}),

(2.25d)

and there exists an ω = ω (x) > 0 for $\forall \overset{˘}{x} \in X_{ω} (x)$ such that

N^{- 1} ‖ l^{″} (r ∣ \overset{˘}{x}) - l^{″} (r ∣ x) ‖ = O (1),

(2.25e)

where O indicates the big-O notation.

C2: The following condition is satisfied,

{〈 {‖ N^{- 1} (l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x}) ‖}^{2 (m + 1)} 〉}_{r ∣ x} = O (N^{- 1}),

(2.26a)

for $m \in N$ , and there exists η > 1 such that

P_{r ∣ x} {det (G (x))^{1 ∕ 2} \int_{{\overline{X}}_{\hat{ω}} (x)} p (\hat{x} ∣ r) d \hat{x} > ϵ p (x ∣ r)} = O (N^{- η})

(2.26b)

for all ϵ ∈ (0,1/2), $\hat{ω} \in (0, ω)$ and $x \in X$ with p(x) > 0, where $P_{r ∣ x} {\cdot}$ denotes the probability of r given x.

The regularity conditions C1 and C2 are needed to prove theorems in later sections. They are expressed in mathematical forms that are convenient for our proofs, although their meanings may seem opaque at first glance. In the following, we will examine these conditions more closely. We will use specific examples to make interpretations of these conditions more transparent.

Remark 1. In this article, we assume that the probability distributions p(x) and p(r∣x) are piecewise twice continuously differentiable. This is because we need to use Fisher information to approximate mutual information, and Fisher information requires derivatives that make sense only for continuous variables. Therefore, the methods developed in this article apply only to continuous input variables or stimulus variables. For discrete input variables, we need alternative methods for approximating MI, which we will address in a separate publication.

Conditions 2.25a and 2.25b state that the first and the second derivatives of q(x) = ln p(x) have finite values for any given $x ϵ X$ . These two conditions are easily satisfied by commonly encountered probability distributions because they only require finite derivatives within $X$ , the set of allowable inputs, and derivatives do not need to be finitely bounded.

Remark 2. Conditions 2.25c to 2.26a constrain how the first and the second derivatives of l(r∣x) = ln p(r∣x) scale with N, the number of neurons. These conditions are easily met when p(r∣x) is conditionally independent or when the noises of different neurons are independent, that is, $p (r ∣ x) = \prod_{n = 1}^{N} p (r_{n} ∣ x)$ .

We emphasize that it is possible to satisfy these conditions even when p(r∣x) is not independent or when the noises are correlated, as we show later. Here we first examine these conditions closely, assuming independence. For simplicity, our demonstration that follows is based on a one-dimensional input variable (K = 1). The conclusions are readily generalizable to higher-dimensional inputs (K > 1) because K is fixed and does not affect the scaling with N.

Assuming independence, we have $l (r ∣ x) = \sum_{n = 1}^{N} l (r_{n} ∣ x)$ with l(r_n∣x) = ln p(r_n∣x), and the left-hand side of equation 2.25c becomes

\begin{matrix} N^{- 2} 〈 l^{'} (r ∣ x)^{4} 〉_{r ∣ x} \\ = N^{- 2} \sum_{n_{1}, \dots, n_{4} = 1}^{N} 〈 l^{'} (r_{n_{1}} ∣ x) l^{'} (r_{n_{2}} ∣ x) l^{'} (r_{n_{3}} ∣ x) l^{'} (r_{n_{4}} ∣ x) 〉_{r_{n_{1}}, r_{n_{2}}, r_{n_{3}}, r_{n_{4}} ∣ x} \\ = N^{- 2} (\sum_{n \neq m} 〈 l^{'} (r_{n} ∣ x)^{2} 〉_{r_{n} ∣ x} 〈 l^{'} (r_{m} ∣ x)^{2} 〉_{r_{m} ∣ x} + \sum_{n = 1}^{N} 〈 l^{'} (r_{n} ∣ x)^{4} 〉_{r_{n} ∣ x}), \end{matrix}

(2.27)

where the final result contains only two terms with even numbers of duplicated indices, while all other terms in the expansion vanish because any unmatched or lone index k (from n₁, n₂, n₃, n₄) should yield a vanishing average:

{〈 l^{'} (r_{k} ∣ x) 〉}_{r_{k} ∣ x} = \int_{R} p (r_{k} ∣ x) l^{'} (r_{k} ∣ x) d r_{k} = \frac{\partial}{\partial x} (\int_{R} p (r_{k} ∣ x) d r_{k}) = 0 .

(2.28)

Thus, condition 2.25c is satisfied as long as ⟨l′(r_n∣x)²⟩_{r_n∣x} and ⟨l′(r_n∣x)⁴⟩_{_{r_n∣x}} are bounded by some finite numbers, say, a and b, respectively, because now equation 2.27 should scale as N⁻² (aN(N – 1) + bN) = O(1). For instance, a gaussian distribution always meets this requirement because the averages of the second and fourth powers are proportional to the second and fourth moments, which are both finite. Note that the argument above works even if ⟨l′(r_n∣x)⁴⟩_{r_n∣x} is not finitely bounded but scales as O(N).

Similarly, under the assumption of independence, the left-hand side of equation 2.25d becomes

\begin{matrix} N^{- 2} {〈 {(l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x})}^{2} 〉}_{r ∣ x} \\ = N^{- 2} \sum_{n, m = 1}^{N} {〈 (l^{″} (r_{n} ∣ x) - 〈 l^{″} (r_{n} ∣ x) 〉_{r_{n} ∣ x}) (l^{″} (r_{m} ∣ x) - 〈 l^{″} (r_{m} ∣ x) 〉_{r_{m} ∣ x)}) 〉}_{r_{n}, r_{m} ∣ x} \\ = N^{- 2} \sum_{n = 1}^{N} {〈 {(l^{″} (r_{n} ∣ x) - 〈 l^{″} (r_{n} ∣ x) 〉_{r_{n} ∣ x})}^{2} 〉}_{r_{n} ∣ x} \\ = N^{- 2} \sum_{n = 1}^{N} (〈 l^{″} (r_{n} ∣ x)^{2} 〉_{r_{n} ∣ x} - 〈 l^{″} (r_{n} ∣ x) 〉_{r_{n} ∣ x}^{2}), \end{matrix}

(2.29)

where, in the second step, the only remaining terms are the squares, while all other terms in the expansion with n ≠ m have vanished because ⟨l″(r_n∣x) – ⟨l″(r_n∣x)⟩_{r_n∣x}⟩_{r_n∣x} = 0. Thus, condition 2.25d is satisfied as long as ⟨l″(r_n∣x)⟩_{r_n∣x} and ⟨l″(r_n∣x)²⟩_{r_n∣x} are bounded so that equation 2.29 scales as N⁻²N = N⁻¹.

Condition 2.25e is easily satisfied under the assumption of independence. It is easy to show that this condition holds when l″(r_n∣x) is bounded.

Condition 2.26a can be examined with similar arguments used for equations 2.27 and 2.29. Assuming independence, we rewrite the left-hand side of equation 2.26a as

\begin{matrix} N^{- z} {〈 {(l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x})}^{z} 〉}_{r ∣ x} \\ = N^{- z} \sum_{n_{1}, \dots, n_{z} = 1}^{N} 〈 (l^{″} (r_{n_{1}} ∣ x) - 〈 l^{″} (r_{n_{1}} ∣ x) 〉_{r_{n_{1}} ∣ x}) \\ \dots {(l^{″} (r_{1} ∣ x) - 〈 l^{″} (r_{n_{z}} ∣ x) 〉_{r_{n_{z}} ∣ x}) 〉}_{r_{n_{z}} ∣ x} \\ = N^{- z} \sum_{n_{1}, \dots, n_{m + 1} = 1}^{N} {〈 \prod_{i = 1}^{m + 1} {(l^{″} (r_{n_{i}} ∣ x) - 〈 l^{″} (r_{n_{i}} ∣ x) 〉_{r_{n_{i}} ∣ x})}^{2} 〉}_{r_{n_{i}} ∣ x} + \dots \end{matrix}

(2.30)

where z = 2(m + 1) ≥ 4 is an even number. Any term in the expansion with an unmatched index n_k should vanish, as in the cases of equations 2.27 and 2.29. When ⟨l″ (r_n∣x)⟩_{r_n∣x} and ⟨l″(r_n∣x)²⟩_{r_n∣x} are bounded, the leading term with respect to scaling with N is the product of squares, as shown at the end of equation 2.30, because all the other nonvanishing terms increase more slowly with N. Thus equation 2.30 should scale as N^−zN^m+1 = N^−m−1, which trivially satisfies condition 2.26a.

In summary, conditions 2.25c to 2.26a are easy to meet when p(r∣x) is independent. It is sufficient to satisfy these conditions when the averages of the first and second derivatives of l(r∣x) = ln p(r∣x), as well as the averages of their powers, are bounded by finite numbers for all the neurons.

Remark 3. For neurons with correlated noises, if there exists an invertible transformation that maps r to $\tilde{r}$ such that $p (\tilde{r} ∣ x)$ becomes conditionally independent, then conditions C1 and C2 are easily met in the space of the new variables by the discussion in remark 2. This situation is best illustrated by the familiar example of a population of neurons with correlated noises that obey a multivariate gaussian distribution:

p (r ∣ x) = \frac{1}{\sqrt{det (2 π Σ)}} exp (- \frac{1}{2} (r - g)^{T} Σ^{- 1} (r - g)),

(2.31)

where ∑ is an N × N invertible covariance matrix, and g = (g₁(x; θ₁),…, g_N(x; θ_N)) describes the mean resp onses with θ_n being the parameter vector. Using the following transformation,

\tilde{r} = Σ^{- 1 ∕ 2} r = ({\tilde{r}}_{1}, {\tilde{r}}_{2}, \dots, {\tilde{r}}_{N})^{T},

(2.32)

\tilde{g} = Σ^{- 1 ∕ 2} g = ({\tilde{g}}_{1}, {\tilde{g}}_{2}, \dots, {\tilde{g}}_{N})^{T},

(2.33)

we obtain the independent distribution:

p (\tilde{r} ∣ x) = \prod_{n = 1}^{N} \frac{1}{\sqrt{2 π}} exp (- \frac{1}{2} ({\tilde{r}}_{n} - {\tilde{g}}_{n})^{2}) .

(2.34)

In the special case when the correlation coefficient between any pair of neurons is a constant c, −1 < c < 1, the noise covariance can be written as

Σ = a ((1 - c) I_{N} + c u u^{T}),

(2.35)

where a > 0 is a constant, I_N is the N × N identity matrix, and $u = (1, 1, \dots, 1)^{T} \in R^{N \times 1}$ . The desired transformation in equations 2.32 and 2.33 is given explicitly by

Σ^{- 1 ∕ 2} = b_{0} (I_{N} - b_{1} u u^{T}),

(2.36)

where

b_{0} = \frac{1}{\sqrt{a (1 - c)}}, b_{1} = \frac{1}{N} (1 \pm \sqrt{\frac{1 - c}{(N - 1) c + 1}}) .

(2.37)

The new response variables defined in equations 2.32 and 2.33 now read:

{\tilde{r}}_{n} = b_{0} (r_{n} - b_{1} \sum_{m = 1}^{N} r_{m}),

(2.38)

{\tilde{g}}_{n} = b_{0} (g_{n} - b_{1} \sum_{m = 1}^{N} g_{m}) .

(2.39)

Now we have the derivatives:

l^{'} ({\tilde{r}}_{n} ∣ x) = ({\tilde{r}}_{n} - {\tilde{g}}_{n}) \frac{\partial {\tilde{g}}_{n}}{\partial x},

(2.40)

l^{″} ({\tilde{r}}_{n} ∣ x) = 〈 l^{″} ({\tilde{r}}_{n} ∣ x) 〉_{r_{n} ∣ x} = ({\tilde{r}}_{n} - {\tilde{g}}_{n}) \frac{\partial^{2} {\tilde{g}}_{n}}{\partial x^{2}},

(2.41)

where $\partial {\tilde{g}}_{n} ∕ \partial x$ and $\partial^{2} {\tilde{g}}_{n} ∕ \partial x^{2}$ are finite as long as ∂g_n/∂x and ∂²g_n/∂x² are finite. Conditions C1 and C2 are satisfied when the derivatives and their powers are finitely bounded as shown before.

The example above shows explicitly that it is possible to meet conditions C1 and C2 even when the noises of different neurons are correlated. More generally, if a nonlinear transformation exists that maps correlated random variables into independent variables, then by similar argument, conditions C1 and C2 are satisfied when the derivatives of the log likelihood functions and their powers in the new variables are finitely bounded. Even when the desired transformation does not exist or is unknown, it does not necessarily imply that conditions C1 and C2 must be violated.

While the exact mathematical conditions for the existence of the desired transformation are unclear, let us consider a specific example. If a joint probability density function can be morphed smoothly and reversibly into a flat or constant density in a cube (hypercube), which is a special case of an independent distribution, then this morphing is the desired transformation. Here we may replace the flat distribution by any known independent distribution and the argument above should still work. So the desired transformation may exist under rather general conditions.

For correlated random variables, one may use algorithms such as independent component analysis to find an invertible linear mapping that makes the new random variables as independent as possible (Bell & Sejnowski, 1997) or use neural networks to find related nonlinear mappings (Huang & Zhang, 2017). These methods do not directly apply to the problem of testing conditions C1 and C2 because they work for a given network size N and further development is needed to address the scaling behavior in the large network limit N → ∞.

Finally, we note that the value of the MI of the transformed independent variables is the same as the MI of the original correlated variables because of the invariance of MI under invertible transformation of marginal variables. A related discussion is in theorem 3, which involves a transformation of the input variables rather than a transformation of the output variables as needed here.

Remark 4. Condition 2.26b is satisfied if a positive number δ and a positive integer m exist such that

det (G (x))^{1 ∕ 2} \int_{{\overline{X}}_{\hat{ω}} (x)} \int_{B_{m, δ} (x)} p (r ∣ \hat{x}) p (\hat{x}) d r d \hat{x} = O (N^{- η})

(2.42)

for all $\hat{x} \in {\overline{X}}_{\hat{ω}} (x)$ , where

B_{m, δ} (x) = {r \in R : - δ N^{\frac{η - 1}{2 m}} G (x) < l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x} < δ N^{\frac{η - 1}{2 m}} G (x)},

(2.43)

and A < B means that the matrix A – B is negative definite. A proof is as follows.

First note that in equation 2.43, if η → 1 or m → ∞, then $N^{\frac{η - 1}{2 m}} \to 1$ . Following Markov’s inequality, condition C2 and equation A.19 in the appendix, for the complementary set of $B_{m, δ} (x)$ , ${\overline{B}}_{m, δ} (x)$ , we have

\begin{matrix} P_{r ∣ x} {{\overline{B}}_{m, δ} (x)} & \leq P_{r ∣ x} {‖ B_{0} ‖^{2} \geq δ^{2} N^{\frac{η - 1}{m}}} \\ \leq δ^{- 2 m} N^{- (η - 1)} 〈 ‖ B_{0} ‖^{2 m} 〉_{r ∣ x} \\ = O (N^{- η}), \end{matrix}

(2.44)

where

B_{0} = G^{- 1 ∕ 2} (x) (l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x}) G^{- 1 ∕ 2} (x) .

(2.45)

Define the set

A_{\hat{ω}} (x) = {r \in R : \int_{{\overline{X}}_{\hat{ω}} (x)} \frac{p (\hat{x} ∣ r)}{p (x ∣ r)} d \hat{x} > det (G (x))^{- 1 ∕ 2} ϵ} .

(2.46)

Then it follows from Markov’s inequality and equation 2.42 that

\begin{matrix} P_{r ∣ x} {A_{\hat{ω}} (x) \cap B_{m, δ} (x)} \\ \leq ϵ^{- 1} det (G (x))^{1 ∕ 2} \int_{B_{m, δ} (x)} \int_{{\overline{X}}_{\hat{ω}} (x)} \frac{p (r ∣ \hat{x}) p (\hat{x})}{p (x)} d \hat{x} d r \\ = O (N^{- η}) . \end{matrix}

(2.47)

Hence, we get

P_{r ∣ x} {A_{\hat{ω}} (x)} \leq P_{r ∣ x} {A_{\hat{ω}} (x) \cap B_{m, δ} (x)} + P_{r ∣ x} {{\overline{B}}_{m, δ} (x)} = O (N^{- η}),

which yields condition 2.26b.

Condition 2.42 is satisfied if there exists a positive number ς such that

ln \frac{p (r ∣ x)}{p (r ∣ \hat{x})} \geq N_{ς}

(2.48)

for all $\hat{x} \in {\overline{X}}_{\hat{ω}} (x)$ and $r \in B_{m, δ} (x)$ . This is because

\begin{matrix} det (G (x))^{1 ∕ 2} \int_{{\overline{X}}_{\hat{ω}} (x)} \int_{B_{m, δ} (x)} p (r ∣ \hat{x}) p (\hat{x}) d r d \hat{x} \\ = det (G (x))^{1 ∕ 2} \int_{{\overline{X}}_{\hat{ω}} (x)} p (\hat{x}) \int_{B_{m, δ} (x)} p (r ∣ x) exp (- ln \frac{p (r ∣ x)}{p (r ∣ \hat{x})}) d r d \hat{x} \\ \leq det (G (x))^{1 ∕ 2} exp (- N_{ς}) = O (N^{K ∕ 2} e^{- N_{ς}}) . \end{matrix}

(2.49)

Here notice that det (G (x))^1/2 = O (N^K/2) (see equation A.23).

Inequality 2.48 holds if p(r∣x) is conditionally independent, namely, $p (r ∣ x) = \prod_{n = 1}^{N} p (r_{n} ∣ x)$ , with

ln \frac{p (r_{n} ∣ x)}{p (r_{n} ∣ \hat{x})} \geq ς, \forall n = 1, 2, \dots, N,

(2.50)

for all $\hat{x} \in {\overline{X}}_{\hat{ω}} (x)$ and $r \in B_{m, δ} (x)$ . Consider the inequality $〈 ln p (r_{n} ∣ x) ∕ p (r_{n} ∣ \hat{x}) 〉_{r_{n} ∣ x} \geq 0$ where the equality holds when $x = \hat{x}$ . If there is only one extreme point at $\hat{x} = x$ for $\hat{x} \in X_{ω} (x)$ , then generally it is easy to find a set $B_{m, δ} (x)$ that satisfies equation 2.50, so that equation 2.26b holds.

2.2.2. Asymptotic Bounds and Approximations for Mutual Information.

Let

ξ = N^{- 1} {〈 {‖ (l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x}) G^{- 1} (x) l^{'} (r ∣ x) ‖}^{2} 〉}_{r ∣ x},

(2.51)

and it follows from conditions C1 and C2 that

\begin{matrix} ξ & \leq {‖ N G^{- 1} (x) ‖}^{2} {〈 {‖ N^{- 1} (l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x}) ‖}^{4} 〉}_{r ∣ x}^{1 ∕ 2} \\ \times {〈 {(N^{- 1} l^{'} (r ∣ x)^{T} l^{'} (r ∣ x))}^{2} 〉}_{r ∣ x}^{1 ∕ 2} \\ = O (N^{- 1 ∕ 2}) . \end{matrix}

(2.52)

Moreover, if p(r∣x) is conditionally independent, then by an argument similar to the discussion in remark 2, we can verify that the condition ξ = O (N⁻¹) is easily met.

In the following we state several conclusions about the MI; their proofs are given in the appendix.

Lemma 1. If condition C1 holds, then the MI I has an asymptotic upper bound for integer N,

I \leq I_{G} + O (N^{- 1}) .

(2.53)

Moreover, if equations 2.25c and 2.25d are replaced by

{〈 {∣ N^{- 1} l^{'} (r ∣ x)^{T} l^{'} (r ∣ x) ∣}^{1 + τ} 〉}_{r ∣ x} = O (1),

(2.54a)

{〈 {‖ N^{- 1} (l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x}) ‖}^{2} 〉}_{r ∣ x} = o (1),

(2.54b)

for some τ ∈ (0,1), where o indicates the Little-O notation, then the MI has the following asymptotic upper bound for integer N:

I \leq I_{G} + o (1) .

(2.55)

Lemma 2. If conditions C1 and C2 hold, ξ = O (N⁻¹), then the MI has an asymptotic lower bound for integer N,

I \geq I_{G} + O (N^{- 1}) .

(2.56)

Moreover, if condition C1 holds but equations 2.25c and 2.25d are replaced by 2.54a and 2.54b, and inequality 2.26b in C2 also holds for η > 0, then the MI has the following asymptotic lower bound for integer N:

I \geq I_{G} + o (1) .

(2.57)

Theorem 1. If conditions C1 and C2 hold, ξ = O (N⁻¹), then the MI has the following asymptotic equality for integer N:

I = I_{G} + O (N^{- 1}) .

(2.58)

For more relaxed conditions, suppose condition C1 holds but equations 2.25c and 2.25d are replaced by 2.54a and 2.54b, and inequality 2.26b in C2 also holds for η > 0, then the MI has an asymptotic equality for integer N:

I = I_{G} + o (1) .

(2.59)

Theorem 2. Suppose J(x) and G(x) are symmetric and positive-definite. Let

ς = 〈 T r (Ψ (x)) 〉_{x},

(2.60)

Ψ (x) = J^{- 1 ∕ 2} (x) P (x) J^{- 1 ∕ 2} (x) .

(2.61)

Then

I_{G} \leq I_{F} + \frac{ς}{2},

(2.62)

where Tr (·) indicating matrix trace; moreover, if P(x) is positive-semidefinite, then

0 \leq I_{G} - I_{F} \leq \frac{ς}{2} .

(2.63)

But if

ς_{1} = {〈 ‖ Ψ (x) ‖ 〉}_{x} = O (N^{- β})

(2.64)

for some β > 0, then

I_{G} = I_{F} + O (N^{- β}) .

(2.65)

Remark 5. In general, we need only to assume that p(x) and p(r∣x) are piecewise twice continuously differentiable for $x ϵ X$ . In this case, lemmas 1 and 2 and theorem 1 can still be established. For more general cases, such as discrete or continuous inputs, we have also derived a general approximation formula for MI from which we can easily derive formula for I_G (this will be discussed in separate paper).

2.3. Approximations of Mutual Information in Neural Populations with Finite Size.

In the preceding section, we provided several bounds, including both lower and upper bounds, and asymptotic relationships for the true MI in the large N (network size) limit. Now, we discuss effective approximations to the true MI in the case of finite N. Here we consider only the case of continuous inputs (we will discuss the case of discrete inputs in another paper).

Theorem 1 tells us that under suitable conditions, we can use I_G to approximate I for a large but finite N (e.g., N ⪢ K), that is,

I ≃ I_{G} .

(2.66)

Moreover, by theorem 2, we know that if ς ≈ 0 with positive-semidefinite P(x) or ς₁ ≈ 0 holds (see equations 2.60 and 2.64), then by equations 2.63, 2.65, and 2.66, we have

I ≃ I_{G} ≃ I_{F} .

(2.67)

Define

\tilde{G} (x) = J (x) + P (x) + Q (x),

(2.68)

{\tilde{I}}_{G} = \frac{1}{2} {〈 ln (det (\frac{\tilde{G} (x)}{2 π e})) 〉}_{x} + H (X),

(2.69)

where $\tilde{G} (x)$ is positive-definite and Q (x) is a symmetric matrix depending on x and ∥Q (x) ∥ = O(1). Suppose $‖ {\tilde{G}}^{- 1} (x) ‖ = O (N^{- 1})$ . If we replace I_G by ${\tilde{I}}_{G}$ in theorem 1, then we can prove equations 2.58 and 2.59 in a manner similar to the proof of that theorem. Considering a special case where ∥P(x)∥ → 0, det (J(x)) = O(1) (e.g., rank (J(x)) < K) and ∥G⁻¹ (x) ∥ ≠ O(N⁻¹), then we can no longer use the asymptotic formulas in theorem 1. However, if we substitute $\tilde{G} (x)$ for G(x) by choosing an appropriate Q (x) such that $\tilde{G} (x)$ is positive-definite and $‖ {\tilde{G}}^{- 1} (x) ‖ = O (N^{- 1})$ , then we can use equation 2.58 and 2.59 as the asymptotic formula.

If we assume G(x) and $\tilde{G} (x)$ are positive-definite and

ζ = {〈 ‖ Q (x) {\tilde{G}}^{- 1} (x) ‖ 〉}_{x} = O (N^{- β}), β > 0,

(2.70)

then similar to the proof of theorem 2, we have

\begin{matrix} 〈 ln (det (G (x))) 〉_{x} \\ = {〈 ln (det (\tilde{G} (x))) 〉}_{x} + {〈 ln (det (I_{K} - Q (x) {\tilde{G}}^{- 1} (x))) 〉}_{x} \\ = {〈 ln (det (\tilde{G} (x))) 〉}_{x} + O (N^{- β}) \end{matrix}

(2.71)

and

{\tilde{I}}_{G} = I_{G} + O (N^{- β}) .

For large N, we usually have ${\tilde{I}}_{G} ≃ I_{G}$ .

It is more convenient to redefine the following quantities:

Q (x) = P_{+} - P (x),

(2.72)

P_{+} = {〈 \frac{\partial ln p (x)}{\partial x} \frac{\partial ln p (x)}{\partial x^{T}} 〉}_{x},

(2.73)

G_{+} (x) = \tilde{G} (x) = J (x) + P_{+},

(2.74)

and

I_{G_{+}} = {\tilde{I}}_{G} = \frac{1}{2} {〈 ln (det (\frac{G_{+} (x)}{2 π e})) 〉}_{x} + H (X) .

(2.75)

Notice that if p(x) is twice differentiable for x and

\int_{X} \frac{\partial^{2} p (x)}{\partial x \partial x^{T}} d x = 0,

(2.76)

then

P_{+} = 〈 P (x) 〉_{x} = {〈 \frac{1}{p (x)} \frac{\partial^{2} p (x)}{\partial x \partial x^{T}} 〉}_{x} - {〈 \frac{\partial^{2} ln p (x)}{\partial x \partial x^{T}} 〉}_{x} .

(2.77)

For example, if p(x) is a normal distribution, $p (x) = N (μ, Σ)$ , then

P (x) = P_{+} = Σ^{- 1} .

(2.78)

Similar to the proof of theorem 2, we can prove that

0 \leq I_{G_{+}} - I_{F} \leq \frac{ς_{+}}{2},

(2.79)

where

ς_{+} = {〈 Tr (P_{+} J^{- 1} (x)) 〉}_{x} .

(2.80)

We find that I_G is often a good approximation of MI I even for relatively small N. However, we cannot guarantee that P(x) is always positive-semidefinite in equation 2.14, and as a consequence, it may happen that det (G(x)) is very small for small N, G(x) is not positive-definite, and ln (det (G(x))) is not a real number. In this case, I_G is not a good approximation to I but I_G₊ is still a good approximation. Generally, if P(x) is always positive-semidefinite, then I_G or I_G₊ is a better approximation than I_F, especially when p(x) is close to a normal distribution.

In the following, we give an example of 1D inputs. High-dimensional inputs are discussed in section 4.1.

2.3.1. A Numerical Comparison for 1D Stimuli.

Considering the Poisson neuron model (see equation 5.7 in section 5.1 for details), the tuning curve of the nth neuron, f (x; θ_n), takes the form of circular normal or von Mises distribution

f (x; θ_{n}) = A exp (- {(\frac{T}{2 π σ_{f}})}^{2} (1 - \cos (\frac{2 π}{T} (x - θ_{n})))),

(2.81)

where x ∈ [−T/2, T/2), θ_n ∈ [−T_θ/2, T_θ/2], n ϵ {1,2, …, N}, with T = π, T_θ = 1, σ_f = 0.5, and A = 20, and the centers θ₁, θ₂, …, θ_N of the N neurons are uniformly distributed on interval [−T_θ/2, T_θ/2], that is, θ_n = (n – 1) d_θ – T_θ/2, with d_θ = T_θ/(N – 1) and N ≥ 2. Suppose the distribution of 1D continuous input x (K = 1) p(x) has the form

p (x) = Z^{- 1} exp (- {(\frac{T}{2 π σ_{p}})}^{2} (1 - \cos (\frac{2 π}{T} x))),

(2.82)

where σ_p is a constant set to π/4 and Z is the normalization constant. Figure 1A shows graphs of the input distribution p(x) and the tuning curves f (x; θ) with different centers θ = −π/4, 0, π/4.

To evaluate the precision of the approximation formulas, we use Monte Carlo (MC) simulation to approximate MI I. For MC simulation, we first sample an input x_j by the distribution p(x), then generate the neural response r_j by the conditional distribution p(r_j∣x_j), where j = 1, 2, …, j_max. The value of MI by MC simulation is calculated by

I_{M C}^{*} = \frac{1}{j_{\max}} \sum_{j = 1}^{j_{\max}} ln (\frac{p (r_{j} ∣ x_{j})}{p (r_{j})}),

(2.83)

where p(r_j) is given by

p (r_{j}) = \sum_{m = 1}^{M} p (r_{j} ∣ x_{m}) p (x_{m})

(2.84)

and x_m = (m – 1) T/M – T/2 for m ϵ {1,2, …, M}.

To evaluate the accuracy of MC simulation, we compute the standard deviation,

I_{s t d} = \sqrt{\frac{1}{i_{\max}} \sum_{i = 1}^{i_{\max}} {(I_{M C}^{i} - I_{M C})}^{2}},

(2.85)

where

I_{M C}^{i} = \frac{1}{j_{\max}} \sum_{j = 1}^{j_{\max}} ln (\frac{p (r_{Γ_{j, i}} ∣ x_{Γ_{j, i}})}{p (r_{Γ_{j, i}})}),

(2.86)

I_{M C} = \frac{1}{i_{\max}} \sum_{i = 1}^{i_{\max}} I_{M C}^{i},

(2.87)

and Γ_j,i ϵ {1,2, …, j_max} is the (j, i)th entry of the matrix $Γ \in N^{j_{\max} \times i_{\max}}$ with samples taken randomly from the integer set {1, 2, …, j_max} by a uniform distribution. Here we set j_max = 5 × 10⁵, i_max = 100 and M = 10³.

For different N ∈ {2, 3, 4, 6, 10, 14, 20, 30, 50, 100, 200,400, 700, 1000}, we compare I_MC with I_G, I_G₊, and I_F, which are illustrated in Figures 1B to 1D. Here we define the relative error of approximation, for example, for I_G, as

D I_{G} = \frac{I_{G} - I_{M C}}{I_{M C}},

(2.88)

and the relative standard deviation

D I_{s t d} = \frac{I_{s t d}}{I_{M C}} .

(2.89)

Figure 1B shows how the values of I_MC, I_G, I_G₊, and I_F change with neuron number N, and Figures 1C and 1D show their relative errors and the absolute values of the relative errors with respect to I_MC. From Figures 1B to 1D, we can see that the values of I_G, I_G₊, and I_F are all very close to one another and the absolute values of their relative errors are all very small. The absolute values are less than 1% when N ≥ 10 and less than 0.1% when N ≥ 100. However, for the high-dimensional inputs, there will be a big difference between I_G, I_G₊, and I_F in many cases (see section 4.1 for more details).

3. Statistical Estimators and Neural Population Decoding

Given the neural response r elicited by the input x, we may infer or estimate the input x from the response. This procedure is sometimes referred to as decoding from the response. We need to choose an efficient estimator or a function $\hat{x} = \hat{x} (r)$ that maps the response r to an estimate $\hat{x}$ of the true stimulus x. The maximum likelihood (ML) estimator defined by

\hat{x} (r) = \arg \max_{x} p (r ∣ x) = \arg \max_{x} l (r ∣ x)

(3.1)

is known to be efficient in large N limit. According to the Cramér-Rao lower bound (Rao, 1945), we have the following relationship between the covariance matrix of any unbiased estimator $Σ_{\hat{x}}$ and the FI matrix J (x),

Σ_{\hat{x}} = {〈 (\hat{x} (r) - x) {(\hat{x} (r) - x)}^{T} 〉}_{r ∣ x} \geq J^{- 1} (x),

(3.2)

where $\hat{x} (r)$ is an unbiased estimation of x from the response r, and A ≥ B means that matrix A – B is positive-semidefinite. Thus,

\begin{matrix} I_{F} = & \frac{1}{2} {〈 ln (det (\frac{J (x)}{2 π e})) 〉}_{x} + H (X) \\ \geq & \frac{1}{2} {〈 ln (det (\frac{Σ_{\hat{x}}^{- 1}}{2 π e})) 〉}_{x} + H (X) = I_{var} . \end{matrix}

(3.3)

The MI between X and $\hat{X}$ is given by

\hat{I} = H (\hat{X}) - {〈 H (\hat{X} ∣ X) 〉}_{\hat{x}, x},

(3.4)

where $H (\hat{X})$ is the entropy of random variable $\hat{X}$ and $H (\hat{X} ∣ X)$ is its conditional entropy of random variable $\hat{X}$ given X. Since the maximum entropy probability distribution is gaussian, $H (\hat{X} ∣ X)$ satisfies

H (\hat{X} ∣ X) \leq \frac{1}{2} ln (det (2 π e Σ_{\hat{x}})) .

(3.5)

Therefore, from equations 3.4 and 3.5, we get

\hat{I} \geq \frac{1}{2} {〈 ln (det (\frac{Σ_{\hat{x}}^{- 1}}{2 π e})) 〉}_{x} + H (\hat{X}) = {\hat{I}}_{var} .

(3.6)

The data processing inequality (Cover & Thomas, 2006) states that postprocessing cannot increase information, so that we have

I \geq \hat{I} \geq {\hat{I}}_{var} .

(3.7)

Here we can not directly obtain I ≥ I_F as in Brunel and Nadal (1998) when $H (\hat{X}) = H (X)$ and $I_{var} = {\hat{I}}_{var}$ . The simulation results in Figure 1 also show that I_F is not a lower bound of I.

For biased estimators, the van Trees’ Bayesian Cramér-Rao bound (Van Trees & Bell, 2007) provides a lower bound:

{〈 Σ_{\hat{x}} 〉}_{x} = {〈 {〈 (\hat{x} (r) - x) {(\hat{x} (r) - x)}^{T} 〉}_{r ∣ x} 〉}_{x} \geq {({〈 J (x) 〉}_{x} + P_{+})}^{- 1} = {〈 G_{+} (x) 〉}_{x}^{- 1} .

(3.8)

It follows from equations 2.75, 3.6, and 3.8 that

I_{G_{+}} \leq \frac{1}{2} ln (det (\frac{{〈 G_{+} (x) 〉}_{x}}{2 π e})) + H (X) = I_{V T},

(3.9)

I_{V T} \geq \frac{1}{2} ln (det (\frac{{〈 Σ_{\hat{x}} 〉}_{x}^{- 1}}{2 π e})) + H (X) = {\tilde{I}}_{var},

(3.10)

I_{var} \geq {\tilde{I}}_{var} .

(3.11)

We may also regard decoding as Bayesian inference. By Bayes’ rule,

p (x ∣ r) = \frac{p (r ∣ x) p (x)}{p (r)} .

(3.12)

According to the Bayesian decision theory, if we know the response r, from the prior p(x) and the likelihood p(r∣x), we can infer an estimation of the true stimulus x, $\hat{x} (r)$ —for example,

\hat{x} (r) = \arg \max_{x} p (x ∣ r) = \arg \max_{x} L (r ∣ x),

(3.13)

which is also called maximum a posteriori (MAP) estimation.

Consider a loss function $φ (\hat{x} (r) ∣ x)$ for estimation,

φ (\hat{x} (r) ∣ x) = - ln p (x ∣ r),

(3.14)

which is minimized when p(x∣r) reaches its maximum. Now the conditional risk is

R (\hat{x} (r) ∣ r) = {〈 φ (\hat{x} (r) ∣ x) 〉}_{x ∣ r},

(3.15)

and the overall risk is

R_{o} = {〈 R (\hat{x} (r) ∣ r) 〉}_{r} = {〈 {〈 φ (\hat{x} (r) ∣ x) 〉}_{x ∣ r} 〉}_{r} = - {〈 ln p (x ∣ r) 〉}_{x, r} .

(3.16)

Then it follows from equations 2.3 and 3.16 that

I = {〈 ln p (x ∣ r) 〉}_{r, x} + H (X) = - R_{o} + H (X) .

(3.17)

Comparing equations 2.12, 2.66, and 3.17, we find

R_{o} ≃ - \frac{1}{2} {〈 ln (det (\frac{G (x)}{2 π e})) 〉}_{x} .

(3.18)

Hence, maximizing MI I (or I_G) means minimizing the overall risk R_o for a determinate H(X). Therefore, we can get the optimal Bayesian inference via optimizing MI I (or I_G).

By the Cramér-Rao lower bound, we know that the inverse of FI matrix J⁻¹(x) reflects the accuracy of decoding (see equation 3.2). P(x) provides some knowledge about the prior distribution p(x); for example, P⁻¹ (x) is the covariance matrix of input x when p(x) is a normal distribution. ∥P(x)∥ is small for a flat prior (poor prior) and large for a sharp prior (good prior). Hence, if the prior p(x) is flat or poor and the knowledge about model is rich, then the MI I is governed by the knowledge of model, which results in a small ς₁ (see equation 2.64) and I ≃ I_G ≃ I_F. Otherwise, the prior knowledge has a great influence on MI I, which results in a large ς₁ and I ≃ I_G ≄ I_F.

4. Variable Transformation and Dimensionality Reduction in Neural Population Coding

For low-dimensional input x and large N, both I_G are I_F are good approximations of MI I, but for high-dimensional input x, a large value of ς₁ may lead to a large error of I_F, in which case I_G (or I_G₊ ) is a better approximation. It is difficult to directly apply the approximation formula I ≃ I_G when we do not have an explicit expression of p (x) or P (x). For many applications, we do not need to know the exact value of I_G and care only about the value of ⟨ln (det (G(x)))⟩_x (see section 5). From equations 2.12, 2.22, and 2.78, we know that if p (x) is close to a normal distribution, we can easily approximate P (x) and H(X) to obtain ⟨ln (det (G(x)))⟩_x and I_G. When p (x) is not a normal distribution, we can employ a technique of variable transformation to make it closer to a normal distribution, as discussed below.

4.1. Variable Transformation.

Suppose $T : X \to \tilde{X}$ is an invertible and differentiable mapping:

\tilde{x} = T (x) = {(T_{1} (x), T_{2} (x), \dots, T_{K} (x))}^{T},

(4.1)

$x = T^{- 1} (\tilde{x})$ , and $\tilde{x} \in \tilde{X} \subseteq R^{K}$ . Let $p (\tilde{x})$ denote the p.d.f. of random variable $\tilde{X}$ and

p (r ∣ \tilde{x}) = {p (r ∣ x) ∣}_{x = T^{- 1} (\tilde{x})} .

(4.2)

Then we have the following conclusions, the proofs of which are given in the appendix.

Theorem 3. The MI is equivariant under the invertible transformations. More specifically, for the above invertible transformation T, the MI I(X; R) in equation 2.1 is equal to

I (\tilde{X}; R) = {〈 \ln \frac{p (r ∣ \tilde{x})}{p (r)} 〉}_{r, \tilde{x}} .

(4.3)

Furthermore, suppose $p (\tilde{x})$ and $p (r ∣ \tilde{x})$ fulfill the conditions C1, C2 and ξ = O (N⁻¹). Then we have

I (\tilde{X}; R) = {\tilde{I}}_{G} + O (N^{- 1}),

(4.4)

\begin{matrix} {\tilde{I}}_{G} & = \frac{1}{2} {〈 \ln (\det (\frac{G (\tilde{x})}{2 π e})) 〉}_{\tilde{x}} + H (\tilde{X}) \\ = \frac{1}{2} {〈 \ln (\det (\frac{G (x)}{2 π e})) 〉}_{x} + H (X) \\ = I_{G}, \end{matrix}

(4.5)

where $H (\tilde{X})$ is the entropy of random variable $\tilde{X}$ and satisfies

H (\tilde{X}) = - {〈 \ln p (\tilde{x}) 〉}_{\tilde{x}} = H (X) + {〈 \ln ∣ \det (D T (x)) ∣ 〉}_{x},

(4.6)

and DT (x) denotes the Jacobian matrix of T (x),

{(D T (x))}_{i, j} = \frac{\partial T_{i} (x)}{\partial x_{j}}, \forall i, j = 1, 2, \dots, K .

(4.7)

Corollary 1. Suppose p(r∣x) is a normal distribution,

p (r ∣ x) = N (A^{T} y, I_{N}),

(4.8)

where y = f (B^T x) = (y₁, y₂, …, y_K)^T, $y_{k} = f_{k} (b_{k}^{T} x)$ for k = 1, 2, …, K, A is a deterministic K × N matrix, B = [b₁, b₂, …, b_K] is a deterministic invertible matrix, and f_k is an invertible and differentiable function. If Y has also a normal distribution, $p (y) = N (μ_{f}, Σ_{f})$ , then

\begin{matrix} I_{G} & = I_{G_{+}} = I (X; R) = I (Υ; R) \\ = \frac{1}{2} \ln (\det (\frac{1}{2 π e} ({AA}^{T} + Σ_{f}^{- 1}))) + H (Υ) \\ = \frac{1}{2} {〈 \ln (\det (\frac{1}{2 π e} (J (x) + P (x)))) 〉}_{x} + H (X), \end{matrix}

(4.9)

where

H (Υ) = \frac{1}{2} \ln (\det (2 π e Σ_{f})) = H (X) + {〈 \ln ∣ \det (D (x)) ∣ 〉}_{x},

(4.10)

D (x) = {(f_{1}^{'} (b_{1}^{T} x) b_{1}, f_{2}^{'} (b_{2}^{T} x) b_{2}, \dots, f_{K}^{'} (b_{K}^{T} x) b_{K})}^{T},

(4.11)

f_{k}^{'} (b_{k}^{T} x) = {\frac{\partial f_{k} (y_{k})}{\partial y_{k}} ∣}_{y_{k} = b_{k}^{T} x}, \forall k = 1, 2, \dots, K .

(4.12)

Remark 6. From corollary 1 and equation 2.78, we know that the approximation accuracy for I_G ≃ I(X; R) is improved when we employ an invertible transformation on the input random variable X to make the new random variable Y closer to a normal distribution (see section 4.3).

Consider the eigendecompositions of AA^T and ∑_f as given by

{AA}^{T} = U_{A} \hat{Σ} U_{A}^{T},

(4.13)

Σ_{f} = U_{f} \tilde{Σ} U_{f}^{T},

(4.14)

where U_A and U_f are K × K orthogonal matrices; $\hat{Σ} = diag ({\hat{σ}}_{1}^{2}, {\hat{σ}}_{2}^{2}, \dots, {\hat{σ}}_{K}^{2})$ and $\tilde{Σ} = diag ({\tilde{σ}}_{1}^{2}, {\tilde{σ}}_{2}^{2}, \dots, {\tilde{σ}}_{K}^{2})$ are K × K eigenvalue matrices; and ${\hat{σ}}_{1} \geq {\hat{σ}}_{2} \geq \dots \geq {\hat{σ}}_{K} > 0$ and ${\tilde{σ}}_{1} \geq {\tilde{σ}}_{2} \geq \dots \geq {\tilde{σ}}_{K} > 0$ . Then by equations 2.11 and 4.9, we have

\begin{matrix} I_{G} & = I_{G_{+}} = I (X; R) = I (Υ; R) \\ = \frac{1}{2} ln (det (\frac{1}{2 π e} (U_{A} \hat{Σ} U_{A}^{T} + U_{f} {\tilde{Σ}}^{- 1} U_{f}^{T}))) + H (Υ), \end{matrix}

(4.15)

I_{F} = \frac{1}{2} ln (det (\frac{\hat{Σ}}{2 π e})) + H (Υ),

(4.16)

and

I_{F} - I_{G} = - \frac{1}{2} ln (det (I_{K} + {\hat{Σ}}^{- 1 ∕ 2} U_{A}^{T} U_{f} {\tilde{Σ}}^{- 1} U_{f}^{T} U_{A} {\hat{Σ}}^{- 1 ∕ 2})) .

(4.17)

Now consider two special cases. If $\tilde{Σ} = I_{K}$ , then by equation 4.17, we get

I_{F} - I_{G} = - \frac{1}{2} \sum_{k = 1}^{K} ln (1 + {\hat{σ}}_{k}^{- 2}) .

(4.18)

If U_A = U_f, then

I_{F} - I_{G} = - \frac{1}{2} \sum_{k = 1}^{K} ln (1 + {\hat{σ}}_{k}^{- 2} {\tilde{σ}}_{k}^{- 2}) .

(4.19)

Here $J (x) = U_{A} \hat{Σ} U_{A}^{T}$ , $P^{- 1} (x) = U_{f} \tilde{Σ} U_{f}^{T}$ . The FI matrices J(x) and P⁻¹(x) become degenerate when ${\hat{σ}}_{K}^{2} \to 0$ and ${\tilde{σ}}_{K}^{2} \to 0$ .

From equations 4.18 and 4.19, we see that if either J(x)or P⁻¹ (x) becomes degenerate, then (I_F – I_G) → −∞. This may happen for high-dimensional stimuli. For a specific example, consider a random matrix A defined as follows. Here we first generate K × N elements A_k,n, (k = 1, 2, …, K; n = 1, 2, …, N) from a normal distribution $N$ (0,1). Then each column of matrix A is normalized by $A_{k, n} \leftarrow A_{k, n} ∕ \sqrt{\sum_{k = 1}^{K} A_{k, n}^{2}}$ . We randomly sample M (set to 2 × 10⁴) image patches with size ω × ω from Olshausen’s nature image data set (Olshausen & Field, 1996) as the inputs. Each input image patch was centered by subtracting its mean: $x_{m} \leftarrow x_{m} - \frac{1}{K} \sum_{k = 1}^{K} x_{k, m}$ . Then let $x_{m} \leftarrow x_{m} - \frac{1}{M} \sum_{m^{'} = 1}^{M} x_{m^{'}}$ for ∀m ϵ {1, 2, …, M}. Define matrix X = [x₁, x₂, …, x_M] and compute eigendecomposition

\frac{1}{M} {XX}^{T} = U_{x} \overset{ˇ}{Σ} U_{x}^{T},

(4.20)

where U_x is a K × K orthogonal matrix and $\overset{ˇ}{Σ} = diag ({\overset{ˇ}{σ}}_{1}^{2}, {\overset{ˇ}{σ}}_{2}^{2}, \dots, {\overset{ˇ}{σ}}_{K}^{2})$ is a K × K eigenvalue matrix with ${\overset{ˇ}{σ}}_{1} \geq {\overset{ˇ}{σ}}_{2} \geq \dots \geq {\overset{ˇ}{σ}}_{K} > 0$ . Define

y = U_{x}^{T} x .

(4.21)

Then

\frac{1}{M} \sum_{m = 1}^{M} y_{m} y_{m}^{T} = \overset{ˇ}{Σ} .

(4.22)

The distribution of random variable Y can be approximated by a normal distribution (see section 4.3 for more details). When $p (y) = N (\overset{ˇ}{μ}, \overset{ˇ}{Σ})$ , we have

I_{G} = I_{G_{+}} = I (X; R) = I (Υ; R),

(4.23)

\begin{matrix} I_{G} & = \frac{1}{2} ln (det (\frac{1}{2 π e} ({AA}^{T} + {\overset{ˇ}{Σ}}^{- 1}))) + H (Υ) \\ = \frac{1}{2} ln (det (\frac{1}{2 π e} ({\overset{ˇ}{Σ}}^{1 ∕ 2} {AA}^{T} {\overset{ˇ}{Σ}}^{1 ∕ 2} + I_{K}))), \end{matrix}

(4.24)

I_{F} = \frac{1}{2} ln (det (\frac{{AA}^{T}}{2 π e})) + H (Υ) .

(4.25)

The error of approximation I_F is given by

\begin{matrix} d I_{F} & = I_{F} - I (X; R) = I_{F} - I_{G} \\ = - \frac{1}{2} ln (det (I_{K} + {({AA}^{T})}^{- 1} {\overset{ˇ}{Σ}}^{- 1})), \end{matrix}

(4.26)

and the relative error for I_F is

D I_{F} = \frac{d I_{F}}{I_{G}} .

(4.27)

Figure 2A shows how the values of I_G and I_F vary with the input dimension K = ω × ω and the number of neurons N (with ω = 2, 4, 6, …, 30 and N = 10⁴, 2 × 10⁴, 5 × 10⁴, 10⁵). The relative error DI_F is shown in Figure 2B. The absolute value of the relative error tends to decrease with N but may grow quite large as K increases. In Figure 2B, the largest absolute value of relative error ∣DI_F∣ is greater than 5000%, which occurs when K = 900 and N = 10⁴. Even the smallest ∣DI_F∣ is still greater than 80%, which occurs when K = 100 and N = 10⁵. In this example, I_F is a bad approximation of MI I, whereas I_G and I_G₊ are strictly equal to the true MI I across all parameters.

4.2. Dimensionality Reduction for Asymptotic Approximations.

Suppose x = (x₁, … x_K)^T is partitioned into two sets of components, $x = (x_{1}^{T}, x_{2}^{T})^{T}$ with

x_{1} = {(x_{1}, x_{2}, \dots, x_{K_{1}})}^{T},

(4.28)

x_{2} = {(x_{K_{1} + 1}, x_{K_{1} + 2}, \dots, x_{K})}^{T},

(4.29)

where $x_{1} \in X_{1} \subseteq R^{K_{1}}$ , $x_{2} \in X_{2} \subseteq R^{K_{2}}$ , K₁ + K₂ = K, K ≥ 2, K₁ ≥ 1 and K₂ ≥ 1.

Then by Fubini’s theorem, the MI I in equation 2.1 can be written as

I = \int_{X_{2}} \int_{X_{1}} \int_{R} p (r ∣ x_{1}, x_{2}) p (x_{1}, x_{2}) ln \frac{p (r ∣ x_{1}, x_{2})}{p (r)} d r d x_{1} d x_{2},

(4.30)

where p(x₁, x₂) = p(x) and p(r∣x₁, x₂) = p(r∣x).

First define

G (x) = (\begin{matrix} G_{1, 1} (x) & G_{1, 2} (X) \\ G_{2, 1} (x) & G_{2, 2} (x) \end{matrix}),

(4.31a)

G_{i, j} (x) = J_{i, j} (x) + P_{i, j} (x),

(4.31b)

where i, j ϵ {1, 2}, and

J_{i, j} (x) = {〈 \frac{\partial ln p (r ∣ x)}{\partial x_{i}} \frac{\partial ln p (r ∣ x)}{\partial x_{j}^{T}} 〉}_{r ∣ x},

(4.32a)

P_{i, j} (x) = - \frac{\partial^{2} ln p (x)}{\partial x_{i} \partial x_{j}^{T}} .

(4.32b)

Then we have the following results, their proofs are given in the appendix.

Theorem 4. Suppose matrices G (x), G_{1, 1} (x), and G_2,2 (x) are positive-definite. If the matrix $A_{x} \in R^{K \times K}$ satisfies

∣ T r ({〈 A_{x} 〉}_{x}) ∣ ⪡ 1

(4.33)

with

A_{x} = G_{2, 2}^{- 1 ∕ 2} (x) G_{2, 1} (x) G_{1, 1}^{- 1} (x) G_{1, 2} (x) G_{2, 2}^{- 1 ∕ 2},

(4.34)

then we have

I_{G} ≃ I_{G_{1}}

(4.35)

with strict equality if and only if

G_{2, 1} (x) G_{1, 1}^{- 1} (x) G_{1, 2} (x) = 0,

(4.36)

where

I_{G_{1}} = \frac{1}{2} {〈 \ln (det (\frac{G_{1, 1} (x)}{2 π e})) 〉}_{x} + \frac{1}{2} {〈 \ln (\det (\frac{G_{2, 2} (x)}{2 π e})) 〉}_{x} + H (X) .

(4.37)

Theorem 5. Suppose matrices G (x), G_1,1 (x) and P_2,2 (x) are positive-definite. If the matrix $B_{x} \in R^{K_{2} \times K_{2}}$ is positive-semidefinite and satisfies

0 \leq T r ({〈 B_{x} 〉}_{x}) ⪡ 1

(4.38)

with

B_{x} = P_{2, 2}^{- 1 ∕ 2} (x) C_{x} P_{2, 2}^{- 1 ∕ 2} (x),

(4.39)

C_{x} = J_{2, 2} (x) - G_{2, 1} (x) G_{1, 1}^{- 1} (x) G_{1, 2} (x),

(4.40)

then we have

I_{G} ≃ I_{G_{2}},

(4.41)

with strict equality if and only if

C_{x} = 0,

(4.42)

where

I_{G_{2}} = \frac{1}{2} {〈 \ln (\det (\frac{G_{1, 1} (x)}{2 π e})) 〉}_{x} + \frac{1}{2} {〈 \ln (\det (\frac{P_{2, 2} (x)}{2 π e})) 〉}_{x} + H (X) .

(4.43)

Corollary 2. If the random variables X₁ and X₂ are independent so that $p (x) = p (x_{1}) p (x_{2}), p (x_{2}) = N (μ_{2}, Σ_{x_{2}})$ is a normal distribution, and G (x), G_1,1 (x), P_1,1 (x) and P_2,2 (x) are all positive-definite and satisfy equation 4.38, then we have

I_{G} ≃ I_{G_{1}^{'}},

(4.44)

I_{G_{1}^{'}} = \frac{1}{2} {〈 \ln (\det (\frac{G_{1, 1} (x)}{2 π e})) 〉}_{x} + H (X_{1}),

(4.45)

with strict equality if and only if

C_{x} = J_{2, 2} (x) - J_{2, 1} (x) G_{1, 1}^{- 1} (x) J_{1, 2} (x) = 0,

(4.46)

where

H (X_{1}) = - {〈 \ln p (x_{1}) 〉}_{x_{1}},

(4.47a)

G_{1, 1} (x) = J_{1, 1} (x) + P_{1, 1} (x),

(4.47b)

P_{1, 1} (x) = - \frac{\partial^{2} \ln p (x_{1})}{\partial x_{1} \partial x_{1}^{T}} .

(4.47c)

Remark 7. Sometimes we are concerned only with calculating the determinant of matrix G(x) with a given p(x). Theorems 3 and 4 provide a dimensionality reduction method for computing G (x) or det (G (x)), by which we need only to compute G_1,1 (x) and G_2,2 (x) separately. To apply the approximation 4.35, we do not need to strictly require ∣Tr (⟨A_x⟩_x)∣ ⪡ 1. Instead we need to require only

∣ Tr ({〈 A_{x} 〉}_{x}) ∣ ⪡ ∣ {〈 ln (det (G_{1, 1} (x)) det (G_{2, 2} (x))) 〉}_{x} ∣ .

(4.48)

Similarly, the inequality ∣Tr (⟨B_x⟩_x)∣ ⪡ 1 can be substituted by

∣ Tr ({〈 B_{x} 〉}_{x}) ∣ ⪡ ∣ {〈 ln (det (G_{1, 1} (x)) det (P_{2, 2} (x))) 〉}_{x} ∣ .

(4.49)

By equation 4.44 and the second mean value theorem for integrals, we get

I_{G_{1}^{'}} = \frac{1}{2} {〈 ln (det (\frac{G_{1, 1} (x_{1}, {\ddot{x}}_{2})}{2 π e})) 〉}_{x_{1}} + H (X_{1})

(4.50)

for some fixed ${\ddot{x}}_{2} \in X_{2}$ . When ∥∑_x₂∥ is small, ${\ddot{x}}_{2}$ should be close to the mean: ${\ddot{x}}_{2} \approx μ_{2}$ . It follows from theorem 1 and corollary 2 that the approximate relationship $I ≃ {I_{G}}_{1}^{'}$ holds. However, equation 4.50 implies that ${I_{G}}_{1}^{'}$ is determined only by the first component x₁. Hence, there is little impact on information transfer by the minor component (i.e., x₂) for the high-dimensional input x. In other words, the information transfer is mainly determined by the first component x₁, and we can omit the minor component x₂.

4.3. Further Discussion.

Suppose x is a zero-mean vector; if it is not, then let x ← x – ⟨x⟩_x. The covariance matrix of x is given by

Σ_{x} = {〈 {xx}^{T} 〉}_{x} = U Σ U^{T},

(4.51)

where U is a K × K orthogonal matrix whose kth column is the eigenvector u_k of ∑_x and ∑ is a diagonal matrix whose diagonal elements are the corresponding eigenvalues— $Σ = diag (σ_{1}^{2}, σ_{2}^{2}, \dots, σ_{K}^{2})$ with σ₁ ≥ σ₂ ≥ … ≥ σ_K > 0. With the whitening transformation,

\tilde{x} = Σ^{- 1 ∕ 2} U^{T} x,

(4.52)

the covariance matrix of $\tilde{x}$ becomes an identity matrix:

Σ_{\tilde{x}} = {〈 \tilde{x} {\tilde{x}}^{T} 〉}_{\tilde{x}} = Σ^{- 1 ∕ 2} U^{T} {〈 {xx}^{T} 〉}_{x} U Σ^{- 1 ∕ 2} = I_{K} .

(4.53)

By the central limit theorem, the distribution of random variable $\tilde{X}$ should be closer to a normal distribution than the distribution of the original random variable X; that is, $p (\tilde{x}) ≃ N (0, I_{K})$ . Using Laplace’s method asymptotic expansion (MacKay, 2003), we get

P (\tilde{x}) = - \frac{\partial^{2} ln p (\tilde{x})}{\partial \tilde{x} \partial {\tilde{x}}^{T}} ≃ Σ_{\tilde{x}}^{- 1} = I_{K},

(4.54)

P_{+} = {〈 P (\tilde{x}) 〉}_{\tilde{x}} ≃ Σ_{\tilde{x}}^{- 1} = I_{K} .

(4.55)

In principal component analysis (PCA), the data set is modeled by a multivariate gaussian. By a PCA-like whitening transformation equation 4.52, we can use the approximation 4.55 with Laplace’s method, which requires only that the peak be close to the mean and the random variable $\tilde{X}$ does not need to be an exact gaussian distribution.

By theorem 3, we have

I (\tilde{X}; R) ≃ I_{G} = \frac{1}{2} {〈 ln (det (\frac{G (\tilde{x})}{2 π e})) 〉}_{\tilde{x}} + H (\tilde{X}),

(4.56)

where

G (\tilde{x}) = J (\tilde{x}) + I_{K},

(4.57)

J (\tilde{x}) = {〈 \frac{\partial ln p (r ∣ \tilde{x})}{\partial \tilde{x}} \frac{\partial ln p (r ∣ \tilde{x})}{\partial {\tilde{x}}^{T}} 〉}_{r ∣ \tilde{x}}

(4.58)

= Σ^{1 ∕ 2} U^{T} {〈 \frac{\partial ln p (r ∣ x)}{\partial x} \frac{\partial ln p (r ∣ x)}{\partial x^{T}} 〉}_{r ∣ x} U Σ^{1 ∕ 2}

(4.59)

= Σ^{1 ∕ 2} U^{T} J (x) U Σ^{1 ∕ 2},

(4.60)

H (\tilde{X}) = - {〈 ln p (\tilde{x}) 〉}_{\tilde{x}} = H (X) - \frac{1}{2} ln (det (Σ)) .

(4.61)

Given a K × K orthogonal matrix $B \in R^{K \times K}$ , we define

y = B^{T} \tilde{x} .

(4.62)

Then it follows from equations 4.56 to 4.62 that

I (Υ; R) ≃ I_{G} = \frac{1}{2} {〈 ln (det (\frac{G (y)}{2 π e})) 〉}_{y} + H (Υ),

(4.63)

where

G (y) = J (y) + I_{K},

(4.64)

J (y) = B^{T} J (\tilde{x}) B,

(4.65)

H (Υ) = H (\tilde{X}) .

(4.66)

Suppose y is partitioned into two sets of components, $y = (y_{1}^{T}, y_{2}^{T})^{T}$ and

y_{1} = {(y_{1}, y_{2}, \dots, y_{K_{1}})}^{T},

(4.67)

y_{2} = {(y_{K_{1} + 1}, y_{K_{1} + 2}, \dots, y_{K})}^{T},

(4.68)

where K₁ + K₂ = K, K ≥ 2, K₁ ≥ 1 and K₂ ≥ 1. Let

G (y) = (\begin{matrix} J_{1, 1} (y) + I_{K_{1}} & J_{1, 2} (y) \\ J_{2, 1} (y) & J_{2, 2} (y) + I_{K_{2}} \end{matrix}),

(4.69)

where

J_{i, j} (y) = {〈 \frac{\partial ln p (r ∣ y)}{\partial y_{i}} \frac{\partial ln p (r ∣ y)}{\partial y_{j}^{T}} 〉}_{r ∣ y}, \forall i, j = 1, 2 .

(4.70)

When K ⪢ 1, suppose we can find an orthogonal matrix B and K₁ that satisfy condition 4.38 in theorem 5 or condition 4.49—that is,

0 \leq {〈 Tr (B_{y}) 〉}_{y} ⪡ γ,

(4.71)

B_{y} = J_{2, 2} (y) - J_{2, 1} (y) {(J_{1, 1} (y) + I_{K_{1}})}^{- 1} J_{1, 2} (y),

(4.72)

γ = {〈 ln (det (J_{1, 1} (y) + I_{K_{1}})) 〉}_{y} .

(4.73)

Here matrix B_y is positive-semidefinite because

J_{2, 2} (y) - J_{2, 1} (y) {(J_{1, 1} (y) + I_{K_{1}})}^{- 1} J_{1, 2} (y) = {〈 ρ (r ∣ y) ρ {(r ∣ y)}^{T} 〉}_{r ∣ y},

(4.74)

where

ρ (r ∣ y) = \frac{\partial ln p (r ∣ y)}{\partial y_{2}} - J_{2, 1} (y) {(J_{1, 1} (y) + I_{K_{1}})}^{- 1} (\frac{\partial ln p (r ∣ y)}{\partial y_{1}} + a (r))

(4.75)

and a (r) is a K₁-dimensional random vector that satisfies

{〈 \frac{\partial ln p (r ∣ y)}{\partial y_{2}} a {(r)}^{T} 〉}_{r ∣ y} = {〈 \frac{\partial ln p (r ∣ y)}{\partial y_{2}} 〉}_{r ∣ y} {〈 a {(r)}^{T} 〉}_{r ∣ y} = 0,

(4.76)

{〈 a (r) a {(r)}^{T} 〉}_{r ∣ y} = I_{K_{1}} .

(4.77)

Assuming that J_1,1 (y) is positive-definite, $‖ J_{1, 1}^{- 1} (y) ‖ = O (N^{- 1})$ and ∥J_1,2(y)∥ = ∥J_2,1(y)∥ = O (N), we have

{(J_{1, 1} (y) + I_{K_{1}})}^{- 1} = J_{1, 1}^{- 1} (y) - J_{1, 1}^{- 2} (y) + O (J_{1, 1}^{- 3} (y))

(4.78)

and

\begin{matrix} Tr (C_{x}) = Tr (J_{2, 2} (y) - J_{2, 1} (y) J_{1, 1}^{- 1} (y) J_{1, 2} (y)) \\ + Tr (J_{2, 1} (y) J_{1, 1}^{- 2} (y) J_{1, 2} (y)) + O (N^{- 1}) . \end{matrix}

(4.79)

Hence, if

∣ Tr (J_{2, 2} (y) - J_{2, 1} (y) J_{1, 1}^{- 1} (y) J_{1, 2} (y)) ∣ ⪡ γ,

(4.80)

∣ Tr (J_{2, 1} (y) J_{1, 1}^{- 2} (y) J_{1, 2} (y)) ∣ ≪ γ,

(4.81)

then equation 4.71 holds. Notice that the matrix $(J_{2, 2} (y) - J_{2, 1} (y) J_{1, 1}^{- 1} (y) J_{1, 2} (y))$ is positive-semidefinite, which is similar to equation 4.74 and $0 \leq Tr (J_{2, 1} (y) J_{1, 1}^{- 1} (y) J_{1, 2} (y)) \leq Tr (J_{2, 2} (y))$ . Hence, if

Tr (J_{2, 2} (y)) ≪ γ,

(4.82)

then equations 4.80 and 4.81 hold so does equation 4.71.

5. Optimization of Information Transfer in Neural Population Coding –

5.1. Population Density Distribution of Parameters in Neural Populations.

If p(r∣x) is conditional independent, we can write

p (r ∣ x) = \prod_{n = 1}^{N} p (r_{n} ∣ x; θ_{n}),

(5.1)

where $θ_{n} \in R^{\tilde{K}}$ denotes a $\tilde{K}$ -dimensional vector for parameters of the nth neuron, and p(r_n∣x; θ_n) is the conditional p.d.f. of the output r_n given x. With the definition in equation 2.13, we have following proposition.

Proposition 1. If p(r∣x) is conditional independent as in equation 5.1, we have

J (x) = N \int_{Θ} p (θ) S (x; θ) d θ,

(5.2)

where

S (x; θ) = \int_{ℜ} p (r ∣ x; θ) \frac{\partial \ln p (r ∣ x; θ)}{\partial x} \frac{\partial \ln p (r ∣ x; θ)}{\partial x^{T}} d r,

(5.3)

$r \in ℜ \subseteq R$ , $θ \in Θ \subseteq R^{\tilde{K}}$ , and p(θ) is the population density function of parameter vector θ:

p (θ) = \frac{1}{N} \sum_{n = 1}^{N} δ (θ - θ_{n}),

(5.4)

with δ (·) being the Dirac delta function.

Proof.

\begin{matrix} J (x) & = \int_{R} p (r ∣ x) \frac{\partial \ln p (r ∣ x)}{\partial x} \frac{\partial \ln p (r ∣ x)}{\partial x^{T}} d r \\ = \sum_{n = 1}^{N} \int_{ℜ} p (r_{n} ∣ x; θ_{n}) \frac{\partial \ln p (r_{n} ∣ x; θ_{n})}{\partial x} \frac{\partial \ln p (r_{n} ∣ x; θ_{n})}{\partial x^{T}} d r_{n} \\ = \int_{Θ} \sum_{n = 1}^{N} δ (θ - θ_{n}) (\int_{ℜ} p (r ∣ x; θ) \frac{\partial \ln p (r ∣ x; θ)}{\partial x} \frac{\partial \ln p (r ∣ x; θ)}{\partial x^{T}} d r) d θ \\ = N \int_{Θ} p (θ) S (x; θ) d θ . \end{matrix}

(5.5)

□

Remark 8. Proposition 1 shows that J(x) can be regarded as a function of the population density of parameters, p(θ). If the p.d.f. of the input p(x) is given, we can find an appropriate p(θ) to maximize MI I.

For neuron model with Poisson spikes, we have

p (r ∣ x) = \prod_{n = 1}^{N} p (r_{n} ∣ x; θ_{n}),

(5.6)

p (r_{n} ∣ x; θ_{n}) = \frac{f {(x; θ_{n})}^{r_{n}}}{r_{n}!} \exp (- f (x; θ_{n})),

(5.7)

where f (x; θ_n) is the tuning curve of the nth neuron, n = 1, 2, …, N. Now we have

\begin{matrix} S (x; θ) & = \int_{ℜ} p (r ∣ x; θ) \frac{\partial \ln p (r ∣ x; θ)}{\partial x} \frac{\partial \ln p (r ∣ x; θ)}{\partial x^{T}} d r \\ = \frac{1}{f (x; θ)} \frac{\partial f (x; θ)}{\partial x} \frac{\partial f (x; θ)}{\partial x^{T}} \\ = \frac{\partial g (x; θ)}{\partial x} \frac{\partial g (x; θ)}{\partial x^{T}}, \end{matrix}

(5.8)

g (x; θ) = 2 \sqrt{f (x; θ)} .

(5.9)

Similarly, for a neuron response model with gaussian noise, we have

p (r ∣ x) = \prod_{n = 1}^{N} p (r_{n} ∣ x; θ_{n}),

(5.10)

p (r_{n} ∣ x; θ_{n}) = \frac{1}{σ \sqrt{2 π}} \exp (- \frac{{(r_{n} - f (x; θ_{n}))}^{2}}{2 σ^{2}}),

(5.11)

where σ is a constant standard deviation. Now we get

S (x; θ) = \frac{1}{σ^{2}} \frac{\partial f (x; θ)}{\partial x} \frac{\partial f (x; θ)}{\partial x^{T}} .

(5.12)

5.2. Optimal Population Distribution for Neural Population Coding.

Suppose p(x) and p(r∣x) fulfill conditions C1 and C2 and equation 5.1. Following the discussion in section 2.2, we define the following objective for maximizing MI I,

maximize I_{G} [p (θ)] = \frac{1}{2} {〈 \ln (\det (\frac{G (x)}{2 π e})) 〉}_{x} + H (X),

(5.13)

or, equivalently,

minimize Q_{G} [p (θ)] = - \frac{1}{2} {〈 \ln (\det (G (x))) 〉}_{x},

(5.14)

where

G (x) = J (x) + P (x),

(5.15)

J (x) = N \int_{Θ} p (θ) S (x; θ) d θ,

(5.16)

S (x; θ) = {〈 \frac{\partial \ln p (r ∣ x; θ)}{\partial x} \frac{\partial \ln p (r ∣ x; θ)}{\partial x^{T}} 〉}_{r ∣ x; θ} .

(5.17)

Here P (x) is given in equation 2.15, and it generally can be substituted by P₊ (see equation 2.78).

When ς₁ ≈ 0 (see equation 2.64), the object function, equation 5.13, can be reduced to

maximize I_{F} [p (θ)] = \frac{1}{2} {〈 \ln (\det (\frac{J (x)}{2 π e})) 〉}_{x} + H (X),

(5.18)

or, equivalently,

minimize Q_{F} [p (θ)] = - \frac{1}{2} {〈 \ln (\det (J (x))) 〉}_{x} .

(5.19)

The constraint condition for p(θ) is given by

subject to \int_{Θ} p (θ) d θ = 1, p (θ) \geq 0 .

(5.20)

However, without further constraints on the neural populations, especially a limit on the peak firing rate, the capacity of the system may grow indefinitely: I(X; R) → ∞. The most common limitation on neural populations is the energy or power constraint. For neuron models with Poisson noise or gaussian noise, a useful constraint is a limitation on the peak power,

∣ f (x; θ_{n}) ∣ \leq E_{\max}, \forall x \in X and \forall n = 1, 2, \dots, N,

(5.21)

where E_max > 0 is the peak power. Under this constraint, maximizing I_G[p(θ)] or I_F[p(θ)] for independent neurons will result in max_x ∣ f (x; θ_n)∣ = E_max for ∀n = 1,2, …, N.

Another constraint is a limitation on average power. For Poisson neurons given in equation 5.7,

\frac{1}{N} \sum_{n = 1}^{N} {〈 {〈 r_{n} p (r_{n} ∣ x; θ_{n}) 〉}_{r_{n} ∣ x} 〉}_{x} \leq E_{avg},

(5.22)

which can also be written as

{〈 {〈 f (x; θ) 〉}_{x} 〉}_{θ} \leq E_{avg},

(5.23)

and for gaussian noise neurons given in equation 5.11,

{〈 {〈 f {(x; θ)}^{2} 〉}_{x} 〉}_{θ} \leq E_{avg},

(5.24)

where E_avg > 0 is the maximum average energy cost.

In equation 5.15, we can approximate the continuous integral by a discrete summation for numerical computation,

J (x) = N \sum_{k = 1}^{K_{1}} α_{k} S (x; θ_{k}),

(5.25)

where the positive integer K₁ ≤ N denotes the number of subclasses in the neural population and

\sum_{k = 1}^{K_{1}} α_{k} = 1, α_{k} > 0, \forall k = 1, 2, \dots, K_{1} .

(5.26)

If we do not know the specific form of p(x) but have M samples, x₁, x₂, …, x_M, which are i.i.d. samples drawn from the distribution p(x), then we can approximate the integral in equation 5.13 by the sample average:

{〈 \ln (\det (G (x))) 〉}_{x} ≃ \frac{1}{M} \sum_{m = 1}^{M} \ln (\det (G (x_{m}))) .

(5.27)

Optimizing the objective 5.13 or 5.18 is a convex optimization problem (see the appendix for a proof).

Proposition 2. The functions I_G[p(θ)] and I_F [p(θ)] are concave about p(θ).

Remark 9. For a low-dimensional input x, we may use equation 5.18 or 5.19 as the objective. Since I_G[p(θ)] and I_F[p(θ)] are concave functions of p(θ), we can directly use efficient numerical methods to get the optimal solution for small K. However, for high-dimensional input x, we need to use other methods (e.g., Huang & Zhang, 2017).

5.3. Necessary and Sufficient Conditions for Optimal Population Distribution.

Applying the method of Lagrange multipliers for the optimization problems 5.13 and 5.20 yields

L [p (θ)] = I_{G} [p (θ)] - λ_{1} (\int_{Θ} p (θ) d θ - 1) + \int_{Θ} λ_{2} (θ) p (θ) d θ,

(5.28)

where λ₁ is a constant and λ₂(θ) is a function of θ. According to Karush-Kuhn-Tucker (KKT) conditions (Boyd & Vandenberghe, 2004), we have

λ_{2} (θ) p (θ) = 0, λ_{2} (θ) \geq 0,

(5.29)

and the necessary condition for optimal population density,

\frac{\partial L [p (θ)]}{\partial p (θ)} = \frac{1}{2} {〈 Tr (N G {(x)}^{- 1} S (x; θ)) 〉}_{x} - λ_{1} + λ_{2} (θ) = 0 .

(5.30)

It follows from equations 5.29 and 5.30 that

\frac{1}{2} {〈 Tr (N G {(x)}^{- 1} S (x; θ)) 〉}_{x} = λ_{1}, p (θ) \neq 0,

(5.31)

\frac{1}{2} {〈 Tr (N G {(x)}^{- 1} S (x; θ)) 〉}_{x} = λ_{1} - λ_{2} (θ), p (θ) = 0 .

(5.32)

Since I_G[p(θ)] is a concave function of p(θ), equations 5.31 and 5.32 are the necessary and sufficient conditions for the optimization problems 5.13 and 5.20.

5.4. Channel Capacity for Neural Population Coding.

If p(x) is unknown, then by Jensen’s inequality, we have

\begin{matrix} I ≃ I_{G} [p (x)] & = \int_{X} p (x) \ln (p {(x)}^{- 1} \det {(\frac{G (x)}{2 π e})}^{1 ∕ 2}) d x \\ \leq \ln \int_{X} \det {(\frac{G (x)}{2 π e})}^{1 ∕ 2} d x, \end{matrix}

(5.33)

and the equality holds if and only if p(x)⁻¹ det (G(x))^1/2 is a constant. Thus,

I_{G} [p^{*} (x)] = \max_{p (x)} (I_{G} [p (x)]) = \ln \int_{X} \det {(\frac{G (x)}{2 π e})}^{1 ∕ 2} d x,

(5.34)

p^{*} (x) = \frac{\det {(G (x))}^{1 ∕ 2}}{\int_{X} \det {(G (\hat{x}))}^{1 ∕ 2} d \hat{x}},

(5.35)

assuming $\int_{X} det (G (\hat{x}))^{1 ∕ 2} d \hat{x} < \infty$ .

Let us consider a specific example. Suppose J(x) = J₀ is a constant matrix; then it follows from equation 2.12 that

I_{G} = \frac{1}{2} {〈 \ln (\det (\frac{J_{0} + P (x)}{2 π e})) 〉}_{x} + H (X) .

(5.36)

According to the maximum entropy probability distribution, we know that maximizing H(X) results in a uniformly distributed p(x). Hence we have G(x) = J₀, and p*(x) coincides with the uniform distribution (see equation 5.35). In this case, the maximum I_G[p*(x)] can be regarded as the channel capacity for this neural population.

If we consider a constraint on random variables X and assume that the covariance matrix of X is ∑₀ and satisfies

Σ_{0}^{- 1} = P (x),

(5.37)

then it follows from the maximum entropy probability distribution that

H (X) \leq \frac{1}{2} (\det (2 π e Σ_{0})),

(5.38)

and the equality holds if and only if the p.d.f. of the input is a normal distribution: $p (x) = N (μ, Σ_{0})$ . Hence,

\begin{matrix} I_{G} & = \frac{1}{2} \ln (\det (\frac{J_{0} + Σ_{0}^{- 1}}{2 π e})) + H (X) \\ \leq \frac{1}{2} \ln (\det (Σ_{0} J_{0} + I_{K})) = I_{G} [p^{*} (x)], \end{matrix}

(5.39)

where I_G[p*(x)] is the channel capacity of neural population. Here the equality holds if and only if $p^{*} (x) = N (μ, Σ_{0})$ , which is consistent with equation 5.37.

Furthermore, if ς₁ ≈ 0 (see equation 2.64), we have

I ≃ I_{G} [p (x)] ≃ I_{F} [p (x)] = \int_{X} p (x) \ln (p {(x)}^{- 1} \det {(\frac{J (x)}{2 π e})}^{1 ∕ 2}) d x .

(5.40)

Similarly, we also get

I_{F} [p^{*} (x)] = \max_{p (x)} (I_{F} [p (x)]) = \ln \int_{X} \det {(\frac{J (x)}{2 π e})}^{1 ∕ 2} d x,

(5.41)

p^{*} (x) = \frac{\det {(J (x))}^{1 ∕ 2}}{\int_{X} \det {(J (\hat{x}))}^{1 ∕ 2} d \hat{x}},

(5.42)

assuming $\int_{X} det (J (\hat{x}))^{1 ∕ 2} d \hat{x} < \infty$ . Here I_F[p*(x)] is the channel capacity of the neural population. The distribution p*(x) coincides with the Jeffrey’s prior in Bayesian probability (Jeffreys, 1961). In this case, if we suppose the covariance matrix of X is ∑₀, then similar to equations 5.38 and 5.39, we can get the channel capacity

I_{F} [p^{*} (x)] = \frac{1}{2} \ln (\det (Σ_{0} J_{0}))

(5.43)

with $p^{*} (x) = N (μ, Σ_{0})$ .

For another example, consider the Poisson neuron model given in equation 5.7 and suppose the input x is one dimension, K = 1. It follows from equations 5.8 and 5.42 that

p^{*} (x) = \frac{{(\int_{Θ} p (θ) {(\frac{\partial g (x; θ)}{\partial x})}^{2} d θ)}^{1 ∕ 2}}{\int_{X} {(\int_{Θ} p (θ) {(\frac{\partial g (\hat{x}; θ)}{\partial \hat{x}})}^{2} d θ)}^{1 ∕ 2} d \hat{x}} .

(5.44)

If p(θ) = δ(θ – θ₀), equation 5.44 becomes

p^{*} (x) = \frac{∣ \frac{\partial g (x; θ_{0})}{\partial x} ∣}{\int_{X} ∣ \frac{\partial g (\hat{x}; θ_{0})}{\partial \hat{x}} ∣ d \hat{x}} .

(5.45)

Atick and Redlich (1990) presented a redundancy measure to approximate Barlow’s optimality principle:

R = 1 - \frac{I (X; R)}{C (R)},

(5.46)

where C(R) is the channel capacity. Here for neural population coding, we have C(R) ≃ I_G[p*(x)] and I(X; R) ≃ I_G (or C(R) – I_F[p*(x)] and I(X; R) ≃ I_F). Hence, we can minimize $R$ by choosing an appropriate J(x) to maximize I_G (or I_F) and simultaneously satisfy equation 5.35 (or 5.42) (see Huang & Zhang, 2017, for further details).

6. Discussion

In this article, we have derived several information-theoretic bounds and approximations for effective approximation of MI in the context of neural population coding for large but finite population size. We have found some regularity conditions under which the asymptotic bounds and approximations hold. Generally these regularity conditions are easy to meet. Special examples that satisfy these conditions include the cases when the likelihood function p(r∣x) for the neural population responses is conditionally independent or has correlated noises with a multivariate gaussian distribution. Under the general regularity conditions, we have derived several asymptotic bounds and approximations of MI for a neural population and found some relationships among different approximations.

How to choose among these different asymptotic approximations of MI in a neural population with finite size N? For a flat prior distribution p(x), we have I_G ≃ I_F; that is, the two approximations I_G and I_F are about equally valid. For a sharply peaked prior distribution p(x), I_G is generally a better approximation to MI I than I_F. Under suitable conditions (e.g., cases C1 and C2) for low-dimensional inputs, I_G and I_F are good approximations of MI I not only for large N but also for small N. For high-dimensional inputs, the FI matrix J(x) (see equation 2.11) or matrix P⁻¹(x) (see equation 2.15) often becomes degenerate, which causes a large error between I_F and MI I. Hence, in this situation, I_G is a better approximation to MI I than I_F. For more convenient computation of the approximation, we have also introduced the approximation formula I_G₊ which may substitute for I_G as a proxy of MI I. For some special cases (see corollary 1), I_G and I_G₊ are strictly equal to the true MI I. Our simulation results for the one-dimensional case show that the approximations I_G, I_G₊, and I_F are all highly precise compared with the true MI I, even for small N (see Figure 1).

These approximation formulas satisfy additional constraints. By the Cramér-Rao lower bound, we know that I_F is related to the covariance matrix of an unbiased estimator (see equation 3.3). By van Trees’ Bayesian Cramér-Rao bound, we get a link between I_G₊ and the covariance matrix of a biased estimator (see equation 3.9). From the point of view of neural population decoding and Bayesian inference, there is a connection between MI (or I_G) and MAP (see equation 3.17).

For more efficient calculation of the approximation I_G (or I_G₊ ) for high-dimensional inputs, we propose to apply an invertible transformation on the input variable so as to make the new variable closer to a normal distribution (see section 4.1). Another useful technique is dimensionality reduction, which effectively approximates MI by further reducing the computational complexity for high-dimensional inputs. We found that I_F could lead to huge errors as a proxy of the true MI I for high-dimensional inputs even when I_G and I_G₊ are strictly equal to the true MI I.

These approximation formulas are potentially useful for optimization problems of information transfer in neural population coding. We have proven that optimizing the population density distribution of parameters p(θ) is a convex optimization problem and have found a set of necessary and sufficient conditions. The approximation formulas are also useful for discussion of the channel capacity of neural population coding (see section 5.4).

Information theory is a powerful tool for neuroscience and other disciplines, including diverse fields such as physics, information and communication technology, machine learning, computer vision, and bioinformatics. Finding effective approximation methods for computing MI is a key for many practical applications of information theory. Generally the FI matrix is easier to evaluate or approximate than MI. This is because calculation of MI involves averaging over both the input variable x and the output variable r (see equation 2.1), and typically p(r) also needs to be calculated from p(r∣x) by another average over x (see equation 2.2). By contrast, the FI matrix J(x) involves averaging over r only (see equation 2.13). Furthermore, it is often easier to find analytical forms of FI for specific models such as a population of tuning curves with Poisson spike statistics. Taking into account the computational efficiency, for practical applications we suggest using I_G or I_G₊ as a proxy of the true MI I for most cases. These approximations could be very useful even when we do not need to know the exact value of MI. For example, for some optimization and learning problems, we only need to know how MI is affected by the conditional p.d.f. or likelihood function p(r∣x). In such situations, we may easily solve for the optimal parameters using the approximation formulas (Huang & Zhang, 2017; Huang, Huang, & Zhang, 2017). Further discussions of the applications will be given in separate publications.

Acknowledgments

This work was supported by an NIH grant R01 DC013698.

Appendix: The Proofs

We consider a Taylor expanding of $L (r ∣ \hat{x})$ around x. If $L (r ∣ \hat{x})$ is twice differentiable for $\forall \hat{x} \in X_{ω} (x)$ , then by condition C1 we get

\begin{matrix} L (r ∣ \hat{x}) - L (r ∣ x) \\ = (\hat{x} - x)^{T} L^{'} (r ∣ x) + \frac{1}{2} (\hat{x} - x)^{T} L^{″} (r ∣ \overset{˘}{x}) (\hat{x} - x) \\ = y^{T} \tilde{v} - \frac{1}{2} y^{T} y + \frac{1}{2} y^{T} B y, \end{matrix}

(A.1)

where

y = G^{1 ∕ 2} (x) (\hat{x} - x),

(A.2)

\tilde{v} = v + v_{1}, v = G^{- 1 ∕ 2} (x) l^{'} (r ∣ x), v_{1} = G^{- 1 ∕ 2} (x) q^{'} (x),

(A.3)

\overset{˘}{x} = x + t (\hat{x} - x) \in X_{ω} (x), t \in (0, 1),

(A.4)

{\begin{matrix} B & = G^{- 1 ∕ 2} (x) {CG}^{- 1 ∕ 2} (x) = B_{0} + B_{1} + B_{2}, \\ C & = C_{0} + C_{1} + C_{2}, \end{matrix}

(A.5)

and

{\begin{matrix} B_{0} & = G^{- 1 ∕ 2} (x) C_{0} G^{- 1 ∕ 2} (x), \\ B_{1} & = G^{- 1 ∕ 2} (x) C_{1} G^{- 1 ∕ 2} (x), \\ B_{2} & = G^{- 1 ∕ 2} (x) C_{2} G^{- 1 ∕ 2} (x), \\ C_{0} & = l^{″} (r ∣ x) - 〈 l^{″} (r ∣ x) 〉_{r ∣ x}, \\ C_{1} & = l^{″} (r ∣ \overset{˘}{x}) - l^{″} (r ∣ x), \\ C_{2} & = q^{″} (\overset{˘}{x}) - q^{″} (x) . \end{matrix}

(A.6)

By condition C1, we know that the matrix B₁ + B₂ is continuous and symmetric for $\overset{ˇ}{x} \in X_{ω}$ and ∥B₁ + B₂∥ = O(1). By the definition of continuous functions, we can prove the following: for any ϵ ∈ (0, 1), there is an ε ∈ (0, ω) such that for all $Y \in Y_{ε}$ ,

- ϵ I_{K} \leq B_{1} + B_{2} \leq ϵ I_{K},

(A.7)

where

Y_{ε} = {y \in R^{K} : ‖ y ‖ < ε \sqrt{N}} .

(A.8)

Hence,

∣ y^{T} (B_{1} + B_{2}) y ∣ < ϵ {‖ y ‖}^{2} .

(A.9)

Here $\overset{ˇ}{x} = x + t G^{- 1 ∕ 2} (x) y$ , ε is a function of r, ε = ε (r) = O (1), and

Y_{ε} \subseteq Y_{ω} = {y \in R^{K} : ‖ y ‖ < ω \sqrt{N}} .

(A.10)

We define the sets

{\begin{matrix} {\overline{Y}}_{ε} & = {y \in R^{K} : ‖ y ‖ \geq ε \sqrt{N}}, \\ Z_{\hat{ε}} & = {z \in R^{K} : ∣ z_{k} ∣ < \hat{ε} \sqrt{N ∕ K}, \forall k = 1, 2, \dots, K}, \\ {\overline{Z}}_{\hat{ε}} & = {z \in R^{K} : ∣ z_{k} ∣ \geq \hat{ε} \sqrt{N ∕ K}, \forall k = 1, 2, \dots, K}, \\ {\tilde{Z}}_{ε} & = {z \in R^{K} : ‖ z + \tilde{v} 1_{R_{\hat{ε}}} ‖ < ε \sqrt{N}}, \end{matrix}

(A.11)

where

\hat{ε} = ε ∕ 2,

(A.12)

1_(·) denotes an indicator random variable,

1_{R_{\hat{ε}}} = {\begin{matrix} 1, r \in R_{\hat{ε}} (x) \\ 0, r \notin R_{\hat{ε}} (x) \end{matrix}, 1_{{\overline{R}}_{\hat{ε}}} = {\begin{matrix} 1, r \in {\overline{R}}_{\hat{ε}} (x) \\ 0, r \notin {\overline{R}}_{\hat{ε}} (x) \end{matrix},

(A.13)

and

{\begin{matrix} R_{\hat{ε}} (x) = {r \in R : ‖ \tilde{v} ‖ < \hat{ε} \sqrt{N}}, \\ {\overline{R}}_{\hat{ε}} (x) = {r \in R : ‖ \tilde{v} ‖ \geq \hat{ε} \sqrt{N}} . \end{matrix}

(A.14)

For all $z \in Z_{\hat{ε}}$ , we have ${‖ z + \tilde{v} 1_{R_{\hat{ε}}} ‖}_{2} \leq ‖ z ‖_{2} + {‖ v 1_{R_{\hat{ε}}} ‖}_{2} < ε \sqrt{N};$ ; then

Z_{\hat{ε}} \subseteq {\tilde{Z}}_{ε} .

(A.15)

It follows from equations A.3 and A.6 that

〈 v 〉_{r ∣ x} = 0, 〈 B_{0} 〉_{r ∣ x} = 0

(A.16)

and

\begin{matrix} {〈 〈 {\tilde{v}}^{T} \tilde{v} 〉_{r ∣ x} 〉}_{x} & = {〈 〈 L^{'} (r ∣ x)^{T} G^{- 1} (x) L^{'} (r ∣ x) 〉_{r ∣ x} 〉}_{x} \\ = {〈 Tr (〈 L^{'} (r ∣ x) L^{'} (r ∣ x)^{T} 〉_{r ∣ x} G^{- 1} (x)) 〉}_{x} \\ = K + ζ \\ = K + O (N^{- 1}), \end{matrix}

(A.17)

and it follows from condition C1 that

\begin{matrix} ζ & = {〈 Tr (\frac{1}{p (x)} \frac{\partial^{2} p (x)}{\partial x \partial x^{T}} G^{- 1} (x)) 〉}_{x} \\ = {〈 Tr ((q^{'} (x)^{T} q^{'} (x) + q^{″} (x)) G^{- 1} (x)) 〉}_{x} \\ = {〈 N^{- 1} (‖ q^{'} (x)^{T} q^{'} (x) ‖ + ‖ q^{″} (x) ‖) ‖ N G^{- 1} (x) ‖ 〉}_{x} \\ = O (N^{- 1}) . \end{matrix}

(A.18)

Combining conditions C1 and C2 and equations A.3, A.4, and A.6, we find

{\begin{matrix} 〈 ‖ B_{0} ‖^{2 m} 〉_{r ∣ x} \leq {〈 〈 ‖ N^{- 1} C_{0} ‖^{2 m} ‖ N G^{- 1} (x) ‖^{2 m} 〉_{r ∣ x} 〉}_{x} = O (N^{- 1}), \\ 〈 ‖ B_{0} ‖^{2 m + 1} 〉_{r ∣ x} \leq {〈 ‖ N G^{- 1} (x) ‖^{2 m + 1} 〈 ‖ N^{- 1} C_{0} ‖^{2} 〉_{r ∣ x}^{1 ∕ 2} 〈 ‖ N^{- 1} C_{0} ‖^{4 m} 〉_{r ∣ x}^{1 ∕ 2} 〉}_{x} \\ = O (N^{- 1}), \\ 〈 ‖ v ‖^{2 m_{0}} 〉_{r ∣ x} \leq {〈 ∣ N^{- 1} l^{'} (r ∣ x)^{T} l^{'} (r ∣ x) ∣^{m_{0}} 〉}_{r ∣ x} ‖ N G^{- 1} (x) ‖^{m_{0}} = O (1), \\ ‖ v_{1} ‖^{2 m_{0}} \leq {∣ N^{- 1} q^{'} (x)^{T} q^{'} (x) ∣}^{m_{0}} ‖ N G^{- 1} (x) ‖^{m_{0}} = O (N^{- m_{0}}), \end{matrix}

(A.19)

together with the power mean inequality,

\begin{matrix} {〈 {({\tilde{v}}^{T} \tilde{v})}^{m_{0}} 〉}_{r ∣ x} & \leq {〈 (‖ v ‖ + ‖ v_{1} ‖)^{2 m_{0}} 〉}_{r ∣ x} \\ \leq 2^{2 m_{0} - 1} 〈 ‖ v ‖^{2 m_{0}} + ‖ v_{1} ‖^{2 m_{0}} 〉_{r ∣ x} \\ = O (1), \end{matrix}

(A.20)

where $m \in N$ , m₀ ∈ {1, 2}. Notice that ∣G⁻¹ (x)∥ = O (N⁻¹). Here we note that for all conformable matrices A and B,

{\begin{matrix} ∣ Tr (A B) ∣ \leq ‖ A ‖ ‖ B ‖, \\ ‖ A B ‖ \leq ‖ A ‖ ‖ B ‖ . \end{matrix}

(A.21)

By equation 2.25c, we have

\begin{matrix} Tr {(N^{- 1} J (x))}^{2} & = 〈 N^{- 1} l^{'} (r ∣ x)^{T} l^{'} (r ∣ x) 〉_{r ∣ x}^{2} \\ \leq {〈 {(N^{- 1} l^{'} (r ∣ x)^{T} l^{'} (r ∣ x))}^{2} 〉}_{r ∣ x} = O (1) . \end{matrix}

(A.22)

Then it follows from equations 2.25b and A.22 that

det (G (x)) = O (N^{K}) .

(A.23)

A.1 Proof of Lemma 1. It follows from equation A.1 that

\begin{matrix} Γ_{ω} & = {〈 {〈 ln \int_{X_{ω} (x)} exp (L (r ∣ \hat{x}) - L (r ∣ x)) d \hat{x} 〉}_{r ∣ x} 〉}_{x} \\ = - {〈 \frac{1}{2} ln (det (G (x))) 〉}_{x} \\ + \underset{{\hat{Γ}}_{ω}}{\underset{︸}{{〈 {〈 ln (\int_{Y_{ω}} exp (y^{T} \tilde{v} - \frac{1}{2} y^{T} y + \frac{1}{2} y^{T} B y) d y) 〉}_{r ∣ x} 〉}_{x}}} . \end{matrix}

(A.24)

For $y \in Y_{ε}$ , according to the definitions in equations A.13 and A.14, we have

\begin{matrix} ∣ y^{T} \tilde{v} 1_{{\overline{R}}_{\hat{ε}}} ∣ & \leq ‖ y ‖ ‖ \tilde{v} 1_{{\overline{R}}_{\hat{ε}}} ‖ \\ \leq (N ε^{2})^{1 ∕ 2} ‖ \tilde{v} 1_{{\overline{R}}_{\hat{ε}}} ‖ \\ \leq 2 {\tilde{v}}^{T} \tilde{v} 1_{{\overline{R}}_{\hat{ε}}} . \end{matrix}

(A.25)

Then by condition C1, we get

\begin{matrix} {〈 {\tilde{v}}^{T} \tilde{v} 1_{{\overline{R}}_{\hat{ε}}} 〉}_{r ∣ x} & \leq {〈 \frac{‖ \tilde{v} ‖^{4}}{{(\hat{ε} \sqrt{N})}^{2}} 〉}_{r ∣ x} \\ \leq N^{- 1} ({\hat{ε}}_{0})^{- 2} 〈 ‖ \tilde{v} ‖^{4} 〉_{r ∣ x} = O (N^{- 1}), \end{matrix}

(A.26)

where ${\hat{ε}}_{0}$ is a positive constant and ${\hat{ε}}_{0} \in [min \hat{ε} (r), \max \hat{ε} (r)]$ . By equations A.9, A.17, and A.24, we get

\begin{matrix} {\hat{Γ}}_{ω} & \geq {〈 {〈 ln (\int_{Y_{ε}} exp (y^{T} \tilde{v} - \frac{1}{2} (1 + ϵ) y^{T} y + \frac{1}{2} y^{T} B_{0} y) d y) 〉}_{r ∣ x} 〉}_{x} \\ \geq {〈 {〈 ln (\int_{Z_{\hat{ε}}} exp (\frac{1}{2} {(z + \frac{\tilde{v} 1_{R_{\hat{ε}}}}{1 + ϵ})}^{T} B_{0} (z + \frac{\tilde{v} 1_{R_{\hat{ε}}}}{1 + ϵ})) ϕ_{\hat{ε}} (z) d z) 〉}_{r ∣ x} 〉}_{x} \\ + {〈 {〈 ln (Ψ_{\hat{ε}}) + \frac{{\tilde{v}}^{T} \tilde{v}}{2 (1 + ϵ)^{2}} - \frac{5 {\tilde{v}}^{T} \tilde{v} 1_{{\overline{R}}_{\hat{ε}}}}{2 (1 + ϵ)^{2}} 〉}_{r ∣ x} 〉}_{x} \\ \leq \frac{1}{2} {〈 {〈 (\int_{Z_{\hat{ε}}} {(z + \frac{\tilde{v} 1_{R_{\hat{ε}}}}{1 + ϵ})}^{T} B_{0} (z + \frac{\tilde{v} 1_{R_{\hat{ε}}}}{1 + ϵ}) ϕ_{\hat{ε}} (z) d z) 〉}_{r ∣ x} 〉}_{x} \\ + {〈 〈 ln (Ψ_{\hat{ε}}) 〉_{r ∣ x} 〉}_{x} + \frac{K + ζ}{2 (1 + ϵ)^{2}} + O (N^{- 1}), \end{matrix}

(A.27)

where $z = y - \tilde{v} 1_{R_{\hat{ε}} (x)}$ , the last step in equation A.27 follows from Jensen’s inequality, and

{\begin{matrix} ϕ_{\hat{ε}} (z) = Ψ_{\hat{ε}}^{- 1} exp (- \frac{1 + ϵ}{2} z^{T} z), \\ Ψ_{\hat{ε}} = \int_{Z_{\hat{ε}}} exp (- \frac{1 + ϵ}{2} z^{T} z) d z . \end{matrix}

(A.28)

Integrating by parts yields

〈 1_{{\overline{Z}}_{\hat{ε}}} 〉_{z} \int_{{\overline{Z}}_{\hat{ε}}} {(\frac{1 + ϵ}{2 π})}^{K ∕ 2} exp (- \frac{1 + ϵ}{2} z^{T} z) d z = O (N^{- K ∕ 2} e^{- N δ})

(A.29)

and

{(\frac{2 π}{1 + ϵ})}^{K ∕ 2} \geq Ψ_{\hat{ε}} \geq {(\frac{2 π}{1 + ϵ})}^{K ∕ 2} (1 - O (N^{- K ∕ 2} e^{- N δ}))

(A.30)

for some δ > 0.

Then from equation A.27, we get

\begin{matrix} {〈 {〈 (\int_{Z_{\hat{ε}}} {(z + \frac{\tilde{v} 1_{R_{\hat{ε}}}}{1 + ϵ})}^{T} B_{0} (z + \frac{\tilde{v} 1_{R_{\hat{ε}}}}{1 + ϵ}) ϕ_{\hat{ε}} (z) d z) 〉}_{r ∣ x} 〉}_{x} \\ = {(\frac{2 π}{1 + ϵ})}^{K ∕ 2} Ψ_{\hat{ε}}^{- 1} {〈 {〈 〈 z^{T} B_{0} {z 1}_{Z_{\hat{ε}}} 〉_{z} + \frac{{\tilde{v}}^{T} B_{0}^{2} \tilde{v} 1_{Z_{\hat{ε}}} 1_{R_{\hat{ε}}}}{(1 + ϵ)^{2}} 〉}_{r ∣ x} 〉}_{x} \\ \geq {(\frac{2 π}{1 + ϵ})}^{K ∕ 2} Ψ_{\hat{ε}}^{- 1} {〈 〈 〈 z^{T} B_{0} {z 1}_{Z_{\hat{ε}}} 〉_{z} 〉_{r ∣ x} 〉}_{x} \geq O (N^{- 1}), \end{matrix}

(A.31)

where

{\begin{matrix} 〈 \cdot 〉_{z} = \int_{R^{K}} (\cdot) ϕ_{0} (z) d z, \\ ϕ_{0} (z) = {(\frac{1 + ϵ}{2 π})}^{K ∕ 2} exp (- \frac{1 + ϵ}{2} z^{T} z) . \end{matrix}

(A.32)

Here, notice that

{(\frac{2 π}{1 + ϵ})}^{K ∕ 2} Ψ_{\hat{ε}}^{- 1} = 1 + O (N^{- K ∕ 2} e^{- N α})

(A.33)

and

\begin{matrix} {〈 〈 〈 z^{T} B_{0} z 1_{{\overline{Z}}_{\hat{ε}}} 〉_{z} 〉_{r ∣ x} 〉}_{x} & = - {〈 〈 〈 z^{T} B_{0} z 1_{Z_{\hat{ε}}} 〉_{z} 〉_{r ∣ x} 〉}_{x} \\ \geq - {〈 〈 ‖ B_{0} ‖^{2} 〉_{r ∣ x}^{1 ∕ 2} 〈 〈 ‖ z ‖^{4} 1_{{\overline{Z}}_{\hat{ε}}} 〉_{z} 〉_{r ∣ x}^{1 ∕ 2} 〉}_{x} \\ = O (N^{- 1}) . \end{matrix}

(A.34)

Hence, from the consideration above, we find

{\hat{Γ}}_{ω} \geq \frac{K}{2} ln (\frac{2 π}{1 + ϵ}) + \frac{K}{2 (1 + ϵ)^{2}} + O (N^{- 1}) .

(A.35)

Since ϵ is arbitrary, let it go to zero. Thus, combining equations A.24 and A.35 yields

Γ_{ω} = - {〈 \frac{1}{2} ln (det (\frac{G (x)}{2 π e})) 〉}_{x} + O (N^{- 1}) .

(A.36)

Considering

{〈 {〈 ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} \geq Γ_{ω},

(A.37)

and combining equations 2.3 and A.36, we immediately get equation 2.53.

On the other hand, by conditions 2.54a and 2.54b, we have

{\begin{matrix} {〈 {\tilde{v}}^{T} \tilde{v} 1_{{\overline{R}}_{\hat{ε}}} 〉}_{r ∣ x} \leq 〈 \frac{‖ \tilde{v} ‖_{2}^{2 + 2 τ}}{{(\hat{ε} \sqrt{N})}^{2 τ}} 〉 \leq N^{- τ} ({\hat{ε}}_{0})^{- 2 τ} {〈 ‖ \tilde{v} ‖^{2 + 2 τ} 〉}_{r ∣ x} = o (1), \\ {〈 〈 〈 z^{T} B_{0} z 1_{Z_{\hat{ε}}} 〉_{z} 〉_{r ∣ x} 〉}_{x} \geq - {〈 〈 ‖ B_{0} ‖^{2} 〉_{r ∣ x}^{1 ∕ 2} 〈 〈 ‖ z ‖^{4} 1_{{\overline{Z}}_{\hat{ε}}} 〉_{z} 〉_{r ∣ x}^{1 ∕ 2} 〉}_{x} = o (1) . \end{matrix}

(A.38)

Similarly we can get equation 2.55. This completes the proof of lemma 1. □

A.2 Proof of Lemma 2. Define the sets

Ω_{ϵ} (x) = {r \in R : y^{T} B_{0} y < ϵ ‖ y ‖^{2}, \forall y \in R^{K}}

(A.39)

and

Θ_{ϵ} (x) = {r \in R : \int_{{\overline{X}}_{ε} (x)} \frac{p (r ∣ \hat{x}) p (\hat{x})}{p (r ∣ x) p (x)} d x^{'} < ϵ det (G (x))^{- 1 ∕ 2}},

(A.40)

where ${\overline{X}}_{ε} (x) = X - X_{ε} (x)$ , assuming ϵ ∈ (0, 1/2) and p(x) > 0.

Then by Markov’s inequality, we have

{〈 1_{{\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} \leq P_{r ∣ x} {‖ B_{0} ‖^{2} \geq ϵ^{2}} \leq ϵ^{2} {〈 ‖ B_{0} ‖^{2} 〉}_{r ∣ x} = O (N^{- 1}),

(A.41)

and by equation 2.26b,

\begin{matrix} {〈 1_{{\overline{Θ}}_{ϵ}} 〉}_{r ∣ x} & = P_{r ∣ x} {\int_{{\overline{X}}_{ε} (x)} \frac{p (r ∣ \hat{x}) p (\hat{x})}{p (r ∣ x) p (x)} d \hat{x} \geq ϵ det (G (x))^{- 1 ∕ 2}} \\ = P_{r ∣ x} {det (G (x))^{1 ∕ 2} \int_{{\overline{X}}_{\hat{ω}} (x)} p (\hat{x} ∣ r) d \hat{x} > ϵ p (r ∣ x)} \\ = O (N^{- η}) . \end{matrix}

(A.42)

Consider the following equality:

{〈 ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} = {〈 1_{Θ_{ϵ}} ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} + {〈 1_{{\overline{Θ}}_{ϵ}} ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} .

(A.43)

For the last term in equation A.43, Jensen’s inequality implies that

{〈 {〈 1_{{\overline{Θ}}_{ϵ}} ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} \leq {〈 〈 1_{{\overline{Θ}}_{ϵ}} 〉_{r ∣ x} 〉}_{x} ln \frac{1}{{〈 〈 1_{{\overline{Θ}}_{ϵ}} 〉_{r ∣ x} 〉}_{x}} = o (N^{- 1}) .

(A.44)

For the first term in equation A.43, it follows from equations A.40 and A.9 that

\begin{matrix} {〈 1_{Θ_{ϵ}} ln \frac{p (r}{p (r ∣ x) p (x} 〉}_{r ∣ x} \\ \leq {〈 1_{Θ_{ϵ}} ln (\int_{X_{ε} (x)} exp (L (r ∣ \hat{x}) - L (r ∣ x)) d \hat{x} + ϵ det (G (x))^{- 1 ∕ 2}) 〉}_{r ∣ x} \\ \leq - \frac{K}{2} ln (det (G (x))) \\ + {〈 1_{Θ_{ϵ}} ln (\int_{Y_{ε}} exp (y^{T} \tilde{v} - \frac{1}{2} (1 + ϵ) y^{T} y + \frac{1}{2} y^{T} B_{0} y) d y + ϵ) 〉}_{r ∣ x} . \end{matrix}

(A.45)

The last term, equation A.45, is upper-bounded by

{〈 1_{Θ_{ϵ} \cap Ω_{ϵ}} ln (\int_{R^{K}} exp (y^{T} \tilde{v} - \frac{1}{2} (1 - 2 ϵ) y^{T} y) d y + ϵ) 〉}_{r ∣ x}

(A.46)

+ {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} ln (\int_{R^{K}} exp (y^{T} \tilde{v} - \frac{1}{2} (1 - ϵ) y^{T} y + \frac{1}{2} y^{T} B_{0} y) d y + ϵ) 〉}_{r ∣ x} .

(A.47)

Equation A.46 is equal to

\begin{matrix} {〈 1_{Θ_{ϵ} \cap Ω_{ϵ}} ln ({(\frac{2 π}{1 - 2 ϵ})}^{K ∕ 2} exp (\frac{{\tilde{v}}^{T} \tilde{v}}{2 (1 - 2 ϵ)}) + ϵ) 〉}_{r ∣ x} \\ \leq {〈 1_{Θ_{ϵ} \cap Ω_{ϵ}} (\frac{{\tilde{v}}^{T} \tilde{v}}{2 (1 - 2 ϵ)} + ln ({(\frac{2 π}{1 - 2 ϵ})}^{K ∕ 2} + ϵ)) 〉}_{r ∣ x} . \end{matrix}

(A.48)

Equation A.47 is equal to

\begin{matrix} 〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} ln (〈 {(\frac{2 π}{1 - ϵ})}^{K ∕ 2} exp (\frac{1}{2} {(z + \frac{\tilde{v}}{1 - ϵ})}^{T} \\ {{B_{0} (z + \frac{\tilde{v}}{1 - ϵ}) + \frac{{\tilde{v}}^{T} \tilde{v}}{2 (1 - ϵ)}) 〉}_{z} + ϵ) 〉}_{r ∣ x} \\ \leq {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} (\frac{K}{2} \ln (\frac{2 π}{1 - ϵ}) + \frac{{\tilde{v}}^{T} \tilde{v}}{2 (1 - ϵ)} + \frac{{\tilde{v}}^{T} B_{0}^{2} \tilde{v}}{2 {(1 - ϵ)}^{2}} + \frac{{\tilde{v}}^{T} B_{0}^{2} \tilde{v}}{{(1 - ϵ)}^{3}}) 〉}_{r ∣ x} \end{matrix}

(A.49a)

+ 〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} ln ({〈 \exp (\frac{1}{2} z^{T} B_{0} z + \frac{z^{T} B_{0} \tilde{v}}{1 - ϵ} - \frac{{\tilde{v}}^{T} B_{0}^{2} \tilde{v}}{{(1 - ϵ)}^{3}}) 〉}_{z} {+ ϵ {(\frac{1 - ϵ}{2 π})}^{K ∕ 2}) 〉}_{r ∣ x},

(A.49b)

where

{\begin{matrix} {〈 \cdot 〉}_{z} = \int_{R^{K}} (\cdot) ϕ_{1} (z) d z \\ ϕ_{1} (z) = {(\frac{1 - ϵ}{2 π})}^{K ∕ 2} \exp (- \frac{1 - ϵ}{2} z^{T} z) \end{matrix} .

(A.50)

Notice that

{〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} \leq {〈 1_{{\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} = O (N^{- 1})

(A.51)

and

{〈 1_{Θ_{ϵ} \cap Ω_{ϵ}} 〉}_{r ∣ x} = 1 - {〈 1_{{\overline{Θ}}_{ϵ} \cup {\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} = 1 + O (N^{- 1}) .

(A.52)

Then by equation A.19, we get

\begin{matrix} {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} ({〈 \exp (z^{T} B_{0} z) 〉}_{z}^{1 ∕ 2} - 1) 〉}_{r ∣ x} \\ \leq {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} \sum_{m = 0}^{\infty} \frac{1}{m!} {〈 {(z^{T} B_{0} z)}^{m} 〉}_{z} 〉}_{r ∣ x}^{1 ∕ 2} - {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} = O (N^{- 1}), \end{matrix}

(A.53)

{〈 {\tilde{v}}^{T} \tilde{v} 1_{{\overline{Θ}}_{ϵ}} 〉}_{r ∣ x} \leq {〈 {‖ \tilde{v} ‖}^{4} 〉}_{r ∣ x}^{1 ∕ 2} {〈 1_{{\overline{Θ}}_{ϵ}} 〉}_{r ∣ x}^{1 ∕ 2} = O (N^{- 1}),

(A.54)

and by equation 2.51,

\begin{matrix} 0 & \leq {〈 {\tilde{v}}^{T} B_{0}^{2} \tilde{v} 1_{Θ_{ϵ} \cap {\overline{Ω}}_{Ω}} 〉}_{r ∣ x} \leq {〈 v^{T} B_{0}^{2} v 〉}_{r ∣ x} + O (N^{- 1}) \\ \leq ξ ‖ N G^{- 1} (x) ‖ + O (N^{- 1}) = O (N^{- 1}) . \end{matrix}

(A.55)

Hence, we have

\begin{matrix} {〈 1_{Θ_{ϵ}} (\frac{K}{2} \ln (\frac{2 π}{1 - ϵ}) + \frac{{\tilde{v}}^{T} \tilde{v}}{2 (1 - ϵ)}) 〉}_{r ∣ x} \\ = (\frac{K}{2} \ln (\frac{2 π}{1 - ϵ}) + \frac{K + ζ}{2 (1 - ϵ)}) + O (N^{- 1}), \end{matrix}

(A.56)

and by Cauchy-Schwarz inequality and equation A.53, the term A.49b is upper-bounded by

\begin{matrix} 〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} ln ({〈 \exp (z^{T} B_{0} z) 〉}_{z}^{1 ∕ 2} {〈 exp (\frac{2 z^{T} B_{0} \tilde{v}}{1 - ϵ} - \frac{2 {\tilde{v}}^{T} B_{0}^{2} \tilde{v}}{{(1 - ϵ)}^{3}}) 〉}_{z}^{1 ∕ 2} \\ {+ ϵ {(\frac{1 - ϵ}{2 π})}^{K ∕ 2}) 〉}_{r ∣ x} \\ = {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} \ln ({〈 \exp (z^{T} B_{0} z) 〉}_{z}^{1 ∕ 2} + ϵ {(\frac{1 - ϵ}{2 π})}^{K ∕ 2}) 〉}_{r ∣ x} \\ \leq {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} ({〈 \exp (z^{T} B_{0} z) 〉}_{z}^{1 ∕ 2} + ϵ {(\frac{1 - ϵ}{2 π})}^{K ∕ 2} - 1) 〉}_{r ∣ x} = O (N^{- 1}) . \end{matrix}

(A.57)

Since ϵ is arbitrary, we can let it go to zero. Then, taking everything together, we get

{〈 {〈 \ln \frac{p (r)}{p (r ∣ x) p (r)} 〉}_{r ∣ x} 〉}_{x} \leq - {〈 \frac{1}{2} \ln (\det (\frac{G (x)}{2 π e})) 〉}_{x} + O (N^{- 1}) .

(A.58)

Putting equation A.58 into 2.3 yields 2.56.

On the other hand, we have

\begin{matrix} {〈 {〈 \ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} \\ = {〈 {〈 1_{Θ_{ϵ} \cap Ω_{ϵ}} \ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} \end{matrix}

(A.59)

+ {〈 {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} \ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} + {〈 {〈 1_{{\overline{Θ}}_{ϵ}} \ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} .

(A.60)

For equation A.60, it fo11ows from Jensen’s inequa1ity that

{〈 {〈 1_{{\overline{Θ}}_{ϵ}} \ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} \leq {〈 {〈 1_{{\overline{Θ}}_{ϵ}} 〉}_{r ∣ x} 〉}_{x} \ln \frac{1}{{〈 {〈 1_{{\overline{Θ}}_{ϵ}} 〉}_{r ∣ x} 〉}_{x}} = o (1)

(A.61)

and

{〈 {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} \ln \frac{p (r)}{p (r ∣ x) p (x)} 〉}_{r ∣ x} 〉}_{x} \leq {〈 {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} 〉}_{x} \ln \frac{1}{{〈 {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} 〉}_{x}} = o (1),

(A.62)

where

{\begin{matrix} {〈 1_{{\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} \leq P ({‖ B_{0} ‖}^{2} \geq ϵ^{2}) \leq ϵ^{- 2} {〈 {‖ B_{0} ‖}^{2} 〉}_{r ∣ x} = o (1), \\ {〈 1_{Θ_{ϵ} \cap {\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} \leq {〈 1_{{\overline{Ω}}_{ϵ}} 〉}_{r ∣ x} = o (1) . \end{matrix}

(A.63)

Similarly we can get equation 2.57. □

A.3 Proof of Theorem 1. By lemmas 1 and 2, we immediately get equation 2.58. The proof of equation 2.59 is similar. □

A.4 Proof of Theorem 2. First, we have

G (x) = J^{1 ∕ 2} (x) (I_{K} + Ψ (x)) J^{1 ∕ 2} (x) .

(A.64)

Since J(x) and G(x) are symmetric and positive-definite, I_K + Ψ(x) is also symmetric and positive-definite. The eigendecompositon of Ψ(x) is given by

Ψ (x) = U_{x} Λ_{x} U_{x}^{T},

(A.65)

where $U_{x} \in R^{K \times K}$ is an orthogonal matrix and the matrix $Λ_{x} \in R^{K \times K}$ is a K × K diagonal matrix with K nonnegative real numbers on the diagonal, λ₁ ≥ λ₂ ≥, …, ≥ λ_K > −1. Then we have

{〈 Tr (Λ_{x}) 〉}_{x} = {〈 Tr (Ψ (x)) 〉}_{x} = {〈 Tr (P (x) J^{- 1} (x)) 〉}_{x} = ς

(A.66)

and

{〈 \ln (\det (I_{K} + Ψ (x))) 〉}_{x} = {〈 Tr (\ln (I_{K} + Λ_{x})) 〉}_{x} \leq {〈 Tr (Λ_{x}) 〉}_{x} = ς .

(A.67)

Notice that ln(1 + x) ≤ x for ∀_x ∈ (−1, ∞). It follows from equations A.64 and A.67 that

{〈 \ln (\det (G (x))) 〉}_{x} = {〈 \ln (\det (J (x))) 〉}_{x} = {〈 \ln (\det (I_{K} + Ψ (x))) 〉}_{x} \leq ς .

(A.68)

From equations 2.12, 2.11, and A.68, we obtain equation 2.62.

If P(x) is positive-semidefinite, then λ₁ ≥ λ₂ ≥, …, ≥ λ_K ≥ 0, ς ≥ 0 and (ln (det ⟨I_K + Ψ(x)))⟩_x ≥ 0. Hence we can get equation 2.63.

On the other hand, it follows from equations 2.64 and A.67 and the power mean inequality that

\begin{matrix} ∣ ς ∣ & \leq {〈 \sum_{k = 1}^{K} ∣ λ_{k} ∣ 〉}_{x} \leq \sqrt{K} {〈 {(\sum_{k = 1}^{K} λ_{k}^{2})}^{1 ∕ 2} 〉}_{x} \\ = \sqrt{K} {〈 ‖ Ψ (x) ‖ 〉}_{x} = \sqrt{K} ς_{1} = O (N^{- β}) . \end{matrix}

(A.69)

Let $λ_{k}^{-} = min (0, λ_{k})$ for ∀k ∈ {1, 2, …, K}. Then

{〈 \sum_{k = 1}^{K} \ln (1 + λ_{k}^{-}) 〉}_{x} \leq {〈 \ln (\det (I_{K} + Ψ (x))) 〉}_{x} .

(A.70)

Notice that $- 1 < λ_{k}^{-} \leq 0$ . Then by equation A.69, we have

{〈 \sum_{k = 1}^{K} \ln (1 + λ_{k}^{-}) 〉}_{x} = {〈 \sum_{m = 1}^{\infty} \frac{- 1}{m} \sum_{k = 1}^{K} {(- λ_{k}^{-})}^{m} 〉}_{x} = O (N^{- β}) .

(A.71)

From equations 2.12, 2.11, A.68, A.70, and A.71, we immediately get equation 2.65. □

A.5 Proof of Theorem 3. Considering the change of variables theorem, for any real-valued function f and invertible transformation T, we have

\int_{\tilde{X}} f (\tilde{x}) d \tilde{x} = \int_{X} f (T (x)) ∣ \det (D T (x)) ∣ d x,

(A.72)

and for p(x) and $p (\tilde{x})$ ,

p (\tilde{x}) ∣_{\tilde{x} = T (x)} = {∣ \det (D T (x)) ∣}^{- 1} p (x) .

(A.73)

Then it follows from equations 4.2, A.72, and A.73 that

{\begin{matrix} p (r) & = \int_{X} p (r ∣ x) p (x) d x = \int_{\tilde{X}} p (r ∣ \tilde{x}) p (\tilde{x}) d \tilde{x}, \\ H (\tilde{X}) & = - \int_{\tilde{X}} p (\tilde{x}) \ln p (\tilde{x}) d x \\ = - \int_{X} p (x) \ln (p (x) {∣ \det (D T (x)) ∣}^{- 1}) d x \\ = H (X) + \int_{X} p (x) \ln ∣ \det (D T (x)) ∣ d x, \\ G (x) & = D T {(x)}^{T} G (\tilde{x}) D T (x) . \end{matrix}

(A.74)

Substituting equations A.73 and A.74 into 2.1, we can directly obtain equation 4.3. Moreover, if $p (\tilde{x})$ and $p (r ∣ \tilde{x})$ fulfill conditions C1, C2 and ξ = O (N⁻¹), then by theorem 1, we immediately obtain equation 4.4. □

A.6 Proof of Corollary 1. It follows from equation 2.21 and theorem 3 that

\begin{matrix} I_{G} & = I_{G +} = I (X; R) = I (Y; R) \\ = \frac{1}{2} \ln (\det (\frac{1}{2 π e} (A A^{T} + Σ_{f}^{- 1}))) + H (Y) \end{matrix}

(A.75)

and

H (Y) = \frac{1}{2} \ln (\det (2 π e Σ_{f})) = H (X) + {〈 \ln ∣ \det (D (x)) ∣ 〉}_{x} .

(A.76)

Here notice that

\begin{matrix} J (x) & = {〈 \frac{\partial \ln p (r ∣ x)}{\partial x} \frac{\partial \ln p (r ∣ x)}{\partial x^{T}} 〉}_{r ∣ x} \\ = {〈 \frac{\partial y^{T}}{\partial x} \frac{\partial \ln p (r ∣ y)}{\partial y} \frac{\partial \ln p (r ∣ y)}{\partial y^{T}} \frac{\partial y}{\partial x^{T}} 〉}_{r ∣ y} \\ = D {(x)}^{T} A A^{T} D (x) \end{matrix}

(A.77)

and

P (x) = - \frac{\partial^{2} \ln p (x)}{\partial x \partial x^{T}} = - \frac{\partial y^{T}}{\partial x} \frac{\partial^{2} \ln p (y)}{\partial y \partial y^{T}} \frac{\partial y}{\partial x^{T}} = D {(x)}^{T} Σ_{f}^{- 1} D (x) .

(A.78)

Hence, combining equations A.75 to A.78, we can immediately obtain equation 4.9. □

A.7 Proof of Theorem 4. First, we have

{〈 \ln (\det (\frac{G (x)}{2 π e})) 〉}_{x} = 〈 \ln (\det (\frac{G_{1, 1} (x)}{2 π e}) {\det (\frac{1}{2 π e} (G_{2, 2} (x) - G_{2, 1} (x) G_{1, 1}^{- 1} (x) G_{1, 2} (x)))) 〉}_{x} = 〈 \ln (\det (\frac{G_{1, 1} (x)}{2 π e})) + \ln (\det (\frac{G_{2, 2} (x)}{2 π e})) + {\ln (\det (I_{K_{2}} - A_{x})) 〉}_{x} .

(A.79)

Then by the eigendecompositon of A_x, we have

A_{x} = U_{x} Λ_{x} U_{x}^{T},

(A.80)

where U_x and Λ_x are K₂ × K₂ eigenvector matrix and eigenvalue matrix, respectively. Since G (x), G_1,1 (x), and G_2,2 (x) are positive-definite, then I_K – A_x is also positive-definite and A_x is positive-semidefinite, with 0 ≤ (Λ_x)_k,k = λ_k < 1 for ∀k ∈ {1, 2, …, K₂}. Moreover, it follows from equation 4.33 that

{\begin{matrix} 0 \leq {〈 Tr (Λ_{x}) 〉}_{x} = {〈 Tr (A_{x}) 〉}_{x} ≪ 1, \\ 0 \leq {〈 Tr (Λ_{x}^{m}) 〉}_{x} = {〈 \sum_{k = 1}^{K_{2}} λ_{k}^{m} 〉}_{x} \leq {〈 Tr (Λ_{x}) 〉}_{x} ≪ 1 . \end{matrix}

(A.81)

Then by equation A.81, we have

\begin{matrix} {〈 \ln (\det (I_{K_{2}} - A_{x})) 〉}_{x} & = {〈 Tr (\ln (I_{K_{2}} - Λ_{x})) 〉}_{x} \\ = \sum_{m = 1}^{\infty} \frac{- 1}{m} {〈 Tr (Λ_{x}^{m}) 〉}_{x} ≃ 0 . \end{matrix}

(A.82)

Substituting equation A.82 into A.79 and then combining with equation 2.12, we get equation 4.35.

If equation 4.36 holds, then A_x = 0 and I_G = I_G₁. Conversely, if I_G = I_G₁, then

0 = {〈 \ln (\det (I_{K_{2}} - A_{x})) 〉}_{x} \leq - {〈 Tr (A_{x}) 〉}_{x} \leq 0,

(A.83)

A_x = 0, and equation 4.36 holds. □

A.8 Proof of Theorem 5. Similar to equation A.79, we have

{〈 \ln (\det (\frac{G (x)}{2 π e})) 〉}_{x} = 〈 \ln (\det (\frac{G_{1, 1} (x)}{2 π e})) + \ln (\det (\frac{P_{2, 2} (x)}{2 π e})) + {\ln (\det (I_{K_{2}} - B_{x})) 〉}_{x} .

(A.84)

Similar to equation A.65, the eigendecompositon of B_x is given by

B_{x} = U_{x} Λ_{x} U_{x}^{T},

(A.85)

where U_x and Λ_x are K₂ × K₂ eigenvector matrix and eigenvalue matrix, respectively. If the matrix B_x is positive-semidefinite and satisfies equation 4.38, then (Λ_x)_k,k = λ_k ≥ 0 for ∀k ∈ {1, 2, …, K₂} and

0 \leq {〈 \ln (\det (I_{K_{2}} + B_{x})) 〉}_{x} = {〈 \sum_{k = 1}^{K_{2}} \ln (1 + λ_{k}) 〉}_{x} \leq {〈 Tr (Λ_{x}) 〉}_{x} = Tr ({〈 B_{x} 〉}_{x}) ≪ 1 .

(A.86)

Substituting equation A.86 into A.84, we immediately get equation 4.41. If C_x = 0, then ln (det(l_K₂ + B_x)) = 0 and I_G = I_G₂. And if I_G = I_G₂, then ln (det(I_K₂ + B_x)) = 0, B_x = 0 and C_x = 0. □

A.9 Proof of Corollary 2. Notice that

{H (X) = H (X_{1}) + H (X_{2}), H (X_{2}) = \frac{1}{2} \ln (\det (2 π e Σ_{x_{2}})), P_{2, 1} (x) = P_{1, 2} (x) = 0, P_{2, 2} (x) = Σ_{x_{2}}^{- 1},

(A.87)

and the matrices

C_{x} = J_{2, 2} (x) - J_{2, 1} (x) G_{1, 1}^{- 1} (x) J_{1, 2} (x),

(A.88)

B_{x} = P_{2, 2}^{- 1 ∕ 2} (x) C_{x} P_{2, 2}^{- 1 ∕ 2} (x)

(A.89)

are positive-semidefinite, and the proof is similar to equation 4.74. Then by theorem 5, we immediately get equation 4.41. Substituting equation A.87 into 4.41 yields equation 4.44 with strict equality if and only if C_x = 0. □

A.10 Proof of Proposition 2. By writing p(θ) as a sum of two density functions p₁(θ) and p₂(θ),

p (θ) = α p_{1} (θ) + (1 - α) p_{2} (θ),

(A.90)

we have

G (x) = N \int_{Θ} p (θ) S (x; θ) d θ + P (x) = α G_{1} (x) + (1 - α) G_{2} (x),

(A.91)

where 0 ≤ α ≤ 1 and

G_{1} (x) = N \int_{Θ} p_{1} (θ) S (x; θ) d θ + P (x),

(A.92)

G_{2} (x) = N \int_{Θ} p_{2} (θ) S (x; θ) d θ + P (x) .

(A.93)

Using the Minkowski determinant inequality and the inequality of weighted arithmetic and geometric means, we find

\det {(G (x))}^{1 ∕ K} = \det {(α G_{1} (x) + (1 - α) G_{2} (x))}^{1 ∕ K} \geq α \det {(G_{1} (x))}^{1 ∕ K} + (1 - α) \det {(G_{2} (x))}^{1 ∕ K} \geq {(\det {(G_{1} (x))}^{α} \det {(G_{2} (x))}^{(1 - α)})}^{1 ∕ K} .

(A.94)

It follows from equations A.91 and A.94 that

\ln (\det (α G_{1} (x) + (1 - α) G_{2} (x))) \geq α \ln (\det (G_{1} (x))) + (1 - α) \ln (\det (G_{2} (x))),

(A.95)

where the equality holds if and only if G₁(x) = G₂(x). Thus ln (det (G(x))) is concave about p(θ). Therefore I_G[p(θ)] is a concave function about p(θ). Similarly, we can prove that I_F[p(θ)] is also a concave function about p(θ). □

Contributor Information

Wentao Huang, Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, U.S.A., and Cognitive and Intelligent Lab and Information Science Academy of China Electronics Technology Group Corporation, Beijing 100846, China.

Kechen Zhang, Department of Biomedical Engineering, Johns Hopkins University School of Medicine, Baltimore, MD 21205, U.S.A..

References

Abbott LF, & Dayan P (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1), 91–101. [DOI] [PubMed] [Google Scholar]
Amari S, & Nakahara H (2005). Difficulty of singularity in population coding. Neural Comput., 17(4), 839–858. [DOI] [PubMed] [Google Scholar]
Atick JJ, Li ZP, & Redlich AN (1992). Understanding retinal color coding from first principles. Neural Comput., 4(4), 559–572. [Google Scholar]
Atick JJ, & Redlich AN (1990). Towards a theory of early visual processing. Neural Comput., 2(3), 308–320. [Google Scholar]
Becker S, & Hinton GE (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356), 161–163. [DOI] [PubMed] [Google Scholar]
Bell AJ, & Sejnowski TJ (1997). The “independent components” of natural scenes are edge filters. Vision Res., 37(23), 3327–3338. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bethge M, Rotermund D, & Pawelzik K (2002). Optimal short-term population coding: When Fisher information fails. Neural Comput., 14(10), 2317–2351. [DOI] [PubMed] [Google Scholar]
Borst A, & Theunissen FE (1999). Information theory and neural coding. Nat. Neurosci, 2(11), 947–257. [DOI] [PubMed] [Google Scholar]
Boyd S, & Vandenberghe L (2004). Convex optimization. Cambridge: Cambridge University Press. [Google Scholar]
Brown EN, Kass RE, & Mitra PP (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci, 7(5), 456–461. [DOI] [PubMed] [Google Scholar]
Brunel N, & Nadal JP (1998). Mutual information, Fisher information, and population coding. Neural Comput., 10(7), 1731–1757. [DOI] [PubMed] [Google Scholar]
Carlton A (1969). On the bias of information estimates. Psychological Bulletin, 71(2), 108. [Google Scholar]
Chase SM, & Young ED (2005). Limited segregation of different types of sound localization information among classes of units in the inferior colliculus. Journal of Neuroscience, 25(33), 7575–7585. [DOI] [PMC free article] [PubMed] [Google Scholar]
Chechik G, Anderson MJ, Bar-Yosef O, Young ED, Tishby N, & Nelken I (2006). Reduction of information redundancy in the ascending auditory pathway. Neuron, 51(3), 359–368. [DOI] [PubMed] [Google Scholar]
Clarke BS, & Barron AR (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory, 36(3), 453–471. [Google Scholar]
Cover TM, & Thomas JA (2006). Elements of information (2nd ed.) New York: Wiley-Interscience. [Google Scholar]
Eckhorn R, & Pöpel B (1975). Rigorous and extended application of information theory to the afferent visual system of the cat. II. Experimental results. Biological Cybernetics, 17(1), 7–17. [DOI] [PubMed] [Google Scholar]
Ganguli D, & Simoncelli EP (2014). Efficient sensory encoding and Bayesian inference with heterogeneous neural populations. Neural Comput., 26(10), 2103–2134. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gawne TJ, & Richmond BJ (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13(7), 2758–2771. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gourévitch B, & Eggermont JJ (2007). Evaluating information transfer between auditory cortical neurons. Journal of Neurophysiology, 97(3), 2533–2543. [DOI] [PubMed] [Google Scholar]
Guo DN, Shamai S, & Verdu S (2005). Mutual information and minimum mean-square error in gaussian channels. IEEE Trans. Inform. Theory, 51(4), 1261–1282. [Google Scholar]
Harper NS, & McAlpine D (2004). Optimal neural population coding of an auditory spatial cue. Nature, 430(7000), 682–686. [DOI] [PubMed] [Google Scholar]
Huang W, Huang X, & Zhang K (2017). Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the 51st Annual Conference on Information Sciences and Systems (pp. 1–4). Piscataway, NJ: IEEE. [Google Scholar]
Huang W, & Zhang K (2017). An information-theoretic framework for fast and robust unsupervised learning via neural population infomax. In Proceedings of the 5th International Conference on Learning Representations (ICLR). arXiv:1611.01886. [Google Scholar]
Jeffreys H (1961). Theory of probability (3rd ed.). New York: Oxford University Press. [Google Scholar]
Kang K, & Sompolinsky H (2001). Mutual information of population codes and distance measures in probability space. Phys. Rev. Lett, 86(21), 4958–4961. [DOI] [PubMed] [Google Scholar]
Khan S, Bandyopadhyay S, Ganguly AR, Saigal S, Erickson DJ III, Protopopescu V, & Ostrouchov G (2007). Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E, 76(2), 026209. [DOI] [PubMed] [Google Scholar]
Kraskov A, Stögbauer H, & Grassberger P (2004). Estimating mutual information. Physical Review E, 69(6), 066138. [DOI] [PubMed] [Google Scholar]
Laughlin SB, & Sejnowski TJ (2003). Communication in neuronal networks. Science, 301(5641), 1870–1874. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lewis A, & Zhaoping L (2006). Are cone sensitivities determined by natural color statistics? J. Vis, 6(3), 285–302. [DOI] [PubMed] [Google Scholar]
MacKay DJC (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. [Google Scholar]
McClurkin JW, Gawne TJ, Optican LM, & Richmond BJ (1991). Lateral geniculate neurons in behaving primates. II. Encoding of visual information in the temporal shape of the response. Journal of Neurophysiology, 66(3), 794–808. [DOI] [PubMed] [Google Scholar]
Miller GA (1955). Note on the bias of information estimates In Quastler H (Ed.), Information theory in psychology: Problems and methods II-B (pp. 95–100). Glencoe, IL: Free Press. [Google Scholar]
Olshausen BA, & Field DJ (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. [DOI] [PubMed] [Google Scholar]
Optican LM, & Richmond BJ (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. III. Information theoretic analysis. Journal of Neurophysiology, 57(1), 162–178. [DOI] [PubMed] [Google Scholar]
Paninski L (2003). Estimation of entropy and mutual information. Neural Comput., 15(6), 1191–1253. [Google Scholar]
Pouget A, Dayan P, & Zemel R (2000). Information processing with population codes. Nat. Rev. Neurosci, 1(2), 125–132. [DOI] [PubMed] [Google Scholar]
Quiroga R, & Panzeri S (2009). Extracting information from neuronal populations: Information theory and decoding approaches. Nat. Rev. Neurosci, 10(3), 173–185. [DOI] [PubMed] [Google Scholar]
Rao CR (1945). Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37(3), 81–91. [Google Scholar]
Rieke F, Warland D, de Ruyter van Steveninck R, & Bialek W (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. [Google Scholar]
Rissanen JJ (1996). Fisher information and stochastic complexity. IEEE Trans. Inform. Theory, 42(1), 40–47. [Google Scholar]
Shannon C (1948). A mathematical theory of communications. Bell System Technical Journal, 27, 379–423, 623–656. [Google Scholar]
Sompolinsky H, Yoon H, Kang KJ, & Shamir M (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64(5), 051904. [DOI] [PubMed] [Google Scholar]
Tovee MJ, Rolls ET, Treves A, & Bellis RP (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. [DOI] [PubMed] [Google Scholar]
Toyoizumi T, Aihara K, & Amari S (2006). Fisher information for spike-based population decoding. Phys. Rev. Lett, 97(9), 098102. [DOI] [PubMed] [Google Scholar]
Treves A, & Panzeri S (1995). The upward bias in measures of information derived from limited data samples. Neural Comput., 7(2), 399–407. [Google Scholar]
Van Hateren JH (1992). Real and optimal neural images in early vision. Nature, 360(6399), 68–70. [DOI] [PubMed] [Google Scholar]
Van Trees HL, & Bell KL (2007). Bayesian bounds for parameter estimation and nonlinear filtering/tracking. Hoboken, NJ: Wiley. [Google Scholar]
Verdu S (1986). Capacity region of gaussian CDMA channels: The symbol synchronous case. In Proc. 24th Allerton Conf. Communication, Control and Computing, (pp. 1025–1034). Piscataway, NJ: IEEE. [Google Scholar]
Victor JD (2000). Asymptotic bias in information estimates and the exponential (bell) polynomials. Neural Comput., 12(12), 2797–2804. [DOI] [PubMed] [Google Scholar]
Wei X-X, & Stocker AA (2015). Mutual information, Fisher information, and efficient coding. Neural Comput., 28, 305–326. [DOI] [PubMed] [Google Scholar]
Yarrow S, Challis E, & Series P (2012). Fisher and Shannon information in finite neural populations. Neural Comput., 24(7), 1740–1780. [DOI] [PubMed] [Google Scholar]
Zhang K, Ginzburg I, McNaughton BL, & Sejnowski TJ (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophysiol, 79(2), 1017–1044. [DOI] [PubMed] [Google Scholar]
Zhang K, & Sejnowski TJ (1999). Neuronal tuning: To sharpen or broaden? Neural Comput., 11(1), 75–84. [DOI] [PubMed] [Google Scholar]

[R1] Abbott LF, & Dayan P (1999). The effect of correlated variability on the accuracy of a population code. Neural Comput., 11(1), 91–101. [DOI] [PubMed] [Google Scholar]

[R2] Amari S, & Nakahara H (2005). Difficulty of singularity in population coding. Neural Comput., 17(4), 839–858. [DOI] [PubMed] [Google Scholar]

[R3] Atick JJ, Li ZP, & Redlich AN (1992). Understanding retinal color coding from first principles. Neural Comput., 4(4), 559–572. [Google Scholar]

[R4] Atick JJ, & Redlich AN (1990). Towards a theory of early visual processing. Neural Comput., 2(3), 308–320. [Google Scholar]

[R5] Becker S, & Hinton GE (1992). Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355(6356), 161–163. [DOI] [PubMed] [Google Scholar]

[R6] Bell AJ, & Sejnowski TJ (1997). The “independent components” of natural scenes are edge filters. Vision Res., 37(23), 3327–3338. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Bethge M, Rotermund D, & Pawelzik K (2002). Optimal short-term population coding: When Fisher information fails. Neural Comput., 14(10), 2317–2351. [DOI] [PubMed] [Google Scholar]

[R8] Borst A, & Theunissen FE (1999). Information theory and neural coding. Nat. Neurosci, 2(11), 947–257. [DOI] [PubMed] [Google Scholar]

[R9] Boyd S, & Vandenberghe L (2004). Convex optimization. Cambridge: Cambridge University Press. [Google Scholar]

[R10] Brown EN, Kass RE, & Mitra PP (2004). Multiple neural spike train data analysis: State-of-the-art and future challenges. Nat. Neurosci, 7(5), 456–461. [DOI] [PubMed] [Google Scholar]

[R11] Brunel N, & Nadal JP (1998). Mutual information, Fisher information, and population coding. Neural Comput., 10(7), 1731–1757. [DOI] [PubMed] [Google Scholar]

[R12] Carlton A (1969). On the bias of information estimates. Psychological Bulletin, 71(2), 108. [Google Scholar]

[R13] Chase SM, & Young ED (2005). Limited segregation of different types of sound localization information among classes of units in the inferior colliculus. Journal of Neuroscience, 25(33), 7575–7585. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Chechik G, Anderson MJ, Bar-Yosef O, Young ED, Tishby N, & Nelken I (2006). Reduction of information redundancy in the ascending auditory pathway. Neuron, 51(3), 359–368. [DOI] [PubMed] [Google Scholar]

[R15] Clarke BS, & Barron AR (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory, 36(3), 453–471. [Google Scholar]

[R16] Cover TM, & Thomas JA (2006). Elements of information (2nd ed.) New York: Wiley-Interscience. [Google Scholar]

[R17] Eckhorn R, & Pöpel B (1975). Rigorous and extended application of information theory to the afferent visual system of the cat. II. Experimental results. Biological Cybernetics, 17(1), 7–17. [DOI] [PubMed] [Google Scholar]

[R18] Ganguli D, & Simoncelli EP (2014). Efficient sensory encoding and Bayesian inference with heterogeneous neural populations. Neural Comput., 26(10), 2103–2134. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R19] Gawne TJ, & Richmond BJ (1993). How independent are the messages carried by adjacent inferior temporal cortical neurons? Journal of Neuroscience, 13(7), 2758–2771. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Gourévitch B, & Eggermont JJ (2007). Evaluating information transfer between auditory cortical neurons. Journal of Neurophysiology, 97(3), 2533–2543. [DOI] [PubMed] [Google Scholar]

[R21] Guo DN, Shamai S, & Verdu S (2005). Mutual information and minimum mean-square error in gaussian channels. IEEE Trans. Inform. Theory, 51(4), 1261–1282. [Google Scholar]

[R22] Harper NS, & McAlpine D (2004). Optimal neural population coding of an auditory spatial cue. Nature, 430(7000), 682–686. [DOI] [PubMed] [Google Scholar]

[R23] Huang W, Huang X, & Zhang K (2017). Information-theoretic interpretation of tuning curves for multiple motion directions. In Proceedings of the 51st Annual Conference on Information Sciences and Systems (pp. 1–4). Piscataway, NJ: IEEE. [Google Scholar]

[R24] Huang W, & Zhang K (2017). An information-theoretic framework for fast and robust unsupervised learning via neural population infomax. In Proceedings of the 5th International Conference on Learning Representations (ICLR). arXiv:1611.01886. [Google Scholar]

[R25] Jeffreys H (1961). Theory of probability (3rd ed.). New York: Oxford University Press. [Google Scholar]

[R26] Kang K, & Sompolinsky H (2001). Mutual information of population codes and distance measures in probability space. Phys. Rev. Lett, 86(21), 4958–4961. [DOI] [PubMed] [Google Scholar]

[R27] Khan S, Bandyopadhyay S, Ganguly AR, Saigal S, Erickson DJ III, Protopopescu V, & Ostrouchov G (2007). Relative performance of mutual information estimation methods for quantifying the dependence among short and noisy data. Physical Review E, 76(2), 026209. [DOI] [PubMed] [Google Scholar]

[R28] Kraskov A, Stögbauer H, & Grassberger P (2004). Estimating mutual information. Physical Review E, 69(6), 066138. [DOI] [PubMed] [Google Scholar]

[R29] Laughlin SB, & Sejnowski TJ (2003). Communication in neuronal networks. Science, 301(5641), 1870–1874. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Lewis A, & Zhaoping L (2006). Are cone sensitivities determined by natural color statistics? J. Vis, 6(3), 285–302. [DOI] [PubMed] [Google Scholar]

[R31] MacKay DJC (2003). Information theory, inference and learning algorithms. Cambridge: Cambridge University Press. [Google Scholar]

[R32] McClurkin JW, Gawne TJ, Optican LM, & Richmond BJ (1991). Lateral geniculate neurons in behaving primates. II. Encoding of visual information in the temporal shape of the response. Journal of Neurophysiology, 66(3), 794–808. [DOI] [PubMed] [Google Scholar]

[R33] Miller GA (1955). Note on the bias of information estimates In Quastler H (Ed.), Information theory in psychology: Problems and methods II-B (pp. 95–100). Glencoe, IL: Free Press. [Google Scholar]

[R34] Olshausen BA, & Field DJ (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609. [DOI] [PubMed] [Google Scholar]

[R35] Optican LM, & Richmond BJ (1987). Temporal encoding of two-dimensional patterns by single units in primate inferior temporal cortex. III. Information theoretic analysis. Journal of Neurophysiology, 57(1), 162–178. [DOI] [PubMed] [Google Scholar]

[R36] Paninski L (2003). Estimation of entropy and mutual information. Neural Comput., 15(6), 1191–1253. [Google Scholar]

[R37] Pouget A, Dayan P, & Zemel R (2000). Information processing with population codes. Nat. Rev. Neurosci, 1(2), 125–132. [DOI] [PubMed] [Google Scholar]

[R38] Quiroga R, & Panzeri S (2009). Extracting information from neuronal populations: Information theory and decoding approaches. Nat. Rev. Neurosci, 10(3), 173–185. [DOI] [PubMed] [Google Scholar]

[R39] Rao CR (1945). Information and accuracy attainable in the estimation of statistical parameters. Bulletin of the Calcutta Mathematical Society, 37(3), 81–91. [Google Scholar]

[R40] Rieke F, Warland D, de Ruyter van Steveninck R, & Bialek W (1997). Spikes: Exploring the neural code. Cambridge, MA: MIT Press. [Google Scholar]

[R41] Rissanen JJ (1996). Fisher information and stochastic complexity. IEEE Trans. Inform. Theory, 42(1), 40–47. [Google Scholar]

[R42] Shannon C (1948). A mathematical theory of communications. Bell System Technical Journal, 27, 379–423, 623–656. [Google Scholar]

[R43] Sompolinsky H, Yoon H, Kang KJ, & Shamir M (2001). Population coding in neuronal systems with correlated noise. Phys. Rev. E, 64(5), 051904. [DOI] [PubMed] [Google Scholar]

[R44] Tovee MJ, Rolls ET, Treves A, & Bellis RP (1993). Information encoding and the responses of single neurons in the primate temporal visual cortex. Journal of Neurophysiology, 70(2), 640–654. [DOI] [PubMed] [Google Scholar]

[R45] Toyoizumi T, Aihara K, & Amari S (2006). Fisher information for spike-based population decoding. Phys. Rev. Lett, 97(9), 098102. [DOI] [PubMed] [Google Scholar]

[R46] Treves A, & Panzeri S (1995). The upward bias in measures of information derived from limited data samples. Neural Comput., 7(2), 399–407. [Google Scholar]

[R47] Van Hateren JH (1992). Real and optimal neural images in early vision. Nature, 360(6399), 68–70. [DOI] [PubMed] [Google Scholar]

[R48] Van Trees HL, & Bell KL (2007). Bayesian bounds for parameter estimation and nonlinear filtering/tracking. Hoboken, NJ: Wiley. [Google Scholar]

[R49] Verdu S (1986). Capacity region of gaussian CDMA channels: The symbol synchronous case. In Proc. 24th Allerton Conf. Communication, Control and Computing, (pp. 1025–1034). Piscataway, NJ: IEEE. [Google Scholar]

[R50] Victor JD (2000). Asymptotic bias in information estimates and the exponential (bell) polynomials. Neural Comput., 12(12), 2797–2804. [DOI] [PubMed] [Google Scholar]

[R51] Wei X-X, & Stocker AA (2015). Mutual information, Fisher information, and efficient coding. Neural Comput., 28, 305–326. [DOI] [PubMed] [Google Scholar]

[R52] Yarrow S, Challis E, & Series P (2012). Fisher and Shannon information in finite neural populations. Neural Comput., 24(7), 1740–1780. [DOI] [PubMed] [Google Scholar]

[R53] Zhang K, Ginzburg I, McNaughton BL, & Sejnowski TJ (1998). Interpreting neuronal population activity by reconstruction: Unified framework with application to hippocampal place cells. J. Neurophysiol, 79(2), 1017–1044. [DOI] [PubMed] [Google Scholar]

[R54] Zhang K, & Sejnowski TJ (1999). Neuronal tuning: To sharpen or broaden? Neural Comput., 11(1), 75–84. [DOI] [PubMed] [Google Scholar]

PERMALINK

Information-Theoretic Bounds and Approximations in Neural Population Coding

Wentao Huang

Kechen Zhang

Abstract

1. Introduction

2. Bounds and Approximations for Mutual Information in Neural Population Coding

2.1. Mutual Information and Notations.

2.2. Information-Theoretic Asymptotic Bounds and Approximations.

2.2.1. Regularity Conditions.

2.2.2. Asymptotic Bounds and Approximations for Mutual Information.

2.3. Approximations of Mutual Information in Neural Populations with Finite Size.

2.3.1. A Numerical Comparison for 1D Stimuli.

Figure 1:

3. Statistical Estimators and Neural Population Decoding

4. Variable Transformation and Dimensionality Reduction in Neural Population Coding

4.1. Variable Transformation.

Figure 2:

4.2. Dimensionality Reduction for Asymptotic Approximations.

4.3. Further Discussion.

5. Optimization of Information Transfer in Neural Population Coding –

5.1. Population Density Distribution of Parameters in Neural Populations.

5.2. Optimal Population Distribution for Neural Population Coding.

5.3. Necessary and Sufficient Conditions for Optimal Population Distribution.

5.4. Channel Capacity for Neural Population Coding.

6. Discussion

Acknowledgments

Appendix: The Proofs

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Information-Theoretic Bounds and Approximations in Neural Population Coding

Wentao Huang

Kechen Zhang

Abstract

1. Introduction

2. Bounds and Approximations for Mutual Information in Neural Population Coding

2.1. Mutual Information and Notations.

2.2. Information-Theoretic Asymptotic Bounds and Approximations.

2.2.1. Regularity Conditions.

2.2.2. Asymptotic Bounds and Approximations for Mutual Information.

2.3. Approximations of Mutual Information in Neural Populations with Finite Size.

2.3.1. A Numerical Comparison for 1D Stimuli.

Figure 1:

3. Statistical Estimators and Neural Population Decoding

4. Variable Transformation and Dimensionality Reduction in Neural Population Coding

4.1. Variable Transformation.

Figure 2:

4.2. Dimensionality Reduction for Asymptotic Approximations.

4.3. Further Discussion.

5. Optimization of Information Transfer in Neural Population Coding –

5.1. Population Density Distribution of Parameters in Neural Populations.

5.2. Optimal Population Distribution for Neural Population Coding.

5.3. Necessary and Sufficient Conditions for Optimal Population Distribution.

5.4. Channel Capacity for Neural Population Coding.

6. Discussion

Acknowledgments

Appendix: The Proofs

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases