Achievable Rates for Pattern Recognition

M Brandon Westover; Joseph A O’Sullivan

doi:10.1109/tit.2007.911296

. Author manuscript; available in PMC: 2020 Mar 9.

Published in final edited form as: IEEE Trans Inf Theory. 2008 Jan 4;54(1):299–320. doi: 10.1109/tit.2007.911296

Achievable Rates for Pattern Recognition

M Brandon Westover ¹, Joseph A O’Sullivan ²

PMCID: PMC7062371 NIHMSID: NIHMS1053529 PMID: 32153303

Abstract

Biological and machine pattern recognition systems face a common challenge: Given sensory data about an unknown pattern, classify the pattern by searching for the best match within a library of representations stored in memory. In many cases, the number of patterns to be discriminated and the richness of the raw data force recognition systems to internally represent memory and sensory information in a compressed format. However, these representations must preserve enough information to accommodate the variability and complexity of the environment, otherwise recognition will be unreliable. Thus, there is an intrinsic tradeoff between the amount of resources devoted to data representation and the complexity of the environment in which a recognition system may reliably operate.

In this paper, we describe a mathematical model for pattern recognition systems subject to resource constraints, and show how the aforementioned resource–complexity tradeoff can be characterized in terms of three rates related to the number of bits available for representing memory and sensory data, and the number of patterns populating a given statistical environment. We prove single-letter information-theoretic bounds governing the achievable rates, and investigate in detail two illustrative cases where the pattern data is either binary or Gaussian.

Index Terms—: Distributed source coding, multiterminal information theory, pattern recognition

I. Introduction

PATTERN recognition is the problem of inferring the state of an environment from incoming and previously stored data. In real-world operating environments, the volume of raw data available often exceeds a recognition system’s resources for data storage and representation. Consequently, data stored in memory only partially summarizes the properties of patterns, and internal representations of incoming sensory data are likewise imperfect approximations. In other words, pattern recognition is frequently a problem of inference from compressed data. However, excessive compression precludes reliable recognition. This apparent tradeoff raises a fundamental question: In a given environment, what are the least amounts of memory data and sensory data consistent with reliable pattern recognition?

The paper is organized as follows. In Section II, we introduce the general problem informally. Relationships between the present work and other pattern recognition research is briefly described in Section III. In Section IV, we formalize our problem as that of determining which combinations of three key rates are achievable, that is, determining which rate combinations allow the theoretical possibility of reliable pattern recognition. These rates quantify the information available for representing memory and sensory data, and the number of distinct patterns which the recognition system can discriminate. Our main results are single-letter formulas providing inner and outer bounds on the set of achievable rates, presented in Section V. In Section VI, we consider some instructive special cases of the main results, and compare our results to those for the related problem of distributed source coding. In Section VII, we explore explicit formulas for the bounds in two special binary and Gaussian cases. Section VIII contains concluding remarks. Proofs for most of the results are placed in the Appendices. The entire discussion is organized around the block diagram in Fig. 1.

Fig. 1. — Pattern recognition subject to data compression.

II. Informal Problem Statement

In this section, we use an imagined example to motivate the mathematical model studied in the later technical sections. Suppose that our pattern recognition system consists of a homunculus living inside the head of some animal. The homunculus has access to a video monitor which displays data captured by the animal’s retinas, and a set of index cards for storing information about the patterns in the environment relevant to survival, constituting a “memory.” The homunculus must identify each pattern by comparing viewed images with information stored in memory. These identifications are then used to guide the animal’s behavior. Let us consider which factors govern the difficulty of our homunculus’ task.

A. Pattern Rate

First, the number of patterns that must be discriminated, M_c, obviously cannot exceed the number of images registerable on the animal’s retinas, which depends in turn on the number of retinal photoreceptors and the number of distinct signaling states of each photoreceptor. Denoting the state of the retinas as Y = (Y₁, Y₂, …, Y_n), where each Y_i takes values in a finite alphabet $Y$ , the number of possible retinal images is $| Y |^{n}$ . In a very simple animal with n = 8 photoreceptors, each able only to distinguish “bright” (Y = 1) from “dark” (Y = 0) (so $Y \in {0, 1}$ ), the absolute upper limit on M_c would be $| Y |^{n} = 2^{8} = 256$ .

With higher resolution eyes (larger n) an exponential explosion in the number of possible images rapidly overwhelms memory and computational resources. In humans, for whom $| Y | \approx 256$ , and n ≈ 2 × 10⁶ [1], [2], $| Y |^{n} ~ 10^{2, 408, 200}$ , far exceeding estimates of the number of particles in the universe [3]. Fortunately, two features of real-world pattern recognition intervene: First, sensory data exhibits strong statistical structure, p(y), so that the vast majority of the $| Y |^{n}$ possible images are never experienced.¹ Second, much of the animal’s visual experience can be filtered out as irrelevant to survival. Thus, we express the number of patterns our homunculus must discriminate as $M_{c} = | Y |^{n R_{c}^{'}}$ , where $R_{c}^{'}$ is called the pattern rate, $0 < R_{c}^{'} < 1$ , and generally $M_{c} ≪ | Y |^{n}$ . Equivalently, we can express M_c in binary units, as $M_{c} = 2^{n R_{c}}$ , in which case $R_{c} = R_{c}^{'} {log}_{2} | Y |$ .

B. Sensory Data Compression Rate

Our homunculus accesses sensory data indirectly through a video monitor that has limited display capacity. That is, whereas the retinas can be in up to $| Y |^{n}$ distinct signaling states, the homunculus’ internal monitor can display at most $M_{y} = 2^{n R_{y}} ≪ | Y |^{n}$ , where R_y is thus the sensory data compression rate. Analogous data reductions arise in real recognition problems for various reasons, both computational (e.g., dimensionality reduction, sparsification, regularization, or other “feature extraction” operations), and economic (e.g., energy constraints, processing time constraints, storage limitations). We represent the transformation from retinal data Y to video data J by the action of an encoder ϕ, resulting in the displayed data J = ϕ(Y).

This sensory data compression step places another restriction on the number of discriminable patterns, so that, in general, $M_{c} ≪ M_{y} ≪ | Y |^{n}$ ; or $R_{c} \leq R_{y} \leq {log}_{2} | Y |$ .

C. Memory Data Compression Rate

The job of our homunculus is to recognize patterns. More formally, the homunculus must assign to each viewed image Y one of M_c class labels, which we take to be integers $M_{c} = {1, 2, \dots, M_{c}}$ . As pre-job training, we imagine the homunculus studies a set of labeled class prototypes or “templates,” T(w) = (X(w), w), $w = {1, 2, \dots, M_{c}} = M_{c}$ , each drawn from a distribution p(x), where each template has dimensionality identical to that of the sensory data, X(w) = (X₁(w), …, X_n(w)).

The homunculus creates index cards, on which it writes the class labels and descriptive information about each class template. However, the number of cards and the amount of information per card are limited, allowing only a compressed summary of the available data. We represent the information memorized about a class M(w) as the output of an encoder f, i.e., M(w) = (I(w), w) = f(T(w)), where I(w) is the compressed description of X(w) and w is the memorized class label. The degree of compression is quantified by specifying either the number of index cards comprising the memory, M_x, or by a compression rate (given in bits) $R_{x} = \frac{1}{n} log M_{x}$ .

As above, memory data compression restricts the number of discriminable patterns M_c, so that $M_{c} ≪ M_{x} ≪ | X |^{n}$ ; or, in terms of rates, $R_{c} \leq R_{x} \leq {log}_{2} | X |$ .

D. Image Formation and Testing

The “testing” phase of our homunculus-driven pattern recognition system involves two processes:, image formation and recognition.

Image formation proceeds as follows. Nature selects a pattern class $W \in M_{c}$ at random, then generates an image Y which is registered on the animal’s retinas. (The class label W is not observable by the homunculus.) We model the image formation process as the transmission of the class template X(w) through a random channel p(y|x). The retinal image Y thus represents a “signature” of the underlying pattern X(w), and the channel p(y|x) represents two types of difficulties intrinsic to most real-world pattern recognition problems: signature variation (differences in the sensory data generated on repeated viewings of the same underlying pattern); and signature ambiguities (distinct patterns may produce similar signatures).²

The homunculus receives the compressed sensory data J = ϕ(Y), compares it with the memory data $C_{u} = {M (1), \dots, M (M_{c})}$ , and finally reports the class label of the best match, Ŵ. The inference procedure used to make these comparisons can reflect knowledge of the pattern source p(x) and image formation process p(y|x), but at the time of testing, it must be specified so as to depend only on the available data, i.e., g must be function only of $C_{u}$ and J, $\hat{W} = g (J, C_{u})$ . We judge the homunculus’ performance by the probability of error P_e. We will consider the system reliable if for some acceptable ϵ it achieves P_e ≤ ϵ.

E. Interpretations of the Problem Formulation

We have now introduced the basic elements of our problem, which is to determine the rate combinations (R_x, R_y, R_c) compatible with the possibility of reliable pattern recognition systems, where a “reliable” system is one for which the probability of recognition error can be made arbitrarily small. To summarize, these basic elements are 1) a model for the underlying patterns, consisting of the number of patterns $M_{c} = 2^{n R_{c}}$ , a set of class labels w ∈ {1, …, M_c}, and the class prototypes X(w) together with their generative model p(x); 2) a model of the channel connecting class prototypes to the sensory data p(y|x); and 3) budgets specifying the number of bits allowed for representing sensory data R_y and memory data R_x inside the system.

We pause here to consider a few different possible perspectives on the problem under study.

Optimization views.

From an optimization point of view, we can ask our central question in two different but equivalent ways: Given the pattern rate R_c, what are the least amounts of sensory and memory data R_y and R_x, needed for reliable pattern recognition? Alternatively, given fixed information budgets for memory and sensory data representation R_x and R_y, what is the maximum achievable pattern recognition rate R_c?

Regarding “n.”

Second, the problem has a different “feel” depending on whether one views the data dimensionality n either as a fixed or an increasing parameter. In the preceding discussion, we have primarily taken the static view, in which there are a fixed number $M_{c} = 2^{n R_{c}}$ of patterns, or “states of nature” of interest, and the problem is to investigate how many memory states M_x and sensory states M_y are needed to recognize them reliably. Alternatively, we may regard n as a dynamic, increasing parameter. Biologically, allowing n to increase might correspond to studying a series of animals with increasingly better eyes and memory organs. In engineering applications, the increase might correspond to building a sequence of machines with progressively higher camera resolution and data storage capacities [6]. Obviously, if while increasing n we hold the bit-budgets R_x and R_y fixed, then the number of memory and sensory states available for data representation grows exponentially, $M_{x} = 2^{n R_{x}}$ , $M_{y} = 2^{n R_{y}}$ . Less obviously, the maximum number of discriminable patterns also grows exponentially,³ with a constant rate R_c, i.e., $M_{c} = 2^{n R_{c}}$ . The “fixed n” and “increasing n” perspectives correspond to the familiar, complementary mathematical methods of proving a given inequality, respectively, either by the “adversarial” approach (given any ϵ, choose n large enough…); or the “asymptotic approach” (take the limit as n → ∞…).

An important final point regarding n is that, like many results in information theory, our results rely on asymptotic arguments. Thus, we only prove the results valid only for “sufficiently large n,” depending in turn on an ϵ corresponding to the tolerable error rate. The needed magnitude of n for a given ϵ (i.e., the issue of error exponents) will depend on the application, and is an important open problem.

III. Related Work

A. Machine Learning Approaches

Pattern recognition is a central topic in machine learning [7]–[10]. The machine learning approach to pattern recognition centers around the following problem: Given a set of labeled sensory data $D = {Y (i), w (i), i = 1 \dots N}$ , we wish to find a rule g that predicts the labels for future sensory data, i.e., if Y is in fact a signature of pattern class w ∈ {1, 2, …, M_c}, we want Pr(g(Y) ≠ W) ≤ ϵ for some acceptable ϵ > 0. Broadly speaking, two competing approaches dominate the literature. In the “generative modeling” approach, one attempts to estimate the distribution underlying the data p(w, y), and then to use the conditional distribution p(w, y) to infer w from Y, i.e., $\hat{w} = g (Y) = arg {max}_{w \in M_{c}} p (w | Y)$ . Alternatively, in the “discriminative” approach, one attempts to learn the optimal decision region boundaries directly, without estimating p(w, y).

Our problem formulation resonates with the “generative modeling” approach, in that we allow the homunculus access to p(w, y).⁴ Informally, such knowledge might come from allowing a very large volume of training data. Nevertheless, the distinction between generative and discriminative approaches then may become practically unimportant, as in many instances either approach can achieve asymptotically optimal performance.

In any case, in the present work we are not directly concerned with the problem of classifier learning. Rather, we investigate the conditions under which reliable classifiers can exist at all, regardless of how they are deigned or learned; we describe performance bounds to which all pattern recognition systems are subject.

It is also worth pointing out the distinction between the machine learning concept of “Vapnik–Chervonenkis (VC) dimension” and M_c in the present work. Informally, the VC dimension is the number of distinct patterns that can be shattered by a given family of classifiers (see [10], [12] for a detailed description). As such, VC dimension is a measure of the complexity of the decision boundaries that can be fit with a given family of classifiers. In contrast, M_c in our work is the number of patterns or pattern classes that can be distinguished, with no constraints on the family of classifiers.

B. Related Work in Combined Data Compression and Inference

Neuroscientist Horace Barlow has argued for more than four decades that data compression is an essential principle underlying learning and intelligent behavior in animal brains (see, e.g., [13]–[17]). Barlow and many others have amassed substantial experimental evidence showing efficient data coding mechanisms at work in the sensory systems of diverse animals, including monkeys, cats, frogs, crickets, and flies [18]. More recently, data compression is gaining appreciation as a mechanism for managing metabolic energy costs in neural systems [19].

In the engineering pattern recognition literature, data compression usually arises indirectly in the context of feature extraction, i.e., techniques for transforming raw data such that “irrelevant” data is discarded and the residual data is rendered into some advantageous format which facilitates storage and comparison, and is robust (“invariant”) with respect to signature variations [20], [21]. In the information theory literature, probably the first direct investigation of the interplay between data compression and statistical inference is due to Ahlswede and Csiszár [22]. In [23] Han and Amari reviewed work up through 1998 on rate-constrained inference problems, including hypothesis testing, pattern recognition, and parameter estimation. Recently, Ishwar et al. have studied the problem of joint classification and reconstruction of sensory data subject to a fidelity constraint, in the context of video coding [24], [25]. In contrast to the problem studied in this paper, in that work there is no data compression constraint on memory data. Work on practical algorithms for joint classification and data compression includes [26]–[28].

IV. Problem Statement

We now proceed to the formal presentation of the main results.

A. Notation

We adopt the following notational conventions. Random variables are denoted by capital letters (e.g., U), their values by lowercase letters (e.g., u), their alphabets by script capital letters (e.g., $U$ ). Sequences of symbols are denoted either by boldface letters or with a superscript, e.g., u = uⁿ = (u₁, u₂, …, u_n). The probability distribution for a random variable $U \in U$ is denoted by pU(u), or p(u) simply when the implied subscript is clear from the context. Entropy, mutual information, and conditional mutual information are denoted in the usual ways, e.g., for random variables U, V, W, we write H(U), I(U; V), and I(U; V|W), respectively. All logarithms are understood to be base two, i.e., log = log₂. Finally, to express statements such as “X and Z are conditionally independent given Y,” i.e., p(x, y, z) = p(y) p(x | y)p(z | y), we write “X – Y – Z form a Markov chain,” or simply X −.Y – Z.

B. Definitions and Assumptions

Definition 1: The environment $E$ for a pattern recognition system is a set of eight objects

E = {M_{c}, X, Y, p (x), p (y | x), p (w), C_{x}, Φ}

where

$M_{c}$ , $X$ , $Y$ are finite alphabets;
p(w), p(xy) = p(x)p(y|x), are probability distributions over $M_{c}$ , and $X^{n} \times Y^{n}$ , respectively;
$C_{x} = {T (1), \dots T (M_{c})}$ is a set of pairs T(w) = (X(w), w) of random vectors X(w) drawn independent and identically distributed (i.i.d.) ~p(x), labeled by $w \in {1, 2, \dots, M_{c}} = M_{c}$ ;
Φ is a mapping from labels $M_{c}$ to vectors in $C_{x}$ , $Φ : M_{c} \to C_{x}$ , Φ(w) = X(w).

We make the following simplifications:

the distribution over class labels is uniform, $p (w) = 1 / | M_{c} |$ for all $w \in M_{c}$ ;
the pattern components are i.i.d., $p (x) = \prod_{i = 1}^{n} p (x_{i})$ ;
the observation channel is memoryless, $p (y | x) = \prod_{i = 1}^{n} p (y_{i} | x_{i})$ .

Definition 2: An (M_c, M_x, M_y, n) pattern recognition code for an environment $E$ consists of three sets of integers

M_{c} = {1 \dots M_{c}}, M_{x} = {1 \dots M_{x}}, M_{y} = {1 \dots M_{y}}

and three mappings

f : X^{n} \times M_{c} \to M_{x} \times M_{c}, f (t (w)) = f (x, w) = (i, w) ≜ m (w) ϕ : Y^{n} \to M_{y}, ϕ (y) = j g : M_{y} \times {(M_{x})}^{M_{c}} \to M_{c}, g (j, C_{u}) = \hat{w}

where $C_{u}$ denotes the result of applying f to the entries of $C_{x}$

C_{u} = {f (T (1)), \dots, f (T (M_{c}))} = {m (1), \dots, m (M_{c})) .

We call $C_{x}$ the pattern templates; f, the memory encoder; $C_{u}$ , the memorized data; ϕ, the sensory encoder; and g, the recognition function or classifier.

Definition 3: The operation of a pattern recognition system (“agent”) implementing a given (M_c, M_x, M_y, n) pattern recognition code (f, ϕ, g) for an environment $E$ is defined in terms of the following events.

Memorization phase:

The agent observes $C_{x}$ , and uses f to compute the memory data $C_{u}$ .
Access to $C_{x}$ is taken away, and thereafter the agent knows of $C_{x}$ only what is retained in $C_{u}$ .

Testing phase

Nature selects an index W ~ p(w).
Nature encodes the pattern according to X(W) = Φ(W).
The pattern X(W)passes through the channel p(y|x), giving rise to an observable signal Y.
The agent computes J = ϕ(Y).
The agent infers W by computing $\hat{W} = g (J, C_{u})$ .

With respect to the events just described, the probability of error for a code (f, ϕ, g) in $E$ is

P_{e}^{n} (w) = Pr (\hat{W} \neq w | W = w)

and the average probability of error of the code is

P_{e}^{n} = \sum_{w \in M_{c}} p (w) Pr (\hat{W} \neq w | W = w) = \frac{1}{M_{c}} \sum_{w \in M_{c}} P_{e}^{n} (w) .

Definition 4: The rate R = (R_c, R_x, R_y) of an (M_c, M_x, M_y, n) code is

R_{c} = \frac{1}{n} {log}_{2} M_{c}, R_{x} = \frac{1}{n} {log}_{2} M_{x}, R_{y} = \frac{1}{n} {log}_{2} M_{y}

where the units are bits-per-symbol.

Definition 5: A rate R = (R_c, R_x, R_y) is achievable in a recognition environment $E$ if for any ϵ > 0 and for n sufficiently large, there exists an (M_c, M_x, M_y, n) code (f, ϕ, g) with rates

R_{c}^{'} = \frac{1}{n} log M_{c}, R_{x}^{'} = \frac{1}{n} log M_{x}, R_{y}^{'} = \frac{1}{n} log M_{y}

such that $R_{c}^{'} \geq R_{c}$ , $R_{x}^{'} \leq R_{x}$ , $R_{y}^{'} \leq R_{y}$ and $P_{e}^{n} < ϵ$ .

Definition 6: The achievable rate region $R$ for a recognition environment $E$ is the set of all achievable rates R = (R_c, R_x, R_y).

Our ultimate goal in an information-theoretic analysis of this problem is to characterize the achievable rate region $R$ in a way that does not involve the unbounded parameter n, that is, to exhibit a single-letter characterization of $R$ .

V. Main results

In this section, we present inner and outer bounds on the achievable rate region $R$ . The bounds are expressed in terms of sets of “auxiliary” random variable pairs UV, defined below. In these definitions, U and V are assumed to take values in finite alphabets $U$ and $V$ and have a well-defined joint distribution with the “given” random variables XY. To each such pair of auxiliary random variables UV we associate a set of rates

R_{U V} = {R : R_{x} \geq I (U; X) R_{y} \geq I (V; Y) R_{c} \leq R_{x} + R_{y} - I (X Y; U V)} .

Next, define two sets of random variable pairs

P_{in} = {U V : U - X - Y, X - Y - V, U - (X, Y) - V},

and

P_{out} = {U V : U - X - Y, X - Y - V} .

We will also sometimes summarize the independence constraints in $P_{in}$ as a single “long” Markov chain U – X − Y − V.

Next, define two additional sets of rates

R_{in} = {R : R \in R_{U V} for some U V \in P_{in}} R_{out} = {R : R \in R_{U V} for some U V \in P_{out}}

and denote the convex hull of $R_{in}$ by ${\bar{R}}_{in}$ .

Our main results are the following.

Theorem 1 (Inner Bound):

R_{in} \subseteq R .

That is, every rate $R \in R_{in}$ is achievable.

Theorem 2 (Better Inner Bound):

{\bar{R}}_{in} \subseteq R

That is, every rate $R \in {\bar{R}}_{in}$ is achievable.

Theorem 3 (Outer Bound):

R_{out} \supseteq R .

That is, no rate $R \notin R_{out}$ is achievable.

Finally, to ensure computability, we include a cardinality bound.

Theorem 4: Regions $R_{in}$ and $R_{out}$ are unchanged if we restrict the cardinality of $U V$ to

| U | | V | \leq | X | | Y | + 2.

Theorem 4 is a simple consequence of the Support Lemma [29 (p. 310)]: we must have $| X ‖ Y | - 1$ letters to ensure preservation of p(xy|uv), and three additional letters to satisfy the constraints on I(X; U), I(Y; V) and I(XY; UV).

Remark 1: If either X = U or Y = V, or both, then the outer bound collapses to the inner bound, since in this case the extra Markov condition U – (X, Y) – V in the definition of $P_{in}$ is extraneous. For example, if U = X, then the condition is equivalent to I(U; V | XY) = I(X; V | XY) = 0, which is obviously true. Similar comments apply if U and V are any deterministic functions of X and Y, e.g., if V = γ(Y), then I(U; V | XY) = I(U; γ(Y)|XY) = 0.

Remark 2: The bounds $R_{in}$ and $R_{out}$ can be expressed in various ways. For example, it is not difficult to show that the following replacements for $R_{U V}$ lead to the same sets of rates $R_{in}$ and $R_{out}$ :

R_{U V}^{'} = {R : R_{x} \geq I (U; X) R_{y} \geq I (V; Y) R_{c} \leq I (U; V) - I (U; V | X Y)}

(1)

R_{U V}^{″} = {R : R_{c} \leq I (U; V) - I (U; V | X Y) R_{x} \geq I (X Y; U | V) + R_{c} R_{y} \geq I (X Y; V | U) + R_{c} R_{x} + R_{y} \geq I (X Y; U V) + R_{c}} .

(2)

That is, if we define

R_{*}^{'} = {R : R \in R_{U V}^{'} for some U V \in P_{*}} R_{*}^{″} = {R : R \in R_{U V}^{″} for some U V \in P_{*}}

where * stands for either “in” or “out,” then $R_{in} = R_{in}^{'} = R_{in}^{″}$ , and $R_{out} = R_{out}^{'} = R_{out}^{″}$ . These equivalencies are proved in Appendix E, and are used in Sections VI-B and VII.

Remark 3: In general, $R_{in}$ is not a convex set, as evidenced by the examples studied in Section VIII. Thus, ${\bar{R}}_{in}$ is in fact an improvement on $R_{in}$ . $R_{out}$ is a convex set, as shown in Appendix C.

The proofs for Theorems 2 and 3 appear in Appendices A and B. Theorem 1 follows immediately from Theorem 2. In sketch-form, the method we use to prove achievability (the inner bound), based on $R^{'} U V$ (1), is as follows. We represent the memory and sensory data using codewords U(i), $i \in {1 \dots 2^{n R_{x}}}$ and V(i), $j \in {1 \dots 2^{n R_{y}}}$ that are typical according to p(u) and p(v), respectively, and the recognition system stores a list of these codewords. Making R_x ≥ I(X; U) and R_y ≥ I(Y; V) provides enough U’s and V’s to “cover” $X^{n}$ and $Y^{n}$ . During pretesting, the system matches each of the labeled template patterns (X(w), w), w = 1, …, M_c presented to it with a unique memory codeword, and attaches to this codeword the corresponding class label (with matching defined in the sense of joint typicality according to p(xu)). The resulting set of M_c “active,” labeled codewords constitutes the system’s memory. During subsequent testing, suppose Nature selects class w, generating sensory data Y ~ p(y|X(w)). The system receives the index J of the codeword for Y, and uses it to retrieve the sensory codeword V(J). The system can then narrow down the list of M_c active memory codewords by a factor of 2^{−nI(U; V)} using knowledge of p(uv).⁵ Thus, the correct memory vector U(w) can be uniquely identified so long as $M_{c} = 2^{n R_{c}} \leq 2^{n I (U; V)}$ , i.e., if R_c ≤ I(U; V).

It is also possible to prove the achievability result using a binning argument, which induces the set $R_{U V}^{″}$ (2): Generate 2^{nI(X; V)} U’s and 2^{nI(Y; V)}V’s, and divide these equally among roughly $2^{n R_{x}}$ and $2^{n R_{y}}$ bins each, respectively. A pattern X(w) is encoded in memory by searching for a bin containing a matching (jointly typical) codeword U(X), and the $M_{c} = 2^{n R_{c}}$ bins thus selected are each assigned the class label w of the pattern stored therein. Sensory data Y is encoded as the bin index of a matching codeword V(Y). This number of U’s and V’s is sufficient to ensure that any given pair X and Y will have a matching (jointly typical) U(X), and V(Y), and the Markov lemma ensures joint typicality of the quadruple (U(X), X, Y, V(Y)). Given encoded sensory data J = ϕ(Y), recognition is done by comparing the roughly $2^{n I (Y; V)} / 2^{n R_{y}}$ sensory codewords in bin J with the $2^{n I (X; U)} / 2^{n R_{x}}$ memory codewords in each of the $2^{n R_{c}}$ memory bins, then reporting the class label $\hat{w}$ assigned to the bin containing the matching memory codeword. No matches other than the correct one, (U(X), V(Y)), will be found provided the number of (U, V) comparisons grows exponentially with n at a rate less than I(U; V), that is, provided I(Y; V) − R_y + I(X; U) – R_x + R_c ≤ I(U; V), which simplifies to the rate–sum constraint R_x + R_y ≥ I(XY; UV) + R_c. The “side” constraints R_x ≥ I(XY; U|V) + R_c and R_y ≥ I(XY; V|U) + R_c then follow from requiring that each bin contain at least one codeword. The final inequality R_c ≤ I(U; V) follows from the first three.

VI. Discussion of the Main Results

A. The Gap Between Bounds

In general, there is a gap between $R_{in}$ and $R_{out}$ , so that $R_{in} ⊊ R ⊊ R_{out}$ . This gap is due to the different constraints in the definitions of $P_{in}$ and $P_{out}$ : Whereas distributions in satisfy three independence constraints U – X – Y, X – Y – V, and U – (X, Y) – V (equivalently, the single “long chain” constraint U – X – Y – V), distributions in $P_{out}$ only need satisfy the first two “short chain” constraints.

Further insight into the nature of the gap can be gained by attempting to construct $P_{out}$ by combining distributions from $P_{in}$ in various ways, and then considering whether the resulting distributions can be used to expand the achievable rate region.⁶ We consider two such constructions. In both, let $Q \in Q$ be a finite random variable, independent of X and Y. Holding p(xy) fixed, to describe a pair of auxiliary random variables $U \in U$ , $V \in V$ with joint distribution p(xyuv), we need only to specify the marginal distribution p(uv | xy). Consider the following two sets:

P_{mix} = {U V : p (u v | x y) = \sum_{q \in Q} p (q) p (u | x q) p (v | y q)}

(3)

P_{conv} = {U V : U = (U_{Q}, Q), V = (V_{Q}, Q), U_{q} V_{q} \in P_{in} \forall q \in Q} .

(4)

In words, $P_{mix}$ is the set of UV whose distributions p(uv | xy) can be constructed as “mixtures” of product marginals; and $P_{conv}$ is the set of “convexifying” random variables (this terminology is explained below).⁷ In both of these sets it is possible to have dependencies between U and V given XY; i.e., in general p(uv | xy) ≠ p(u | x) p(v | y), hence, in general, $P_{mix}$ , $P_{conv} ⊋ P_{in}$ .

There is a gap similar to the one under discussion between the best known bounds for the distributed source coding problem (DSC), established by Berger and Tung (see Section VI-B). In both problems, the bounds are given in terms of sets with independence (Markov) constraints identical to those in $P_{in}$ and $P_{out}$ .⁸ Berger has suggested that in the DSC problem the gap is due to the fact that $P_{out}$ admits convex mixtures of product marginal distributions, whereas $P_{in}$ does not; i.e., in our notation, $P_{in} ⊊ P_{mix} \subseteq P_{out}$ [31]. The inclusion $P_{mix} \subseteq P_{out}$ is verified by checking U – X – Y and X – Y – V: Write

p (u | x y) = \sum_{v} p (u v | x y) = \sum_{q} p (q) p (u | x q) \sum_{v} p (v | y q) = \sum_{q} p (q) p (u | x q) ≜ p (u | x)

hence, U – X – Y; and a symmetric calculation shows X – Y – V. While clearly $P_{mix}$ is a larger set than $P_{in}$ , it is unclear whether the admission of mixtures can account for all of the gap between $P_{in}$ and $P_{out}$ . That is, we know of no proof that $P_{mix} = P_{out}$ . Moreover, we know of no way to use auxiliary random variables from $P_{mix}$ in achievability arguments.

It is also straightforward to verify that the second set $P_{conv}$ is contained in $P_{out}$

I (U; Y | X) = I (U_{Q}, Q; Y | X) = I (Q; Y | X) + I (U_{Q}; Y | X, Q) \overset{(a)}{=} 0 + \sum_{q} p (q) I (U_{q}; Y | X, Q = q) \overset{(b)}{=} 0

(where the reasons are: (a) Q is independent of X and Y, and (b) $U_{q} V_{q} \in P_{in}$ ; hence, U – X – Y; and a symmetric calculation shows X − Y − V. $P_{conv}$ has a form sometimes introduced in time-sharing arguments, as a means to convexify a given rate region. For example, for $U V \in P_{conv}$ , we have

I (X Y; U V) = I (X Y; U_{Q}, V_{Q}, Q) = I (X Y; Q) + I (X Y; U_{Q}; V_{Q} | Q) \overset{(a)}{=} \sum_{q} p (q) I (X Y; U_{q} V_{q} | Q = q) ≜ \sum_{q} p (q) I (X Y; U_{q} V_{q})

(where (a) is because Q is independent of X and Y) and similarly I(X; U) = Σ_qp(q)I(X; U_q), and I(Y; V) = Σ_qp(q)I(Y; V_q). It follows that the convex hull of $R_{in}$ may be represented as

{\bar{R}}_{in} = {R : R \in R_{U V} for some U V \in P_{conv}} .

(5)

In contrast to $P_{mix}$ , auxiliary variables from $P_{conv}$ can be used as the basis for standard achievability arguments, as we have done in the proof of Theorem 2 (see Appendix A). From this, we have the following logical statement:

If P_{out} = P_{conv} then R_{out} = {\bar{R}}_{in} .

(6)

Unfortunately, we have no proof that $P_{out} = P_{conv}$ . Notwithstanding, in Subsection VII-A, we examine one case where it appears that $R_{out} = {\bar{R}}_{in}$ does hold, giving grounds to conjecture that this equality may hold at least under special conditions.

B. Relationship With Distributed Source Coding

There are interesting connections between the results of Tung and Berger [32], [33] for the DSC problem and our results in Theorems 1 and 3. Briefly, the situation treated in the DSC problem is as follows (see Fig. 2). Two correlated sequences, X and Y, are encoded separately as i = f(X), j = ϕ(Y) and the decoder g must reproduce the original sequences subject to a fidelity constraint $(E d_{x} (\hat{X}, X), E d_{y} (\hat{Y}, Y)) \leq D$ , where D = (D_x, D_y). The problem is to characterize, for any given distortion D, the set of achievable rates $\tilde{R} (D)$ .

Fig. 2. — The distributed source coding problem.

The best known inner and outer bounds for the DSC problem can be expressed as follows.⁹ Let $P_{in}$ and $P_{out}$ be defined as above, and define two new sets incorporating the distortion constraint

P_{in} (D) = P_{in} \cap P_{U V} (D) P_{out} (D) = P_{out} \cap P_{U V} (D)

where

P_{U V} (D) = {U V : \exists \hat{X} (U, V), \hat{Y} (U, V) s.t. (E d_{x} (\hat{X}, X), E d_{y} (\hat{Y}, Y)) \leq D} .

Parallelling (1), also define the sets of rates

{\tilde{R}}_{U V} = {R : R_{x} \geq I (X Y; U | V) R_{y} \geq I (X Y; V | U) R_{x} + R_{y} \geq I (X Y; U V)}

(7)

and

{\tilde{R}}_{in} (D) = {R : R \in {\tilde{R}}_{U V} for some U V \in P_{in} (D)} {\tilde{R}}_{out} (D) = {R : R \in {\tilde{R}}_{U V} for some U V \in P_{out} (D)} .

Then the Berger–Tung bounds for the DSC problem are ${\tilde{R}}_{in} (D) \subseteq \tilde{R} (D)$ and ${\tilde{R}}_{out} (D) \supseteq \tilde{R} (D)$ .

There are strong formal similarities between our bounds and the DSC bounds. Most importantly, the gap between bounds for both problems is due to the difference between the length-four constraint U − X − Y − V and the less stringent length-three constraints U − X – Y, X − Y − V. Further, note the formal similarity between the sets ${\tilde{R}}_{U V}$ (7) and $R_{U V}^{''}$ . To carry this comparison further, suppose in the problem under study that, in addition to recognizing patterns, we also wish to reproduce an estimate of the original signals subject to a fidelity constraint, as in the DSC problem.¹⁰ Denote the achievable rate region for this “joint recognition and recovery” problem by $R (D)$ . Making this addition in fact adds little technical difficulty, and the resulting bounds can be expressed, not surprisingly, as $R_{in} (D) \subseteq R (D)$ and $R_{out} (D) \supseteq R (D)$ , where

R_{in} (D) = {R : R \in R_{U V} for some U V \in P_{in} (D)} R_{out} (D) = {R : R \in R_{U V} for some U V \in P_{out} (D)} .

Apparently, the pattern recognition problem can be construed as a kind of generalization of the DSC problem, with the added complication that the “decoder” receives with Y not one sequence X but $M_{c} = 2^{n R_{c}}$ such sequences X(1),…,X(M_c) and must first determine which is the appropriate one with which to jointly decode Y. This extra discrimination evidently requires that extra information be included at the encoders. This “rate excess” is the difference between the minimum encoding rates required for the DSC and pattern recognition problems.¹¹ Comparing $R_{U V}^{''}$ (7) with ${\tilde{R}}_{U V}$ (7), this rate excess is the same at both decoders, and is equal to R_c. Thus, R_c can be interpreted as the number of extra bits needed at both encoders to decide which of the possible $M_{c} = 2^{n R_{c}}$ patterns X(w) the sensory data Y represents, beyond the information required to simply reproduce the pair Y, X(w) within the allowed distortion limits.

C. Degenerate Cases

We now briefly examine the degenerate cases where either X = U, or Y = V, or both. In these cases, I(U; V | XY) = 0. Hence, using (1), we see that both inner and outer bounds on $R$ both reduce to the three inequalities R_x ≥ I(U; X), R_y ≥ I(V; Y), R_c ≤ I(U; V)}. Clearly, in these cases the bounds are tight, in that the inner and outer bounds are equal; there is no gap (see Remark 1). These degenerate cases have simple interpretations and are thus useful for building intuition about Theorems 1–3.

Sharp memory, sharp eyesight.

First, consider a system in which the budgets for memory and sensory representations are unrestricted, i.e., no compression is required. In this case, we can effectively treat the memories and sensory representations as veridical; i.e., we can set U = X and V = Y. The theorem constraints then become R_x ≥ I(X; X) = H(X), R_y ≥ I(Y; Y) = H(Y), and

R_{c} \leq I (U; V) = I (X; Y) .

(8)

This result indicates that, in the absence of compression, the recognition problem is formally equivalent to the following classical communication problem: Transmit one of $M_{c} = 2^{n R_{c}}$ possible messages (patterns) to a receiver (the recognition module) [6]. In this case, the patterns can be thought of as random codewords stored without compression and available to the decoder; Shannon’s random coding for communication [30], [34] applies, yielding the mutual information I(X; Y) (see (8)) as the bound on R_c.

Sharp memory, poor eyesight.

Next, suppose that memory is effectively unlimited, so that we can put U = X, but sensory data may be compressed. In this case, we can readily rewrite the condition on R_c as

R_{c} \leq I (X; Y) - I (X; Y | V) = I (X; V) .

(9)

We check the extreme cases: If V is fully informative about Y, Y = ϕ⁻¹(V), then I(X; Y | V) = H(Y | V)−H(Y | X, V) = 0, and we recover the case discussed above, R_c ≤ I(X; Y). For intermediate cases, where V is partially informative, the effect of V is to degrade the achievable performance of the system below that possible with “perfect senses,” and the reduction incurred is I(X; Y | V). In the extreme case that V is utterly uninformative (e.g., a constant V = 0, or otherwise independent of Y), I(X; Y | V) = I(X; Y), and we get R_c = 0, or $M_{c} \leq 2^{n R_{c}} = 1$ ; hence, the system is useless.

Poor memory, sharp eyesight.

In the case of limited memory but unrestricted resources for sensory data representation (V = Y), we get an expression symmetric with the previous case

R_{c} \leq I (X; Y) - I (X; Y | U) = I (U; Y) .

(10)

As before, if the memory is perfect (U = X), we get I(X; Y | U) = I(X; Y | X) = 0, recovering the channel coding constraint R_c ≤ I(X; Y); assuming useless memories (U = 0) yields R_c ≤ I(X; Y) − I(X; Y) = 0; and intermediate cases place the system between these extremes.

VII. Examples

In this section, we investigate the achievable rate regions for binary and Gaussian versions of our problem. For this purpose, it will be convenient to characterize the sets $R$ , $R_{in}$ and $R_{out}$ by their surfaces in the positive orthant $ℝ_{+}^{3}$ . The surface of $R$ can be expressed as

r (r_{x}, r_{y}) = max_{R \in C (r_{x}, r_{y})} R_{c} C (r_{x}, r_{y}) = {R : R \in R, R_{x} = r_{x}, R_{y} = r_{y}} .

Similarly, by direct extension of Theorems 1 and 3, and using the representation of $R_{in}$ and $R_{out}$ based on $R_{U V}^{'}$ (1), the surfaces of $R_{in}$ and $R_{out}$ are

r_{in} (r_{x}, r_{y}) = max_{U V \in C_{in} (r_{x}, r_{y})} I (U; V) - I (U; V | X Y)

(11)

r_{out} (r_{x}, r_{y}) = max_{U V \in C_{out} (r_{x}, r_{y})} I (U; V) - I (U; V | X Y)

(12)

where

C_{in} (r_{x}, r_{y}) = {U V \in P_{in} : I (U; X) = r_{x}, I (V; Y) = r_{y}} C_{out} (r_{x}, r_{y}) = {U V \in P_{out} : I (U; X) = r_{x}, I (V; Y) = r_{y}}

The expression for the inner bound surface (11) reduces to

r_{in} = max_{U V \in C_{in} (r_{x}, r_{y})} I (U; V) .

An alternative expression for the outer bound surface which will be used in Subsection VII-B, based on R_UV (1), is

r_{in} = r_{x} + r_{y} - min_{U V \in C_{in} (r_{x}, r_{y})} I (X Y; U V) .

(13)

Finally, denote the convex hull of the inner and outer bound surfaces $\bar{r_{in}} (r_{x}, r_{y})$ , $\bar{r_{out}} (r_{x}, r_{y})$ .

In the specific cases studied in the following examples we seek to convert these implicit characterizations into explicit formulas not involving the optimization over $C_{in} (r_{x}, r_{y})$ and $C_{out} (r_{x}, r_{y})$ .

A. Binary Case

We first investigate the inner and outer bound surfaces for a case in which the template patterns and sensory data alphabets are binary, $X = Y = {0, 1}$ . Let the template patterns X = (X₁,…,X_n) consist of n independent drawings from a uniform Bernoulli distribution X_i ~ B(1/2), i = 1…n, and let the sensory data Y = (Y₁,…,Y_n) be the output of a binary-symmetric channel with crossover probability q, $p (y | x) = q^{\bar{δ} (x, y)} {\bar{q}}^{δ (x, y)}$ where $\bar{q} = 1 - q; \bar{δ} (x, y) = 1 - δ (x, y)$ ; and δ(x, y) = 1 if x = y, and otherwise δ(x, y) = 0. Equivalently, we can represent Y as Y = X ⊕ W, where W ~ B(q) and is independent of X.

1). Numerical Results:

We have taken two approaches to studying the surfaces of $R_{in}$ and $R_{out}$ for this binary case. First, we carried out the optimizations in (11) and (12) numerically. This calculation was via a Monte Carlo method which executed a dense random sampling of the set of probability distributions p(uv | xy) associated with $P_{in}$ and $P_{out}$ .¹² For each sample p(uv | xy), we calculated I(X; U), I(Y; V) and I(XY; UV); then, for each value of r_x, r_y ∈ [0,1] the numerical estimate of r_in(r_x, r_y) or r_out(r_x, r_y) was the largest sample value found by the Monte Carlo search for r_x + r_y – I(XY; UV). From here on, we denote the numerical surface estimates by $\hat{r_{in}} (r_{x}, r_{y})$ and $\hat{r_{out}} (r_{x}, r_{y})$ .

The cardinality bound in Theorem 4 is not necessarily tight. Therefore, to assess the alphabet sizes required of $U$ and $V$ for the binary case, we performed our numerical experiments for increasing values of $| U |$ and $| V |$ . For the inner bound surface, we found $| U | = | V | = 2$ was sufficient: no further increase in $\hat{r_{in}} (r_{x}, r_{y})$ was afforded by allowing $| U |$ , $| V | = 3$ , 4. For the outer bound surface, $| U | = | V | = 3$ was sufficient.

The surface plots from our numerical experiments are shown in Fig. 3. Fig. 4 shows representations of the distributions p(uv | xy) underlying 25 different points (r_x, r_y) for Fig. 3 (a) the inner and Fig. 3 (b) the outer bounds, in which probabilities are represented by the area of white squares.^{13, 14} The row–column format of the matrix p(uv | xy) is xy = 00, 01, 10, 11 moving down rows; moving across columns, for the inner bound with $U, V = {0, 1}$ the format is uv = 00, 01, 10, 11, whereas for the outer bound with $U, V = {0, 1, e}$ , the column format is uv = 00, 01, 1e, 10, 1e, e0, e1, ee. (The choice of for the third letter of $U$ and $V$ is explained below.)

2). Conjectured Formulas:

Second, we guessed formulas for the inner and outer bound surfaces, which turned out to fit the numerical results just described. We first present the formulas, then discuss the motivations behind them.

Our formulas involve the following two functions. First, define

s (r_{x}, r_{y}) = 1 - h (q_{x} * q * q_{y}),

where

q_{x} = h^{- 1} (1 - r_{x}), q_{y} = h^{- 1} (1 - r_{y});

h(·) is the binary entropy function

h (x) = - x log (x) - (1 - x) log (1 - x);

“*” denotes binary convolution

x * y = x (1 - y) + y (1 - x);

and q_x, q_y ∈ [0, 1/2] to ensure that h(·) is invertible. Next, let s*(r_x, r_y) denote the upper concave envelope of s(r_x, r_y)

s^{*} (r_{x}, r_{y}) = sup θ s (r_{x_{1}}, r_{y_{1}}) + \bar{θ} s (r_{x_{2}}, r_{y_{2}})

where $\bar{θ} = 1 - θ$ ; and the supremum is over all combinations (θ, r_x1, r_y1, r_x2, r_y2) such that

(r_{x}, r_{y}) = θ (r_{x_{1}}, r_{y_{1}}) + \bar{θ} (r_{x_{2}}, r_{y_{2}})

and each variable in the optimization is restricted to the unit interval [0, 1]. As explained in Appendix F, in both this case and for the corresponding Gaussian formulas in the next section, the expression for this convex hull simplifies to

s^{*} (r_{x}, r_{y}) = sup θ s (r_{x^{'}}^{'} r_{y}^{'})

with the supremum over all combinations $(θ, r_{x}^{'}, r_{y}^{'})$ such that

(r_{x}, r_{y}) = θ (r_{x}^{'}, r_{y}^{'}) .

Conjecture 1: For the binary case the surfaces of $R_{in}$ and $R_{out}$ are

r_{in} (r_{x}, r_{y}) = s (r_{x,} r_{y})

(14)

r_{out} (r_{x}, r_{y}) = s^{*} (r_{x}, r_{y})

(15)

and the surface of the achievable rate region $R$ is

r (r_{x}, r_{y}) = \bar{r_{in}} (r_{x}, r_{y}) = r_{out} (r_{x}, r_{y}) .

(16)

3). Rationale for the Inner Bound (14):

The surfaces r_in(r_x, r_y), r_out(r_x, r_y) are specified in terms of probability distributions p(xyuv) = p(xy)p(uv | xy) that maximize (11) and (12). For r_in(r_x, r_y), the distribution factorizes as p(uv | xy) = p(u | x)p(v | y), and a natural guess is that in the maximizing distribution both p(u | x) and p(v | y) are binary symmetric channels

p (u | x) = q_{x}^{\bar{δ} (x, u)} {\bar{q}}_{x}^{δ (x, u)}, p (v | y) = q_{y}^{\bar{δ} (y, v)} {\bar{q}}_{y}^{δ (y, v)};

or, equivalently, U = X ⊕ W_x, V = Y ⊕ W_y, where W_x ~ B(q_x), W_x ~ B(q_y), and q_x, $q_{y} \in [0, \frac{1}{2}]$ ; see Fig. 5(a). For this choice of U and V we calculate

r_{x} = I (X; U) = H (X) - H (X | U) = 1 - H (U \oplus W_{x} | U) = 1 - H (W_{x}) = 1 - h (q_{x})

and likewise r_y = 1 − h(q_y). Then

I (U; V) = H (V) - H (V | U) = 1 - H (U \oplus W_{x} \oplus W \oplus W_{y} | U) = 1 - h (q_{x} * q * q_{y}) = s (r_{x}, r_{y}) .

Clearly, s(r_x, r_y) is a lower bound on r_in(r_x, r_y), since: 1) U – Y − V, hence $U V \in P_{in}$ ; 2) $U V \in C (r_{x}, r_{y})$ ; and 3)

r_{in} (r_{x}, r_{y}) = max_{U V \in C (r_{x}, r_{y})} I (U; V) \geq 1 - h (q * q_{x} * q_{y}) = s (r_{x}, r_{y}) .

The converse, r_in(r_x, r_y), ≤ s(r_x, r_y) is unproven, so the identification of r_in(r_x, r_y) with s(r_x, r_y) remains a conjecture. Nevertheless, in our numerical optimization we found no points outside of this region for any choice of (r_x, r_y), and the distributions which emerge from our computer experiments (Fig. 4(a)) closely resemble the long binary-symmetric channel in the calculation of s(r_x, r_y). This provides strong experimental evidence supporting (14) in Conjecture 1.

Fig. 5. — Binary-symmetric channel models for the inner and outer bounds. (a) Model for r_in(r_x, r_y) with $| U | = | V | = 2$ . (b) Model for r_out(r_x, r_y) with $| U | = | V | = 3$ . See text for explanation.

4). Rationale for the Outer Bound (15):

Clearly, s*(r_x, r_y) is a lower bound on r_out, since, for all r_x, r_y ∈ [0, 1]

$r_{out} (r_{x}, r_{y}) \geq r_{in} (r_{x}, r_{y}) \Rightarrow \bar{r_{out}} (r_{x}, r_{y}) \geq \bar{r_{in}} (r_{x}, r_{y})$ ;
$R_{out}$ is convex, ⇒ $\bar{r_{out}} (r_{x}, r_{y}) = r_{out} (r_{x}, r_{y})$ ;
$r_{in} (r_{x}, r_{y}) \geq s (r_{x}, r_{y}) \Rightarrow \bar{r_{in}} (r_{x}, r_{y}) \geq \bar{s} (r_{x}, r_{y}) = s^{*} (r_{x}, r_{y})$ ;

and together these imply r_out(r_x, r_y) ≥ s*(r_x, r_y). Unfortunately, we do not have a proof of the converse, r_out(r_x, r_y) ≤ s*(r_x, r_y), so the identification of r_out(r_x, r_y) with s*(r_x, r_y) remains a conjecture. Nevertheless, empirically (i.e., according to our numerical experiments) the outer bound surface is identical to the convex hull of the inner bound surface. Moreover, empirically, the cardinalities required to construct the outer bound are $| U | = | V | = 3$ .

We can provide an explicit construction of the conjectured outer bound surface and the probability distributions that achieve it as follows. The distributions in this construction also agree with those found empirically, shown in Fig. 4(b). Let $U = V = {0, 1, e}$ . Consider the channel diagrammed in Fig. 5(b), which could be called a “synchronous erasure channel.” Here, U and V are generated by first passing X and Y through binary-symmetric channels, followed by an “erasure” E ∈ {0, 1} event in which both channel outputs are preserved with probability θ = Pr(E = 0), or both are erased (UV = ee) with probability $\bar{θ} = 1 - θ = Pr (E = 1)$ . An explicit formula for this channel is

p (u v | x y) = θ^{{\bar{δ}}_{e}} (u, v) {\bar{θ}}^{δ_{e} (u, v)} {(q_{x}^{\bar{δ} (x, u)} {\bar{q}}_{x}^{δ (x, u)} q_{y}^{\bar{δ} (y, v)} {\bar{q}}_{y}^{δ (y, v)})}^{{\bar{δ}}_{e} (u, v)} Δ_{e} (u, v)

where

δ (α, β) = {\begin{array}{l} 1, & if α = β \\ 0, & if α \neq β \end{array} δ_{e} (α, β) = {\begin{array}{l} 1, & if (α, β) = (e, e) \\ 0, & if (α, β) \neq (e, e) \end{array} Δ_{e} (α, β) = {\begin{array}{l} 0, & if α = e, β \neq e \\ 0, & if α \neq e, β = e \\ 1, & otherwise \end{array}

and

\bar{δ} (α, β) = 1 - δ (α, β), {\bar{δ}}_{e} (α, β) = 1 - δ_{e} (α, β) .

Equivalently, we can represent U and V as follows. Let W ~ B(q), W_x ~ B(q_x), W_y ~ B(q_y), E ~ B(θ) be Bernoulli random variables that are independent of each other and independent of X and Y, and define

Y = X \oplus W U = (X \oplus W_{x}) \otimes E V = (Y \oplus W_{y}) \otimes E

where the multiplication by E is defined by

α \otimes E = {\begin{array}{l} α, & if E = 0 \\ e, & if E = 1 . \end{array}

It is straightforward to verify $U V \in P_{out}$ . To check U – X − Y, write

I (U; Y | X) = I ((X \oplus W_{x}) \otimes E; X \oplus W | X) = I (W_{x} \otimes E; W | X) = 0

where the last line follows from the independence of W, W_x and E from each other and X. A similar calculation shows X – Y – V. Finally, calculating the rate region surface associated with this choice of UV we get, first

r_{x} = I (X; U) = I (X; (X \oplus W_{x}) \otimes E, E) = I (X; E) + I (X; (X \oplus W_{x}) \otimes E | E) = 0 + \bar{θ} I (X; E | E = 1) + θ I (X; X \oplus W_{x} | E = 0) = 0 + 0 + θ (1 - h (q_{x}))

where the last step follows from the previous calculations for the inner bound; and a similar calculation shows r_y = θ(1 – h(q_y)). Then, using U – X – Y, X – Y – V to write

I (U; V) - I (U; V | X Y) = I (X; U) + I (Y; V) - I (X Y; U V) = I (X; U) + I (Y; V) - [H (U) + H (U | V) - H (U V | X Y)]

we have (suppressing some detail)

H (U) = H (U, E) = H (E) + H (U | E) = h (θ) + θ H (U | E = 0) + \bar{θ} H (U | E = 1) = h (θ) + θ + 0

H (U | V) = H (U | V, E) = θ H (U | V, E = 0) + \bar{θ} H (U | V, E = 1) = θ h (q_{x} * q * q_{y})

H (U V | X Y) = H (E | X Y) + H (U V | X Y, E) = h (θ) + θ [H (U | X, E = 0) + H (V | Y, E = 1)] = h (θ) + θ [h (q_{x}) + h (q_{y})] .

Putting these together and canceling terms

I (U; V) - I (U; V | X Y) = θ (1 - h (q_{x} * q * q_{y})) = s^{*} (r_{x}, r_{y}) .

Thus, we have constructed an explicit example which achieves s*(r_x, r_y) with $| U | = | V | = 3$ .

B. Gaussian Case

We now consider a Gaussian version of our problem. Let X and Y be zero-mean Gaussian random variables with correlation coefficient ρ_xy. We propose explicit formulas for the surfaces of $R_{in}$ and $R_{out}$ for the Gaussian case, in terms of the following two functions. In both formulas, put

r_{x} = - \frac{1}{2} log (1 - ρ_{x u}^{2}) r_{y} = - \frac{1}{2} log (1 - ρ_{y v}^{2}) .

Note that these expressions determine the correlation coefficients ρ_xu and ρ_yv. Define

S (r_{x}, r_{y}) = - \frac{1}{2} log (1 - ρ_{x y}^{2} ρ_{y v}^{2} ρ_{x u}^{2})

(17)

and

S^{*} (r_{x}, r_{y}) = r_{x} + r_{y} + \frac{1}{2} log [1 + \frac{2 ρ γ - β}{1 - ρ^{2}}]

(18)

where

γ = ρ_{x y} ρ_{x u} ρ_{y v} β = ρ_{x u}^{2} + ρ_{y v}^{2} - (1 - ρ_{x y}^{2}) ρ_{x u}^{2} ρ_{y v}^{2}, ρ = \frac{β}{2 γ} - \sqrt{{(\frac{β}{2 γ})}^{2} - 1} .

(19)

Conjecture 2: In the Gaussian case, the surfaces of $R_{in}$ and $R_{out}$ are

r_{in} (r_{x}, r_{y}) = S (r_{x}, r_{y})

(20)

r_{out} (r_{x}, r_{y}) = S^{*} (r_{x}, r_{y}) .

(21)

Fig. 6 shows plots of the inner and outer bounds and their difference, as well as the difference between the outer bound and the convex hull of the inner bound. Interestingly, unlike the binary case, for the Gaussian case the outer bound is not equal to the convex hull of the inner bound.

The following proof relies on some basic properties of the mutual information between Gaussian random variables, given as lemmas in Appendix G.

In the analysis that follows, we assume that the maximizing distributions are Gaussian. Under this assumption, we solve the inner and outer bounds. Except for this unproved assumption, the proof of the conjecture is complete.

Proof: (Conjecture 2, eq. (20)) As noted in Appendix G, mutual informations between jointly Gaussian random variables are completely determined by their correlation coefficients. For a length-4 Markov chain U – X – Y – V of jointly Gaussian random variables I(U; V | XY) = 0 and, applying Lemma 9 from Appendix G we have ρ_uv = ρ_xuρ_xyρ_yv, hence

I (U; V) - I (U; V | X Y) = - \frac{1}{2} log (1 - ρ_{x u}^{2} ρ_{x y}^{2} ρ_{y v}^{2}) .

This mutual information is maximized when the constraints I(X; U) ≤ r_x, I(Y; V) ≤ r_y are satisfied with equality, hence when ρ_xu and ρ_yv satisfy $r_{x} = - \frac{1}{2} log (1 - ρ_{x u}^{2})$ and $r_{y} = - \frac{1}{2} log (1 - ρ_{y v}^{2})$ .

The following proof for the surface of the outer bound region uses the form of r_out(r_x, r_y) given by (13). In this case, the optimization problem reduces to minimizing I(XY; UV) subject to the length-3 Markov constraints U – X – Y, X – Y – V.

Proof: (Conjecture 2, eq. (21)) Using Lemma 10 from Appendix G, we have

C_{x y, u v} = [\begin{matrix} ρ_{x u} & ρ_{x v} \\ ρ_{y u} & ρ_{y v} \end{matrix}] = [\begin{matrix} 1 & ρ_{x y} \\ ρ_{x y} & 1 \end{matrix}] [\begin{matrix} ρ_{x u} & 0 \\ 0 & ρ_{y v} \end{matrix}] .

The left-hand matrix in this decomposition is C_xy,_xy, denoted hereafter simply as C, and we denote the right-hand matrix by D. Then applying Lemma 8 from Appendix G yields

I (X Y; U V) = \frac{1}{2} log | C | - \frac{1}{2} log | C - C_{x y, u v} C_{u v, u v}^{- 1} C_{u v, y x} | = \frac{1}{2} log | C | - \frac{1}{2} log | C - C D C_{u v, u v}^{- 1} D C | = - \frac{1}{2} log | C | - \frac{1}{2} log | C^{- 1} - D C_{u v, u v}^{- 1} D | .

Substituting for the 2 × 2 matrices in this last expression and rearranging terms yields

I (X Y; U V) = - \frac{1}{2} log [1 + \frac{2 ρ_{u v} γ - β}{1 - ρ_{u v}^{2}}]

where γ and β are defined in (19).

By assumption, ρ_xu and ρ_yv are fixed, so we optimize I(XY; UV) only with respect to ρ_uv. Setting ∂I(XY; UV)/∂ρ_uv = 0 and solving, we obtain that, if β > 2γ > 0, then the maximum is achieved at $ρ_{u v}^{*} = ρ$ , where ρ is defined in (19).

To complete the proof we must show that β > 2γ > 0. Noting that β, γ > 0 and substituting, the desired inequality becomes

ρ_{x u}^{2} + ρ_{y v}^{2} - ρ_{x u}^{2} ρ_{y v}^{2} > 2 ρ_{x y} ρ_{x u} ρ_{y v} - ρ_{x y}^{2} ρ_{x u}^{2} ρ_{y v}^{2} .

Subtracting from each side and factoring yields the equivalent inequality

- (1 - ρ_{x u}^{2}) (1 - ρ_{y v}^{2}) > - {(1 - ρ_{x y} ρ_{x u} ρ_{y v})}^{2} .

To show that this holds for all ρ_xy, note that the maximum of the right-hand side is achieved by ρ_xy = 1, so that the inequality becomes

(1 - ρ_{x u}^{2}) (1 - ρ_{y v}^{2}) - {(1 - ρ_{x u} ρ_{y v})}^{2} < 0.

This inequality holds, since

(1 - ρ_{x u}^{2}) (1 - ρ_{y v}^{2}) - {(1 - ρ_{x u} ρ_{y v})}^{2} = - ρ_{x u}^{2} - ρ_{y v}^{2} + 2 ρ_{x u} ρ_{y v} = (ρ_{x u} - ρ_{y v}) (ρ_{y v} - ρ_{x u}) = - {(ρ_{x u} - ρ_{y v})}^{2} < 0 .

VIII. Conclusion

We have presented an information-theoretic analysis of pattern recognition systems subject to data compression constraints. Our main results consist of fundamental bounds characterizing the minimum sensory and memory information budgets required for reliable pattern recognition, or, equivalently, the maximum number of patterns that can be discriminated on given sensory and memory data budgets.

As a starting point, we have focused on the case of unstructured data, in which patterns are representable as vectors with i.i.d. components, and the sensory data observation channel is memoryless. In recent years, there has been much theoretical and experimental work aimed at developing methods to render data into a format with independent (or approximately independent) components (see, e.g., [36]–[39]). Such methods have been especially successful in the study of “natural” signals, e.g., sounds and imagery in naturally occurring environments. Nevertheless, a decomposition into independent components is often impossible or only approximate, and it will be important in future work to extend our results to cover the case of correlated components and channels with memory.

We have focused on “reliable” pattern recognition systems, in the sense that the recognition error rate is able to be made arbitrarily close to zero. Nevertheless, in some applications it is of interest (or unavoidable) to allow less-than-perfect accuracy. This can be partly addressed by recasting the recognition problem as a “coarse-to-fine” search, where the system is given information in several successive stages, and at each stage is required only to partially recognize the pattern, i.e., to identify the pattern as belonging to a particular subclass, postponing definitive identification for the final stage. Extending our results to this successive refinement setting is relatively straightforward; see [40]. The more direct approach of explicitly allowing a strictly positive error rate is an open problem.

Much work remains to be done in designing practical pattern recognition systems that achieve the bounds described herein. One of the most challenging problems in this regard is the design of adequate statistical models of real-world signals. For examples of progress on this exciting front, see [11], [36], [41]–[48]. Another significant challenge is that of learning optimal classifiers from training data. In this connection, it will likely prove fruitful to explore connections between the present results and those established in machine learning theory; see, e.g., [7]–[10]. Another practical challenge is to build systems that make optimal use of time. Donald Geman and colleagues have been developing the theory of systems that reach their pattern recognition decisions with a minimum amount of computation [49]. It will be interesting to explore the relationship of this concept with our results concerning recognition using the minimum amount of information.

Open theoretical problems include the calculation of error exponents and, most importantly, the closing of the gap between our inner and outer bounds. As discussed in Section VI-B, the gap in our problem bears close resemblance to that in the distributed source coding problem. A solution to the distributed source coding problem would likely lead to a solution to ours, and vice versa.¹⁵

Acknowledgment

The authors gratefully acknowledge stimulating discussions with Michael DeVore, Naveen Singla, and Po-Hsiang Lai.

This work was supported by the Mathers Foundation and by the Office of Naval Research. The material in this paper was presented in part at an ONR PI meeting, Minneapolis, MN, May 2003; the Neural Information Processing Systems Workshop on Information Theory and Learning, Whistler, BC, Canada, December 2003; the IEEE International Symposium on Information Theory, Chicago, IL, June/July 2004; and the IEEE Information Theory Workshop, Punta del Este, Uruguay, March 2006.

Appendix A. Proof of the Inner Bound

In this section we prove the inner bound ${\bar{R}}_{in} \subseteq R$ , Theorem 2. The proof relies on standard random coding arguments and properties of strongly jointly typical sets [30]. Given a joint distribution p(xyuv), the strongly jointly δ-typical set is defined by

T_{UVXY}^{δ} = {x y u v : | N (xyuv | x y u v) / n - p (xyuv) | \leq δ \forall xyuv \in X Y U V}

where N(xyuv | xyuv) is the number of times the symbol combination xyuv occurs in xyuv. Likewise, we write. e.g., $T_{X}^{δ}$ , $T_{X Y}^{δ}$ , $T_{X Y U}^{δ}$ for singles, pairs, and triples. We will also use conditionally strongly jointly δ-typical sets, for example

T_{x U}^{δ} = {u : (x u) \in T_{X U}^{δ}} .

The subscripts are omitted when context allows. We will also need the fact that for any positive numbers δ, ϵ > 0, fixed vector x, and large enough n

2^{- n [I (X; Y) + ϵ]} \leq Pr (x Y \in T_{x Y}^{δ}) \leq 2^{- n [I (X; Y) - ϵ]} .

(22)

Theorem 1: Suppose R is a point in the convex hull of $R_{in}$ , $R \in {\bar{R}}_{in}$ . That is, $R = \sum_{q \in Q} p (q) R_{q}$ , where p(q) is a probability distribution over some finite alphabet $Q$ , and for each $q \in Q$ , $R_{q} = (R_{x q}, R_{y q}, R_{c q}) \in R_{in}$ .

We wish to show that for any ϵ > 0 and large enough n, there exists an (M_c, M_x, M_y, n) code (f, ϕ, g) with rates $R_{c}^{'} = \frac{1}{n} log M_{c}$ , $R_{x}^{'} = \frac{1}{n} log M_{x}$ , $R_{y}^{'} = \frac{1}{n} log M_{y}$ such that $R_{c}^{'} \geq R_{c}$ , $R_{x}^{'} \leq R_{x}$ , $R_{y}^{'} \leq R_{y}$ , and $P_{e}^{n} \leq ϵ$ .

By definition, $R_{q} \in R_{in}$ implies that for each $q \in Q$ there exist random variables U_q, V_q such that

p_{q} (xyuv) = p (x y) p_{q} (u | x) p_{q} (v | y)

and

R_{x q} = I (X; U_{q}) + α_{x q} R_{y q} = I (Y; V_{q}) + α_{y q} R_{c q} = R_{x q} + R_{y q} - I (X Y; U_{q} V_{q}) - γ_{q}

for some values α_xq, α_yq, γ_q > 0 such that γ_q ≥ α_xq + α_yq.¹⁶ Now let

R_{x q}^{'} = I (X; U_{q}) + α_{x q} / 4, R_{x}^{'} = \sum_{q \in Q} p (q) R_{x q}^{'} R_{y q}^{'} = I (Y; V_{q}) + α_{y q} / 4, R_{y}^{'} = \sum_{q \in Q} p (q) R_{y q}^{'}

and

R_{c q}^{'} = R_{x q}^{'} + R_{y q}^{'} - I (X Y; U_{q} V_{q}) - γ_{q} / 4 R_{c}^{'} = \sum_{q \in Q} p (q) R_{c q}^{'} .

With these choices, we have $R_{x}^{'} \leq R_{x}$ , $R_{y}^{'} \leq R_{y}$ , $R_{c}^{'} \geq R_{c}$ .

Given

X^{n} ~ \prod_{i = 1}^{n} p (x_{i}), Y^{n} ~ \prod_{i = 1}^{n} p (y_{i})

divide the sequences into $| Q |$ segments with lengths n_q = np(q), denoted $X^{n_{q}}$ , $Y^{n_{q}}$ , i.e.,

X^{n} = [X^{n_{1}} X^{n_{2}} \dots X^{n_{| Q |}}], Y^{n} = [Y^{n_{1}} Y^{n_{2}} \dots Y^{n_{| Q |}}] .

Finally, we will use the additional notation: $M_{x q} = 2^{n_{q} R_{x q}^{'}}$ , $M_{y q} = 2^{n_{q} R_{y q}^{'}}$ and $M_{x q} = {1, \dots, M_{x q}}$ , $M_{y q} = {1, \dots, M_{y q}}$ .

We will construct the desired overall code (f, ϕ, g) with rate $(R_{c}^{'}, R_{x}^{'}, R_{y}^{'})$ by first constructing encoders f_q, ϕ_q with rates $R_{x q}^{'}$ , $R_{y q}^{'}$ for the $| Q |$ component sequences $X^{n_{q}}$ , $Y^{n_{q}}$ , then constructing a classifier g which acts on the combined outputs of the encoders.

Please refer to Fig. 7 for a summary of the notation introduced below.

1
Codebooks: For each $q \in Q$ , from p_q(xyuv) compute the marginal distributions p_q(u),p_q(v). To serve as memory codewords, select M_xq length-n_q vectors by sampling with replacement from a uniform distribution over the set $T_{U}^{δ}$ . Assign each codeword a unique index $i_{q} \in M_{x q} = {1, 2, \dots, M_{x q}}$ . To serve as sensory codewords, similarly select M_yq length-n_q vectors by from $T_{V}^{δ}$ , and assign each an index $j_{q} \in M_{y q} = {1, 2, \dots, M_{y q}}$ . Denote the codebooks
$B_{u} (q) = {u^{n_{q}} (1), \dots, u^{n_{q}} (M_{x q})} B_{v} (q) = {v^{n_{q}} (1), \dots, v^{n_{q}} (M_{y q})} .$
2
Encoders: We define encoders f_q and ϕ_q in terms of maps
$φ_{x q} : X^{n_{q}} \to U^{n_{q}} b_{x q} : U^{n_{q}} \to M_{x q} φ_{y q} : Y^{n_{q}} \to V^{n_{q}} b_{y q} : V^{n_{q}} \to M_{y q}$
as follows. Given any $x^{n_{q}} \in X^{n_{q}}$ , search the codebook $B_{u} (q)$ for a codeword $u^{n_{q}}$ such that $(x^{n_{q}}, u^{n_{q}}) \in T_{X U}^{δ}$ . If this search is successful, set $φ_{x q} (x^{n_{q}}) = u^{n_{q}}$ , $b_{x q} (u^{n_{q}}) = i_{q}$ , $b_{x q}^{- 1} (i_{q}) = u^{n_{q}}$ , where i_q is the index of $u^{n_{q}}$ in $B_{u} (q)$ . If the search fails, (arbitrarily) set $u^{n_{q}} = u^{n_{q}} (1)$ so that φ_xq and b_xq are defined for all of $X^{n_{q}}$ . In the same way, given any $y^{n_{q}} \in Y^{n_{q}}$ , search $B_{v} (q)$ for a codeword $v^{n_{q}}$ such that $(y^{n_{q}}, v^{n_{q}}) \in T_{Y V}^{δ}$ . If successful, denote the index of the found codeword j_q, and set $φ_{y q} (y^{n_{q}}) = v^{n_{q}}$ , $b_{y q} (v^{n_{q}}) = j_{q}$ , $b_{y q}^{- 1} (j_{q}) = v^{n_{q}}$ . For search failure, set j_q =1, $v^{n_{q}} = v^{n_{q}} (1)$ so that φ_yq and b_yq are defined for all of $Y^{n_{q}}$ .

Finally, define

f_{q} : X^{n_{q}} \to M_{x q}, f_{q} (x^{n_{q}}) = i_{q} ≜ b_{x q} (φ_{x q} (x^{n_{q}})) ϕ_{q} : Y^{n_{q}} \to M_{y q}, ϕ_{q} (y^{n_{q}}) = j_{q} ≜ b_{y q} (φ_{y q} (y^{n_{q}})) .

Now, given vectors $x^{n} = (x^{n_{1}} \dots x^{n_{| Q |}})$ and $y^{n} = (y^{n_{1}} \dots y^{n_{| Q |}})$ , the encoders above each produce $| Q |$ vectors $φ_{x q} (x^{n_{q}}) = u^{n_{q}}$ , $φ_{y q} (y^{n_{q}}) = v^{n_{q}}$ , and $| Q |$ indices $b_{x q} (u^{n_{q}}) = i_{q}$ , $b_{y q} (y^{n_{q}}) = j_{q}$ . Denote the concatenations of these

Note that the vector of integers i ranges over $M_{x} = \prod_{q \in Q} M_{x q}$ different values i(i),i = 1…M_x. Let the map between the vectors i and the corresponding integers i be ℓ_x, i.e., if i = i(i), let ℓ_x(i) = i, and $l_{x}^{- 1} (i) = i$ . Similarly, j ranges over $M_{y} = \prod_{q \in Q} M_{y q}$ values, j(j),j = 1…M_y and we define ℓ_y such that if j = j(j), then ℓ_y(j) = j and $l_{y}^{- 1} (j) = j$ . Then we can specify encoders $f^{'} : X^{n} \to M_{x}$ and $ϕ : Y^{n} \to M_{y}$ for full length-n vectors xⁿ and yⁿ by

f^{'} (x^{n}) = i ≜ l_{x} (b_{x} (φ_{x} (x^{n})) ϕ (y^{n}) = j ≜ l_{y} (b_{y} (φ_{y} (y^{n})) .

To finalize the construction of the memory encoder, for any given labeled template pattern t(w) = (xⁿ, w), let $f : X^{n} \times M_{c} \to M_{x} \times M_{c}$ be defined by

f (t (w)) = m (w) = (i, w) ≜ (f^{'} (x^{n}), w) .

The rates of the encoders are $(R_{x}^{'}, R_{y}^{'})$ , as verified by calculating

\frac{1}{n} log M_{x} = \frac{1}{n} log \prod_{q \in Q} M_{x q} = \frac{1}{n} \sum_{q \in Q} log 2^{n_{q} R_{x q}^{'}} = \sum_{q \in Q} p (q) R_{x q}^{'} = R_{x}^{'} \frac{1}{n} log M_{y} = \frac{1}{n} log \prod_{q \in Q} M_{y q} = \frac{1}{n} \sum_{q \in Q} log 2^{n_{q} R_{y q}^{'}} = \sum_{q \in Q} p (q) R_{y q}^{'} = R_{y}^{'} .

3
Memorization: Given a realization of the template patterns $C_{x} = (t (1) \dots t (M_{c})), t (w) = (x^{n} (w), w)$ and the encoders defined above f and ϕ, compute the memory data $C_{u} = f (C_{x}) ≜ {f (t (1)), \dots, f (t (M_{c}))} = {m (1), \dots, m (M_{c})}$ .
4
Recognition Function: Given the stored memory data $C_{u}$ , we proceed to construct the classifier g as follows.

For each $q \in Q$ , given any pair of length n_q vectors $u^{n_{q}}$ , $v^{n_{q}}$ , define a function that tests the pair for strong joint typicality

ρ_{q} (u^{n_{q}}, v^{n_{q}}) = T [(u^{n_{q}}, v^{n_{q}}) \in T_{U V}^{δ}]

where $T [\cdot]$ is the truth-indicator function $T [A] = 1$ if A is true, and $T [A] = 0$ if A is false.

Now, given the sensory data j, compute its vector representation $j = [j_{1} \dots j_{| Q |}] = l_{y}^{- 1} (j)$ . For each $q \in Q$ , retrieve from $B_{v} (q)$ the corresponding codeword $v^{n_{q}} = b_{y q}^{- 1} (j_{q})$ . Similarly, for each memory $m (w) = (i, w) \in C_{u}$ , compute the corresponding vector $i = [i_{q} \dots i_{| Q |}] = l_{x}^{- 1} (i)$ , and from the memory codebook $B_{u} (q)$ retrieve $u^{n_{q}} = b_{x q}^{- 1} (i_{q})$ . Next, define a function $r_{w} : M_{y} \to {0, 1}$ that tests each $v^{n_{q}}$ in $v^{n} = [v^{n_{1}} \dots v^{n_{| Q |}}]$ against the corresponding $u^{n_{q}}$ in $u^{n} = [u^{n_{1}} \dots u^{n_{| Q |}}]$ , reporting a 1 if all compared pairs are jointly typical and zero otherwise

r_{w} (j) = T [ρ_{1} (u^{n_{1}}, v^{n_{1}}) \dots ρ_{| Q |} (u^{n_{| Q |}}, v^{n_{| Q |}}) = 1^{| Q |}]

where $1^{| Q |}$ denotes the length- $| Q |$ all-ones vector.

We can now specify the recognition function g as follows. Given the encoded sensory data j, the recognition module searches for a unique $w^{'} \in M_{c}$ such that r_w′ (j) = 1. If this search is successful, set $\hat{w} = w^{'}$ . Otherwise, if there is none or more than one such value, declare an error and (arbitrarily) set $\hat{w} = 1$ . Thus, we have defined $g : M_{y} \times {(M_{x})}^{M_{c}} \to M_{c}$ , $g (j, C_{u}) = \hat{w}$ , as desired.

A. Performance Analysis

1). Error Events:

We analyze the probability of error for a given W = w, T(w) = (Xⁿ(w), w), Yⁿ. Denote the results of processing these with the components of the code (f, ϕ, g) above by M(w) = (I(w), w) = f(T(w)), J(w) = ϕ(Yⁿ); $U^{n_{q}} (w) = φ_{x q} (X^{n_{q}})$ , $V^{n_{q}} (w) = φ_{y q} (Y^{n_{q}})$ for each $q \in Q$ , and R_w = r_w(J(w)). The following is an exhaustive list of possible errors.

First, in words, the possible errors are as follows:

the sensory data and pattern template are not jointly typical;
the pattern template is unencodable;
the sensory data is unencodable;
the codewords for the memory and sensory data are not jointly typical;
the sensory data is jointly typical with more than one memory codeword;
two different patterns are assigned the same memory codeword.

More formally, we express the error events thus: For events E_i, let ${\bar{E}}_{k} = {(\cup_{i = 1}^{k} E_{i})}^{c}$ , where A^c denotes the complement of A. For each $q \in Q$

$E_{1} (q) = {(X^{n_{q}}, Y^{n_{q}}) \notin T_{X Y}^{δ}}$ ;
$E_{2} (q) = {\bar{E}}_{1} (q) \cap {\forall U^{n_{q}} \in B_{u} (q) : (X^{n_{q}}, U^{n_{q}}) \notin T_{X U}^{δ}}$ ;
$E_{3} (q) = {\bar{E}}_{1} (q) \cap {\forall V^{n_{q}} \in B_{v} (q) : (Y^{n_{q}}, V^{n_{q}}) \notin T_{Y V}^{δ}}$ ;
$E_{4} (q) = {\bar{E}}_{3} (q) \cap {(U^{n_{q}} (w), V^{n_{q}} (w)) \notin T_{U V}^{δ}}$ ; and letting $E_{i} ≜ \cup_{q \in Q} E_{i} (q)$ , i = 1…4
$E_{5} = {\bar{E}}_{4} \cap {\exists w^{'} \in M_{c} : w^{'} \neq w, r_{w^{'}} (J (w)) = 1}$ ;
$E_{6} = {\bar{E}}_{5} \cap {\exists w^{'} \in M_{c} : w^{'} \neq w, I (w^{'}) = I (w)}$ .

Each of these vanishes as n → ∞, for the following reasons:

P(E₁(q)) → 0, by the strong asymptotic equipartition property (AEP);
P(E₂(q)) → 0, because $R_{x q}^{'} \geq I (X; U_{q})$ ;
P(E₃(q)) → 0, because $R_{y q}^{'} \geq I (Y; V_{q})$ ;
P(E₄(q)) → 0, because of the factorization p_q(xyuv) = p(xy)p_q(u | x)p_q(v | y) and the Markov lemma.

Regarding P(E₅) Rewrite E₅ as

E_{5} = {\bar{E}}_{4} \cap {\exists w^{'} \in M_{c} : w^{'} \neq w, r_{w^{'}} (J (w)) = 1} = {\bar{E}}_{4} \cap {\underset{w^{'} \in 〈 M_{c} \ w}}{\cup} {\forall q \in Q, (U^{n_{q}} (w^{'}), V^{n_{q}} (w)) \in T_{U V}^{δ}}} .

Then

P (E_{5}) \leq | C_{u} | \prod_{q \in Q} 2^{- n_{q} [I (U_{q}; V_{q}) - ϵ_{q}]} \leq M_{c} \prod_{q \in Q} 2^{- n_{q} [I (U_{q}; V_{q}) - ϵ_{q}]} = 2^{n R_{c}^{'}} 2^{- n \sum_{q \in Q} p (q) [I (U_{q}; V_{q}) + ϵ_{q}]} = 2^{n [R_{c}^{'} - \sum_{q \in Q} p (q) I (U_{q}; V_{q}) - ϵ]} .

So, P(E₅) → 0 as n → ∞ if.

R_{c}^{'} \leq \sum_{q \in Q} p (q) I (U_{q}; V_{q}) + ϵ

. This is indeed the case, since

R_{c}^{'} = \sum_{q \in Q} p (q) R_{c q}^{'} = \sum_{q \in Q} p (q) [R_{x q}^{'} + R_{y q}^{'} - I (X Y; U_{q} V_{q}) - γ_{q}] = \sum_{q \in Q} p (q) [I (X; U_{q}) + I (Y; V_{q}) - I (X Y; U_{q} V_{q}) + α_{x q} + α_{y q} - γ_{q}] \overset{(a)}{\leq} \sum_{q \in Q} p (q) [I (X; U_{q}) + I (Y; V_{q}) - I (X Y; U_{q} V_{q})] + ϵ \overset{(b)}{\leq} \sum_{q \in Q} p (q) I (U_{q}; V_{q}) + ϵ

where (a) is because (γ – α_xq + α_yq); and (b) follows from elementary properties of mutual information and from the factorization. p(q)p(xy)p(u | xq) p(v | yq).

Regarding P(E₆): Denoting the components of the codeword for each memory $M (w) \in C_{u}$ by $U^{n_{q}} (w)$ , rewrite event E₆ as
$E_{6} = {\bar{E}}_{5} \cap {\exists w^{'} \in M_{c} : w^{'} \neq w, I (w^{'}) = I (w)} = {\bar{E}}_{5} \cap {\exists w^{'} \in M_{c} : w^{'} \neq w, \forall q \in Q : (X^{n_{q}} (w^{'}), U^{n_{q}} (w)) \in T_{X U}^{δ}} .$
Then
$P (E_{6}) \leq M_{c} \prod_{q \in Q} 2^{- n_{q} [I (X; U_{q}) - ϵ_{q}]} = 2^{n [R_{c}^{'} - \sum_{q \in Q} p (q) I (X; U_{q}) - ϵ]} .$
So as if P(E₆) → 0 as n → ∞ if $R_{c}^{'} \leq \sum_{q \in Q} p (q) I (X; U_{q}) + ϵ]$ . This is indeed the case: From our preceding calculation for P(E₅), we have $R_{c}^{'} \leq \sum_{q \in Q} p (q) I (U_{q}; V_{q}) + ϵ$ ; and the assumed factorization of p_q(xyuv) implies that the following is a Markov chain: U_q – X – Y − V_q. Hence, by the data processing inequality
$R_{c} \leq \sum_{q \in Q} p (q) I (U_{q}; V_{q}) + ϵ \leq \sum_{q \in Q} p (q) I (U_{q}; X) + ϵ .$

This concludes the proof of the inner bound.

Appendix B. Proof of the Outer Bound

In this section we prove Theorem 1, which states the outer bound $R \subseteq R_{out}$ . In the proof let W be the test index, selected from a uniform distribution p(w) over the pattern indices $M_{c}$ ; let T = T(W) = (X, W)be the selected test pattern from the set of template patterns $C_{x}$ ; let M = M(W) = (I, W) = f(T) be the compressed, memorized form of T; let $C_{u} = f (C_{x})$ be the memorized data; let Y be the sensory data; let J = J(W) = ϕ(Y) be the encoded sensory data, and let $\hat{W} = g (J, C_{u})$ be the inferred value of W. Note that M, $\hat{W}$ are random variables through their dependence on X, W, Y, and $C_{x}$ . The mutual informations in the proof are calculated with respect to the joint distribution (and its marginals) over (W, $C_{x}$ , $C_{u}$ , X, Y, M, I, J, $\hat{W}$ ). We can verify that this distribution is well defined by writing it out explicitly. Let $T [\cdot]$ be the truth-indicator function $T [A] = 1$ if A is true, and $T [A] = 0$ if A is false. Then

p (w, C_{x}, C_{u}, x, y, m, j) = p (w) p (C_{x}) p (C_{u} | C_{x}) p (x | w, C_{x}) p (y | x) p (m | x, w) p (j | y) p (\hat{w} | j, C_{u})

where

p (w) = \frac{1}{M_{c}} T [w \in M_{c}] p (C_{x}) = \prod_{w = 1}^{M_{c}} \prod_{i = 1}^{n} p (x_{i} (w)) p (C_{u} | C_{x}) = T [C_{u} = f (C_{x})] p (x | w, C_{x}) = T [x = x (w) \in C_{x}] p (y | x) = \prod_{i = 1}^{n} p (y_{i} | x_{i}) p (m | x, w) = T [m = f (x, w)] p (j | y) = T [j = ϕ (y)] p (\hat{w} | j, C_{u}) = T [\hat{w} = g (j, C_{u})] .

The independence relationships underlying the structure of this distribution are evident from the block diagram of Fig. 1.

Proof: (Theorem 3) Assume $R = (R_{x}, R_{y}, R_{c}) \in R$ . Then there exists a sequence of (Mx, M_y, M_c, n) codes (f, ϕ, g), such that for any ϵ > 0

M_{c} \geq 2^{n R_{c}} M_{x} \leq 2^{n R_{x}} M_{y} \leq 2^{n R_{y}}

and $P_{e}^{n} = Pr (\hat{W} \neq W) \leq ϵ$ . To show that $R \in R_{out}$ , we must construct a pair of auxiliary random variables UV such that $U V \in P_{out}$ and $R \in R_{U V}$ .

We construct the desired pair UV in three steps: 1) We introduce a set of intermediate random variable pairs U_iV_i, i = 1,2,…,n, individually contained in $P_{out}$ ; 2) we derive mutual information inequalities for R_x, R_y, and R_c involving sums of the intermediate variables; 3) we convert the sum-inequalities into inequalities in the desired pair UV.

Step 1:

Let the intermediate auxiliary random variables be

U_{i} = (M, X^{i - 1}) V_{i} = (J, Y^{i - 1})

for i = 1,2,…,n. Each pair is in $P_{out}$ . This is verified for the U_i by calculating

I (U_{i}; Y_{i} | X_{i}) = H (Y_{i} | X_{i}) - H (Y_{i} | M, X^{i - 1}, X_{i}) = H (Y_{i} | X_{i}) - H (Y_{i} | M, X^{i}) \overset{(a)}{\leq} H (Y_{i} | X_{i}) - H (Y_{i} | M, X^{n}) \overset{(b)}{=} H (Y_{i} | X_{i}) - H (Y_{i} | X^{n}) \overset{(c)}{=} H (Y_{i} | X_{i}) - H (Y_{i} | X_{i}) = 0

where the reasons for the lettered steps are (a) conditioning does not increase entropy, (b) the Y_i are independent of all other variables given Xⁿ, and (c) the pairs X_iY_i are i.i.d. Hence, U_i − X_i − Y_i is a Markov chain. By a similar argument, X_i − Y_i − V_i is also a Markov chain. Hence, $U_{i} V_{i} \in P_{out}$ for each i = 1,2,…,n.

Step 2:

First, for the sensory encoder rate

n R_{y} \geq H (J) \overset{(a)}{=} H (J) - H (J | Y^{n}) = \sum_{i = 1}^{n} H (Y_{i}) - H (Y_{i} | Y^{i - 1} J) = \sum_{i = 1}^{n} H (X_{i}) - H (X_{i} | V_{i}) = \sum_{i = 1}^{n} I (X_{i}; V_{i})

where (a) follows from J = ϕ(Yⁿ).

Next, taking account of all M_c memorized patterns, for the memory encoder rate we have

M_{c} (n R_{x}) = M_{c} log M_{x} \geq H (C_{u}) = H (C_{u}) - H (C_{u} | C_{x}) = H (C_{x}) - H (C_{x} | C_{u}) = \sum_{w = 1}^{M_{c}} H (T (w)) - H (T (w) | M (w)) \overset{(a)}{=} \sum_{w = 1}^{M_{c}} H (X^{n} (W) | W = w) - H (X^{n} (W) | I, W = w) \overset{(b)}{=} \sum_{w = 1}^{M_{c}} H (X^{n}) - H (X^{n} | I, W = w) \overset{(c)}{=} M_{c} \sum_{w = 1}^{M_{c}} p (w) [H (X^{n}) - H (X^{n} | I, W = w)] \overset{(d)}{=} M_{c} [H (X^{n}) - H (X^{n} | I, W)] \overset{(e)}{=} M_{c} \sum_{i = 1}^{n} [H (X_{i}) - H (X_{i} | X^{i - 1}, I, W)] \overset{(f)}{=} M_{c} \sum_{i = 1}^{n} [H (X_{i}) - H (X_{i} | U_{i})] = M_{c} \sum_{i = 1}^{n} I (X_{i}; U_{i})

where (a) is simply a matter of variable definitions and notation, (b) follows from the assumption that the Xⁿ(W)’s all have the same distribution and are drawn independently of W, (c) follows from the definition of p(w), (d) follows from the definition of conditional entropy, (e) follows from $p (x^{n}) = \prod_{i = 1}^{n} p (x_{i})$ and the telescoping property, and (f) follows from the definition of U_i. Hence, $n R_{x} \geq \sum_{i = 1}^{n} I (X_{i}; U_{i})$ .

Finally n R_{c} \leq log M_{c} = H (W) = I (W; C_{u}, J) + H (W | C_{u}, J) \overset{(a)}{\leq} I (W; C_{u}, J) + n ϵ_{n} = I (W; C_{u}) + I (W; J | C_{u}) + n ϵ_{n} \overset{(b)}{=} 0 + I (W; J | C_{u}) + n ϵ_{n} = I (W, C_{u}; J) - I (J; C_{u}) + n ϵ_{n} \leq I (W, C_{u}; J) + n ϵ_{n} \overset{(c)}{=} I (M; J) + n ϵ_{n} \overset{(d)}{=} \sum_{i = 1}^{n} I (X_{i}; U_{i}) + I (Y_{i}; V_{i}) - I (X_{i} Y_{i}; U_{i} V_{i}) + n ϵ_{n} .

The lettered steps are justified as follows.

By assumption, $Pr (\hat{w} \neq W) = P_{e}^{n} \to 0$ , where $\hat{W} = g (J, C_{u})$ . Thus, applying Fano’s inequality yields
$H (W | C_{u}, J) \leq H (P_{e}^{n}) + P_{e}^{n} log (M_{c} - 1) \leq n ϵ_{n}$
where ϵ_n → 0.
The test index W and patterns $C_{x}$ are drawn independently, hence, $I (W; C_{u}) = I (W; f (C_{x})) = 0$ .
Writing $C_{u} = C_{u}^{*} \cup M$ , $C_{u}^{*} = C_{u} \ M$ , we have
$I (W, C_{u}; J) = I (W, M, C_{u}^{*}; J) = I (M; J) + I (W, C_{u}^{*}; J | M) = I (M; J) + I (W, C_{u}^{*}; J | I, W) = I (M; J) + I (C_{u}^{*}; J | I, W) = I (M; J) + 0,$
since the M(i) = (I(i), i) are independent of J for i ≠ W.
To justify this step, we invoke the following two results, proved in Appendix D. Let A, α, B, β and γ be arbitrary discrete random variables.

Then we get the following.

Lemma 6:

I (α; β) \geq I (A; α) + I (B; β) - I (A B; α β),

with equality if and only if I(Aα; Bβ) = I(A; B).

Lemma 7: Let Z_i = (γ; Aⁱ⁻¹), i =1,2,…,n where the A_i are i.i.d. Then

\sum_{i = 1}^{n} I (A_{i}; Z_{i}) = I (A^{n}; γ) .

To apply Lemma 6, make the substitution (α, β, A, B) → (M, J, Xⁿ, Yⁿ). Then the condition for equality is satisfied

I (X^{n}, M; Y^{n}, J) = I (X^{n}, I, W; Y^{n}, J) = I (X^{n}, W; Y^{n}, J) + I (I; Y^{n}, J | X^{n}, W) = I (X^{n}, W; Y^{n}, J) + I (I, W; Y^{n}, J | X^{n}, W) \overset{(a)}{=} I (X^{n}, W; Y^{n}, J) + 0 = I (X^{n}, W; Y^{n}) + I (X^{n}, W; J | Y^{n}) \overset{(b)}{=} I (X^{n}, W; Y^{n}) + 0 = I (X^{n}; Y^{n}) + I (W; Y^{n} | X^{n}) \overset{(c)}{=} I (X^{n}; Y^{n}) + 0

since (a) M = (I, W) = f(Xⁿ, W), (b) J = ϕ(Yⁿ), and (c) Yⁿ only depends on W through Xⁿ = Xⁿ(W), so that H(Yⁿ | Xⁿ, W) = H(Yⁿ | Xⁿ). Thus, we have

I (M; J) = I (X^{n}; M) + I (Y^{n}; J) - I (X^{n}, Y^{n}; M, J) .

(23)

Next, apply Lemma 7 three times with the substitutions

(Z_{i}, γ, A^{i - 1}) \to (U_{i}, M, X^{i - 1}), \to (V_{i}, J, Y^{i - 1}), \to (U_{i} V_{i}, M J, X^{i - 1} Y^{i - 1})

to obtain

\sum_{i = 1}^{n} I (X_{i}; U_{i}) = I (X^{n}; M) \sum_{i = 1}^{n} I (Y_{i}; V_{i}) = I (Y^{n}; J) \sum_{i = 1}^{n} I (X_{i} Y_{i}; U_{i} V_{i}) = I (X^{n} Y^{n}; M J) .

Adding the first two expressions and subtracting the third yields

\sum_{i = 1}^{n} [I (X_{i}, U_{i}) + I (Y_{i}; V_{i}) - I (X_{i}, Y_{i}; U_{i}, V_{i})] = [I (X^{n}; M) + I (Y^{n}; J) - I (X^{n} Y^{n}; M J)] .

(24)

Combining (23) and (24) yields

I (M; J) = \sum_{i = 1}^{n} I (X_{i}; U_{i}) + I (Y_{i}; V_{i}) - I (X_{i}, Y_{i}; U_{i}, V_{i})

as claimed.

Step 3:

For this step, we use the following lemma, proved in Appendix C as part of the demonstration that $R_{out}$ is convex.

Lemma 1: Suppose $U_{i} V_{i} \in P_{out}$ , i = 1,2,…,n. Then there exists $U V \in P_{out}$ such that

\frac{1}{n} \sum_{i = 1}^{n} I (X_{i}; U_{i}) = I (X; U) \frac{1}{n} \sum_{i = 1}^{n} I (Y_{i}; V_{i}) = I (Y; V) \frac{1}{n} \sum_{i = 1}^{n} I (X_{i} Y_{i}; U_{i} V_{i}) = I (X Y; U V) .

Applying Lemma 1 to the results of Steps 1 and 2, we obtain

R_{x} \geq \frac{1}{n} \sum_{i = 1}^{n} I (X_{i}; U_{i}) = I (X; U) R_{y} \geq \frac{1}{n} \sum_{i = 1}^{n} I (Y_{i}; V_{i}) = I (Y; V) R_{c} \leq R_{x} + R_{y} - \frac{1}{n} \sum_{i = 1}^{n} I (X_{i} Y_{i}; U_{i} V_{i}) = R_{x} + R_{y} - I (X Y; U V)

where $U V \in P_{out}$ . With respect to this UV, by definition we have $R \in R_{U V}$ . Hence, $R \in R_{out}$ , and the proof is complete. □

Appendix C. Convexity of the Outer Bound

In this appendix, we prove a slightly more general version of Lemma 1 from Appendix B, and use this result to show that the outer bound rate region $R_{out}$ is convex.

In the following, let $Q$ be any finite alphabet, and assume that we have pairs X_qY_q for all $q \in Q$ which are i.i.d. ~ p(xy).

Lemma 2: Suppose $U_{q} V_{q} \in P_{out}$ for all $q \in Q$ , and let $Q ~ p (q)$ , $q \in Q$ be any discrete random variable independent of the pairs {X_qY_q}. Then there exists a pair of discrete random variables $U V \in P_{out}$ such that

\sum_{q \in Q} p (q) I (X_{q}; U_{q}) = I (X; U) \sum_{q \in Q} p (q) I (Y_{q}; V_{q}) = I (Y; V) \sum_{q \in Q} p (q) [I (X_{q}; U_{q}) + I (Y_{q}; V_{q}) - I (X_{q} Y_{q}; U_{q} V_{q})] = I (X; U) + I (Y; V) - I (X Y; U V) .

Remark 4: Lemma 1 in Appendix B follows immediately from the above Lemma, by choosing $Q = {1, 2, \dots, n}$ and p(q) = 1/n for all $q \in Q$ .

Proof: As a candidate for the pair UV in the lemma, consider $U V \in P_{conv}$ for the given Q (see (4)), i.e., U = (U_Q,Q) and V = (V_Q,Q). To verify that $U V \in P_{out}$ , we proceed to check that U − X − Y and X − Y − V are Markov chains.

By the assumption $U_{q} V_{q} \in P_{out}$ for each $q \in Q$ , we have I(U_q; Y_q | X_q) = 0 and I(V_q; X_q | Y_q) = 0. Hence

0 = \sum_{q \in Q} p (q) I (U_{q}; Y_{q} | X_{q}) = \sum_{q \in Q} p (q) I (U_{q}; Y_{q} | X_{q}, Q = q) = I (U_{Q}; Y_{Q} | X_{Q} Q) \overset{(a)}{=} I (U_{Q}; Y | X, Q) = I (U_{Q} Q; Y | X) - I (Q; Y | X) \overset{(b)}{=} I (U_{Q} Q; Y | X) = I (U; Y | X)

where in (a) we are able to drop the subscript Q on X_Q and Y_Q because the X_q and Y_q are i.i.d. and independent of Q; and similarly (b) is because I(Q;Y | X) = 0, due to the independence of Q and Y. By an analogous calculation, we also find I(V; X | Y) = 0. Hence, U − X − Y and X − Y − V, and $U V \in P_{out}$ as desired.

It remains to demonstrate the three equalities in the lemma. For the first, write

I (X; U) = I (X; U_{Q} Q) = I (X; U_{Q} | Q) + I (X; Q) \overset{(\underline{a})}{=} I (X; U_{Q} | Q) \overset{(b)}{=} I (X_{Q}; U_{Q} | Q) = \sum_{q \in Q} p (q) I (X_{q}; U_{q})

where, as above, (a) and (b) follow from the fact that the X_q are i.i.d. and independent of Q. Similar calculations yield

I (Y; V) = \sum_{q \in Q} p (q) I (Y_{q}; V_{q}),

and

I (X Y; U V) = \sum_{q \in Q} p (q) I (X_{q} Y_{q}; U_{q} V_{q}) .

The convexity of $R_{out}$ follows readily from the preceding lemma.

Lemma 3: $R_{out}$ is convex. That is, let R_q be any set of rates such that $R_{q} \in R_{out}$ for all $q \in Q$ , where $Q$ is a finite alphabet, and let p(q) be any probability distribution over $Q$ . Then $R = \sum_{q \in Q} p (q) R_{q} \in R_{out}$ .

Proof: Fix an arbitrary distribution p(q) and rates $R_{q} \in R_{out}$ for all $q \in Q$ . By the definition of $R_{out}$ , for each rate R_q, there exists a pair $U_{q} V_{q} \in P_{out}$ such that $R_{q} \in R_{U_{q} V_{q}}$ . Consequently

R_{x} = \sum_{q \in Q} p (q) R_{x, q} \geq \sum_{q \in Q} p (q) I (X_{q}; U_{q}) R_{y} = \sum_{q \in Q} p (q) R_{y, q} \geq \sum_{q \in Q} p (q) I (Y_{q}; V_{q}) R_{c} = \sum_{q \in Q} p (q) R_{c, q} \leq \sum_{q \in Q} p (q) [I (X_{q}; U_{q}) + I (Y_{q}; V_{q}) - I (X_{q} Y_{q}; U_{q} V_{q})] .

As in the proof of Lemma 2, use these pairs to construct a new pair UV, by defining U = (UQ, Q), V = (VQ, Q). From the proof of Lemma 2, we know 1) that $U V \in P_{out}$ , and 2) the sums on the right-hand sides of the inequalities above can be replaced with expressions in U and V, yielding

R_{x} \geq I (X; U) R_{y} \geq I (Y; V) R_{c} \leq I (X; U) + I (Y; V) - I (X Y; U V) \leq R_{x} + R_{y} - I (X Y; U V)

which means that $R \in R_{U V}$ for the given UV. Hence, $R = \sum_{q \in Q} p (q) R_{q} \in R_{out}$ . Since p(q) and $R_{q} \in R_{out}$ were arbitrary, we conclude that $R_{out}$ is convex. □

Appendix D. Mixing Lemmas

In this appendix, we prove Lemmas 6 and 7, which are used in proving the outer bound.

Consider the elementary Shannon inequalities, stated in the following two lemmas. The variables A, B, α, β, γ, δ appearing in the lemmas denote arbitrary discrete random variables.

Lemma 4:

I (A; α) = I (A; α, γ) - I (A, α; γ) + I (α; γ) .

Proof:

I (A; γ | α) = I (A; α, γ) - I (A; α) = I (A, α; γ) - I (γ; α) .

Lemma 5:

I (A; α) + I (B; β) = I (A; B) + I (α; β) - I (A, α; B, β) + I (A, B; α, β) .

Proof:

I (A, α; B, β) - I (A, B; α, β) = H (A, α) + H (B, β) - H (A, B) - H (α, β) = - I (A; α) - I (B; β) + I (A; B) + I (α; β) .

Lemma 6 follows directly from the preceding lemmas.

Lemma 6:

I (α; β) \geq I (A; α) + I (B; β) - I (A, B; α, β)

with equality if and only if I(A, α; B, β) = I(A; B).

Proof: Rearrange Lemma 5 to get

I (α; β) = I (A; α) + I (B; β) - I (A, B; α, β) + [I (A, α; B, β) - I (A; B)] .

The lemma now follows readily from the preceding expression: We obtain equality in the lemma if (and only if) the term in brackets is zero. Otherwise, the bracketed term is nonnegative, since

I (A, α; B, β) - I (A; B) = H (α | A) + H (β | B) - H (α, β | A, B) = H (α | A) - H (α | A, B) + H (β | B) - H (β | A, B, α) \geq 0,

where the inequality is due to the fact that conditioning does not increase entropy. □

Lemma 7: If U_i = (γ, Aⁱ⁻¹), then

I (A^{n}; γ) = \sum_{i = 1}^{n} I (A_{i}; U_{i}) - \sum_{i = 2}^{n} I (A_{i}; A^{i - 1}),

Proof: In Lemma 4, put A = A_i, α = Aⁱ⁻¹ Note that U₁ = γ. Hence, substituting and summing from 2 to n yields

\sum_{i = 2}^{n} I (A_{i}; A^{i - 1}) = \sum_{i = 2}^{n} I (A_{i}; U_{i}) - I (A^{n}; γ) + I (A_{1}; γ) = \sum_{i = 2}^{n} I (A_{i}; U_{i}) - I (A^{n}; γ) + I (A_{1}; U_{1}) = \sum_{i = 1}^{n} I (A_{i}; U_{i}) - I (A^{n}; γ) .

Appendix E. Alternative Representations of $R_{in}$ and $R_{out}$

Here we show that the alternative representations of the inner and outer bound surfaces introduced in Remark 2 are in fact equivalent, i.e., $R_{in} = R_{in}^{'} = R_{in}^{''}$ , $R_{out} = R_{out}^{'} = R_{out}^{''}$ .

We show first that $R_{out}^{'} = R_{out}^{''}$ . Suppose $R \in R_{out}^{'}$ . Then, using U − X − Y, X − Y − V we have

R_{c} \leq I (U; V) - I (U; V | X Y) = I (X; U) + I (Y; V) - I (X Y; U V)

(25)

which implies

R_{x} \geq I (X; U) \geq I (X Y; U V) - I (Y; V) + R_{c} = I (X Y; U | V) + R_{c}

and similarly R_y ≥ I(XY; V | U) + R_c; and

R_{x} + R_{y} \geq I (X; U) + I (Y; V) \geq I (X Y; U V) + R_{c} .

We conclude that $R \in R_{out}^{''}$ , hence $R_{out}^{'} \subseteq R_{out}^{''}$ . A symmetrical argument shows $R_{out}^{''} \subseteq R_{out}^{'}$ , proving $R_{out}^{'} = R_{out}^{''}$ .

Next, we show that $R_{out}$ and $R_{out}^{'}$ are identical. To this end, note that these sets correspond to regions in the positive orthant $ℝ_{+}^{3}$ , and that two such regions are identical if they have the same surfaces. Following the presentation in Section VII, the surfaces of $R_{out}$ and $R_{out}^{'}$ are

r_{out} (r_{x}, r_{y}) = max_{U V \in C_{out} (r_{x}, r_{y})} r_{x} + r_{y} - I (X Y; U V) r_{out}^{'} (r_{x}, r_{y}) = max_{U V \in C_{out} (r_{x}, r_{y})} I (U; V) - I (U; V | X Y)

where

C_{out} (r_{x}, r_{y}) = {U V \in P_{out} : I (U; X) = r_{x}, I (V; Y)) = r_{y}} .

r_{out}^{'} (r_{x,} r_{y}) = max_{U V \in C_{out} (r_{x}, r_{y})} I (X; U) + I (Y; V) - I (X Y; U V) = max_{U V \in C_{in} (r_{x}, r_{y})} r_{x} + r_{y} - I (X Y; U V) = r_{out} (r_{x}, r_{y}) .

Thus, the desired equivalence $R_{out} = R_{out}^{'}$ follows simply from the fact that at the surfaces, the inequalities defining each region become equalities.

The same line of argument as above of course also shows $R_{in} = R_{in}^{'} = R_{in}^{''}$ .

Appendix F. Simplification of Convex Hulls

In this appendix, we argue geometrically that the expressions for the convex hulls of the inner bound regions simplify to just one term in both the binary and Gaussian cases. To discuss both cases simultaneously, let us represent the surface of either inner bound by a positive-valued function $f : D \to ℝ_{+}$ . Here, $D$ is a square region

D = {r = (x, y) \in ℝ^{2} : 0 \leq x \leq M, 0 \leq y \leq M}

and M is a positive constant. In the binary case, f(r) = s(r), and $D = [0, 1] \times [0, 1]$ ; in the Gaussian case, f(r) = S(r) and $D = [0, \infty) \times [0, \infty)$ . Some important properties shared by both cases are that for all $D = [0, \infty) \times [0, \infty)$

f (x, y) \geq 0, f (0, y) = f (x, 0) = 0 f_{x} (r), f_{y} (r) > 0, f_{x x} (r), f_{y y} (r) < 0

where the subscripts denote partial derivatives.

Denote the convex hull of f(r) by c(r). Generically, the boundary of the convex hull is

c (r) = max θ f (r_{1}) + \bar{θ} f (r_{2})

where the maximum is over all triples (θ, r₁, r₂) such that $r = θ r_{1} + \bar{θ} r_{2}, θ \in [0, 1]$ , and $r_{1}, r_{2} \in D$ . However, as argued next, for the cases under study this simplifies to

c (r) = max θ f (r^{'})

where r = θr′.

The convex hull of a surface can be characterized in terms of its tangent planes. Given any point $r^{'} = (x, y) \in D$ , if its tangent plane lies entirely above the surface, then (r′, f(r′)) is on the convex hull. If the tangent plane cuts through the surface at one or more other points, or if the tangent plane lies below the surface, then (r, f(r)) is not on the convex hull. If the tangent plane intersects the surface at exactly two points, then both points are on the convex hull.

The tangent plane at an arbitrary point $r^{'} = (x^{'}, y^{'}) \in D$ is the set of points satisfying

z (x, y) = f_{x} (x - x^{'}) + f_{y} (y - y^{'}) + z^{'}

where the partial derivatives are evaluated at r′, i.e., f_x = f_x(r′), f_y = f_y(r′), and z′ = f(r′). The tangent plane intersects the z = 0 plane in a line. Setting z(r) = 0 and solving

y = m x + b, where m = - (f_{x} / f_{y}) b = 1 / f_{y} [x^{'} f_{x} + y^{'} f_{y} - z^{'}] .

Since f_x, f_y > 0, the slope m = −(f_x/f_y) is negative. This line intersects the positive orthant whenever the intercept b ≥ 0, in which case the tangent plane cuts through the surface, since f ≥ 0. Thus, the only points on the original surface f(x,y) that can be on the convex hull are those for which b ≤ 0.

Next, consider any path through $D$ along a line segment y = αx, α > 0, starting from one of the “outer edges” of $D$ , where x = M or y = M, and consider what happens to the tangent plane’s line of intersection ℓ with the z = 0 plane as we move in along the path toward the origin (0, 0). Initially, the tangent planes lie entirely above the surface, and the intercept of ℓ is negative, b < 0. This intercept increases along the path until b = 0, at which point ℓ intersects (0, 0). Here, the tangent plane contains a line segment attached on one end to the point of tangency, and at the other end to the point (r, f(r)) = (0, 0, 0); everywhere else, the tangent plane is above the surface. Continuing toward the origin, all other points along the path have tangent planes such that ℓ has a positive intercept b > 0, hence, these points are excluded from the convex hull.

These considerations imply that the convex hull c(r) is composed entirely of two kinds of points. First, points which coincide with the original surface, c(r), = θf(r) with θ = 1. These points occur at values of r = (x, y) “up and to the right” of (0, 0). Second, points along line segments connecting surface points “up and to the right” (r′, f(r′)) with the point (r, f(r)) = (0, 0, 0), that is $c (r) = θ f (r^{'}) + \bar{θ} f (0, 0) = θ f (r^{'})$ , where r = θr′ and θ ∈ [0, 1]. Hence, for all $r \in D$ , c(r) has the desired form.

Two more examples of functions that behave in the same way just described are f(x, y) = (1 − (1 − x)²)(1 − (1 − y)²) and f(x, y) = xy, with $D = [0, 1] \times [0, 1]$ .

Appendix G. Properties of Gaussian Mutual Information

Our analysis of the Gaussian pattern recognition problem relies on the following well-known results, stated below without proof.

Lemma 8: The mutual information between two Gaussian random vectors X and Y depends only on the matrices of correlation coefficients. Specifically

I (X; Y) = \frac{1}{2} log (det C_{x, x}) - \frac{1}{2} log (det C_{x, x} | y)

where

C_{x, x | y} = C_{x, x} - C_{x, y} C_{y, y}^{- 1} C_{y, x} .

In the special case Y = X + W, where X and W are independent Gaussian random variables with variances P and N, respectively, we have

I (X; Y) = \frac{1}{2} log (1 + \frac{P}{N}) = - \frac{1}{2} log (1 - ρ_{x, y}^{2})

where the correlation coefficient $ρ_{x, y} = \sqrt{P / (P + N)}$ .

Lemma 9: If X, Y and Z are zero-mean Gaussian random vectors that form a Markov chain X − Y − Z, then

C_{x, z} = C_{x, y} C_{y, y}^{- 1} C_{y, z} .

Note that for dimension one, X → Y → Z ρ_x,z = ρ_x,yρ_y,z implies.

Lemma 10: Let X, Y, U, and V be jointly Gaussian random variables such that U − X − Y and X − Y − V are Markov chains. Then the matrix of correlation coefficients C_xy,uv decomposes as

C_{x y, u v} = [\begin{matrix} 1 & ρ_{x y} \\ ρ_{x y} & 1 \end{matrix}] [\begin{matrix} ρ_{x u} & 0 \\ 0 & ρ_{y v} \end{matrix}] .

This lemma follows immediately by using Lemma 9 to obtain the substitutions $C_{x, v} = C_{x, y} C_{y, y}^{- 1} C_{y, v} = ρ_{x y} ρ_{y v}$ and $C_{x, y} = C_{u, x} C_{x, x}^{- 1} C_{x, y} = ρ_{u x} ρ_{x y}$ .

Footnotes

Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition, Statistical Learning and Inference.

If this were not the case, visual experience would be like watching television white noise.

Grenander [4] and Mumford [5] have argued that four “universal transformations” (noise and blur, superposition, domain warping, and interruptions) account for most of the ambiguity and variability in naturally occurring signals.

One should probably beware of the strange (and unnecessary) interpretation that, as we upgrade our camera (i.e., as we increase n), the number of patterns in the world consequently increases. More naturally, we may view the world as always presenting a practically unlimited number of patterns, while the number of patterns that can be taken advantage of by a system grows with increasing information processing resources.

⁴

In particular, our formulation is consistent with the “General Pattern Theory” framework of Grenander and colleagues, which has provided a basis for much of the generative modeling work in pattern recognition research [11].

⁵

This step relies on the long Markov chain U − X − Y − V and the Markov lemma.

⁶

Alternatively, one can search for ways to tighten the outer bound.

⁷

The random variables in these sets behave differently in mutual information computations. For example, compare I(XY; UV) for UV in either set, using the same set of distributions p(u | xq), p(v | yq) as ingredients. For

U V \in P_{mix}

I_{mix} = I (X Y; U V) = \sum_{xyuv} p (x y) (\sum_{q} p (q) p (u | x q) p (v | y q)) \times log \frac{(\sum_{q} p (q) p (u | x q) p (v | y q))}{(\sum_{q} p (q) p (u | q) p (v | q))}

whereas for

U V \in P_{conv}

I_{conv} = I (X Y; U V) = \sum_{q} p (q) \sum_{xyuv} p (x y) p (u | x q) p (v | y q) log \frac{p (u | x q) p (v | y q)}{p (u | q) p (v | q)} .

It follows from the log–sum inequality [30, p. 29] that I_conv ≥ I_mix

⁸

The notation is ours.

⁹

But see the footnote at the end Section VIII.

¹⁰

This is related to the problem addressed in [24], [25], except that in that work there is no requirement that the memory data be compressed.

¹¹

Similar comments are made in [24].

¹²

The optimization over distributions p(xyuv) reduces to a search over conditional distributions p(uv | xy) because p(xy) is fixed. Details of the optimization algorithm are given in [35].

¹³

These are called Hinton diagrams in the machine learning literature, after their inventor Geoffrey Hinton.

¹⁴

These distributions are not unique, as the mutual informations are unchanged under various reassignments of values of x, y, u, v, and consequent rearrangements of the entries of p(uv | xy); the distributions shown have been accordingly rearranged into a common format to facilitate comparison.

¹⁵

During the review process for this paper, Servetto indeed claimed a solution to the distributed source coding problem using a novel approach. Unfortunately, he suffered an untimely death on 7/24/2007, before finalizing his work. The most recent public draft of his paper on this topic is available on the arXiv (see [50]).

¹⁶

This last condition ensures R_cq ≤ R_xq + R_yq − I(XY; U_qV_q).

Contributor Information

M. Brandon Westover, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114-2622 USA.

Joseph A. O’Sullivan, Department of Electrical engineering, Washington University, St. Louis, MO 63130 USA.

References

[1].Barlow HB, “Sensory mechanisms, the reduction of redundancy, and intelligence,” in Proc. Symp. Mechanization of Thought Processes, 1959, pp. 537–574. [Google Scholar]
[2].Van Essen DC and Anderson CH, “Information processing strategies and pathways in the primate retina and visual cortex,” in An Introduction to Neural and Electronic Networks, Lau C, Zornetzer SF, and Davis JL, Eds. San Diego, CA: Academic, 1995, pp. 43–72. [Google Scholar]
[3].Eddington AS, The Mathematical Theory of Relativity, 3rd ed Cambridge, U.K.: Cambridge Univ. Press, 1963. [Google Scholar]
[4].Grenander U, Elements of Pattern Theory, ser Johns Hopkins Series in the Mathematical Sciences. Baltimore, MD: Johns Hopkins Univ. Press, 1996. [Google Scholar]
[5].Mumford D, “Pattern theory: A unifying perspective,” in Perception as Bayesian Inference, Knill DC and Richards W, Eds. Cambridge, U.K.: Cambridge Univ. Press, 1996, ch. 1, pp. 25–62. [Google Scholar]
[6].Schmid JA and O’Sullivan NA, “Performance prediction methodology for biometric systems using a large deviations approach,” IEEE Trans. Signal Process, vol. 52, no. 10, pp. 3036–3045, Oct.. [Google Scholar]
[7].Bishop CM, Pattern Recognition and Machine Learning. New York: Springer, 2006. [Google Scholar]
[8].MacKay DJC, Information Theory, Inference, and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press, 2003. [Online]. Available: http://www.inference.phy.cam.ac.uk/mackay/itila/ [Google Scholar]
[9].Kearns MJ and Vazirani UV, An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press, 1994. [Google Scholar]
[10].Vapnik VN, Statistical Learning Theory. New York: Wiley, 1998. [Google Scholar]
[11].Grenander U and Miller M, Pattern Theory: From Representation to Inference. New York: Oxford Univ. Press, 2006. [Google Scholar]
[12].Burges CJC, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998. [Google Scholar]
[13].Barlow HB, “The coding of sensory messages,” in Current Problems in Animal Behavior, Thorpe WH and Zangwill OL, Eds. Cambridge, U.K.: Cambridge Univ. Press, 1961. [Google Scholar]
[14].Barlow HB, “Understanding natural vision,” in Physical and Biological Processing of Images, ser. Springer series in Information Sceinces, Braddick OJ and Sleigh AC, Eds. Berlin, Germany: Springer-Verlag, 1983, vol. 11, pp. 2–14. [Google Scholar]
[15].Barlow H, “What is the computational goal of the neocortex?,” in Large Scale Neuronal Theories of the Brain, Koch C and Davis JL, Eds. Cambridge, MA: MIT Press, 1994, ch. 1, pp. 1–22. [Google Scholar]
[16].Barlow H, “Redundancy reduction revisited,” Network-Computa. Neural Syst, vol. 12, no. 3, pp. 241–53, 2001. [PubMed] [Google Scholar]
[17].Gardner-Medwin AR and Barlow HB, “The limits of counting accuracy in distributed neural representations,” Neural Comput, vol. 13, no. 3, pp. 477–504, 2001. [DOI] [PubMed] [Google Scholar]
[18].Rieke F, Warland D, de Ruyter van Steveninck R, and Bialek W, Spikes: Exploring the Neural Code. Cambridge, MA: MIT Press, 1996. [Google Scholar]
[19].Lennie P, “The cost of cortical computation,” Curr. Biol, vol. 13, no. 6, pp. 493–497, 2003. [DOI] [PubMed] [Google Scholar]
[20].LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [Google Scholar]
[21].Jain AK, Duin RPW, and Mao J, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Machine Intell, vol. 22, no. 1, pp. 4–37, Jan. 2000. [Google Scholar]
[22].Ahlswede RF and Csiszár I, “Hypothesis testing with communication constraints,” IEEE Trans. Inf. Theory, vol. IT-32, no. 4, pp. 533–542, Jul. 1986. [Google Scholar]
[23].Han TS and Amari SI, “Statistical inference under multiterminal data compression,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2300–2324, Oct. 1998. [Google Scholar]
[24].Ishwar P, Prabhakaran VM, and Ramchandran K, “Toward a theory for video coding using distributed compression principles,” in Proc. Int. Conf. Image Processing (ICIP 2), Barcelona, Spain, Sept. 2003, pp. 687–690. [Google Scholar]
[25].Ishwar P, Prabhakaran VM, and Ramchandran K, “On joint classification and compression in a distributed source coding framework,” in Proc. IEEE Workshop on Statistical Signal Processing, St. Louis, MO, 2003, pp. 25–28. [Google Scholar]
[26].Oehler KL and Gray RM, “Combining image compression and classification using vector quantization,” IEEE Trans. Pattern Anal. Machine Intell, vol. 17, no. 5, pp. 461–473, May 1995. [Google Scholar]
[27].Hinton GE and Salakhutdinov RR, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. [DOI] [PubMed] [Google Scholar]
[28].Dong Y, Chang S, and Carin L, “Rate-distortion bound for joint compression and classification with application to multi-aspect sensing,” IEEE Sensors J, vol. 5, no. 3, pp. 481–492, Jun. 2005. [Google Scholar]
[29].Csiszár I and Körner J, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1981. [Google Scholar]
[30].Cover TM and Thomas JA, Elements of Information Theory. New York: Wiley, 1990. [Google Scholar]
[31].Berger T, “Multiterminal source coding,” in The Information Theory Approach to Communications, Longo G, Ed. New York: Springer-Verlag, 1977. [Google Scholar]
[32].Tung SY, “Multiterminal Source Coding,” Ph.D. dissertation, Cornell Univ., Ithaca, NY, 1978. [Google Scholar]
[33].Berger T, The Information Theory Approach to Communications. New York: Springer-Verlag, 1977, ch. Multiterminal Source Coding. [Google Scholar]
[34].Shannon CE, “A mathematical theory of communication,” Bell Syst. Tech. J, vol. 27, pp. 379–423, 623–656, 1948. [Google Scholar]
[35].Westover MB, “Image Representation and Pattern Recognition in Brains and Machines,” Ph.D. dissertation, Washington Univ., St.Loouis, MO, 2004. [Google Scholar]
[36].Olshausen BA and Field DJ, “Sparse coding with and overcomplete basis set: A strategy employed by {V}1?,” Vision Res, vol. 37, pp. 3311–3325, 1997. [DOI] [PubMed] [Google Scholar]
[37].Sejnowski TJ and Bell AJ, “The “independent components” of natural scenes are edge filters,” Vision Res, vol. 37, pp. 3327–3338, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
[38].Hyvärinen A, “Survey on independent component analysis,” Neural Comput. Sur, vol. 2, pp. 94–128, 1999. [Google Scholar]
[39].Schwartz O and Simoncelli EP, “Natural signal statistics and sensory gain control,” Nature Neurosc, vol. 4, no. 8, pp. 819–825, Aug. 2001. [DOI] [PubMed] [Google Scholar]
[40].O’Sullivan JA, Singla N, and Westover MB, “Successive refinement for pattern recognition,” in Proc. IEEE Information Theory Workshop, Punta del Este, Uruguay, 2006, pp. 141–145. [Google Scholar]
[41].Srivastava A, Lee AB, Simoncelli EP, and Zhu SC, “On advances in statistical modeling of natural images,” J. Math. Imaging and Vision, vol. 18, no. 1, pp. 17–33, Jan. 2003. [Google Scholar]
[42].Lee AB, Mumford D, and Huang J, “Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model,” Int.J. Comp. Vision, vol. 41, pp. 35–59, 2001. [Google Scholar]
[43].Zhu SC, “Statistical modeling and conceptualization of visual patterns,” IEEE Trans. Pattern Anal. Machine Intell, vol. 25, no. 6, pp. 691–712, Jun. 2003. [Google Scholar]
[44].Heeger DJ and Bergen JR, “Pyramid-based texture analysis/synthesis,” in Proc. 22nd Annu. Conf. Computer Graphics and Interactive Techniques, 1995, pp. 229–238, ACM Press. [Google Scholar]
[45].Zhu SC, Wu YN, and Mumford D, “Filters, random fields and maximum entropy (frame): Toward a unified theory for texture modeling,” Int. J. Comp. Vision, vol. 27, no. 2, pp. 107–126, 1998. [Google Scholar]
[46].De Bonet JS and Viola PA, “A nonparametric multi-scale statistical model for natural images,” in Advances in Neural Information Processing Systems, Jordan MI, Kearns MJ, and Solla SA, Eds. Cambridge, MA: MIT Press, 1998, vol. 10. [Google Scholar]
[47].Portilla J and Simoncelli EP, “A parametric texture model based on joint statistics of complex wavelet coefficients,” Int. J. Comp. Vision, vol. 40, no. 1, pp. 49–71, 2000. [Google Scholar]
[48].Jelinek F, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1999. [Google Scholar]
[49].Blanchard G and Geman D, “Hierarchical testing designs for pattern recognition,” Ann. Statist, vol. 33, pp. 1155–1202, 2005. [Google Scholar]
[50].Servetto S, Multiterminal Source Coding With Two Encoders—I: A Computable Outer Bound Apr. 2006. [Online]. Available: http://www.arxiv.org/abs/cs.IT/0604005/

[R1] [1].Barlow HB, “Sensory mechanisms, the reduction of redundancy, and intelligence,” in Proc. Symp. Mechanization of Thought Processes, 1959, pp. 537–574. [Google Scholar]

[R2] [2].Van Essen DC and Anderson CH, “Information processing strategies and pathways in the primate retina and visual cortex,” in An Introduction to Neural and Electronic Networks, Lau C, Zornetzer SF, and Davis JL, Eds. San Diego, CA: Academic, 1995, pp. 43–72. [Google Scholar]

[R3] [3].Eddington AS, The Mathematical Theory of Relativity, 3rd ed Cambridge, U.K.: Cambridge Univ. Press, 1963. [Google Scholar]

[R4] [4].Grenander U, Elements of Pattern Theory, ser Johns Hopkins Series in the Mathematical Sciences. Baltimore, MD: Johns Hopkins Univ. Press, 1996. [Google Scholar]

[R5] [5].Mumford D, “Pattern theory: A unifying perspective,” in Perception as Bayesian Inference, Knill DC and Richards W, Eds. Cambridge, U.K.: Cambridge Univ. Press, 1996, ch. 1, pp. 25–62. [Google Scholar]

[R6] [6].Schmid JA and O’Sullivan NA, “Performance prediction methodology for biometric systems using a large deviations approach,” IEEE Trans. Signal Process, vol. 52, no. 10, pp. 3036–3045, Oct.. [Google Scholar]

[R7] [7].Bishop CM, Pattern Recognition and Machine Learning. New York: Springer, 2006. [Google Scholar]

[R8] [8].MacKay DJC, Information Theory, Inference, and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press, 2003. [Online]. Available: http://www.inference.phy.cam.ac.uk/mackay/itila/ [Google Scholar]

[R9] [9].Kearns MJ and Vazirani UV, An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press, 1994. [Google Scholar]

[R10] [10].Vapnik VN, Statistical Learning Theory. New York: Wiley, 1998. [Google Scholar]

[R11] [11].Grenander U and Miller M, Pattern Theory: From Representation to Inference. New York: Oxford Univ. Press, 2006. [Google Scholar]

[R12] [12].Burges CJC, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998. [Google Scholar]

[R13] [13].Barlow HB, “The coding of sensory messages,” in Current Problems in Animal Behavior, Thorpe WH and Zangwill OL, Eds. Cambridge, U.K.: Cambridge Univ. Press, 1961. [Google Scholar]

[R14] [14].Barlow HB, “Understanding natural vision,” in Physical and Biological Processing of Images, ser. Springer series in Information Sceinces, Braddick OJ and Sleigh AC, Eds. Berlin, Germany: Springer-Verlag, 1983, vol. 11, pp. 2–14. [Google Scholar]

[R15] [15].Barlow H, “What is the computational goal of the neocortex?,” in Large Scale Neuronal Theories of the Brain, Koch C and Davis JL, Eds. Cambridge, MA: MIT Press, 1994, ch. 1, pp. 1–22. [Google Scholar]

[R16] [16].Barlow H, “Redundancy reduction revisited,” Network-Computa. Neural Syst, vol. 12, no. 3, pp. 241–53, 2001. [PubMed] [Google Scholar]

[R17] [17].Gardner-Medwin AR and Barlow HB, “The limits of counting accuracy in distributed neural representations,” Neural Comput, vol. 13, no. 3, pp. 477–504, 2001. [DOI] [PubMed] [Google Scholar]

[R18] [18].Rieke F, Warland D, de Ruyter van Steveninck R, and Bialek W, Spikes: Exploring the Neural Code. Cambridge, MA: MIT Press, 1996. [Google Scholar]

[R19] [19].Lennie P, “The cost of cortical computation,” Curr. Biol, vol. 13, no. 6, pp. 493–497, 2003. [DOI] [PubMed] [Google Scholar]

[R20] [20].LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [Google Scholar]

[R21] [21].Jain AK, Duin RPW, and Mao J, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Machine Intell, vol. 22, no. 1, pp. 4–37, Jan. 2000. [Google Scholar]

[R22] [22].Ahlswede RF and Csiszár I, “Hypothesis testing with communication constraints,” IEEE Trans. Inf. Theory, vol. IT-32, no. 4, pp. 533–542, Jul. 1986. [Google Scholar]

[R23] [23].Han TS and Amari SI, “Statistical inference under multiterminal data compression,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2300–2324, Oct. 1998. [Google Scholar]

[R24] [24].Ishwar P, Prabhakaran VM, and Ramchandran K, “Toward a theory for video coding using distributed compression principles,” in Proc. Int. Conf. Image Processing (ICIP 2), Barcelona, Spain, Sept. 2003, pp. 687–690. [Google Scholar]

[R25] [25].Ishwar P, Prabhakaran VM, and Ramchandran K, “On joint classification and compression in a distributed source coding framework,” in Proc. IEEE Workshop on Statistical Signal Processing, St. Louis, MO, 2003, pp. 25–28. [Google Scholar]

[R26] [26].Oehler KL and Gray RM, “Combining image compression and classification using vector quantization,” IEEE Trans. Pattern Anal. Machine Intell, vol. 17, no. 5, pp. 461–473, May 1995. [Google Scholar]

[R27] [27].Hinton GE and Salakhutdinov RR, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. [DOI] [PubMed] [Google Scholar]

[R28] [28].Dong Y, Chang S, and Carin L, “Rate-distortion bound for joint compression and classification with application to multi-aspect sensing,” IEEE Sensors J, vol. 5, no. 3, pp. 481–492, Jun. 2005. [Google Scholar]

[R29] [29].Csiszár I and Körner J, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1981. [Google Scholar]

[R30] [30].Cover TM and Thomas JA, Elements of Information Theory. New York: Wiley, 1990. [Google Scholar]

[R31] [31].Berger T, “Multiterminal source coding,” in The Information Theory Approach to Communications, Longo G, Ed. New York: Springer-Verlag, 1977. [Google Scholar]

[R32] [32].Tung SY, “Multiterminal Source Coding,” Ph.D. dissertation, Cornell Univ., Ithaca, NY, 1978. [Google Scholar]

[R33] [33].Berger T, The Information Theory Approach to Communications. New York: Springer-Verlag, 1977, ch. Multiterminal Source Coding. [Google Scholar]

[R34] [34].Shannon CE, “A mathematical theory of communication,” Bell Syst. Tech. J, vol. 27, pp. 379–423, 623–656, 1948. [Google Scholar]

[R35] [35].Westover MB, “Image Representation and Pattern Recognition in Brains and Machines,” Ph.D. dissertation, Washington Univ., St.Loouis, MO, 2004. [Google Scholar]

[R36] [36].Olshausen BA and Field DJ, “Sparse coding with and overcomplete basis set: A strategy employed by {V}1?,” Vision Res, vol. 37, pp. 3311–3325, 1997. [DOI] [PubMed] [Google Scholar]

[R37] [37].Sejnowski TJ and Bell AJ, “The “independent components” of natural scenes are edge filters,” Vision Res, vol. 37, pp. 3327–3338, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] [38].Hyvärinen A, “Survey on independent component analysis,” Neural Comput. Sur, vol. 2, pp. 94–128, 1999. [Google Scholar]

[R39] [39].Schwartz O and Simoncelli EP, “Natural signal statistics and sensory gain control,” Nature Neurosc, vol. 4, no. 8, pp. 819–825, Aug. 2001. [DOI] [PubMed] [Google Scholar]

[R40] [40].O’Sullivan JA, Singla N, and Westover MB, “Successive refinement for pattern recognition,” in Proc. IEEE Information Theory Workshop, Punta del Este, Uruguay, 2006, pp. 141–145. [Google Scholar]

[R41] [41].Srivastava A, Lee AB, Simoncelli EP, and Zhu SC, “On advances in statistical modeling of natural images,” J. Math. Imaging and Vision, vol. 18, no. 1, pp. 17–33, Jan. 2003. [Google Scholar]

[R42] [42].Lee AB, Mumford D, and Huang J, “Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model,” Int.J. Comp. Vision, vol. 41, pp. 35–59, 2001. [Google Scholar]

[R43] [43].Zhu SC, “Statistical modeling and conceptualization of visual patterns,” IEEE Trans. Pattern Anal. Machine Intell, vol. 25, no. 6, pp. 691–712, Jun. 2003. [Google Scholar]

[R44] [44].Heeger DJ and Bergen JR, “Pyramid-based texture analysis/synthesis,” in Proc. 22nd Annu. Conf. Computer Graphics and Interactive Techniques, 1995, pp. 229–238, ACM Press. [Google Scholar]

[R45] [45].Zhu SC, Wu YN, and Mumford D, “Filters, random fields and maximum entropy (frame): Toward a unified theory for texture modeling,” Int. J. Comp. Vision, vol. 27, no. 2, pp. 107–126, 1998. [Google Scholar]

[R46] [46].De Bonet JS and Viola PA, “A nonparametric multi-scale statistical model for natural images,” in Advances in Neural Information Processing Systems, Jordan MI, Kearns MJ, and Solla SA, Eds. Cambridge, MA: MIT Press, 1998, vol. 10. [Google Scholar]

[R47] [47].Portilla J and Simoncelli EP, “A parametric texture model based on joint statistics of complex wavelet coefficients,” Int. J. Comp. Vision, vol. 40, no. 1, pp. 49–71, 2000. [Google Scholar]

[R48] [48].Jelinek F, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1999. [Google Scholar]

[R49] [49].Blanchard G and Geman D, “Hierarchical testing designs for pattern recognition,” Ann. Statist, vol. 33, pp. 1155–1202, 2005. [Google Scholar]

[R50] [50].Servetto S, Multiterminal Source Coding With Two Encoders—I: A Computable Outer Bound Apr. 2006. [Online]. Available: http://www.arxiv.org/abs/cs.IT/0604005/

PERMALINK

Achievable Rates for Pattern Recognition

M Brandon Westover

Joseph A O’Sullivan

Roles

Abstract

I. Introduction

Fig. 1.

II. Informal Problem Statement

A. Pattern Rate

B. Sensory Data Compression Rate

C. Memory Data Compression Rate

D. Image Formation and Testing

E. Interpretations of the Problem Formulation

Optimization views.

Regarding “n.”

III. Related Work

A. Machine Learning Approaches

B. Related Work in Combined Data Compression and Inference

IV. Problem Statement

A. Notation

B. Definitions and Assumptions

Memorization phase:

Testing phase

V. Main results

VI. Discussion of the Main Results

A. The Gap Between Bounds

B. Relationship With Distributed Source Coding

Fig. 2.

C. Degenerate Cases

Sharp memory, sharp eyesight.

Sharp memory, poor eyesight.

Poor memory, sharp eyesight.

VII. Examples

A. Binary Case

1). Numerical Results:

Fig. 3.

Fig. 4.

2). Conjectured Formulas:

3). Rationale for the Inner Bound (14):

Fig. 5.

4). Rationale for the Outer Bound (15):

B. Gaussian Case

Fig. 6.

VIII. Conclusion

Acknowledgment

Appendix A. Proof of the Inner Bound

Fig. 7.

A. Performance Analysis

1). Error Events:

Appendix B. Proof of the Outer Bound

Step 1:

Step 2:

Step 3:

Appendix C. Convexity of the Outer Bound

Appendix D. Mixing Lemmas

Appendix E. Alternative Representations of Rin and Rout

Appendix F. Simplification of Convex Hulls

Appendix G. Properties of Gaussian Mutual Information

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Appendix E. Alternative Representations of $R_{in}$ and $R_{out}$