Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 Mar 9.
Published in final edited form as: IEEE Trans Inf Theory. 2008 Jan 4;54(1):299–320. doi: 10.1109/tit.2007.911296

Achievable Rates for Pattern Recognition

M Brandon Westover 1, Joseph A O’Sullivan 2
PMCID: PMC7062371  NIHMSID: NIHMS1053529  PMID: 32153303

Abstract

Biological and machine pattern recognition systems face a common challenge: Given sensory data about an unknown pattern, classify the pattern by searching for the best match within a library of representations stored in memory. In many cases, the number of patterns to be discriminated and the richness of the raw data force recognition systems to internally represent memory and sensory information in a compressed format. However, these representations must preserve enough information to accommodate the variability and complexity of the environment, otherwise recognition will be unreliable. Thus, there is an intrinsic tradeoff between the amount of resources devoted to data representation and the complexity of the environment in which a recognition system may reliably operate.

In this paper, we describe a mathematical model for pattern recognition systems subject to resource constraints, and show how the aforementioned resource–complexity tradeoff can be characterized in terms of three rates related to the number of bits available for representing memory and sensory data, and the number of patterns populating a given statistical environment. We prove single-letter information-theoretic bounds governing the achievable rates, and investigate in detail two illustrative cases where the pattern data is either binary or Gaussian.

Index Terms—: Distributed source coding, multiterminal information theory, pattern recognition

I. Introduction

PATTERN recognition is the problem of inferring the state of an environment from incoming and previously stored data. In real-world operating environments, the volume of raw data available often exceeds a recognition system’s resources for data storage and representation. Consequently, data stored in memory only partially summarizes the properties of patterns, and internal representations of incoming sensory data are likewise imperfect approximations. In other words, pattern recognition is frequently a problem of inference from compressed data. However, excessive compression precludes reliable recognition. This apparent tradeoff raises a fundamental question: In a given environment, what are the least amounts of memory data and sensory data consistent with reliable pattern recognition?

The paper is organized as follows. In Section II, we introduce the general problem informally. Relationships between the present work and other pattern recognition research is briefly described in Section III. In Section IV, we formalize our problem as that of determining which combinations of three key rates are achievable, that is, determining which rate combinations allow the theoretical possibility of reliable pattern recognition. These rates quantify the information available for representing memory and sensory data, and the number of distinct patterns which the recognition system can discriminate. Our main results are single-letter formulas providing inner and outer bounds on the set of achievable rates, presented in Section V. In Section VI, we consider some instructive special cases of the main results, and compare our results to those for the related problem of distributed source coding. In Section VII, we explore explicit formulas for the bounds in two special binary and Gaussian cases. Section VIII contains concluding remarks. Proofs for most of the results are placed in the Appendices. The entire discussion is organized around the block diagram in Fig. 1.

Fig. 1.

Fig. 1.

Pattern recognition subject to data compression.

II. Informal Problem Statement

In this section, we use an imagined example to motivate the mathematical model studied in the later technical sections. Suppose that our pattern recognition system consists of a homunculus living inside the head of some animal. The homunculus has access to a video monitor which displays data captured by the animal’s retinas, and a set of index cards for storing information about the patterns in the environment relevant to survival, constituting a “memory.” The homunculus must identify each pattern by comparing viewed images with information stored in memory. These identifications are then used to guide the animal’s behavior. Let us consider which factors govern the difficulty of our homunculus’ task.

A. Pattern Rate

First, the number of patterns that must be discriminated, Mc, obviously cannot exceed the number of images registerable on the animal’s retinas, which depends in turn on the number of retinal photoreceptors and the number of distinct signaling states of each photoreceptor. Denoting the state of the retinas as Y = (Y1, Y2, …, Yn), where each Yi takes values in a finite alphabet Y, the number of possible retinal images is |Y|n. In a very simple animal with n = 8 photoreceptors, each able only to distinguish “bright” (Y = 1) from “dark” (Y = 0) (so Y{0,1}), the absolute upper limit on Mc would be |Y|n=28=256.

With higher resolution eyes (larger n) an exponential explosion in the number of possible images rapidly overwhelms memory and computational resources. In humans, for whom |Y|256, and n ≈ 2 × 106 [1], [2], |Y|n~102,408,200, far exceeding estimates of the number of particles in the universe [3]. Fortunately, two features of real-world pattern recognition intervene: First, sensory data exhibits strong statistical structure, p(y), so that the vast majority of the |Y|n possible images are never experienced.1 Second, much of the animal’s visual experience can be filtered out as irrelevant to survival. Thus, we express the number of patterns our homunculus must discriminate as Mc=|Y|nRc, where Rc is called the pattern rate, 0<Rc<1, and generally Mc|Y|n. Equivalently, we can express Mc in binary units, as Mc=2nRc, in which case Rc=Rclog2|Y|.

B. Sensory Data Compression Rate

Our homunculus accesses sensory data indirectly through a video monitor that has limited display capacity. That is, whereas the retinas can be in up to |Y|n distinct signaling states, the homunculus’ internal monitor can display at most My=2nRy|Y|n, where Ry is thus the sensory data compression rate. Analogous data reductions arise in real recognition problems for various reasons, both computational (e.g., dimensionality reduction, sparsification, regularization, or other “feature extraction” operations), and economic (e.g., energy constraints, processing time constraints, storage limitations). We represent the transformation from retinal data Y to video data J by the action of an encoder ϕ, resulting in the displayed data J = ϕ(Y).

This sensory data compression step places another restriction on the number of discriminable patterns, so that, in general, McMy|Y|n; or RcRylog2|Y|.

C. Memory Data Compression Rate

The job of our homunculus is to recognize patterns. More formally, the homunculus must assign to each viewed image Y one of Mc class labels, which we take to be integers Mc={1,2,,Mc}. As pre-job training, we imagine the homunculus studies a set of labeled class prototypes or “templates,” T(w) = (X(w), w), w={1,2,,Mc}=Mc, each drawn from a distribution p(x), where each template has dimensionality identical to that of the sensory data, X(w) = (X1(w), …, Xn(w)).

The homunculus creates index cards, on which it writes the class labels and descriptive information about each class template. However, the number of cards and the amount of information per card are limited, allowing only a compressed summary of the available data. We represent the information memorized about a class M(w) as the output of an encoder f, i.e., M(w) = (I(w), w) = f(T(w)), where I(w) is the compressed description of X(w) and w is the memorized class label. The degree of compression is quantified by specifying either the number of index cards comprising the memory, Mx, or by a compression rate (given in bits) Rx=1nlogMx.

As above, memory data compression restricts the number of discriminable patterns Mc, so that McMx|X|n; or, in terms of rates, RcRxlog2|X|.

D. Image Formation and Testing

The “testing” phase of our homunculus-driven pattern recognition system involves two processes:, image formation and recognition.

Image formation proceeds as follows. Nature selects a pattern class WMc at random, then generates an image Y which is registered on the animal’s retinas. (The class label W is not observable by the homunculus.) We model the image formation process as the transmission of the class template X(w) through a random channel p(y|x). The retinal image Y thus represents a “signature” of the underlying pattern X(w), and the channel p(y|x) represents two types of difficulties intrinsic to most real-world pattern recognition problems: signature variation (differences in the sensory data generated on repeated viewings of the same underlying pattern); and signature ambiguities (distinct patterns may produce similar signatures).2

The homunculus receives the compressed sensory data J = ϕ(Y), compares it with the memory data Cu={M(1),,M(Mc)}, and finally reports the class label of the best match, Ŵ. The inference procedure used to make these comparisons can reflect knowledge of the pattern source p(x) and image formation process p(y|x), but at the time of testing, it must be specified so as to depend only on the available data, i.e., g must be function only of Cu and J, W^=g(J,Cu). We judge the homunculus’ performance by the probability of error Pe. We will consider the system reliable if for some acceptable ϵ it achieves Peϵ.

E. Interpretations of the Problem Formulation

We have now introduced the basic elements of our problem, which is to determine the rate combinations (Rx, Ry, Rc) compatible with the possibility of reliable pattern recognition systems, where a “reliable” system is one for which the probability of recognition error can be made arbitrarily small. To summarize, these basic elements are 1) a model for the underlying patterns, consisting of the number of patterns Mc=2nRc, a set of class labels w ∈ {1, …, Mc}, and the class prototypes X(w) together with their generative model p(x); 2) a model of the channel connecting class prototypes to the sensory data p(y|x); and 3) budgets specifying the number of bits allowed for representing sensory data Ry and memory data Rx inside the system.

We pause here to consider a few different possible perspectives on the problem under study.

Optimization views.

From an optimization point of view, we can ask our central question in two different but equivalent ways: Given the pattern rate Rc, what are the least amounts of sensory and memory data Ry and Rx, needed for reliable pattern recognition? Alternatively, given fixed information budgets for memory and sensory data representation Rx and Ry, what is the maximum achievable pattern recognition rate Rc?

Regarding “n.”

Second, the problem has a different “feel” depending on whether one views the data dimensionality n either as a fixed or an increasing parameter. In the preceding discussion, we have primarily taken the static view, in which there are a fixed number Mc=2nRc of patterns, or “states of nature” of interest, and the problem is to investigate how many memory states Mx and sensory states My are needed to recognize them reliably. Alternatively, we may regard n as a dynamic, increasing parameter. Biologically, allowing n to increase might correspond to studying a series of animals with increasingly better eyes and memory organs. In engineering applications, the increase might correspond to building a sequence of machines with progressively higher camera resolution and data storage capacities [6]. Obviously, if while increasing n we hold the bit-budgets Rx and Ry fixed, then the number of memory and sensory states available for data representation grows exponentially, Mx=2nRx, My=2nRy. Less obviously, the maximum number of discriminable patterns also grows exponentially,3 with a constant rate Rc, i.e., Mc=2nRc. The “fixed n” and “increasing n” perspectives correspond to the familiar, complementary mathematical methods of proving a given inequality, respectively, either by the “adversarial” approach (given any ϵ, choose n large enough…); or the “asymptotic approach” (take the limit as n → ∞…).

An important final point regarding n is that, like many results in information theory, our results rely on asymptotic arguments. Thus, we only prove the results valid only for “sufficiently large n,” depending in turn on an ϵ corresponding to the tolerable error rate. The needed magnitude of n for a given ϵ (i.e., the issue of error exponents) will depend on the application, and is an important open problem.

III. Related Work

A. Machine Learning Approaches

Pattern recognition is a central topic in machine learning [7]–[10]. The machine learning approach to pattern recognition centers around the following problem: Given a set of labeled sensory data D={Y(i),w(i),i=1N}, we wish to find a rule g that predicts the labels for future sensory data, i.e., if Y is in fact a signature of pattern class w ∈ {1, 2, …, Mc}, we want Pr(g(Y) ≠ W) ≤ ϵ for some acceptable ϵ > 0. Broadly speaking, two competing approaches dominate the literature. In the “generative modeling” approach, one attempts to estimate the distribution underlying the data p(w, y), and then to use the conditional distribution p(w, y) to infer w from Y, i.e., w^=g(Y)=argmaxwMcp(w|Y). Alternatively, in the “discriminative” approach, one attempts to learn the optimal decision region boundaries directly, without estimating p(w, y).

Our problem formulation resonates with the “generative modeling” approach, in that we allow the homunculus access to p(w, y).4 Informally, such knowledge might come from allowing a very large volume of training data. Nevertheless, the distinction between generative and discriminative approaches then may become practically unimportant, as in many instances either approach can achieve asymptotically optimal performance.

In any case, in the present work we are not directly concerned with the problem of classifier learning. Rather, we investigate the conditions under which reliable classifiers can exist at all, regardless of how they are deigned or learned; we describe performance bounds to which all pattern recognition systems are subject.

It is also worth pointing out the distinction between the machine learning concept of “Vapnik–Chervonenkis (VC) dimension” and Mc in the present work. Informally, the VC dimension is the number of distinct patterns that can be shattered by a given family of classifiers (see [10], [12] for a detailed description). As such, VC dimension is a measure of the complexity of the decision boundaries that can be fit with a given family of classifiers. In contrast, Mc in our work is the number of patterns or pattern classes that can be distinguished, with no constraints on the family of classifiers.

B. Related Work in Combined Data Compression and Inference

Neuroscientist Horace Barlow has argued for more than four decades that data compression is an essential principle underlying learning and intelligent behavior in animal brains (see, e.g., [13]–[17]). Barlow and many others have amassed substantial experimental evidence showing efficient data coding mechanisms at work in the sensory systems of diverse animals, including monkeys, cats, frogs, crickets, and flies [18]. More recently, data compression is gaining appreciation as a mechanism for managing metabolic energy costs in neural systems [19].

In the engineering pattern recognition literature, data compression usually arises indirectly in the context of feature extraction, i.e., techniques for transforming raw data such that “irrelevant” data is discarded and the residual data is rendered into some advantageous format which facilitates storage and comparison, and is robust (“invariant”) with respect to signature variations [20], [21]. In the information theory literature, probably the first direct investigation of the interplay between data compression and statistical inference is due to Ahlswede and Csiszár [22]. In [23] Han and Amari reviewed work up through 1998 on rate-constrained inference problems, including hypothesis testing, pattern recognition, and parameter estimation. Recently, Ishwar et al. have studied the problem of joint classification and reconstruction of sensory data subject to a fidelity constraint, in the context of video coding [24], [25]. In contrast to the problem studied in this paper, in that work there is no data compression constraint on memory data. Work on practical algorithms for joint classification and data compression includes [26]–[28].

IV. Problem Statement

We now proceed to the formal presentation of the main results.

A. Notation

We adopt the following notational conventions. Random variables are denoted by capital letters (e.g., U), their values by lowercase letters (e.g., u), their alphabets by script capital letters (e.g., U). Sequences of symbols are denoted either by boldface letters or with a superscript, e.g., u = un = (u1, u2, …, un). The probability distribution for a random variable UU is denoted by pU(u), or p(u) simply when the implied subscript is clear from the context. Entropy, mutual information, and conditional mutual information are denoted in the usual ways, e.g., for random variables U, V, W, we write H(U), I(U; V), and I(U; V|W), respectively. All logarithms are understood to be base two, i.e., log = log2. Finally, to express statements such as “X and Z are conditionally independent given Y,” i.e., p(x, y, z) = p(y) p(x | y)p(z | y), we write “XYZ form a Markov chain,” or simply X −.Y – Z.

B. Definitions and Assumptions

Definition 1: The environment E for a pattern recognition system is a set of eight objects

E={Mc,X,Y,p(x),p(y|x),p(w),Cx,Φ}

where

  • Mc, X, Y are finite alphabets;

  • p(w), p(xy) = p(x)p(y|x), are probability distributions over Mc, and Xn×Yn, respectively;

  • Cx={T(1),T(Mc)} is a set of pairs T(w) = (X(w), w) of random vectors X(w) drawn independent and identically distributed (i.i.d.) ~p(x), labeled by w{1,2,,Mc}=Mc;

  • Φ is a mapping from labels Mc to vectors in Cx, Φ:McCx, Φ(w) = X(w).

We make the following simplifications:

  • the distribution over class labels is uniform, p(w)=1/|Mc| for all wMc;

  • the pattern components are i.i.d., p(x)=i=1np(xi);

  • the observation channel is memoryless, p(y|x)=i=1np(yi|xi).

Definition 2: An (Mc, Mx, My, n) pattern recognition code for an environment E consists of three sets of integers

Mc={1Mc},Mx={1Mx},My={1My}

and three mappings

f:Xn×McMx×Mc,f(t(w))=f(x,w)=(i,w)m(w)ϕ:YnMy,ϕ(y)=jg:My×(Mx)McMc,g(j,Cu)=w^

where Cu denotes the result of applying f to the entries of Cx

Cu={f(T(1)),,f(T(Mc))}={m(1),,m(Mc)).

We call Cx the pattern templates; f, the memory encoder; Cu, the memorized data; ϕ, the sensory encoder; and g, the recognition function or classifier.

Definition 3: The operation of a pattern recognition system (“agent”) implementing a given (Mc, Mx, My, n) pattern recognition code (f, ϕ, g) for an environment E is defined in terms of the following events.

Memorization phase:

  • The agent observes Cx, and uses f to compute the memory data Cu.

  • Access to Cx is taken away, and thereafter the agent knows of Cx only what is retained in Cu.

Testing phase

  • Nature selects an index W ~ p(w).

  • Nature encodes the pattern according to X(W) = Φ(W).

  • The pattern X(W)passes through the channel p(y|x), giving rise to an observable signal Y.

  • The agent computes J = ϕ(Y).

  • The agent infers W by computing W^=g(J,Cu).

With respect to the events just described, the probability of error for a code (f, ϕ, g) in E is

Pen(w)=Pr(W^w|W=w)

and the average probability of error of the code is

Pen=wMcp(w)Pr(W^w|W=w)=1McwMcPen(w).

Definition 4: The rate R = (Rc, Rx, Ry) of an (Mc, Mx, My, n) code is

Rc=1nlog2Mc,Rx=1nlog2Mx,Ry=1nlog2My

where the units are bits-per-symbol.

Definition 5: A rate R = (Rc, Rx, Ry) is achievable in a recognition environment E if for any ϵ > 0 and for n sufficiently large, there exists an (Mc, Mx, My, n) code (f, ϕ, g) with rates

Rc=1nlogMc,Rx=1nlogMx,Ry=1nlogMy

such that RcRc, RxRx, RyRy and Pen<ϵ.

Definition 6: The achievable rate region R for a recognition environment E is the set of all achievable rates R = (Rc, Rx, Ry).

Our ultimate goal in an information-theoretic analysis of this problem is to characterize the achievable rate region R in a way that does not involve the unbounded parameter n, that is, to exhibit a single-letter characterization of R.

V. Main results

In this section, we present inner and outer bounds on the achievable rate region R. The bounds are expressed in terms of sets of “auxiliary” random variable pairs UV, defined below. In these definitions, U and V are assumed to take values in finite alphabets U and V and have a well-defined joint distribution with the “given” random variables XY. To each such pair of auxiliary random variables UV we associate a set of rates

RUV={R:RxI(U;X)RyI(V;Y)RcRx+RyI(XY;UV)}.

Next, define two sets of random variable pairs

Pin ={UV:UXY,XYV,U(XY)V},

and

Pout={UV:UXY,XYV}.

We will also sometimes summarize the independence constraints in Pin as a single “long” Markov chain UX − YV.

Next, define two additional sets of rates

Rin={R:RRUV for some UVPin}Rout ={R:RRUV for some UVPout}

and denote the convex hull of Rin by R¯in.

Our main results are the following.

Theorem 1 (Inner Bound):

RinR.

That is, every rate RRin is achievable.

Theorem 2 (Better Inner Bound):

R¯inR

That is, every rate RR¯in is achievable.

Theorem 3 (Outer Bound):

RoutR.

That is, no rate RRout is achievable.

Finally, to ensure computability, we include a cardinality bound.

Theorem 4: Regions Rin and Rout are unchanged if we restrict the cardinality of UV to

|U||V||X||Y|+2.

Theorem 4 is a simple consequence of the Support Lemma [29 (p. 310)]: we must have |XY|1 letters to ensure preservation of p(xy|uv), and three additional letters to satisfy the constraints on I(X; U), I(Y; V) and I(XY; UV).

Remark 1: If either X = U or Y = V, or both, then the outer bound collapses to the inner bound, since in this case the extra Markov condition U – (X, Y) – V in the definition of Pin is extraneous. For example, if U = X, then the condition is equivalent to I(U; V | XY) = I(X; V | XY) = 0, which is obviously true. Similar comments apply if U and V are any deterministic functions of X and Y, e.g., if V = γ(Y), then I(U; V | XY) = I(U; γ(Y)|XY) = 0.

Remark 2: The bounds Rin and Rout can be expressed in various ways. For example, it is not difficult to show that the following replacements for RUV lead to the same sets of rates Rin and Rout:

RUV={R:RxI(U;X)RyI(V;Y)RcI(U;V)I(U;V|XY)} (1)
RUV={R:RcI(U;V)I(U;V|XY)RxI(XY;U|V)+RcRyI(XY;V|U)+RcRx+RyI(XY;UV)+Rc}. (2)

That is, if we define

R*={R:RRUV for some UVP*}R*={R:RRUV for some UVP*}

where * stands for either “in” or “out,” then Rin=Rin=Rin, and Rout=Rout=Rout. These equivalencies are proved in Appendix E, and are used in Sections VI-B and VII.

Remark 3: In general, Rin is not a convex set, as evidenced by the examples studied in Section VIII. Thus, R¯in is in fact an improvement on Rin. Rout is a convex set, as shown in Appendix C.

The proofs for Theorems 2 and 3 appear in Appendices A and B. Theorem 1 follows immediately from Theorem 2. In sketch-form, the method we use to prove achievability (the inner bound), based on RUV (1), is as follows. We represent the memory and sensory data using codewords U(i), i{12nRx} and V(i), j{12nRy} that are typical according to p(u) and p(v), respectively, and the recognition system stores a list of these codewords. Making RxI(X; U) and RyI(Y; V) provides enough U’s and V’s to “cover” Xn and Yn. During pretesting, the system matches each of the labeled template patterns (X(w), w), w = 1, …, Mc presented to it with a unique memory codeword, and attaches to this codeword the corresponding class label (with matching defined in the sense of joint typicality according to p(xu)). The resulting set of Mc “active,” labeled codewords constitutes the system’s memory. During subsequent testing, suppose Nature selects class w, generating sensory data Y ~ p(y|X(w)). The system receives the index J of the codeword for Y, and uses it to retrieve the sensory codeword V(J). The system can then narrow down the list of Mc active memory codewords by a factor of 2nI(U; V) using knowledge of p(uv).5 Thus, the correct memory vector U(w) can be uniquely identified so long as Mc=2nRc2nI(U;V), i.e., if RcI(U; V).

It is also possible to prove the achievability result using a binning argument, which induces the set RUV (2): Generate 2nI(X; V) U’s and 2nI(Y; V)V’s, and divide these equally among roughly 2nRx and 2nRy bins each, respectively. A pattern X(w) is encoded in memory by searching for a bin containing a matching (jointly typical) codeword U(X), and the Mc=2nRc bins thus selected are each assigned the class label w of the pattern stored therein. Sensory data Y is encoded as the bin index of a matching codeword V(Y). This number of U’s and V’s is sufficient to ensure that any given pair X and Y will have a matching (jointly typical) U(X), and V(Y), and the Markov lemma ensures joint typicality of the quadruple (U(X), X, Y, V(Y)). Given encoded sensory data J = ϕ(Y), recognition is done by comparing the roughly 2nI(Y;V)/2nRy sensory codewords in bin J with the 2nI(X;U)/2nRx memory codewords in each of the 2nRc memory bins, then reporting the class label w^ assigned to the bin containing the matching memory codeword. No matches other than the correct one, (U(X), V(Y)), will be found provided the number of (U, V) comparisons grows exponentially with n at a rate less than I(U; V), that is, provided I(Y; V) − Ry + I(X; U) – Rx + RcI(U; V), which simplifies to the rate–sum constraint Rx + RyI(XY; UV) + Rc. The “side” constraints RxI(XY; U|V) + Rc and RyI(XY; V|U) + Rc then follow from requiring that each bin contain at least one codeword. The final inequality RcI(U; V) follows from the first three.

VI. Discussion of the Main Results

A. The Gap Between Bounds

In general, there is a gap between Rin and Rout, so that Rin RRout . This gap is due to the different constraints in the definitions of Pin and Pout: Whereas distributions in satisfy three independence constraints UXY, XYV, and U – (X, Y) – V (equivalently, the single “long chain” constraint UXYV), distributions in Pout only need satisfy the first two “short chain” constraints.

Further insight into the nature of the gap can be gained by attempting to construct Pout by combining distributions from Pin in various ways, and then considering whether the resulting distributions can be used to expand the achievable rate region.6 We consider two such constructions. In both, let QQ be a finite random variable, independent of X and Y. Holding p(xy) fixed, to describe a pair of auxiliary random variables UU, VV with joint distribution p(xyuv), we need only to specify the marginal distribution p(uv | xy). Consider the following two sets:

Pmix={UV:p(uv|xy)=qQp(q)p(u|xq)p(v|yq)} (3)
Pconv={UV:U=(UQ,Q),V=(VQ,Q),UqVqPin qQ}. (4)

In words, Pmix is the set of UV whose distributions p(uv | xy) can be constructed as “mixtures” of product marginals; and Pconv is the set of “convexifying” random variables (this terminology is explained below).7 In both of these sets it is possible to have dependencies between U and V given XY; i.e., in general p(uv | xy) ≠ p(u | x) p(v | y), hence, in general, Pmix, PconvPin.

There is a gap similar to the one under discussion between the best known bounds for the distributed source coding problem (DSC), established by Berger and Tung (see Section VI-B). In both problems, the bounds are given in terms of sets with independence (Markov) constraints identical to those in Pin and Pout.8 Berger has suggested that in the DSC problem the gap is due to the fact that Pout admits convex mixtures of product marginal distributions, whereas Pin does not; i.e., in our notation, PinPmixPout [31]. The inclusion PmixPout is verified by checking UXY and XYV: Write

p(u|xy)=vp(uv|xy)=qp(q)p(u|xq)vp(v|yq)=qp(q)p(u|xq)p(u|x)

hence, UXY; and a symmetric calculation shows XYV. While clearly Pmix is a larger set than Pin, it is unclear whether the admission of mixtures can account for all of the gap between Pin and Pout. That is, we know of no proof that Pmix=Pout. Moreover, we know of no way to use auxiliary random variables from Pmix in achievability arguments.

It is also straightforward to verify that the second set Pconv is contained in Pout

I(U;Y|X)=I(UQ,Q;Y|X)=I(Q;Y|X)+I(UQ;Y|X,Q)=(a)0+qp(q)I(Uq;Y|X,Q=q)=(b)0

(where the reasons are: (a) Q is independent of X and Y, and (b) UqVqPin; hence, UXY; and a symmetric calculation shows XYV. Pconv has a form sometimes introduced in time-sharing arguments, as a means to convexify a given rate region. For example, for UVPconv, we have

I(XY;UV)=I(XY;UQ,VQ,Q)=I(XY;Q)+I(XY;UQ;VQ|Q)=(a)qp(q)I(XY;UqVq|Q=q)qp(q)I(XY;UqVq)

(where (a) is because Q is independent of X and Y) and similarly I(X; U) = Σqp(q)I(X; Uq), and I(Y; V) = Σqp(q)I(Y; Vq). It follows that the convex hull of Rin may be represented as

R¯in={R:RRUV for some UVPconv}. (5)

In contrast to Pmix, auxiliary variables from Pconv can be used as the basis for standard achievability arguments, as we have done in the proof of Theorem 2 (see Appendix A). From this, we have the following logical statement:

If Pout =Pconvthen Rout=R¯in. (6)

Unfortunately, we have no proof that Pout=Pconv. Notwithstanding, in Subsection VII-A, we examine one case where it appears that Rout=R¯in does hold, giving grounds to conjecture that this equality may hold at least under special conditions.

B. Relationship With Distributed Source Coding

There are interesting connections between the results of Tung and Berger [32], [33] for the DSC problem and our results in Theorems 1 and 3. Briefly, the situation treated in the DSC problem is as follows (see Fig. 2). Two correlated sequences, X and Y, are encoded separately as i = f(X), j = ϕ(Y) and the decoder g must reproduce the original sequences subject to a fidelity constraint (Edx(X^,X),Edy(Y^,Y))D, where D = (Dx, Dy). The problem is to characterize, for any given distortion D, the set of achievable rates R˜(D).

Fig. 2.

Fig. 2.

The distributed source coding problem.

The best known inner and outer bounds for the DSC problem can be expressed as follows.9 Let Pin and Pout be defined as above, and define two new sets incorporating the distortion constraint

Pin (D)=Pin PUV(D)Pout (D)=Pout PUV(D)

where

PUV(D)={UV:X^(U,V),Y^(U,V) s.t.(Edx(X^,X),Edy(Y^,Y))D}.

Parallelling (1), also define the sets of rates

R˜UV={R:RxI(XY;U|V)RyI(XY;V|U)Rx+RyI(XY;UV)} (7)

and

R˜in (D)={R:RR˜UV for some UVPin (D)}R˜out (D)={R:RR˜UV for some UVPout (D)}.

Then the Berger–Tung bounds for the DSC problem are R˜in (D)R˜(D) and R˜out(D)R˜(D).

There are strong formal similarities between our bounds and the DSC bounds. Most importantly, the gap between bounds for both problems is due to the difference between the length-four constraint UXYV and the less stringent length-three constraints UXY, XYV. Further, note the formal similarity between the sets R˜UV (7) and RUV. To carry this comparison further, suppose in the problem under study that, in addition to recognizing patterns, we also wish to reproduce an estimate of the original signals subject to a fidelity constraint, as in the DSC problem.10 Denote the achievable rate region for this “joint recognition and recovery” problem by R(D). Making this addition in fact adds little technical difficulty, and the resulting bounds can be expressed, not surprisingly, as Rin(D)R(D) and Rout(D)R(D), where

Rin(D)={R:RRUV for some UVPin(D)}Rout(D)={R:RRUV for some UVPout(D)}.

Apparently, the pattern recognition problem can be construed as a kind of generalization of the DSC problem, with the added complication that the “decoder” receives with Y not one sequence X but Mc=2nRc such sequences X(1),…,X(Mc) and must first determine which is the appropriate one with which to jointly decode Y. This extra discrimination evidently requires that extra information be included at the encoders. This “rate excess” is the difference between the minimum encoding rates required for the DSC and pattern recognition problems.11 Comparing RUV (7) with R˜UV (7), this rate excess is the same at both decoders, and is equal to Rc. Thus, Rc can be interpreted as the number of extra bits needed at both encoders to decide which of the possible Mc=2nRc patterns X(w) the sensory data Y represents, beyond the information required to simply reproduce the pair Y, X(w) within the allowed distortion limits.

C. Degenerate Cases

We now briefly examine the degenerate cases where either X = U, or Y = V, or both. In these cases, I(U; V | XY) = 0. Hence, using (1), we see that both inner and outer bounds on R both reduce to the three inequalities RxI(U; X), RyI(V; Y), RcI(U; V)}. Clearly, in these cases the bounds are tight, in that the inner and outer bounds are equal; there is no gap (see Remark 1). These degenerate cases have simple interpretations and are thus useful for building intuition about Theorems 1–3.

Sharp memory, sharp eyesight.

First, consider a system in which the budgets for memory and sensory representations are unrestricted, i.e., no compression is required. In this case, we can effectively treat the memories and sensory representations as veridical; i.e., we can set U = X and V = Y. The theorem constraints then become RxI(X; X) = H(X), RyI(Y; Y) = H(Y), and

RcI(U;V)=I(X;Y). (8)

This result indicates that, in the absence of compression, the recognition problem is formally equivalent to the following classical communication problem: Transmit one of Mc=2nRc possible messages (patterns) to a receiver (the recognition module) [6]. In this case, the patterns can be thought of as random codewords stored without compression and available to the decoder; Shannon’s random coding for communication [30], [34] applies, yielding the mutual information I(X; Y) (see (8)) as the bound on Rc.

Sharp memory, poor eyesight.

Next, suppose that memory is effectively unlimited, so that we can put U = X, but sensory data may be compressed. In this case, we can readily rewrite the condition on Rc as

RcI(X;Y)I(X;Y|V)=I(X;V). (9)

We check the extreme cases: If V is fully informative about Y, Y = ϕ−1(V), then I(X; Y | V) = H(Y | V)−H(Y | X, V) = 0, and we recover the case discussed above, RcI(X; Y). For intermediate cases, where V is partially informative, the effect of V is to degrade the achievable performance of the system below that possible with “perfect senses,” and the reduction incurred is I(X; Y | V). In the extreme case that V is utterly uninformative (e.g., a constant V = 0, or otherwise independent of Y), I(X; Y | V) = I(X; Y), and we get Rc = 0, or Mc2nRc=1; hence, the system is useless.

Poor memory, sharp eyesight.

In the case of limited memory but unrestricted resources for sensory data representation (V = Y), we get an expression symmetric with the previous case

RcI(X;Y)I(X;Y|U)=I(U;Y). (10)

As before, if the memory is perfect (U = X), we get I(X; Y | U) = I(X; Y | X) = 0, recovering the channel coding constraint RcI(X; Y); assuming useless memories (U = 0) yields RcI(X; Y) − I(X; Y) = 0; and intermediate cases place the system between these extremes.

VII. Examples

In this section, we investigate the achievable rate regions for binary and Gaussian versions of our problem. For this purpose, it will be convenient to characterize the sets R, Rin and Rout by their surfaces in the positive orthant +3. The surface of R can be expressed as

r(rx,ry)=maxRC(rx,ry)RcC(rx,ry)={R:RR,Rx=rx,Ry=ry}.

Similarly, by direct extension of Theorems 1 and 3, and using the representation of Rin and Rout based on RUV (1), the surfaces of Rin and Rout are

rin(rx,ry)=maxUVCin (rx,ry)I(U;V)I(U;V|XY) (11)
rout(rx,ry)=maxUVCout (rx,ry)I(U;V)I(U;V|XY) (12)

where

Cin(rx,ry)={UVPin:I(U;X)=rx,I(V;Y)=ry}Cout(rx,ry)={UVPout:I(U;X)=rx,I(V;Y)=ry}

The expression for the inner bound surface (11) reduces to

rin=maxUVCin(rx,ry)I(U;V).

An alternative expression for the outer bound surface which will be used in Subsection VII-B, based on RUV (1), is

rin=rx+ryminUVCin(rx,ry)I(XY;UV). (13)

Finally, denote the convex hull of the inner and outer bound surfaces rin¯(rx,ry), rout ¯(rx,ry).

In the specific cases studied in the following examples we seek to convert these implicit characterizations into explicit formulas not involving the optimization over Cin(rx,ry) and Cout(rx,ry).

A. Binary Case

We first investigate the inner and outer bound surfaces for a case in which the template patterns and sensory data alphabets are binary, X=Y={0,1}. Let the template patterns X = (X1,…,Xn) consist of n independent drawings from a uniform Bernoulli distribution Xi ~ B(1/2), i = 1…n, and let the sensory data Y = (Y1,…,Yn) be the output of a binary-symmetric channel with crossover probability q, p(y|x)=qδ¯(x,y)q¯δ(x,y) where q¯=1q;δ¯(x,y)=1δ(x,y); and δ(x, y) = 1 if x = y, and otherwise δ(x, y) = 0. Equivalently, we can represent Y as Y = XW, where W ~ B(q) and is independent of X.

1). Numerical Results:

We have taken two approaches to studying the surfaces of Rin and Rout for this binary case. First, we carried out the optimizations in (11) and (12) numerically. This calculation was via a Monte Carlo method which executed a dense random sampling of the set of probability distributions p(uv | xy) associated with Pin and Pout.12 For each sample p(uv | xy), we calculated I(X; U), I(Y; V) and I(XY; UV); then, for each value of rx, ry ∈ [0,1] the numerical estimate of rin(rx, ry) or rout(rx, ry) was the largest sample value found by the Monte Carlo search for rx + ryI(XY; UV). From here on, we denote the numerical surface estimates by rin^(rx,ry) and rout ^(rx,ry).

The cardinality bound in Theorem 4 is not necessarily tight. Therefore, to assess the alphabet sizes required of U and V for the binary case, we performed our numerical experiments for increasing values of |U| and |V|. For the inner bound surface, we found |U|=|V|=2 was sufficient: no further increase in rin^(rx,ry) was afforded by allowing |U|, |V|=3, 4. For the outer bound surface, |U|=|V|=3 was sufficient.

The surface plots from our numerical experiments are shown in Fig. 3. Fig. 4 shows representations of the distributions p(uv | xy) underlying 25 different points (rx, ry) for Fig. 3 (a) the inner and Fig. 3 (b) the outer bounds, in which probabilities are represented by the area of white squares.13, 14 The row–column format of the matrix p(uv | xy) is xy = 00, 01, 10, 11 moving down rows; moving across columns, for the inner bound with U,V={0,1} the format is uv = 00, 01, 10, 11, whereas for the outer bound with U,V={0,1,e}, the column format is uv = 00, 01, 1e, 10, 1e, e0, e1, ee. (The choice of for the third letter of U and V is explained below.)

Fig. 3.

Fig. 3.

Contour plots of the binary inner bound surface (a); outer bound surface (b); differences between the outer bound and inner bounds (c); and plot of inner and outer bound surfaces along a diagonal cut, (rx(θ), ry(θ)) = θ(1, 1), θ ∈ [0, 1]. In these plots q = 0.2.

Fig. 4.

Fig. 4.

Hinton diagrams of the maximizing probability distributions p(uv | xy) for 25 values of (rx, ry). (a) Distributions for Rin. (b) Distributions for Rout.

2). Conjectured Formulas:

Second, we guessed formulas for the inner and outer bound surfaces, which turned out to fit the numerical results just described. We first present the formulas, then discuss the motivations behind them.

Our formulas involve the following two functions. First, define

s(rx,ry)=1h(qx*q*qy),

where

qx=h1(1rx),qy=h1(1ry);

h(·) is the binary entropy function

h(x)=xlog(x)(1x)log(1x);

“*” denotes binary convolution

x*y=x(1y)+y(1x);

and qx, qy ∈ [0, 1/2] to ensure that h(·) is invertible. Next, let s*(rx, ry) denote the upper concave envelope of s(rx, ry)

s*(rx,ry)=supθs(rx1,ry1)+θ¯s(rx2,ry2)

where θ¯=1θ; and the supremum is over all combinations (θ, rx1, ry1, rx2, ry2) such that

(rx,ry)=θ(rx1,ry1)+θ¯(rx2,ry2)

and each variable in the optimization is restricted to the unit interval [0, 1]. As explained in Appendix F, in both this case and for the corresponding Gaussian formulas in the next section, the expression for this convex hull simplifies to

s*(rx,ry)=supθs(rxry)

with the supremum over all combinations (θ,rx,ry) such that

(rx,ry)=θ(rx,ry).

Conjecture 1: For the binary case the surfaces of Rin and Rout are

rin(rx,ry)=s(rx,ry) (14)
rout(rx,ry)=s*(rx,ry) (15)

and the surface of the achievable rate region R is

r(rx,ry)=rin¯(rx,ry)=rout(rx,ry). (16)

3). Rationale for the Inner Bound (14):

The surfaces rin(rx, ry), rout(rx, ry) are specified in terms of probability distributions p(xyuv) = p(xy)p(uv | xy) that maximize (11) and (12). For rin(rx, ry), the distribution factorizes as p(uv | xy) = p(u | x)p(v | y), and a natural guess is that in the maximizing distribution both p(u | x) and p(v | y) are binary symmetric channels

p(u|x)=qxδ¯(x,u)q¯xδ(x,u),p(v|y)=qyδ¯(y,v)q¯yδ(y,v);

or, equivalently, U = XWx, V = YWy, where Wx ~ B(qx), Wx ~ B(qy), and qx, qy[0,12]; see Fig. 5(a). For this choice of U and V we calculate

rx=I(X;U)=H(X)H(X|U)=1H(UWx|U)=1H(Wx)=1h(qx)

and likewise ry = 1 − h(qy). Then

I(U;V)=H(V)H(V|U)=1H(UWxWWy|U)=1h(qxqqy)=s(rx,ry).

Clearly, s(rx, ry) is a lower bound on rin(rx, ry), since: 1) UYV, hence UVPin; 2) UVC(rx,ry); and 3)

rin(rx,ry)=maxUVC(rx,ry)I(U;V)1h(q*qx*qy)=s(rx,ry).

The converse, rin(rx, ry), ≤ s(rx, ry) is unproven, so the identification of rin(rx, ry) with s(rx, ry) remains a conjecture. Nevertheless, in our numerical optimization we found no points outside of this region for any choice of (rx, ry), and the distributions which emerge from our computer experiments (Fig. 4(a)) closely resemble the long binary-symmetric channel in the calculation of s(rx, ry). This provides strong experimental evidence supporting (14) in Conjecture 1.

Fig. 5.

Fig. 5.

Binary-symmetric channel models for the inner and outer bounds. (a) Model for rin(rx, ry) with |U|=|V|=2. (b) Model for rout(rx, ry) with |U|=|V|=3. See text for explanation.

4). Rationale for the Outer Bound (15):

Clearly, s*(rx, ry) is a lower bound on rout, since, for all rx, ry ∈ [0, 1]

  • rout (rx,ry)rin (rx,ry)rout ¯(rx,ry)rin ¯(rx,ry);

  • Rout is convex, ⇒ rout ¯(rx,ry)=rout (rx,ry);

  • rin (rx,ry)s(rx,ry)rin¯(rx,ry)s¯(rx,ry)=s*(rx,ry);

and together these imply rout(rx, ry) ≥ s*(rx, ry). Unfortunately, we do not have a proof of the converse, rout(rx, ry) ≤ s*(rx, ry), so the identification of rout(rx, ry) with s*(rx, ry) remains a conjecture. Nevertheless, empirically (i.e., according to our numerical experiments) the outer bound surface is identical to the convex hull of the inner bound surface. Moreover, empirically, the cardinalities required to construct the outer bound are |U|=|V|=3.

We can provide an explicit construction of the conjectured outer bound surface and the probability distributions that achieve it as follows. The distributions in this construction also agree with those found empirically, shown in Fig. 4(b). Let U=V={0,1,e}. Consider the channel diagrammed in Fig. 5(b), which could be called a “synchronous erasure channel.” Here, U and V are generated by first passing X and Y through binary-symmetric channels, followed by an “erasure” E ∈ {0, 1} event in which both channel outputs are preserved with probability θ = Pr(E = 0), or both are erased (UV = ee) with probability θ¯=1θ=Pr(E=1). An explicit formula for this channel is

p(uv|xy)=θδ¯e(u,v)θ¯δe(u,v)(qxδ¯(x,u)q¯xδ(x,u)qyδ¯(y,v)q¯yδ(y,v))δ¯e(u,v)Δe(u,v)

where

δ(α,β)={1, if α=β0, if αβδe(α,β)={1, if (α,β)=(e,e)0, if (α,β)(e,e)Δe(α,β)={0, if α=e,βe0, if αe,β=e1, otherwise 

and

δ¯(α,β)=1δ(α,β),δ¯e(α,β)=1δe(α,β).

Equivalently, we can represent U and V as follows. Let W ~ B(q), Wx ~ B(qx), Wy ~ B(qy), E ~ B(θ) be Bernoulli random variables that are independent of each other and independent of X and Y, and define

Y=XWU=(XWx)EV=(YWy)E

where the multiplication by E is defined by

αE={α,if E=0e,if E=1.

It is straightforward to verify UVPout. To check UXY, write

I(U;Y|X)=I((XWx)E;XW|X)=I(WxE;W|X)=0

where the last line follows from the independence of W, Wx and E from each other and X. A similar calculation shows XYV. Finally, calculating the rate region surface associated with this choice of UV we get, first

rx=I(X;U)=I(X;(XWx)E,E)=I(X;E)+I(X;(XWx)E|E)=0+θ¯I(X;E|E=1)+θI(X;XWx|E=0)=0+0+θ(1h(qx))

where the last step follows from the previous calculations for the inner bound; and a similar calculation shows ry = θ(1 – h(qy)). Then, using UXY, XYV to write

I(U;V)I(U;V|XY)=I(X;U)+I(Y;V)I(XY;UV)=I(X;U)+I(Y;V)[H(U)+H(U|V)H(UV|XY)]

we have (suppressing some detail)

H(U)=H(U,E)=H(E)+H(U|E)=h(θ)+θH(U|E=0)+θ¯H(U|E=1)=h(θ)+θ+0
H(U|V)=H(U|V,E)=θH(U|V,E=0)+θ¯H(U|V,E=1)=θh(qxqqy)
H(UV|XY)=H(E|XY)+H(UV|XY,E)=h(θ)+θ[H(U|X,E=0)+H(V|Y,E=1)]=h(θ)+θ[h(qx)+h(qy)].

Putting these together and canceling terms

I(U;V)I(U;V|XY)=θ(1h(qxqqy))=s*(rx,ry).

Thus, we have constructed an explicit example which achieves s*(rx, ry) with |U|=|V|=3.

B. Gaussian Case

We now consider a Gaussian version of our problem. Let X and Y be zero-mean Gaussian random variables with correlation coefficient ρxy. We propose explicit formulas for the surfaces of Rin and Rout for the Gaussian case, in terms of the following two functions. In both formulas, put

rx=12log(1ρxu2)ry=12log(1ρyv2).

Note that these expressions determine the correlation coefficients ρxu and ρyv. Define

S(rx,ry)=12log(1ρxy2ρyv2ρxu2) (17)

and

S*(rx,ry)=rx+ry+12log[1+2ργβ1ρ2] (18)

where

γ=ρxyρxuρyvβ=ρxu2+ρyv2(1ρxy2)ρxu2ρyv2,ρ=β2γ(β2γ)21. (19)

Conjecture 2: In the Gaussian case, the surfaces of Rin and Rout are

rin(rx,ry)=S(rx,ry) (20)
rout(rx,ry)=S*(rx,ry). (21)

Fig. 6 shows plots of the inner and outer bounds and their difference, as well as the difference between the outer bound and the convex hull of the inner bound. Interestingly, unlike the binary case, for the Gaussian case the outer bound is not equal to the convex hull of the inner bound.

Fig. 6.

Fig. 6.

Contour plots of (a) the binary inner bound surface; (b) outer bound surface; (c) difference between the outer bound and inner bounds; and (d) plot of surfaces for the inner bound, its convex hull, and the outer bound along a diagonal cut, (rx(θ), ry(θ)) = θ(1, 1), θ ∈ [0, 1]. In these plots ρxy = 0.8.

The following proof relies on some basic properties of the mutual information between Gaussian random variables, given as lemmas in Appendix G.

In the analysis that follows, we assume that the maximizing distributions are Gaussian. Under this assumption, we solve the inner and outer bounds. Except for this unproved assumption, the proof of the conjecture is complete.

Proof: (Conjecture 2, eq. (20)) As noted in Appendix G, mutual informations between jointly Gaussian random variables are completely determined by their correlation coefficients. For a length-4 Markov chain UXYV of jointly Gaussian random variables I(U; V | XY) = 0 and, applying Lemma 9 from Appendix G we have ρuv = ρxuρxyρyv, hence

I(U;V)I(U;V|XY)=12log(1ρxu2ρxy2ρyv2).

This mutual information is maximized when the constraints I(X; U) ≤ rx, I(Y; V) ≤ ry are satisfied with equality, hence when ρxu and ρyv satisfy rx=12log(1ρxu2) and ry=12log(1ρyv2).

The following proof for the surface of the outer bound region uses the form of rout(rx, ry) given by (13). In this case, the optimization problem reduces to minimizing I(XY; UV) subject to the length-3 Markov constraints UXY, XYV.

Proof: (Conjecture 2, eq. (21)) Using Lemma 10 from Appendix G, we have

Cxy,uv=[ρxuρxvρyuρyv]=[1ρxyρxy1][ρxu00ρyv].

The left-hand matrix in this decomposition is Cxy,xy, denoted hereafter simply as C, and we denote the right-hand matrix by D. Then applying Lemma 8 from Appendix G yields

I(XY;UV)=12log|C|12log|CCxy,uvCuv,uv1Cuv,yx|=12log|C|12log|CCDCuv,uv1DC|=12log|C|12log|C1DCuv,uv1D|.

Substituting for the 2 × 2 matrices in this last expression and rearranging terms yields

I(XY;UV)=12log[1+2ρuvγβ1ρuv2]

where γ and β are defined in (19).

By assumption, ρxu and ρyv are fixed, so we optimize I(XY; UV) only with respect to ρuv. Setting ∂I(XY; UV)/∂ρuv = 0 and solving, we obtain that, if β > 2γ > 0, then the maximum is achieved at ρuv*=ρ, where ρ is defined in (19).

To complete the proof we must show that β > 2γ > 0. Noting that β, γ > 0 and substituting, the desired inequality becomes

ρxu2+ρyv2ρxu2ρyv2>2ρxyρxuρyvρxy2ρxu2ρyv2.

Subtracting from each side and factoring yields the equivalent inequality

(1ρxu2)(1ρyv2)>(1ρxyρxuρyv)2.

To show that this holds for all ρxy, note that the maximum of the right-hand side is achieved by ρxy = 1, so that the inequality becomes

(1ρxu2)(1ρyv2)(1ρxuρyv)2<0.

This inequality holds, since

(1ρxu2)(1ρyv2)(1ρxuρyv)2=ρxu2ρyv2+2ρxuρyv=(ρxuρyv)(ρyvρxu)=(ρxuρyv)2<0.

VIII. Conclusion

We have presented an information-theoretic analysis of pattern recognition systems subject to data compression constraints. Our main results consist of fundamental bounds characterizing the minimum sensory and memory information budgets required for reliable pattern recognition, or, equivalently, the maximum number of patterns that can be discriminated on given sensory and memory data budgets.

As a starting point, we have focused on the case of unstructured data, in which patterns are representable as vectors with i.i.d. components, and the sensory data observation channel is memoryless. In recent years, there has been much theoretical and experimental work aimed at developing methods to render data into a format with independent (or approximately independent) components (see, e.g., [36]–[39]). Such methods have been especially successful in the study of “natural” signals, e.g., sounds and imagery in naturally occurring environments. Nevertheless, a decomposition into independent components is often impossible or only approximate, and it will be important in future work to extend our results to cover the case of correlated components and channels with memory.

We have focused on “reliable” pattern recognition systems, in the sense that the recognition error rate is able to be made arbitrarily close to zero. Nevertheless, in some applications it is of interest (or unavoidable) to allow less-than-perfect accuracy. This can be partly addressed by recasting the recognition problem as a “coarse-to-fine” search, where the system is given information in several successive stages, and at each stage is required only to partially recognize the pattern, i.e., to identify the pattern as belonging to a particular subclass, postponing definitive identification for the final stage. Extending our results to this successive refinement setting is relatively straightforward; see [40]. The more direct approach of explicitly allowing a strictly positive error rate is an open problem.

Much work remains to be done in designing practical pattern recognition systems that achieve the bounds described herein. One of the most challenging problems in this regard is the design of adequate statistical models of real-world signals. For examples of progress on this exciting front, see [11], [36], [41]–[48]. Another significant challenge is that of learning optimal classifiers from training data. In this connection, it will likely prove fruitful to explore connections between the present results and those established in machine learning theory; see, e.g., [7]–[10]. Another practical challenge is to build systems that make optimal use of time. Donald Geman and colleagues have been developing the theory of systems that reach their pattern recognition decisions with a minimum amount of computation [49]. It will be interesting to explore the relationship of this concept with our results concerning recognition using the minimum amount of information.

Open theoretical problems include the calculation of error exponents and, most importantly, the closing of the gap between our inner and outer bounds. As discussed in Section VI-B, the gap in our problem bears close resemblance to that in the distributed source coding problem. A solution to the distributed source coding problem would likely lead to a solution to ours, and vice versa.15

Acknowledgment

The authors gratefully acknowledge stimulating discussions with Michael DeVore, Naveen Singla, and Po-Hsiang Lai.

This work was supported by the Mathers Foundation and by the Office of Naval Research. The material in this paper was presented in part at an ONR PI meeting, Minneapolis, MN, May 2003; the Neural Information Processing Systems Workshop on Information Theory and Learning, Whistler, BC, Canada, December 2003; the IEEE International Symposium on Information Theory, Chicago, IL, June/July 2004; and the IEEE Information Theory Workshop, Punta del Este, Uruguay, March 2006.

Appendix A. Proof of the Inner Bound

In this section we prove the inner bound R¯in R, Theorem 2. The proof relies on standard random coding arguments and properties of strongly jointly typical sets [30]. Given a joint distribution p(xyuv), the strongly jointly δ-typical set is defined by

TUVXYδ={xyuv:|N(xyuv|xyuv)/np(xyuv)|δxyuvXYUV}

where N(xyuv | xyuv) is the number of times the symbol combination xyuv occurs in xyuv. Likewise, we write. e.g., TXδ, TXYδ, TXYUδ for singles, pairs, and triples. We will also use conditionally strongly jointly δ-typical sets, for example

TxUδ={u:(xu)TXUδ}.

The subscripts are omitted when context allows. We will also need the fact that for any positive numbers δ, ϵ > 0, fixed vector x, and large enough n

2n[I(X;Y)+ϵ]Pr(xYTxYδ)2n[I(X;Y)ϵ]. (22)

Theorem 1: Suppose R is a point in the convex hull of Rin, RR¯in. That is, R=qQp(q)Rq, where p(q) is a probability distribution over some finite alphabet Q, and for each qQ, Rq=(Rxq,Ryq,Rcq)Rin.

We wish to show that for any ϵ > 0 and large enough n, there exists an (Mc, Mx, My, n) code (f, ϕ, g) with rates Rc=1nlogMc, Rx=1nlogMx, Ry=1nlogMy such that RcRc, RxRx, RyRy, and Penϵ.

By definition, RqRin implies that for each qQ there exist random variables Uq, Vq such that

pq(xyuv)=p(xy)pq(u|x)pq(v|y)

and

Rxq=I(X;Uq)+αxqRyq=I(Y;Vq)+αyqRcq=Rxq+RyqI(XY;UqVq)γq

for some values αxq, αyq, γq > 0 such that γqαxq + αyq.16 Now let

Rxq=I(X;Uq)+αxq/4,Rx=qQp(q)RxqRyq=I(Y;Vq)+αyq/4,Ry=qQp(q)Ryq

and

Rcq=Rxq+RyqI(XY;UqVq)γq/4Rc=qQp(q)Rcq.

With these choices, we have RxRx, RyRy, RcRc.

Given

Xn~i=1np(xi),Yn~i=1np(yi)

divide the sequences into |Q| segments with lengths nq = np(q), denoted Xnq, Ynq, i.e.,

Xn=[Xn1Xn2Xn|Q|],Yn=[Yn1Yn2Yn|Q|].

Finally, we will use the additional notation: Mxq=2nqRxq, Myq=2nqRyq and Mxq={1,,Mxq}, Myq={1,,Myq}.

We will construct the desired overall code (f, ϕ, g) with rate (Rc,Rx,Ry) by first constructing encoders fq, ϕq with rates Rxq, Ryq for the |Q| component sequences Xnq, Ynq, then constructing a classifier g which acts on the combined outputs of the encoders.

Please refer to Fig. 7 for a summary of the notation introduced below.

Fig. 7.

Fig. 7.

Mappings for sequences (Xnq,Ynq), qQ and concatenations (Xn, Yn).

  • 1
    Codebooks: For each qQ, from pq(xyuv) compute the marginal distributions pq(u),pq(v). To serve as memory codewords, select Mxq length-nq vectors by sampling with replacement from a uniform distribution over the set TUδ. Assign each codeword a unique index iqMxq={1,2,,Mxq}. To serve as sensory codewords, similarly select Myq length-nq vectors by from TVδ, and assign each an index jqMyq={1,2,,Myq}. Denote the codebooks
    Bu(q)={unq(1),,unq(Mxq)}Bv(q)={vnq(1),,vnq(Myq)}.
  • 2
    Encoders: We define encoders fq and ϕq in terms of maps
    φxq:XnqUnqbxq:UnqMxqφyq:YnqVnqbyq:VnqMyq
    as follows. Given any xnqXnq, search the codebook Bu(q) for a codeword unq such that (xnq,unq)TXUδ. If this search is successful, set φxq(xnq)=unq, bxq(unq)=iq, bxq1(iq)=unq, where iq is the index of unq in Bu(q). If the search fails, (arbitrarily) set unq=unq(1) so that φxq and bxq are defined for all of Xnq. In the same way, given any ynqYnq, search Bv(q) for a codeword vnq such that (ynq,vnq)TYVδ. If successful, denote the index of the found codeword jq, and set φyq(ynq)=vnq, byq(vnq)=jq, byq1(jq)=vnq. For search failure, set jq =1, vnq=vnq(1) so that φyq and byq are defined for all of Ynq.

Finally, define

fq:XnqMxq,fq(xnq)=iqbxq(φxq(xnq))ϕq:YnqMyq,ϕq(ynq)=jqbyq(φyq(ynq)).

Now, given vectors xn=(xn1xn|Q|) and yn=(yn1yn|Q|), the encoders above each produce |Q| vectors φxq(xnq)=unq, φyq(ynq)=vnq, and |Q| indices bxq(unq)=iq, byq(ynq)=jq. Denote the concatenations of these

Appendix A.

Note that the vector of integers i ranges over Mx=qQMxq different values i(i),i = 1…Mx. Let the map between the vectors i and the corresponding integers i be x, i.e., if i = i(i), let x(i) = i, and lx1(i)=i. Similarly, j ranges over My=qQMyq values, j(j),j = 1…My and we define y such that if j = j(j), then y(j) = j and ly1(j)=j. Then we can specify encoders f:XnMx and ϕ:YnMy for full length-n vectors xn and yn by

f(xn)=ilx(bx(φx(xn))ϕ(yn)=jly(by(φy(yn)).

To finalize the construction of the memory encoder, for any given labeled template pattern t(w) = (xn, w), let f:Xn×McMx×Mc be defined by

f(t(w))=m(w)=(i,w)(f(xn),w).

The rates of the encoders are (Rx,Ry), as verified by calculating

1nlogMx=1nlogqQMxq=1nqQlog2nqRxq=qQp(q)Rxq=Rx1nlogMy=1nlogqQMyq=1nqQlog2nqRyq=qQp(q)Ryq=Ry.
  • 3

    Memorization: Given a realization of the template patterns Cx=(t(1)t(Mc)),t(w)=(xn(w),w) and the encoders defined above f and ϕ, compute the memory data Cu=f(Cx){f(t(1)),,f(t(Mc))}={m(1),,m(Mc)}.

  • 4

    Recognition Function: Given the stored memory data Cu, we proceed to construct the classifier g as follows.

For each qQ, given any pair of length nq vectors unq, vnq, define a function that tests the pair for strong joint typicality

ρq(unq,vnq)=T[(unq,vnq)TUVδ]

where T[] is the truth-indicator function T[A]=1 if A is true, and T[A]=0 if A is false.

Now, given the sensory data j, compute its vector representation j=[j1j|Q|]=ly1(j). For each qQ, retrieve from Bv(q) the corresponding codeword vnq=byq1(jq). Similarly, for each memory m(w)=(i,w)Cu, compute the corresponding vector i=[iqi|Q|]=lx1(i), and from the memory codebook Bu(q) retrieve unq=bxq1(iq). Next, define a function rw:My{0,1} that tests each vnq in vn=[vn1vn|Q|] against the corresponding unq in un=[un1un|Q|], reporting a 1 if all compared pairs are jointly typical and zero otherwise

rw(j)=T[ρ1(un1,vn1)ρ|Q|(un|Q|,vn|Q|)=1|Q|]

where 1|Q| denotes the length-|Q| all-ones vector.

We can now specify the recognition function g as follows. Given the encoded sensory data j, the recognition module searches for a unique wMc such that rw (j) = 1. If this search is successful, set w^=w. Otherwise, if there is none or more than one such value, declare an error and (arbitrarily) set w^=1. Thus, we have defined g:My×(Mx)McMc, g(j,Cu)=w^, as desired.

A. Performance Analysis

1). Error Events:

We analyze the probability of error for a given W = w, T(w) = (Xn(w), w), Yn. Denote the results of processing these with the components of the code (f, ϕ, g) above by M(w) = (I(w), w) = f(T(w)), J(w) = ϕ(Yn); Unq(w)=φxq(Xnq), Vnq(w)=φyq(Ynq) for each qQ, and Rw = rw(J(w)). The following is an exhaustive list of possible errors.

First, in words, the possible errors are as follows:

  • the sensory data and pattern template are not jointly typical;

  • the pattern template is unencodable;

  • the sensory data is unencodable;

  • the codewords for the memory and sensory data are not jointly typical;

  • the sensory data is jointly typical with more than one memory codeword;

  • two different patterns are assigned the same memory codeword.

More formally, we express the error events thus: For events Ei, let E¯k=(i=1kEi)c, where Ac denotes the complement of A. For each qQ

  • E1(q)={(Xnq,Ynq)TXYδ};

  • E2(q)=E¯1(q){UnqBu(q):(Xnq,Unq)TXUδ};

  • E3(q)=E¯1(q){VnqBv(q):(Ynq,Vnq)TYVδ};

  • E4(q)=E¯3(q){(Unq(w),Vnq(w))TUVδ}; and letting EiqQEi(q), i = 1…4

  • E5=E¯4{wMc:ww,rw(J(w))=1};

  • E6=E¯5{wMc:ww,I(w)=I(w)}.

Each of these vanishes as n → ∞, for the following reasons:

  • P(E1(q)) → 0, by the strong asymptotic equipartition property (AEP);

  • P(E2(q)) → 0, because RxqI(X;Uq);

  • P(E3(q)) → 0, because RyqI(Y;Vq);

  • P(E4(q)) → 0, because of the factorization pq(xyuv) = p(xy)pq(u | x)pq(v | y) and the Markov lemma.

  • Regarding P(E5) Rewrite E5 as
    E5=E¯4{wMc:ww,rw(J(w))=1}=E¯4{wMc\w}{qQ,(Unq(w),Vnq(w))TUVδ}}.
    Then
    P(E5)|Cu|qQ2nq[I(Uq;Vq)ϵq]McqQ2nq[I(Uq;Vq)ϵq]=2nRc2nqQp(q)[I(Uq;Vq)+ϵq]=2n[RcqQp(q)I(Uq;Vq)ϵ].
    So, P(E5) → 0 as n → ∞ if. RcqQp(q)I(Uq;Vq)+ϵ. This is indeed the case, since
    Rc=qQp(q)Rcq=qQp(q)[Rxq+RyqI(XY;UqVq)γq]=qQp(q)[I(X;Uq)+I(Y;Vq)I(XY;UqVq)+αxq+αyqγq](a)qQp(q)[I(X;Uq)+I(Y;Vq)I(XY;UqVq)]+ϵ(b)qQp(q)I(Uq;Vq)+ϵ
    where (a) is because (γαxq + αyq); and (b) follows from elementary properties of mutual information and from the factorization. p(q)p(xy)p(u | xq) p(v | yq).
  • Regarding P(E6): Denoting the components of the codeword for each memory M(w)Cu by Unq(w), rewrite event E6 as
    E6=E¯5{wMc:ww,I(w)=I(w)}=E¯5{wMc:ww,qQ:(Xnq(w),Unq(w))TXUδ}.
    Then
    P(E6)McqQ2nq[I(X;Uq)ϵq]=2n[RcqQp(q)I(X;Uq)ϵ].
    So as if P(E6) → 0 as n → ∞ if RcqQp(q)I(X;Uq)+ϵ]. This is indeed the case: From our preceding calculation for P(E5), we have RcqQp(q)I(Uq;Vq)+ϵ; and the assumed factorization of pq(xyuv) implies that the following is a Markov chain: UqXYVq. Hence, by the data processing inequality
    RcqQp(q)I(Uq;Vq)+ϵqQp(q)I(Uq;X)+ϵ.

This concludes the proof of the inner bound.

Appendix B. Proof of the Outer Bound

In this section we prove Theorem 1, which states the outer bound RRout. In the proof let W be the test index, selected from a uniform distribution p(w) over the pattern indices Mc; let T = T(W) = (X, W)be the selected test pattern from the set of template patterns Cx; let M = M(W) = (I, W) = f(T) be the compressed, memorized form of T; let Cu=f(Cx) be the memorized data; let Y be the sensory data; let J = J(W) = ϕ(Y) be the encoded sensory data, and let W^=g(J,Cu) be the inferred value of W. Note that M, W^ are random variables through their dependence on X, W, Y, and Cx. The mutual informations in the proof are calculated with respect to the joint distribution (and its marginals) over (W, Cx, Cu, X, Y, M, I, J, W^). We can verify that this distribution is well defined by writing it out explicitly. Let T[] be the truth-indicator function T[A]=1 if A is true, and T[A]=0 if A is false. Then

p(w,Cx,Cu,x,y,m,j)=p(w)p(Cx)p(Cu|Cx)p(x|w,Cx)p(y|x)p(m|x,w)p(j|y)p(w^|j,Cu)

where

p(w)=1McT[wMc]p(Cx)=w=1Mci=1np(xi(w))p(Cu|Cx)=T[Cu=f(Cx)]p(x|w,Cx)=T[x=x(w)Cx]p(y|x)=i=1np(yi|xi)p(m|x,w)=T[m=f(x,w)]p(j|y)=T[j=ϕ(y)]p(w^|j,Cu)=T[w^=g(j,Cu)].

The independence relationships underlying the structure of this distribution are evident from the block diagram of Fig. 1.

Proof: (Theorem 3) Assume R=(Rx,Ry,Rc)R. Then there exists a sequence of (Mx, My, Mc, n) codes (f, ϕ, g), such that for any ϵ > 0

Mc2nRcMx2nRxMy2nRy

and Pen=Pr(W^W)ϵ. To show that RRout , we must construct a pair of auxiliary random variables UV such that UVPout and RRUV.

We construct the desired pair UV in three steps: 1) We introduce a set of intermediate random variable pairs UiVi, i = 1,2,…,n, individually contained in Pout ; 2) we derive mutual information inequalities for Rx, Ry, and Rc involving sums of the intermediate variables; 3) we convert the sum-inequalities into inequalities in the desired pair UV.

Step 1:

Let the intermediate auxiliary random variables be

Ui=(M,Xi1)Vi=(J,Yi1)

for i = 1,2,…,n. Each pair is in Pout . This is verified for the Ui by calculating

I(Ui;Yi|Xi)=H(Yi|Xi)H(Yi|M,Xi1,Xi)=H(Yi|Xi)H(Yi|M,Xi)(a)H(Yi|Xi)H(Yi|M,Xn)=(b)H(Yi|Xi)H(Yi|Xn)=(c)H(Yi|Xi)H(Yi|Xi)=0

where the reasons for the lettered steps are (a) conditioning does not increase entropy, (b) the Yi are independent of all other variables given Xn, and (c) the pairs XiYi are i.i.d. Hence, UiXiYi is a Markov chain. By a similar argument, XiYiVi is also a Markov chain. Hence, UiViPout for each i = 1,2,…,n.

Step 2:

First, for the sensory encoder rate

nRyH(J)=(a)H(J)H(J|Yn)=i=1nH(Yi)H(Yi|Yi1J)=i=1nH(Xi)H(Xi|Vi)=i=1nI(Xi;Vi)

where (a) follows from J = ϕ(Yn).

Next, taking account of all Mc memorized patterns, for the memory encoder rate we have

Mc(nRx)=MclogMxH(Cu)=H(Cu)H(Cu|Cx)=H(Cx)H(Cx|Cu)=w=1McH(T(w))H(T(w)|M(w))=(a)w=1McH(Xn(W)|W=w)H(Xn(W)|I,W=w)=(b)w=1McH(Xn)H(Xn|I,W=w)=(c)Mcw=1Mcp(w)[H(Xn)H(Xn|I,W=w)]=(d)Mc[H(Xn)H(Xn|I,W)]=(e)Mci=1n[H(Xi)H(Xi|Xi1,I,W)]=(f)Mci=1n[H(Xi)H(Xi|Ui)]=Mci=1nI(Xi;Ui)

where (a) is simply a matter of variable definitions and notation, (b) follows from the assumption that the Xn(W)’s all have the same distribution and are drawn independently of W, (c) follows from the definition of p(w), (d) follows from the definition of conditional entropy, (e) follows from p(xn)=i=1np(xi) and the telescoping property, and (f) follows from the definition of Ui. Hence, nRxi=1nI(Xi;Ui).

FinallynRclogMc=H(W)=I(W;Cu,J)+H(W|Cu,J)(a)I(W;Cu,J)+nϵn=I(W;Cu)+I(W;J|Cu)+nϵn=(b)0+I(W;J|Cu)+nϵn=I(W,Cu;J)I(J;Cu)+nϵnI(W,Cu;J)+nϵn=(c)I(M;J)+nϵn=(d)i=1nI(Xi;Ui)+I(Yi;Vi)I(XiYi;UiVi)+nϵn.

The lettered steps are justified as follows.

  1. By assumption, Pr(w^W)=Pen0, where W^=g(J,Cu). Thus, applying Fano’s inequality yields
    H(W|Cu,J)H(Pen)+Penlog(Mc1)nϵn
    where ϵn → 0.
  2. The test index W and patterns Cx are drawn independently, hence, I(W;Cu)=I(W;f(Cx))=0.

  3. Writing Cu=Cu*M, Cu*=Cu\M, we have
    I(W,Cu;J)=I(W,M,Cu*;J)=I(M;J)+I(W,Cu*;J|M)=I(M;J)+I(W,Cu*;J|I,W)=I(M;J)+I(Cu*;J|I,W)=I(M;J)+0,
    since the M(i) = (I(i), i) are independent of J for iW.
  4. To justify this step, we invoke the following two results, proved in Appendix D. Let A, α, B, β and γ be arbitrary discrete random variables.

Then we get the following.

Lemma 6:

I(α;β)I(A;α)+I(B;β)I(AB;αβ),

with equality if and only if I(; ) = I(A; B).

Lemma 7: Let Zi = (γ; Ai−1), i =1,2,…,n where the Ai are i.i.d. Then

i=1nI(Ai;Zi)=I(An;γ).

To apply Lemma 6, make the substitution (α, β, A, B) → (M, J, Xn, Yn). Then the condition for equality is satisfied

I(Xn,M;Yn,J)=I(Xn,I,W;Yn,J)=I(Xn,W;Yn,J)+I(I;Yn,J|Xn,W)=I(Xn,W;Yn,J)+I(I,W;Yn,J|Xn,W)=(a)I(Xn,W;Yn,J)+0=I(Xn,W;Yn)+I(Xn,W;J|Yn)=(b)I(Xn,W;Yn)+0=I(Xn;Yn)+I(W;Yn|Xn)=(c)I(Xn;Yn)+0

since (a) M = (I, W) = f(Xn, W), (b) J = ϕ(Yn), and (c) Yn only depends on W through Xn = Xn(W), so that H(Yn | Xn, W) = H(Yn | Xn). Thus, we have

I(M;J)=I(Xn;M)+I(Yn;J)I(Xn,Yn;M,J). (23)

Next, apply Lemma 7 three times with the substitutions

(Zi,γ,Ai1)(Ui,M,Xi1),(Vi,J,Yi1),(UiVi,MJ,Xi1Yi1)

to obtain

i=1nI(Xi;Ui)=I(Xn;M)i=1nI(Yi;Vi)=I(Yn;J)i=1nI(XiYi;UiVi)=I(XnYn;MJ).

Adding the first two expressions and subtracting the third yields

i=1n[I(Xi,Ui)+I(Yi;Vi)I(Xi,Yi;Ui,Vi)]=[I(Xn;M)+I(Yn;J)I(XnYn;MJ)]. (24)

Combining (23) and (24) yields

I(M;J)=i=1nI(Xi;Ui)+I(Yi;Vi)I(Xi,Yi;Ui,Vi)

as claimed.

Step 3:

For this step, we use the following lemma, proved in Appendix C as part of the demonstration that Rout  is convex.

Lemma 1: Suppose UiViPout, i = 1,2,…,n. Then there exists UVPout such that

1ni=1nI(Xi;Ui)=I(X;U)1ni=1nI(Yi;Vi)=I(Y;V)1ni=1nI(XiYi;UiVi)=I(XY;UV).

Applying Lemma 1 to the results of Steps 1 and 2, we obtain

Rx1ni=1nI(Xi;Ui)=I(X;U)Ry1ni=1nI(Yi;Vi)=I(Y;V)RcRx+Ry1ni=1nI(XiYi;UiVi)=Rx+RyI(XY;UV)

where UVPout . With respect to this UV, by definition we have RRUV. Hence, RRout , and the proof is complete. □

Appendix C. Convexity of the Outer Bound

In this appendix, we prove a slightly more general version of Lemma 1 from Appendix B, and use this result to show that the outer bound rate region Rout  is convex.

In the following, let Q be any finite alphabet, and assume that we have pairs XqYq for all qQ which are i.i.d. ~ p(xy).

Lemma 2: Suppose UqVqPout for all qQ, and let Q~p(q), qQ be any discrete random variable independent of the pairs {XqYq}. Then there exists a pair of discrete random variables UVPout such that

qQp(q)I(Xq;Uq)=I(X;U)qQp(q)I(Yq;Vq)=I(Y;V)qQp(q)[I(Xq;Uq)+I(Yq;Vq)I(XqYq;UqVq)]=I(X;U)+I(Y;V)I(XY;UV).

Remark 4: Lemma 1 in Appendix B follows immediately from the above Lemma, by choosing Q={1,2,,n} and p(q) = 1/n for all qQ.

Proof: As a candidate for the pair UV in the lemma, consider UVPconv for the given Q (see (4)), i.e., U = (UQ,Q) and V = (VQ,Q). To verify that UVPout, we proceed to check that UXY and XYV are Markov chains.

By the assumption UqVqPout for each qQ, we have I(Uq; Yq | Xq) = 0 and I(Vq; Xq | Yq) = 0. Hence

0=qQp(q)I(Uq;Yq|Xq)=qQp(q)I(Uq;Yq|Xq,Q=q)=I(UQ;YQ|XQQ)=(a)I(UQ;Y|X,Q)=I(UQQ;Y|X)I(Q;Y|X)=(b)I(UQQ;Y|X)=I(U;Y|X)

where in (a) we are able to drop the subscript Q on XQ and YQ because the Xq and Yq are i.i.d. and independent of Q; and similarly (b) is because I(Q;Y | X) = 0, due to the independence of Q and Y. By an analogous calculation, we also find I(V; X | Y) = 0. Hence, UXY and XYV, and UVPout as desired.

It remains to demonstrate the three equalities in the lemma. For the first, write

I(X;U)=I(X;UQQ)=I(X;UQ|Q)+I(X;Q)=(a_)I(X;UQ|Q)=(b)I(XQ;UQ|Q)=qQp(q)I(Xq;Uq)

where, as above, (a) and (b) follow from the fact that the Xq are i.i.d. and independent of Q. Similar calculations yield

I(Y;V)=qQp(q)I(Yq;Vq),

and

I(XY;UV)=qQp(q)I(XqYq;UqVq).

The convexity of Rout follows readily from the preceding lemma.

Lemma 3: Rout is convex. That is, let Rq be any set of rates such that RqRout for all qQ, where Q is a finite alphabet, and let p(q) be any probability distribution over Q. Then R=qQp(q)RqRout.

Proof: Fix an arbitrary distribution p(q) and rates RqRout for all qQ. By the definition of Rout, for each rate Rq, there exists a pair UqVqPout such that RqRUqVq. Consequently

Rx=qQp(q)Rx,qqQp(q)I(Xq;Uq)Ry=qQp(q)Ry,qqQp(q)I(Yq;Vq)Rc=qQp(q)Rc,qqQp(q)[I(Xq;Uq)+I(Yq;Vq)I(XqYq;UqVq)].

As in the proof of Lemma 2, use these pairs to construct a new pair UV, by defining U = (UQ, Q), V = (VQ, Q). From the proof of Lemma 2, we know 1) that UVPout, and 2) the sums on the right-hand sides of the inequalities above can be replaced with expressions in U and V, yielding

RxI(X;U)RyI(Y;V)RcI(X;U)+I(Y;V)I(XY;UV)Rx+RyI(XY;UV)

which means that RRUV for the given UV. Hence, R=qQp(q)RqRout. Since p(q) and RqRout were arbitrary, we conclude that Rout is convex. □

Appendix D. Mixing Lemmas

In this appendix, we prove Lemmas 6 and 7, which are used in proving the outer bound.

Consider the elementary Shannon inequalities, stated in the following two lemmas. The variables A, B, α, β, γ, δ appearing in the lemmas denote arbitrary discrete random variables.

Lemma 4:

I(A;α)=I(A;α,γ)I(A,α;γ)+I(α;γ).

Proof:

I(A;γ|α)=I(A;α,γ)I(A;α)=I(A,α;γ)I(γ;α).

Lemma 5:

I(A;α)+I(B;β)=I(A;B)+I(α;β)I(A,α;B,β)+I(A,B;α,β).

Proof:

I(A,α;B,β)I(A,B;α,β)=H(A,α)+H(B,β)H(A,B)H(α,β)=I(A;α)I(B;β)+I(A;B)+I(α;β).

Lemma 6 follows directly from the preceding lemmas.

Lemma 6:

I(α;β)I(A;α)+I(B;β)I(A,B;α,β)

with equality if and only if I(A, α; B, β) = I(A; B).

Proof: Rearrange Lemma 5 to get

I(α;β)=I(A;α)+I(B;β)I(A,B;α,β)+[I(A,α;B,β)I(A;B)].

The lemma now follows readily from the preceding expression: We obtain equality in the lemma if (and only if) the term in brackets is zero. Otherwise, the bracketed term is nonnegative, since

I(A,α;B,β)I(A;B)=H(α|A)+H(β|B)H(α,β|A,B)=H(α|A)H(α|A,B)+H(β|B)H(β|A,B,α)0,

where the inequality is due to the fact that conditioning does not increase entropy. □

Lemma 7: If Ui = (γ, Ai−1), then

I(An;γ)=i=1nI(Ai;Ui)i=2nI(Ai;Ai1),

Proof: In Lemma 4, put A = Ai, α = Ai−1 Note that U1 = γ. Hence, substituting and summing from 2 to n yields

i=2nI(Ai;Ai1)=i=2nI(Ai;Ui)I(An;γ)+I(A1;γ)=i=2nI(Ai;Ui)I(An;γ)+I(A1;U1)=i=1nI(Ai;Ui)I(An;γ).

Appendix E. Alternative Representations of Rin and Rout

Here we show that the alternative representations of the inner and outer bound surfaces introduced in Remark 2 are in fact equivalent, i.e., Rin =Rin =Rin, Rout =Rout =Rout.

We show first that Rout=Rout. Suppose RRout. Then, using UXY, XYV we have

RcI(U;V)I(U;V|XY)=I(X;U)+I(Y;V)I(XY;UV) (25)

which implies

RxI(X;U)I(XY;UV)I(Y;V)+Rc=I(XY;U|V)+Rc

and similarly RyI(XY; V | U) + Rc; and

Rx+RyI(X;U)+I(Y;V)I(XY;UV)+Rc.

We conclude that RRout, hence Rout Rout. A symmetrical argument shows Rout Rout, proving Rout =Rout.

Next, we show that Rout and Rout are identical. To this end, note that these sets correspond to regions in the positive orthant +3, and that two such regions are identical if they have the same surfaces. Following the presentation in Section VII, the surfaces of Rout and Rout are

rout(rx,ry)=maxUVCout (rx,ry)rx+ryI(XY;UV)rout (rx,ry)=maxUVCout (rx,ry)I(U;V)I(U;V|XY)

where

Cout(rx,ry)={UVPout:I(U;X)=rx,I(V;Y))=ry}.
rout (rx,ry)=maxUVCout (rx,ry)I(X;U)+I(Y;V)I(XY;UV)=maxUVCin(rx,ry)rx+ryI(XY;UV)=rout (rx,ry).

Thus, the desired equivalence Rout=Rout follows simply from the fact that at the surfaces, the inequalities defining each region become equalities.

The same line of argument as above of course also shows Rin=Rin=Rin.

Appendix F. Simplification of Convex Hulls

In this appendix, we argue geometrically that the expressions for the convex hulls of the inner bound regions simplify to just one term in both the binary and Gaussian cases. To discuss both cases simultaneously, let us represent the surface of either inner bound by a positive-valued function f:D+. Here, D is a square region

D={r=(x,y)2:0xM,0yM}

and M is a positive constant. In the binary case, f(r) = s(r), and D=[0,1]×[0,1]; in the Gaussian case, f(r) = S(r) and D=[0,)×[0,). Some important properties shared by both cases are that for all D=[0,)×[0,)

f(x,y)0,f(0,y)=f(x,0)=0fx(r),fy(r)>0,fxx(r),fyy(r)<0

where the subscripts denote partial derivatives.

Denote the convex hull of f(r) by c(r). Generically, the boundary of the convex hull is

c(r)=maxθf(r1)+θ¯f(r2)

where the maximum is over all triples (θ, r1, r2) such that r=θr1+θ¯r2,θ[0,1], and r1,r2D. However, as argued next, for the cases under study this simplifies to

c(r)=maxθf(r)

where r = θr′.

The convex hull of a surface can be characterized in terms of its tangent planes. Given any point r=(x,y)D, if its tangent plane lies entirely above the surface, then (r′, f(r′)) is on the convex hull. If the tangent plane cuts through the surface at one or more other points, or if the tangent plane lies below the surface, then (r, f(r)) is not on the convex hull. If the tangent plane intersects the surface at exactly two points, then both points are on the convex hull.

The tangent plane at an arbitrary point r=(x,y)D is the set of points satisfying

z(x,y)=fx(xx)+fy(yy)+z

where the partial derivatives are evaluated at r′, i.e., fx = fx(r′), fy = fy(r′), and z′ = f(r′). The tangent plane intersects the z = 0 plane in a line. Setting z(r) = 0 and solving

y=mx+b,wherem=(fx/fy)b=1/fy[xfx+yfyz].

Since fx, fy > 0, the slope m = −(fx/fy) is negative. This line intersects the positive orthant whenever the intercept b ≥ 0, in which case the tangent plane cuts through the surface, since f ≥ 0. Thus, the only points on the original surface f(x,y) that can be on the convex hull are those for which b ≤ 0.

Next, consider any path through D along a line segment y = αx, α > 0, starting from one of the “outer edges” of D, where x = M or y = M, and consider what happens to the tangent plane’s line of intersection with the z = 0 plane as we move in along the path toward the origin (0, 0). Initially, the tangent planes lie entirely above the surface, and the intercept of is negative, b < 0. This intercept increases along the path until b = 0, at which point intersects (0, 0). Here, the tangent plane contains a line segment attached on one end to the point of tangency, and at the other end to the point (r, f(r)) = (0, 0, 0); everywhere else, the tangent plane is above the surface. Continuing toward the origin, all other points along the path have tangent planes such that has a positive intercept b > 0, hence, these points are excluded from the convex hull.

These considerations imply that the convex hull c(r) is composed entirely of two kinds of points. First, points which coincide with the original surface, c(r), = θf(r) with θ = 1. These points occur at values of r = (x, y) “up and to the right” of (0, 0). Second, points along line segments connecting surface points “up and to the right” (r′, f(r′)) with the point (r, f(r)) = (0, 0, 0), that is c(r)=θf(r)+θ¯f(0,0)=θf(r), where r = θr′ and θ ∈ [0, 1]. Hence, for all rD, c(r) has the desired form.

Two more examples of functions that behave in the same way just described are f(x, y) = (1 − (1 − x)2)(1 − (1 − y)2) and f(x, y) = xy, with D=[0,1]×[0,1].

Appendix G. Properties of Gaussian Mutual Information

Our analysis of the Gaussian pattern recognition problem relies on the following well-known results, stated below without proof.

Lemma 8: The mutual information between two Gaussian random vectors X and Y depends only on the matrices of correlation coefficients. Specifically

I(X;Y)=12log(detCx,x)12log(detCx,x|y)

where

Cx,x|y=Cx,xCx,yCy,y1Cy,x.

In the special case Y = X + W, where X and W are independent Gaussian random variables with variances P and N, respectively, we have

I(X;Y)=12log(1+PN)=12log(1ρx,y2)

where the correlation coefficient ρx,y=P/(P+N).

Lemma 9: If X, Y and Z are zero-mean Gaussian random vectors that form a Markov chain XYZ, then

Cx,z=Cx,yCy,y1Cy,z.

Note that for dimension one, XYZ ρx,z = ρx,yρy,z implies.

Lemma 10: Let X, Y, U, and V be jointly Gaussian random variables such that UXY and XYV are Markov chains. Then the matrix of correlation coefficients Cxy,uv decomposes as

Cxy,uv=[1ρxyρxy1][ρxu00ρyv].

This lemma follows immediately by using Lemma 9 to obtain the substitutions Cx,v=Cx,yCy,y1Cy,v=ρxyρyv and Cx,y=Cu,xCx,x1Cx,y=ρuxρxy.

Footnotes

Communicated by P. L. Bartlett, Associate Editor for Pattern Recognition, Statistical Learning and Inference.

1

If this were not the case, visual experience would be like watching television white noise.

2

Grenander [4] and Mumford [5] have argued that four “universal transformations” (noise and blur, superposition, domain warping, and interruptions) account for most of the ambiguity and variability in naturally occurring signals.

3

One should probably beware of the strange (and unnecessary) interpretation that, as we upgrade our camera (i.e., as we increase n), the number of patterns in the world consequently increases. More naturally, we may view the world as always presenting a practically unlimited number of patterns, while the number of patterns that can be taken advantage of by a system grows with increasing information processing resources.

4

In particular, our formulation is consistent with the “General Pattern Theory” framework of Grenander and colleagues, which has provided a basis for much of the generative modeling work in pattern recognition research [11].

5

This step relies on the long Markov chain UXYV and the Markov lemma.

6

Alternatively, one can search for ways to tighten the outer bound.

7
The random variables in these sets behave differently in mutual information computations. For example, compare I(XY; UV) for UV in either set, using the same set of distributions p(u | xq), p(v | yq) as ingredients. For UVPmix
Imix=I(XY;UV)=xyuvp(xy)(qp(q)p(u|xq)p(v|yq))×log(qp(q)p(u|xq)p(v|yq))(qp(q)p(u|q)p(v|q))
whereas for UVPconv
Iconv =I(XY;UV)=qp(q)xyuvp(xy)p(u|xq)p(v|yq)logp(u|xq)p(v|yq)p(u|q)p(v|q).
It follows from the log–sum inequality [30, p. 29] that IconvImix
8

The notation is ours.

9

But see the footnote at the end Section VIII.

10

This is related to the problem addressed in [24], [25], except that in that work there is no requirement that the memory data be compressed.

11

Similar comments are made in [24].

12

The optimization over distributions p(xyuv) reduces to a search over conditional distributions p(uv | xy) because p(xy) is fixed. Details of the optimization algorithm are given in [35].

13

These are called Hinton diagrams in the machine learning literature, after their inventor Geoffrey Hinton.

14

These distributions are not unique, as the mutual informations are unchanged under various reassignments of values of x, y, u, v, and consequent rearrangements of the entries of p(uv | xy); the distributions shown have been accordingly rearranged into a common format to facilitate comparison.

15

During the review process for this paper, Servetto indeed claimed a solution to the distributed source coding problem using a novel approach. Unfortunately, he suffered an untimely death on 7/24/2007, before finalizing his work. The most recent public draft of his paper on this topic is available on the arXiv (see [50]).

16

This last condition ensures RcqRxq + RyqI(XY; UqVq).

Contributor Information

M. Brandon Westover, Department of Neurology, Massachusetts General Hospital, Boston, MA 02114-2622 USA.

Joseph A. O’Sullivan, Department of Electrical engineering, Washington University, St. Louis, MO 63130 USA.

References

  • [1].Barlow HB, “Sensory mechanisms, the reduction of redundancy, and intelligence,” in Proc. Symp. Mechanization of Thought Processes, 1959, pp. 537–574. [Google Scholar]
  • [2].Van Essen DC and Anderson CH, “Information processing strategies and pathways in the primate retina and visual cortex,” in An Introduction to Neural and Electronic Networks, Lau C, Zornetzer SF, and Davis JL, Eds. San Diego, CA: Academic, 1995, pp. 43–72. [Google Scholar]
  • [3].Eddington AS, The Mathematical Theory of Relativity, 3rd ed Cambridge, U.K.: Cambridge Univ. Press, 1963. [Google Scholar]
  • [4].Grenander U, Elements of Pattern Theory, ser Johns Hopkins Series in the Mathematical Sciences. Baltimore, MD: Johns Hopkins Univ. Press, 1996. [Google Scholar]
  • [5].Mumford D, “Pattern theory: A unifying perspective,” in Perception as Bayesian Inference, Knill DC and Richards W, Eds. Cambridge, U.K.: Cambridge Univ. Press, 1996, ch. 1, pp. 25–62. [Google Scholar]
  • [6].Schmid JA and O’Sullivan NA, “Performance prediction methodology for biometric systems using a large deviations approach,” IEEE Trans. Signal Process, vol. 52, no. 10, pp. 3036–3045, Oct.. [Google Scholar]
  • [7].Bishop CM, Pattern Recognition and Machine Learning. New York: Springer, 2006. [Google Scholar]
  • [8].MacKay DJC, Information Theory, Inference, and Learning Algorithms. Cambridge, U.K.: Cambridge Univ. Press, 2003. [Online]. Available: http://www.inference.phy.cam.ac.uk/mackay/itila/ [Google Scholar]
  • [9].Kearns MJ and Vazirani UV, An Introduction to Computational Learning Theory. Cambridge, MA: MIT Press, 1994. [Google Scholar]
  • [10].Vapnik VN, Statistical Learning Theory. New York: Wiley, 1998. [Google Scholar]
  • [11].Grenander U and Miller M, Pattern Theory: From Representation to Inference. New York: Oxford Univ. Press, 2006. [Google Scholar]
  • [12].Burges CJC, “A tutorial on support vector machines for pattern recognition,” Data Mining and Knowledge Discovery, vol. 2, no. 2, pp. 121–167, 1998. [Google Scholar]
  • [13].Barlow HB, “The coding of sensory messages,” in Current Problems in Animal Behavior, Thorpe WH and Zangwill OL, Eds. Cambridge, U.K.: Cambridge Univ. Press, 1961. [Google Scholar]
  • [14].Barlow HB, “Understanding natural vision,” in Physical and Biological Processing of Images, ser. Springer series in Information Sceinces, Braddick OJ and Sleigh AC, Eds. Berlin, Germany: Springer-Verlag, 1983, vol. 11, pp. 2–14. [Google Scholar]
  • [15].Barlow H, “What is the computational goal of the neocortex?,” in Large Scale Neuronal Theories of the Brain, Koch C and Davis JL, Eds. Cambridge, MA: MIT Press, 1994, ch. 1, pp. 1–22. [Google Scholar]
  • [16].Barlow H, “Redundancy reduction revisited,” Network-Computa. Neural Syst, vol. 12, no. 3, pp. 241–53, 2001. [PubMed] [Google Scholar]
  • [17].Gardner-Medwin AR and Barlow HB, “The limits of counting accuracy in distributed neural representations,” Neural Comput, vol. 13, no. 3, pp. 477–504, 2001. [DOI] [PubMed] [Google Scholar]
  • [18].Rieke F, Warland D, de Ruyter van Steveninck R, and Bialek W, Spikes: Exploring the Neural Code. Cambridge, MA: MIT Press, 1996. [Google Scholar]
  • [19].Lennie P, “The cost of cortical computation,” Curr. Biol, vol. 13, no. 6, pp. 493–497, 2003. [DOI] [PubMed] [Google Scholar]
  • [20].LeCun Y, Bottou L, Bengio Y, and Haffner P, “Gradient-based learning applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998. [Google Scholar]
  • [21].Jain AK, Duin RPW, and Mao J, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Machine Intell, vol. 22, no. 1, pp. 4–37, Jan. 2000. [Google Scholar]
  • [22].Ahlswede RF and Csiszár I, “Hypothesis testing with communication constraints,” IEEE Trans. Inf. Theory, vol. IT-32, no. 4, pp. 533–542, Jul. 1986. [Google Scholar]
  • [23].Han TS and Amari SI, “Statistical inference under multiterminal data compression,” IEEE Trans. Inf. Theory, vol. 44, no. 6, pp. 2300–2324, Oct. 1998. [Google Scholar]
  • [24].Ishwar P, Prabhakaran VM, and Ramchandran K, “Toward a theory for video coding using distributed compression principles,” in Proc. Int. Conf. Image Processing (ICIP 2), Barcelona, Spain, Sept. 2003, pp. 687–690. [Google Scholar]
  • [25].Ishwar P, Prabhakaran VM, and Ramchandran K, “On joint classification and compression in a distributed source coding framework,” in Proc. IEEE Workshop on Statistical Signal Processing, St. Louis, MO, 2003, pp. 25–28. [Google Scholar]
  • [26].Oehler KL and Gray RM, “Combining image compression and classification using vector quantization,” IEEE Trans. Pattern Anal. Machine Intell, vol. 17, no. 5, pp. 461–473, May 1995. [Google Scholar]
  • [27].Hinton GE and Salakhutdinov RR, “Reducing the dimensionality of data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, Jul. 2006. [DOI] [PubMed] [Google Scholar]
  • [28].Dong Y, Chang S, and Carin L, “Rate-distortion bound for joint compression and classification with application to multi-aspect sensing,” IEEE Sensors J, vol. 5, no. 3, pp. 481–492, Jun. 2005. [Google Scholar]
  • [29].Csiszár I and Körner J, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1981. [Google Scholar]
  • [30].Cover TM and Thomas JA, Elements of Information Theory. New York: Wiley, 1990. [Google Scholar]
  • [31].Berger T, “Multiterminal source coding,” in The Information Theory Approach to Communications, Longo G, Ed. New York: Springer-Verlag, 1977. [Google Scholar]
  • [32].Tung SY, “Multiterminal Source Coding,” Ph.D. dissertation, Cornell Univ., Ithaca, NY, 1978. [Google Scholar]
  • [33].Berger T, The Information Theory Approach to Communications. New York: Springer-Verlag, 1977, ch. Multiterminal Source Coding. [Google Scholar]
  • [34].Shannon CE, “A mathematical theory of communication,” Bell Syst. Tech. J, vol. 27, pp. 379–423, 623–656, 1948. [Google Scholar]
  • [35].Westover MB, “Image Representation and Pattern Recognition in Brains and Machines,” Ph.D. dissertation, Washington Univ., St.Loouis, MO, 2004. [Google Scholar]
  • [36].Olshausen BA and Field DJ, “Sparse coding with and overcomplete basis set: A strategy employed by {V}1?,” Vision Res, vol. 37, pp. 3311–3325, 1997. [DOI] [PubMed] [Google Scholar]
  • [37].Sejnowski TJ and Bell AJ, “The “independent components” of natural scenes are edge filters,” Vision Res, vol. 37, pp. 3327–3338, 1997. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Hyvärinen A, “Survey on independent component analysis,” Neural Comput. Sur, vol. 2, pp. 94–128, 1999. [Google Scholar]
  • [39].Schwartz O and Simoncelli EP, “Natural signal statistics and sensory gain control,” Nature Neurosc, vol. 4, no. 8, pp. 819–825, Aug. 2001. [DOI] [PubMed] [Google Scholar]
  • [40].O’Sullivan JA, Singla N, and Westover MB, “Successive refinement for pattern recognition,” in Proc. IEEE Information Theory Workshop, Punta del Este, Uruguay, 2006, pp. 141–145. [Google Scholar]
  • [41].Srivastava A, Lee AB, Simoncelli EP, and Zhu SC, “On advances in statistical modeling of natural images,” J. Math. Imaging and Vision, vol. 18, no. 1, pp. 17–33, Jan. 2003. [Google Scholar]
  • [42].Lee AB, Mumford D, and Huang J, “Occlusion models for natural images: A statistical study of a scale-invariant dead leaves model,” Int.J. Comp. Vision, vol. 41, pp. 35–59, 2001. [Google Scholar]
  • [43].Zhu SC, “Statistical modeling and conceptualization of visual patterns,” IEEE Trans. Pattern Anal. Machine Intell, vol. 25, no. 6, pp. 691–712, Jun. 2003. [Google Scholar]
  • [44].Heeger DJ and Bergen JR, “Pyramid-based texture analysis/synthesis,” in Proc. 22nd Annu. Conf. Computer Graphics and Interactive Techniques, 1995, pp. 229–238, ACM Press. [Google Scholar]
  • [45].Zhu SC, Wu YN, and Mumford D, “Filters, random fields and maximum entropy (frame): Toward a unified theory for texture modeling,” Int. J. Comp. Vision, vol. 27, no. 2, pp. 107–126, 1998. [Google Scholar]
  • [46].De Bonet JS and Viola PA, “A nonparametric multi-scale statistical model for natural images,” in Advances in Neural Information Processing Systems, Jordan MI, Kearns MJ, and Solla SA, Eds. Cambridge, MA: MIT Press, 1998, vol. 10. [Google Scholar]
  • [47].Portilla J and Simoncelli EP, “A parametric texture model based on joint statistics of complex wavelet coefficients,” Int. J. Comp. Vision, vol. 40, no. 1, pp. 49–71, 2000. [Google Scholar]
  • [48].Jelinek F, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1999. [Google Scholar]
  • [49].Blanchard G and Geman D, “Hierarchical testing designs for pattern recognition,” Ann. Statist, vol. 33, pp. 1155–1202, 2005. [Google Scholar]
  • [50].Servetto S, Multiterminal Source Coding With Two Encoders—I: A Computable Outer Bound Apr. 2006. [Online]. Available: http://www.arxiv.org/abs/cs.IT/0604005/

RESOURCES