The Generalized Asymptotic Equipartition Property: Necessary and Sufficient Conditions

Matthew T Harrison

doi:10.1109/TIT.2008.924668

. Author manuscript; available in PMC: 2011 May 23.

Published in final edited form as: IEEE Trans Inf Theory. 2008;54(7):3211–3216. doi: 10.1109/TIT.2008.924668

The Generalized Asymptotic Equipartition Property: Necessary and Sufficient Conditions

Matthew T Harrison ¹

PMCID: PMC3099461 NIHMSID: NIHMS279472 PMID: 21614133

Abstract

Suppose a string $X_{1}^{n} = (X_{1}, X_{2}, \dots, X_{n})$ generated by a memoryless source (X_n)_n≥1 with distribution P is to be compressed with distortion no greater than D ≥ 0, using a memoryless random codebook with distribution Q. The compression performance is determined by the “generalized asymptotic equipartition property” (AEP), which states that the probability of finding a D-close match between $X_{1}^{n}$ and any given codeword $Y_{1}^{n}$ , is approximately 2^{−nR(P, Q, D)}, where the rate function R(P, Q, D) can be expressed as an infimum of relative entropies. The main purpose here is to remove various restrictive assumptions on the validity of this result that have appeared in the recent literature. Necessary and sufficient conditions for the generalized AEP are provided in the general setting of abstract alphabets and unbounded distortion measures. All possible distortion levels D ≥ 0 are considered; the source (X_n)_n≥1 can be stationary and ergodic; and the codebook distribution can have memory. Moreover, the behavior of the matching probability is precisely characterized, even when the generalized AEP is not valid. Natural characterizations of the rate function R(P, Q, D) are established under equally general conditions.

Index Terms: Asymptotic equipartition property, data compression, large deviations, pattern-matching, random codebooks, rate-distortion theory

I. Introduction

Suppose a random string $X_{1}^{n} = (X_{1}, X_{2}, \dots, X_{n})$ produced by a memoryless source (X_n : n ≥ 1) with distribution P on a source alphabet S, is to be compressed with distortion no more than some D ≥ 0 with respect to a single-letter distortion measure ρ(x, y).¹ The basic information-theoretic model for understanding the best performance that can be achieved is the study of random codebooks. If we generate memoryless random strings $Y_{1}^{n} = (Y_{1}, Y_{2}, \dots, Y_{n})$ according to some distribution Q on the reproduction alphabet T, we would like to know how many such strings are needed so that, with high probability, we will be able to find at least one codeword $Y_{1}^{n}$ that matches the source string $X_{1}^{n}$ with distortion D or less. The crucial mathematical problem in answering this question is the evaluation of the probability that a given, typical $X_{1}^{n}$ , will be D-close to a random $Y_{1}^{n}$ . This probability can be expressed as

Prob {Y_{1}^{n} \in B_{n} (X_{1}^{n}, D) | X_{1}^{n}} = Q^{n} (B_{n} (X_{1}^{n}, D))

(1)

where $B_{n} (X_{1}^{n}, D)$ denotes the “distortion ball” consisting of all reproduction strings that are within distortion D (or less) from $X_{1}^{n}$ ; note that the matching probability in (1) is itself a random quantity, as it depends on the source string $X_{1}^{n}$ .

The importance of evaluating (1) was already identified by Shannon in his classic study of rate-distortion theory [4], where he showed that, for the best codebook distribution Q = Q_*, we have

Q_{*}^{n} (B_{n} (X_{1}^{n}, D)) \approx 2^{- n R (P, D)}

(2)

where R(P, D) is the rate-distortion function of the source.

The more general question of evaluating the matching probability (1) for distributions Q perhaps different from the optimal reproduction distribution Q_*, arises naturally in a variety of contexts, including problems in pattern-matching, mismatched codebooks, Lempel-Ziv compression, combinatorial optimization on random strings, and others; see, e.g., [5]-[13], and the review and references in [14]. In this case, Shannon’s estimate (2) is replaced by the so-called “generalized asymptotic equipartition property” (or generalized AEP), which states that

- n^{- 1} \log Q^{n} (B_{n} (X_{1}^{n}, D)) \to R (P, Q, D) a.s.

(3)

where “a.s.” stands for “almost surely” and refers to the random string $X_{1}^{n}$ . The rate function R(P, Q, D) is defined in a way that closely resembles the rate-distortion function definition,

R (P, Q, D) : = \inf_{W} H (W ∥ P \times Q)

where H(·∥·) denotes the relative entropy, and the infimum is over all (bivariate) probability distributions of random variables (U, V) with values on S and T, respectively, such that U has distribution P and the expected distortion E[ρ(U, V)] ≤ D. (For a broad introduction to the generalized AEP, its applications and refinements, see [14] and the references therein.)

Although much is known about the generalized AEP and about R(P, Q, D) [14], all known results are established under certain restrictive conditions. In particular, it is always assumed that D_ave(P, Q) < ∞ and that D ≠ D_min(P, Q), where

\begin{array}{l} D_{ave} (P, Q) : = E_{(X, Y) \sim P \times Q} ρ (X, Y) \\ D_{\min} (P, Q) : = \inf {D : R (P, Q, D) < \infty} . \end{array}

Often, the codebook distribution is required to be memoryless, and when it is not, it is further assumed that the distortion measure is bounded.

The main point of this paper is to remove these constraints, and to analyze which (if any) are essential for the validity of the generalized AEP. Our motivation is twofold. On one hand, unnecessarily stringent conditions make the theoretical picture incomplete. On the other, there are applications which naturally require more general statements, such as universal lossy compression, where the source distribution is not known a priori [3].

Thus motivated, we give necessary and sufficient conditions for the generalized AEP in (3), and we precisely characterize the behavior of the matching probability in the pathological situations when the generalized AEP fails. Our results hold for all values of D, and they cover arbitrary abstract alphabets and distortion measures. We also allow the source to be stationary and ergodic, and the codebook distribution to have a certain amount of memory. We similarly extend the characterization of the rate function R(P, Q, D) to the same level of generality. We show that it can always be written as a convex dual, and that a minimizer W in the definition of R(P, Q, D) always exists (unless, of course, the infimum is taken over the empty set).

II. Notation and Assumptions

The source and reproduction alphabets S and T, along with their associated σ-algebras are assumed to be Borel spaces² and the distortion measure ρ : S × T ↦ [0, ∞) is assumed to be product measurable. P and Q denote generic probability distributions on S and T, respectively, and let X ~ P and Y ~ Q be independent random variables (r.v.). Define³

ρ^{Q} (x) : = ess inf ρ (x, Y) .

If W is a probability distribution on S × T, then we use W^S to denote the marginal distribution of W on S, and similarly for W^T. Define

R (P, Q, D) : = \inf_{W \in W (P, D)} H (W ∥ P \times Q)

where the infimum is over the subset of probability distributions on S × T defined by

W (P, D) : = {W : W^{S} = P, E_{(U, V) \sim W} ρ (U, V) \leq D}

and where H(μ∥ν) denotes the relative entropy (in nats), namely,

H (μ ∥ ν) : = {\begin{cases} E_{μ} \log \frac{d μ}{d ν}, & if μ ≪ ν \\ \infty, & otherwise . \end{cases}

We use the convention that 0̸ = +∞ and logarithms are base e. For independent r.v.s X ~ P and Y ~ Q, define

\begin{matrix} Λ (P, Q, λ) : = E [\log E [e^{λ ρ (X, Y)} | X]] \\ Λ^{*} (P, Q, D) : = \sup_{λ \leq 0} [λ D - Λ (P, Q, λ)] . \end{matrix}

Consider now the product spaces Sⁿ and Tⁿ with the usual σ-algebras and with generic probability distributions P_n and Q_n, respectively. Let $X_{1}^{n} \sim P_{n}$ and $Y_{1}^{n} \sim Q_{n}$ be independent and let W_n be a probability distribution on Sⁿ × Tⁿ. We use Qⁿ to denote the product distribution of Q on Tⁿ. Define the sequence of single-letter distortion measures (ρ_n : n ≥ 1) by

ρ_{n} (x_{1}^{n}, y_{1}^{n}) : = n^{- 1} \sum_{k = 1}^{n} ρ (x_{k}, y_{k}) .

Since we are working with abstract alphabets, it makes sense to formally replace S, T, ρ, P, Q, W, X and Y with Sⁿ, Tⁿ, ρ_n, P_n, Q_n, W_n, $X_{1}^{n}$ and $Y_{1}^{n}$ , respectively, in the above definitions of ρ^Q, W(P, D), R(P, Q, D), Λ(P, Q, D) and Λ*(P, Q, D) in order to define $ρ_{n}^{Q_{n}}$ , W_n(P_n, D), R_n(P_n, Q_n, D), Λ_n(P_n, Q_n, D) and $Λ_{n}^{*} (P_{n}, Q_{n}, D)$ .

For each $x_{1}^{n} \in S^{n}$ , let $P_{x_{1}^{n}}$ denote the empirical probability distribution of $x_{1}^{n}$ on S and let $δ_{x_{1}^{n}}$ denote the probability distribution on Sⁿ that assigns probability 1 to the sequence $x_{1}^{n}$ . Define

L_{n} (x_{1}^{n}, Q_{n}, D) : = - n^{- 1} \log Q_{n} (B_{n} (x_{1}^{n}, D))

where

B_{n} (x_{1}^{n}, D) : = {y_{1}^{n} \in T^{n} : ρ_{n} (x_{1}^{n}, y_{1}^{n}) \leq D}

denotes the distortion ball of radius D at the point $x_{1}^{n}$ .

Finally, consider the one-sided infinite sequence spaces S^∞ and T^∞ with the usual σ-algebras and with generic probability distributions ℙ and ℚ, respectively. Let (X_n : n ≥ 1) and (Y_n : n ≥ 1) be independent random sequences with distributions ℙ and ℚ, respectively. Let P_n and Q_n denote the marginal distributions of ℙ and ℚ on Sⁿ and Tⁿ, respectively, so that $X_{1}^{n} \sim P_{n}$ and $Y_{1}^{n} \sim Q_{n}$ as before. Define

\begin{matrix} R_{\infty} (ℙ, ℚ, D) : = \underset{n \to \infty}{lim sup} n^{- 1} R_{n} (P_{n}, Q_{n}, D) \\ Λ_{\infty} (ℙ, ℚ, λ) : = \underset{n \to \infty}{lim sup} n^{- 1} Λ_{n} (P_{n}, Q_{n}, n λ) \\ Λ_{\infty}^{*} (ℙ, ℚ, D) : = \sup_{λ \leq 0} [λ D - Λ_{\infty} (ℙ, ℚ, λ)] . \end{matrix}

It is straightforward to verify that when ℙ is stationary with P₁ = P and ℚ is memoryless with Q_n = Qⁿ, then Λ_∞ (ℙ, ℚ, λ) = Λ(P, Q, λ) and hence $Λ_{\infty}^{*} (ℙ, ℚ, D) = Λ^{*} (P, Q, D)$ .

III. Memoryless Codebook Distributions

Our goal is provide necessary and sufficient conditions for the generalized AEP in (3) and for the characterization of the limit as a convex dual. In this section we restrict attention to the case where the code-book distribution is memoryless and where the source distribution is stationary and ergodic.

Under appropriate regularity conditions, it is well known in the literature that R(P, Q, D) can be expressed as a convex dual [9]. (See [14] for a review and further references). Our first result is that the regularity conditions are not necessary. (Proofs of all results are collected in Sections V and VI.)

Proposition 1

R(P, Q, D) = Λ* (P, Q, D) for all D. If W(P, D) is not empty, then this set contains a W such that R(P, Q, D) = H(W∥P × Q).

We separate our results about the generalized AEP into two theorems. The first theorem asserts that the generalized AEP always holds along some subsequence (of n’s). Consequently, the generalized AEP can only fail if the limit does not exist.⁴ The second theorem gives necessary and sufficient conditions for the existence of the limit and also characterizes the pathology. Taken together, they provide necessary and sufficient conditions for the generalized AEP.

Theorem 2

Suppose ℙ is stationary and ergodic with P₁ = P. Then

\underset{n \to \infty}{lim inf} L_{n} (X_{1}^{n}, Q^{n}, D) = R (P, Q, D) a.s.

(4)

for all D. The result also holds with $L_{n} (X_{1}^{n}, Q^{n}, D)$ replaced by $R (P_{X_{1}^{n}}, Q, D)$ .

Theorem 3

Suppose ℙ is stationary and ergodic with P₁ = P. Then $Prob {\lim_{n} L_{n} (X_{1}^{n}, Q^{n}, D) exists} < 1$ if and only if 0 < D = D_min (P, Q) < ∞ and R(P, Q, D) < ∞ and ρ^Q (X₁) is not a.s. constant. Furthermore, in this pathological situation

Prob {L_{n} (X_{1}^{n}, Q^{n}, D) = \infty infinitely often} > 0

(5a)

Prob {L_{n} (X_{1}^{n}, Q^{n}, D) < \infty infinitely often} = 1

(5b)

\lim_{m \to \infty} L_{N_{m}} (X_{1}^{N_{m}}, Q^{N_{m}}, D) = R (P, Q, D) a.s.

(5c)

where (N_m : m ≥ 1) is the (a.s.) infinite random subsequence of (n ≥ 1) for which $L_{n} (X_{1}^{n}, Q^{n}, D)$ is finite. All of the above also hold with $L_{n} (X_{1}^{n}, Q^{n}, D)$ replaced by $R (P_{X_{1}^{n}}, Q, D)$ .

The proofs below show that (N_m : m ≥ 1) can also be (a.s.) characterized as the random subsequence for which

n^{- 1} \sum_{k = 1}^{n} ρ^{Q} (X_{k}) \leq D .

(6)

Note that D_min(P, Q) = Eρ^Q(X), whenever the former is finite.

A simple example that illustrates the pathology is the following: Let (X_n : n ≥ 1) be the sequence 1, 0, 1, 0, … with probability 1/2 and the sequence 0, 1, 0, 1, … with probability 1/2, namely, the binary, stationary, periodic Markov chain (which is ergodic). Let Q be the point mass at 0, let ρ(x, y) := |x − y| and let D = 1/2. Note that ρ^Q(X₁) = X₁ is not constant, that D = D_min (P, Q) = 1/2 and that R(P, Q, D) = 0 is finite. In the case when X₁ = 0, $L_{n} (X_{1}^{n}, Q^{n}, D) = 0$ for all n. In the case when X₁ = 1, however, $L_{2 n} (X_{1}^{2 n}, Q^{2 n}, D) = 0$ and $L_{2 n - 1} (X_{1}^{2 n - 1}, Q^{2 n - 1}, D) = \infty$ for all n.

IV. Extensions to the Case With Memory

Although the source (X_n : n ≥ 1) can have memory, the generalized AEP stated thus far is restricted to the case where the reproduction distribution is memoryless, that is, L_n is evaluated with a product measure Qⁿ. We relax this assumption here and replace it with a strong mixing condition, namely

C^{- 1} ℚ (A) ℚ (B) \leq ℚ (A \cap B) \leq C ℚ (A) ℚ (B)

(7)

for some fixed 1 ≤ C < ∞ and all $A \in σ (Y_{1}^{n})$ and $B \in σ (Y_{n + 1}^{\infty})$ and all n. Examples include the cases where ℚ is memoryless (C = 1) and where ℚ is a hidden Markov model (HMM) whose underlying Markov chain has a finite state space with all (strictly) positive transition probabilities.

Proposition 4

If ℙ and ℚ are stationary and ℚ satisfies (7), then

R_{\infty} (ℙ, ℚ, D) = \lim_{n \to \infty} n^{- 1} R_{n} (P_{n}, Q_{n}, D) = Λ_{\infty}^{*} (ℙ, ℚ, D)

for all D.

The proof shows that limit supremum in the definition of Λ_∞ is actually a limit and Proposition 1 implies that $R_{n} (P_{n}, Q_{n}, D) = Λ_{n}^{*} (P_{n}, Q_{n}, D)$ . For the special case when S and T are finite and ℚ is a finite state Markov chain, a formula for R_∞(ℙ, ℚ, D) not involving limits was identified in [7].

Theorem 5

If ℚ is stationary and satisfies (7), then Theorems 2 and 3 remain valid when Qⁿ, $Q^{N_{m}}$ , R(P, Q, D) and $R (P_{X_{1}^{n}}, Q, D)$ are replaced by Q_n, Q_{N_m}, R_∞(ℙ, ℚ, D) and $n^{- 1} R_{n} (δ_{X_{1}^{n}}, Q_{n}, D)$ , respectively.

The mixing conditions here are strong enough to ensure that

D_{\min} (P, Q) = D_{\min} (ℙ, ℚ) : = \inf {D : R_{\infty} (ℙ, ℚ, D) < \infty}

(8)

and that

ρ_{n}^{Q_{n}} (x_{1}^{n}) = n^{- 1} \sum_{k = 1}^{n} ρ^{Q} (x_{k})

(9)

which is why the results for memory can still be in terms of D_min(P, Q) and ρ^Q. Extending Theorem 3 to situations where these do not hold seems difficult. The generalized AEP for ℚ with memory can also be found in [7], [12], [14], [16], in some cases under more general mixing conditions, but always for bounded distortion measure ρ and for D ≠ D_min (ℙ, ℚ).

V. Proofs: Standard Cases

Owing to space constraints, the proofs presented here focus primarily on the case D = D_min (P, Q), which has received little or no attention in the literature. We also provide some justification for removing the standard assumption that D_ave (P, Q) < ∞. This is a moment condition on ρ and it is assumed in all previous treatments of the subject. More detailed proofs, including measurability issues, can be found in a longer version of this paper [2] and the technical report that preceded it [1].

When ℚ is stationary and satisfies (7), then

| \log Q_{n} (A_{n}) + \log Q_{m} (A_{m}) - \log Q_{n + m} (A_{n} \times A_{m}) | \leq \log C

(10)

and

| Λ_{n} (δ_{x_{1}^{n}}, Q_{n}, n λ) + Λ_{m} (δ_{x_{n + 1}^{n + m}}, Q_{m}, m λ) - Λ_{n + m} (δ_{x_{1}^{n + m}}, Q_{n + m}, (n + m) λ) | \leq \log C .

(11)

These two bounds make it relatively straightforward to extend the memoryless case to the case with memory. Replacing x with X in (11), taking expectations and using well-known facts about subadditive sequences (cf., [15, Lemma 10.21]) gives Proposition 4. Equation (10) combined with a blocking argument can be used to extend the memoryless generalized AEP to the case with memory. The details are tedious and analogous to existing arguments in the literature. We refer the interested reader to [2] and [1]. Henceforth, except for the special case D = D_min (P, Q) and our remarks about the lower bound below, we consider only the memoryless setting.

It is convenient in the proofs to temporarily redefine

D_{\min} (P, Q) : = \inf {D : Λ^{*} (P, Q, D) < \infty} .

Proposition 1, once established, shows that D_min agrees with the definition in the text.

A. The Lower Bound and Replacing L_n With R_n

For each λ ≤ 0, the exponential Chebyshev’s inequality gives

\begin{array}{l} n L_{n} (X_{1}^{n}, Q_{n}, D) & \geq & Λ_{n}^{*} (δ_{X_{1}^{n}}, Q_{n}, D) \\ \geq & n λ D - Λ_{n} (δ_{X_{1}^{n}}, Q_{n}, n λ) . \end{array}

Taking limits, using (11) to justify the sub-additive ergodic theorem [15, Th. 10.22], and then optimizing over λ ≤ 0 gives

\underset{n \to \infty}{lim inf} L_{n} (X_{1}^{n}, Q_{n}, D) \geq Λ_{\infty}^{*} (ℙ, ℚ, D) a.s.

(12)

This holds for all D. Combined with the previous bound this also shows that whenever $L_{n} (X_{1}^{n}, Q_{n}, D)$ converges to $Λ_{\infty}^{*} (ℙ, ℚ, D), n^{- 1} Λ_{n}^{*} (δ_{X_{1}^{n}}, Q_{n}, D)$ converges to the same limit. Furthermore, using Lemma 7 below it is straightforward to show that $Λ_{n}^{*}$ must be infinite when L_n is infinite. Using Propositions 1 and 4 to equate R_n and $Λ_{n}^{*}$ , these remarks justify our claims that L_n can be replaced by n⁻¹ R_n in each of the theorems.

There is also a lower bound for Proposition 1.

H (W ∥ P \times Q) \geq Λ^{*} (P, Q, D) for all W \in W (P, D) .

(13)

This only makes sense if W (P, D) ≠ 0̸. A simple proof can be found in [14, Th. 2]. Note that the proof there does not make use of any regularity assumptions, other than the existence of a regular conditional distribution W(y∣x).

B. The Upper Bound: D < D_min(P, Q) or D > D_ave(P, Q)

The upper bound is

\underset{n \to \infty}{lim sup} L_{n} (X_{1}^{n}, Q_{n}, D) \leq Λ_{\infty}^{*} (ℙ, ℚ, D) a.s.

(14)

If D < D_min(P, Q) then the upper bound is trivial since the right side is +∞. If D > D_ave(P, Q) (the latter of which must then be finite), then the left side converges to 0 (Chebyshev’s inequality and the ergodic theorem) and again the upper bound must hold.

For Proposition 1, the upper bound is

H (W ∥ P \times Q) \leq Λ^{*} (P, Q, D) for some W \in W (P, D) .

(15)

Again, this only makes sense if W(P, D) ≠ 0̸. If D < D_min(P, Q), then any W ∈ W(P, D) trivially satisfies (15). If D ≥ D_ave(P, Q), then W = P × Q ∈ W(P, D) trivially satisfies (15).

C. The Upper Bound: D_min(P, Q) < D ≤ D_ave(P, Q)

This is the case considered in the literature, at least under the assumption that D_ave(P, Q) < ∞. Under appropriate regularity conditions, the upper bound (14) is an immediate consequence of the one-sided large deviations theorem in [17] applied to the random variables $W_{n} : = - n ρ_{n} (x_{1}^{n}, Y_{1}^{n})$ and the constants a_n := −D for a typical, fixed realization (x_n : n ≥ 1) of the source. See [14, Th. 1] for an overview of this technique using the Gärtner-Ellis theorem [18, Th. V.6], which is a generalization of [17]. The required regularity conditions are placed on Λ_∞(ℙ, ℚ, λ).

For the memoryless codebook case, Λ_∞(ℙ, ℚ, λ) = Λ(P, Q, λ) and the regularity conditions needed in [17] are given by the following Lemma.

Lemma 6

If D_min(P, Q) < ∞, then Λ(λ) := Λ(P, Q, λ) is convex and continuously differentiable on (−∞, 0) with lim_λ↓−∞ Λ′(λ) = E(ρ^Q(X)) = D_min(P, Q) and lim_λ↑0 Λ′(λ) = D_ave(P, Q) (which may be infinite). If D_min(P, Q) < D_ave(P, Q), then Λ(λ) is strictly convex on (−∞, 0).

Remarks

The regularity assumption D_ave(P, Q) < ∞ is used in the literature to establish these properties of Λ. It can be replaced by the much weaker assumption that D_min(P, Q) < ∞. If Lemma 6 were known to be true for Λ_∞, then the more general case with memory would also follow immediately from [17]. Establishing the strict convexity of Λ_∞ seems challenging, however, and we resorted to a blocking argument as mentioned above to derive the general case from the memoryless case.

Proof

Let Z be a real-valued, nonnegative random variable. Define Γ(λ) := log Ee^λZ. It is well-known [19, Lem. 2.2.5, Ex. 2.2.24], [9], [14] that Γ is nondecreasing, convex and C^∞ on (−∞, 0) with Γ′(λ) ↓ ess inf Z as λ ↓ ∞ and Γ′(λ) ↑ EZ (which may be infinite) as λ ↑ 0. Furthermore, if ess inf Z < EZ then Γ is strictly convex on (−∞, 0).

Now define Γ(x, λ) := log Ee^{λρ(x, Y)}. Using Z := ρ(x, Y), we see that Γ(x, λ) has all the above properties of Γ(λ) for each fixed x. Since Λ(λ) = EΓ(X, λ), which must be finite on (−∞, 0] if D_min < ∞, we need only show that these properties are preserved by expectation. Convexity and strict convexity are immediately preserved. Preserving properties of the derivative requires some justification. Moving the expectation inside the derivative is justified by convexity and the monotone convergence theorem for the left hand derivative (including λ ↑ 0) and then finiteness and the dominated convergence theorem for the right hand derivative (including λ ↓ −∞). The fact that D_min = lim_λ↓−∞ Λ′(λ) is a well-known property of Fenchel–Legendre transforms.

Turning to Proposition 1, if D_min(P, Q) < D < D_ave(P, Q), a proof of the upper bound (15) can be found in [14, Th. 2]. The achieving W ∈ W(P, D) is identified as

\frac{dW}{d (P \times Q)} (x, y) : = \frac{e^{λ_{D} ρ (x, y)}}{E e^{λ_{D} ρ (x, Y)}}

where λ_D < 0 is uniquely chosen so that Λ*(P, Q, D) = λ_DD − Λ(P, Q, λ_D). Note that the properties of Λ given in Lemma 6 are sufficient regularity conditions for ensuring that λ_D exists and has the desired properties. See [19, Lemma 2.2.5] and [20, Th. 23.5, Corollary 23.5.1, Th. 25.1].

VI. Proofs: Nonstandard Cases

Here we consider the special case when D = D_min := D_min(P, Q) < ∞. Lemma 6 is applicable, so

D_{\min} = E (ρ^{Q} (X)) .

Unlike most of the previous section, we explicitly treat the case of code-book distributions with memory. In particular, we assume that ℚ is stationary and satisfies (7). The bounds in (10) and (11) easily give (9) and

D_{\min} = \inf {D : Λ_{\infty}^{*} (ℙ, ℚ, D) < \infty}

which means that Λ*(D_min) := Λ*(P, Q, D_min) is finite (infinite) exactly when $Λ_{\infty}^{*} (D_{\min}) : = Λ_{\infty}^{*} (ℙ, ℚ, D_{\min})$ is finite (infinite). The lower bounds in (12) and (13) are also valid. These lower bounds address the upper bounds in (14) and (15) in the case Λ*(D_min) = ∞. So in this section we will restrict attention to Λ*(D_min) < ∞. The proofs make use of the sets

A_{n} (x_{1}^{n}) : = {y_{1}^{n} \in T^{n} : ρ_{n} (x_{1}^{n}, y_{1}^{n}) = ρ_{n}^{Q_{n}} (x_{1}^{n})} .

We use A(x) := A₁(x) = {y : ρ(x, y) = ρ^Q(x)} and we use the notation 1{B} to denote the indicator function of the event B.

Lemma 7

If Λ*(D_min) < ∞, then

Λ^{*} (D_{\min}) = - E [\log Q (A (X))] = H (W ∥ P \times Q)

for W ∈ W(P, D_min) defined by

\frac{dW}{d (P \times Q)} (x, y) : = \frac{1 {y \in A (x)}}{Q (A (x))} .

Proof

Define

\tilde{ρ} (x, y) : = \max {ρ (x, y) - ρ^{Q} (x), 0}

so that ρ̃ is a valid distortion measure and so that

Q (A (x)) = Q ({y : \tilde{ρ} (x, y) = 0}) .

Let Λ̃ be defined analogously to Λ, except with ρ̃ instead of ρ. We have Λ(λ) = Λ̃(λ) + λD_min so that

\begin{array}{l} Λ^{*} (D_{\min}) \\ = \sup_{λ \leq 0} [λ D_{\min} - \tilde{Λ} (λ) - λ D_{\min}] \\ = \lim_{λ ↓ - \infty} - \tilde{Λ} (λ) = - \lim_{λ ↓ - \infty} E [\log E [e^{λ \tilde{ρ} (X, Y)} | X]] \\ = - E [\log E [\lim_{λ ↓ - \infty} e^{λ \tilde{ρ} (X, Y)} | X]] \\ = - E [\log E [1 {\tilde{ρ} (X, Y) = 0} | X]] \\ = - E [\log Q ({y : \tilde{ρ} (X, y) = 0})] = - E [\log Q (A (X))] . \end{array}

We moved the limit inside the expectations using first the monotone convergence theorem and then the dominated convergence theorem. This also shows that the denominator in the definition of W is strictly positive P-a.s., so W is well-defined.

It is easy to see that $W \in W (P, D_{\min})$ and that

\begin{array}{l} H (W ∥ P \times Q) \\ = E [\frac{d W}{d (P \times Q)} (X, Y) \log \frac{d W}{d (P \times Q)} (X, Y)] \\ = E [\frac{1 {Y \in A (X)}}{Q (A (X))} \log 1 {Y \in A (X)}] - E [\frac{1 {Y \in A (X)}}{Q (A (X))} \log Q (A (X))] \\ = 0 - E [\log Q (A (X))] = Λ^{*} (D_{\min}) \end{array}

which completes the proof.

Lemma 7 gives the upper bound (15) and completes the proof of Proposition 1 for all possible values of D. We now address the upper bound (14) for the generalized AEP. Because of (9)

Q_{n + m} (A_{n + m} (x_{1}^{n + m})) = Q_{n + m} (A_{n} (x_{1}^{n}) \times A_{m} (x_{n + 1}^{n + m}))

and we can use (10) in conjunction with the subadditive ergodic theorem [15, Th. 10.22] to conclude that

\begin{array}{l} \lim_{n \to \infty} - \frac{1}{n} \log Q_{n} (A_{n} (X_{1}^{n})) \\ \overset{a.s.}{=} \lim_{n \to \infty} - \frac{1}{n} E [- \log Q_{n} (A_{n} (X_{1}^{n}))] \\ = \lim_{n \to \infty} \frac{1}{n} Λ_{n}^{*} (P_{n}, Q_{n}, D_{\min}) = Λ_{\infty}^{*} (ℙ, ℚ, D_{\min}) . \end{array}

(16)

The limit in the subadditive ergodic theorem is deterministic because it is shift invariant and the source is ergodic. The second equality comes from Lemma 7 and the third from Propositions 1 and 4. Note that if ρ^Q(X) is a.s. constant, then $Q_{n} (A_{n} (X_{1}^{n})) \overset{a.s.}{=} Q_{n} (B_{n} (X_{1}^{n}, D_{\min}))$ and (16) gives the upper bound (14).

Now suppose ρ^Q(X₁) is not a.s. constant (and D = D_min and Λ*(D_min) < ∞). This is the only pathological situation where the upper bound does not hold. Our analysis makes use of recurrence properties for random walks with stationary and ergodic increments. What we need is summarized in the following lemma. The notation “i.o.” means “infinitely often.”

Lemma 8

Let (U_n : n ≥ 1) be a real-valued stationary and ergodic process and define $W_{n} : = \sum_{k = 1}^{n} U_{k}$ for n ≥ 1. If EU₁ = 0 and Prob{U₁ ≠ 0} > 0, then Prob{W_n > 0 i.o.} > 0 and Prob{W_n ≥ 0 i.o.} = 1.

Proof

Define W₀ := 0 so that (W_n : n ≥ 0) is a random walk with stationary and ergodic increments [21]. [22] shows that {lim inf_n⁻¹W_n > 0} and {W_n → ∞} differ by a null set. The ergodic theorem gives Prob{n⁻¹W_n → 0} = 1, so Prob{W_n → ∞} = 0. Similarly, by considering the process −W_n, we see that Prob{W_n → −∞} = 0.

Now {|W_n| → ∞} is invariant and must have probability 0 or 1. If it has probability 1, then since we cannot have W_n → ∞ or W_n → −∞ we must have W_n oscillating between increasingly larger positive and negative values, which means Prob{W_n > 0i.o.} = 1 and completes the proof.

Suppose Prob{|W_n| → ∞} = 0. Define

N (A) : = \sum_{n \geq 0} 1 {W_{n} \in A}, A \subset ℝ

to be the number of times the random walk visits the set A. [21, Corollary 2.3.4] shows that either N(J) < ∞ a.s. for all bounded intervals J or {N(J)=0}∪{N(J)=∞} has probability 1 for all intervals J (open or closed, bounded or unbounded, but not a single point). By assumption |W_n|↛∞, so we can rule out the first possibility. Since Prob{W₀ = 0} = 1, we see that for any interval J containing {0} we must have Prob{N(J) = ∞} = 1. In particular, taking J := [0, ∞) shows that Prob{W_n ≥ 0 i.o.} = 1. Similarly, taking J := (0, ∞) shows that

\begin{array}{l} Prob {W_{n} > 0 i.o.} \\ = Prob {N (J) = \infty} \\ = Prob {N (J) > 0} \geq Prob {U_{1} > 0} > 0. \end{array}

Returning to the main argument

\begin{array}{l} L_{n} (X_{1}^{n}, Q_{n}, D_{\min}) \\ \geq - n^{- 1} \log Q_{n} {y_{1}^{n} : n^{- 1} \sum_{k = 1}^{n} ρ^{Q} (X_{k}) \leq D_{\min}} \\ = {\begin{cases} 0, & if \sum_{k = 1}^{n} ρ^{Q} (X_{k}) \leq n D_{\min} \\ \infty, & if \sum_{k = 1}^{n} ρ^{Q} (X_{k}) > n D_{\min} \end{cases} \\ = {\begin{cases} 0, & if W_{n} \leq 0 \\ \infty, & if W_{n} > 0 \end{cases} \end{array}

(17)

where $W_{n} : = \sum_{k = 1}^{n} (ρ^{Q} (X_{k}) - D_{\min})$ . Lemma 8 shows that Prob{W_n > 0 i.o.} > 0. This and (17) prove (5a).

Lemma 8 also shows that Prob{W_n ≤ 0 i.o.} = 1. Let (N_m : m ≥ 1) be the (a.s.) infinite, random subsequence of (n ≥ 1) such that W_n ≤ 0. Note that

N_{m}^{- 1} \sum_{k = 1}^{N_{m}} ρ^{Q} (X_{k}) \leq D_{\min}

\begin{array}{l} L_{N_{m}} (X_{1}^{N_{m}}, Q_{N_{m}}, D_{\min}) \\ \leq L_{N_{m}} (X_{1}^{N_{m}}, Q_{N_{m}}, N_{m}^{- 1} \sum_{k = 1}^{N_{m}} ρ^{Q} (X_{k})) \\ = - N_{m}^{- 1} \log Q_{N_{m}} (A_{N_{m}} (X_{1}^{N_{m}})) . \end{array}

(18)

Now, the final expression in (18) is a.s. finite because $E [- \log Q_{n} (A_{n} (X_{1}^{n}))] = Λ_{n}^{*} (P_{n}, Q_{n}, D_{\min}) < \infty$ . This proves (5b) and shows that (N_m : m ≥ 1) satisfies the claims of the theorem, including (6). Letting m → ∞ in (18) and using (16) gives (5c), the upper bound along the sequence (N_m : m ≥ 1). Note that it also shows that the lim inf_n L_n is a.s. $Λ_{\infty}^{*} (D_{\min})$ even in this pathological case, which proves Theorem 2 and its generalization in Theorem 5.

Acknowledgments

The author would like to thank I. Kontoyiannis, M. Madiman, and two anonymous reviewers for many useful comments and corrections, and I. Kontoyiannis for invaluable advice and for suggesting the problems that led to this correspondence.

This work was supported in part by a National Defense Science and Engineering Graduate Fellowship.

Footnotes

Precise rigorous definitions are given in the following section.

Borel spaces include ℝ^d as well as a large class of infinite-dimensional spaces, including Polish spaces. This assumption is made so that we can avoid certain pathologies while working with random sequences and conditional distributions [15].

The essential infimum of a random variable η, is ess inf η := inf{r : Prob{η < r} > 0}.

⁴

We are considering limits in the extended sense: if a sequence diverges to +∞ (or to −∞), then we still say that the limit exists and identify the limit as +∞ (or as −∞).

The material in this correspondence was presented in part at the 40th Annual Allerton Conference on Communications, Control, and Computers, Monticello, IL, October 2002

References

1.Harrison M. Brown Univ, Div Appl Math. Providence, RI, APPTS #03-3: 2003. The First Order Asymptotics of Waiting Times Between Stationary Processes Under Nonstandard Conditions. [Google Scholar]
2.Harrison MT. The Generalized Asymptotic Equipartition Property: Necessary and Sufficient Conditions 2007 [Online] doi: 10.1109/TIT.2008.924668. Available: http://arxiv.org/abs/0711.2666. [DOI] [PMC free article] [PubMed]
3.Harrison M, Kontoyiannis I. Proc 40th Ann Allerton Conf Commun Contr Comput. Allerton, IL: Oct, 2002. Maximum likelihood estimation for lossy data compression; pp. 596–604. [Google Scholar]
4.Shannon C. Coding theorems for a discrete source with a fidelity criterion. In: Slepian D, editor. IRE Nat Conv Rec. Vol. 4. New York: IEEE; 1959. pp. 142–163. Key Papers in the Develop. Inf. Theory. [Google Scholar]
5.Zhang Z, Yang E-h, Wei VK. The redundancy of source coding with a fidelity criterion – Part I: Known statistics. IEEE Trans Inf Theory. 1997 Jan;43:71–91. [Google Scholar]
6.Łuczak T, Szpankowski W. A suboptimal lossy data compression based on approximate pattern matching. IEEE Trans Inf Theory. 1997 Sep;43:1439–1451. [Google Scholar]
7.Yang E-h, Kieffer J. On the performance of data compression algorithms based upon string matching. IEEE Trans Inf Theory. 1998 Jan;44:47–65. [Google Scholar]
8.Kontoyiannis I. An implementable lossy version of the lempel-ziv algorithm – Part I: Optimality for memoryless sources. IEEE Trans Inf Theory. 1999 Nov;45:2293–2305. [Google Scholar]
9.Yang E-h, Zhang Z. On the redundancy of lossy source coding with abstract alphabets. IEEE Trans Inf Theory. 1999 May;45:1092–1110. [Google Scholar]
10.Dembo A, Kontoyiannis I. The asymptotics of waiting times between stationary processes, allowing distortion. Ann Appl Prob. 1999 May;9:413–429. [Google Scholar]
11.Yang E-h, Zhang Z. The shortest common superstring problem: Average case analysis for both exact and approximate matching. IEEE Trans Inf Theory. 1999;45:1867–1886. [Google Scholar]
12.Chi Z. The first-order asymptotic of waiting times with distortion between stationary processes. IEEE Trans Inf Theory. 2001 Jan;47:338–347. [Google Scholar]
13.Szpankowski W. Average Case Analysis of Algorithms on Sequences. New York: Wiley; 2001. [Google Scholar]
14.Dembo A, Kontoyiannis I. Source coding, large deviations, and approximate pattern matching. IEEE Trans Inf Theory. 2002 Jun;48:1590–1615. [Google Scholar]
15.Kallenberg O. Foundations of Modern Probability. 2. New York: Springer; 2002. [Google Scholar]
16.Chi Z. Stochastic sub-additivity approach to the conditional large deviation principle. Ann Prob. 2001;29(3):1303–1328. [Google Scholar]
17.Plachky D, Steinebach J. A theorem about probabilities of large deviations with an application to queuing theory. Period Math Hung. 1975;6(4):343–345. [Google Scholar]
18.den Hollander F. Large Deviations. Providence, RI: American Math Soc; 2000. [Google Scholar]
19.Dembo A, Zeitouni O. Large Deviations Techniques and Applications. 2. New York: Springer; 1998. [Google Scholar]
20.Rockafellar RT. Convex Analysis. Princeton, NJ: Princeton Univ Press; 1970. [Google Scholar]
21.Berbee H. Random Walks With Stationary Increments and Renewal Theory. Vol. 112. Amsterdam, The Netherlands: Mathematisch Centrum; 1979. Mathematical Centre Tracts. [Google Scholar]
22.Kesten H. Sums of stationary sequences cannot grow slower than linearly. Proc AMS. 1975 May;49(1):205–211. [Google Scholar]

[R1] 1.Harrison M. Brown Univ, Div Appl Math. Providence, RI, APPTS #03-3: 2003. The First Order Asymptotics of Waiting Times Between Stationary Processes Under Nonstandard Conditions. [Google Scholar]

[R2] 2.Harrison MT. The Generalized Asymptotic Equipartition Property: Necessary and Sufficient Conditions 2007 [Online] doi: 10.1109/TIT.2008.924668. Available: http://arxiv.org/abs/0711.2666. [DOI] [PMC free article] [PubMed]

[R3] 3.Harrison M, Kontoyiannis I. Proc 40th Ann Allerton Conf Commun Contr Comput. Allerton, IL: Oct, 2002. Maximum likelihood estimation for lossy data compression; pp. 596–604. [Google Scholar]

[R4] 4.Shannon C. Coding theorems for a discrete source with a fidelity criterion. In: Slepian D, editor. IRE Nat Conv Rec. Vol. 4. New York: IEEE; 1959. pp. 142–163. Key Papers in the Develop. Inf. Theory. [Google Scholar]

[R5] 5.Zhang Z, Yang E-h, Wei VK. The redundancy of source coding with a fidelity criterion – Part I: Known statistics. IEEE Trans Inf Theory. 1997 Jan;43:71–91. [Google Scholar]

[R6] 6.Łuczak T, Szpankowski W. A suboptimal lossy data compression based on approximate pattern matching. IEEE Trans Inf Theory. 1997 Sep;43:1439–1451. [Google Scholar]

[R7] 7.Yang E-h, Kieffer J. On the performance of data compression algorithms based upon string matching. IEEE Trans Inf Theory. 1998 Jan;44:47–65. [Google Scholar]

[R8] 8.Kontoyiannis I. An implementable lossy version of the lempel-ziv algorithm – Part I: Optimality for memoryless sources. IEEE Trans Inf Theory. 1999 Nov;45:2293–2305. [Google Scholar]

[R9] 9.Yang E-h, Zhang Z. On the redundancy of lossy source coding with abstract alphabets. IEEE Trans Inf Theory. 1999 May;45:1092–1110. [Google Scholar]

[R10] 10.Dembo A, Kontoyiannis I. The asymptotics of waiting times between stationary processes, allowing distortion. Ann Appl Prob. 1999 May;9:413–429. [Google Scholar]

[R11] 11.Yang E-h, Zhang Z. The shortest common superstring problem: Average case analysis for both exact and approximate matching. IEEE Trans Inf Theory. 1999;45:1867–1886. [Google Scholar]

[R12] 12.Chi Z. The first-order asymptotic of waiting times with distortion between stationary processes. IEEE Trans Inf Theory. 2001 Jan;47:338–347. [Google Scholar]

[R13] 13.Szpankowski W. Average Case Analysis of Algorithms on Sequences. New York: Wiley; 2001. [Google Scholar]

[R14] 14.Dembo A, Kontoyiannis I. Source coding, large deviations, and approximate pattern matching. IEEE Trans Inf Theory. 2002 Jun;48:1590–1615. [Google Scholar]

[R15] 15.Kallenberg O. Foundations of Modern Probability. 2. New York: Springer; 2002. [Google Scholar]

[R16] 16.Chi Z. Stochastic sub-additivity approach to the conditional large deviation principle. Ann Prob. 2001;29(3):1303–1328. [Google Scholar]

[R17] 17.Plachky D, Steinebach J. A theorem about probabilities of large deviations with an application to queuing theory. Period Math Hung. 1975;6(4):343–345. [Google Scholar]

[R18] 18.den Hollander F. Large Deviations. Providence, RI: American Math Soc; 2000. [Google Scholar]

[R19] 19.Dembo A, Zeitouni O. Large Deviations Techniques and Applications. 2. New York: Springer; 1998. [Google Scholar]

[R20] 20.Rockafellar RT. Convex Analysis. Princeton, NJ: Princeton Univ Press; 1970. [Google Scholar]

[R21] 21.Berbee H. Random Walks With Stationary Increments and Renewal Theory. Vol. 112. Amsterdam, The Netherlands: Mathematisch Centrum; 1979. Mathematical Centre Tracts. [Google Scholar]

[R22] 22.Kesten H. Sums of stationary sequences cannot grow slower than linearly. Proc AMS. 1975 May;49(1):205–211. [Google Scholar]

PERMALINK

The Generalized Asymptotic Equipartition Property: Necessary and Sufficient Conditions

Matthew T Harrison

Roles

Abstract

I. Introduction

II. Notation and Assumptions

III. Memoryless Codebook Distributions

Proposition 1

Theorem 2

Theorem 3

IV. Extensions to the Case With Memory

Proposition 4

Theorem 5

V. Proofs: Standard Cases

A. The Lower Bound and Replacing L_n With R_n

B. The Upper Bound: D < D_min(P, Q) or D > D_ave(P, Q)

C. The Upper Bound: D_min(P, Q) < D ≤ D_ave(P, Q)

Lemma 6

Remarks

Proof

VI. Proofs: Nonstandard Cases

Lemma 7

Proof

Lemma 8

Proof

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

The Generalized Asymptotic Equipartition Property: Necessary and Sufficient Conditions

Matthew T Harrison

Roles

Abstract

I. Introduction

II. Notation and Assumptions

III. Memoryless Codebook Distributions

Proposition 1

Theorem 2

Theorem 3

IV. Extensions to the Case With Memory

Proposition 4

Theorem 5

V. Proofs: Standard Cases

A. The Lower Bound and Replacing Ln With Rn

B. The Upper Bound: D < Dmin(P, Q) or D > Dave(P, Q)

C. The Upper Bound: Dmin(P, Q) < D ≤ Dave(P, Q)

Lemma 6

Remarks

Proof

VI. Proofs: Nonstandard Cases

Lemma 7

Proof

Lemma 8

Proof

Acknowledgments

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

A. The Lower Bound and Replacing L_n With R_n

B. The Upper Bound: D < D_min(P, Q) or D > D_ave(P, Q)

C. The Upper Bound: D_min(P, Q) < D ≤ D_ave(P, Q)