Skip to main content
Entropy logoLink to Entropy
. 2021 Feb 5;23(2):199. doi: 10.3390/e23020199

Error Exponents and α-Mutual Information

Sergio Verdú 1
PMCID: PMC7915702  PMID: 33562882

Abstract

Over the last six decades, the representation of error exponent functions for data transmission through noisy channels at rates below capacity has seen three distinct approaches: (1) Through Gallager’s E0 functions (with and without cost constraints); (2) large deviations form, in terms of conditional relative entropy and mutual information; (3) through the α-mutual information and the Augustin–Csiszár mutual information of order α derived from the Rényi divergence. While a fairly complete picture has emerged in the absence of cost constraints, there have remained gaps in the interrelationships between the three approaches in the general case of cost-constrained encoding. Furthermore, no systematic approach has been proposed to solve the attendant optimization problems by exploiting the specific structure of the information functions. This paper closes those gaps and proposes a simple method to maximize Augustin–Csiszár mutual information of order α under cost constraints by means of the maximization of the α-mutual information subject to an exponential average constraint.

Keywords: information measures, relative entropy, Rényi divergence, mutual information, α-mutual information, Augustin–Csiszár mutual information, data transmission, error exponents, large deviations

1. Introduction

1.1. Phase 1: The MIT School

The capacity C of a stationary memoryless channel is equal to the maximal symbolwise input–output mutual information. Not long after Shannon [1] established this result, Rice [2] observed that, when operating at any encoding rate R<C, there exist codes whose error probability vanishes exponentially with blocklength, with a speed of decay that decreases as R approaches C. This early observation moved the center of gravity of information theory research towards the quest for the reliability function, a term coined by Shannon [3] to refer to the maximal achievable exponential decay as a function of R. The MIT information theory school, and most notably, Elias [4], Feinstein [5], Shannon [3,6], Fano [7], Gallager [8,9], and Shannon, Gallager and Berlekamp [10,11], succeeded in upper/lower bounding the reliability function by the sphere-packing error exponent function and the random coding error exponent function, respectively. Fortunately, these functions coincide for rates between C and a certain value, called the critical rate, thereby determining the reliability function in that region. The influential 1968 textbook by Gallager [9] set down the major error exponent results obtained during Phase 1 of research on this topic, including the expurgation technique to improve upon the random coding error exponent lower bound. Two aspects of those early works (and of Dobrushin’s contemporary papers [12,13] on the topic) stand out:

  • (a)
    The error exponent functions were expressed as the result of the Karush-Kuhn-Tucker optimization of ad-hoc functions which, unlike mutual information, carried little insight. In particular, during the first phase, center stage is occupied by the parametrized function of the input distribution PX and the random transformation (or "channel”) PY|X,
    E0(ρ,PX)=logyBxAPX(x)PY|X11+ρ(y|x)1+ρ, (1)
    introduced by Gallager in [8].
  • (b)

    Despite the large-deviations nature of the setup, none of the tools from that then-nascent field (other than the Chernoff bound) found their way to the first phase of the work on error exponents; in particular, relative entropy, introduced by Kullback and Leibler [14], failed to put in an appearance.

To this date, the reliability function remains open for low rates even for the binary symmetric channel, despite a number of refined converse and achievability results (e.g., [15,16,17,18,19,20,21]) obtained since [9]. Our focus in this paper is not on converse/achievability techniques but on the role played by various information measures in the formulation of error exponent results.

1.2. Phase 2: Relative Entropy

The second phase of the error exponent research was pioneered by Haroutunian [22] and Blahut [23], who infused the expressions for the error exponent functions with meaning by incorporating relative entropy. The sphere-packing error exponent function corresponding to a random transformation PY|X is given as

Esp(R)=supPXminQY|X:ABI(PX,QY|X)RD(QY|XPY|X|PX). (2)

Roughly speaking, optimal codes of rate R<C incur in errors due to atypical channel behavior, and large deviations establishes that the overwhelmingly most likely such behavior can be explained as if the channel would be supplanted by the one with mutual information bounded by R which is closest to the true channel in conditional relative entropy D(QY|XPY|X|PX). Within the confines of finite-alphabet memoryless channels, this direction opened the possibility of using the combinatorial method of types to obtain refined results robustifying the choice of the optimal code against incomplete knowledge of the channel. The 1981 textbook by Csiszár and Körner [24] summarizes the main results obtained during Phase 2.

1.3. Phase 3: Rényi Information Measures

Entropy and relative entropy were generalized by Rényi [25], who introduced the notions of Rényi entropy and Rényi divergence of order α. He arrived at Rényi entropy by relaxing the axioms Shannon proposed in [1], and showed to be satisfied by no measure but entropy. Shortly after [25], Campbell [26] realized the operational role of Rényi entropy in variable-length data compression if the usual average encoding length criterion E[(c(X))] is replaced by an exponential average α1logE[exp(α(c(X))]. Arimoto [27] put forward a generalized conditional entropy inspired by Rényi’s measures (now known as Arimoto-Rényi conditional entropy) and proposed a generalized mutual information by taking the difference between Rényi entropy and the Arimoto-Rényi conditional entropy. The role of the Arimoto-Rényi conditional entropy in the analysis of the error probability of Bayesian M-ary hypothesis testing problems has been recently shown in [28], tightening and generalizing a number of results dating back to Fano’s inequality [29].

Phase 3 of the error exponent research was pioneered by Csiszár [30] where he established a connection between Gallager’s E0 function and Rényi divergence by means of a Bayesian measure of the discrepancy among a finite collection of distributions introduced by Sibson [31]. Although [31] failed to realize its connection to mutual information, Csiszár [30,32] noticed that it could be viewed as a natural generalization of mutual information. Arimoto [27] also observed that the unconstrained maximization of his generalized mutual information measure with respect to the input distribution coincides with a scaled version of the maximal E0 function. This resulted in an extension of the Arimoto-Blahut algorithm useful for the computation of error exponent functions [33] (see also [34]) for finite-alphabet memoryless channels.

Within Haroutunian’s framework [22] applied in the context of the method of types, Poltyrev [35] proposed an alternative to Gallager’s E0 function, defined by means of a cumbersome maximization over a reverse random transformation. This measure turned out to coincide (modulo different parametrizations) with another generalized mutual information introduced four years earlier by Augustin in his unpublished thesis [36], by means of a minimization with respect to an output probability measure.

The key contribution in the development of this third phase is Csiszár’s paper [32] where he makes a compelling case for the adoption of Rényi’s information measures in the large deviations analysis of lossless data compression, hypothesis testing and data transmission. Recall that more than two decades earlier, Csiszár [30] had already established the connection of Gallager’s E0 function and the generalized mutual information inspired by Sibson [31], which, henceforth, we refer to as the α-mutual information. Therefore, its relevance to the error exponent analysis of error correcting codes had already been established. Incidentally, more recently, another operational role was found for α-mutual information in the context of the large deviations analysis of composite hypothesis testing [37]. In addition to α-mutual information, and always working with discrete alphabets, Csiszár [32] considers the generalized mutual informations due to Arimoto [27], and to Augustin [36], which we refer to as the Augustin–Csiszár mutual information of order α. Csiszár shows that all those three generalizations of mutual information coincide upon their unconstrained maximization with respect to the input distribution. Further relationships among those Rényi-based generalized mutual informations have been obtained in recent years in [38,39,40,41,42,43,44,45]. In [32] the maximal α-mutual information or generalized capacity of order α finds an operational characterization as a generalized cutoff rate–an equivalent way to express the reliability function. This would have been the final word on the topic if it weren’t for its limitation to discrete-alphabet channels, and more importantly, encoding without cost constraints.

1.4. Cost Constraints

If the transmitted codebook is cost-constrained, i.e., every codeword (c1,,cn) is forced to satisfy i=1nb(ci)nθ for some nonnegative cost function b(·), then the channel capacity is equal to the input–output mutual information maximized over input probability measures restricted to satisfy E[b(X)]θ. Gallager [9] incorporated cost constraints in his treatment of error exponents by generalizing (1) to the function

E0(ρ,PX,r,θ)=logyBxAPX(x)exprb(x)rθPY|X11+ρ(y|x)1+ρ, (3)

with which he was able to prove an achievability result invoking Shannon’s random coding technique [1]. Gallager also suggested in the footnote of page 329 of [9] that the converse technique of [10] is amenable to extension to prove a sphere-packing converse based on (3). However, an important limitation is that that technique only applies to constant-composition codes (all codewords have the same empirical distribution). A more powerful converse circumventing that limitation (at least for symmetric channels) was given by [46] also expressing the upper bound on the reliability function by optimizing (3) with respect to ρ, r and PX. A notable success of the approach based on the optimization of (3) was the determination of the reliability function (for all rates below capacity) of the direct detection photon channel [47].

In contrast, the Phase Two expression (2) for the sphere-packing error exponent for cost-constrained channels is much more natural and similar to the way the expression for channel capacity is impacted by cost constraints, namely we simply constrain the maximization in (2) to satisfy E[b(X)]θ. Unfortunately, no general methods to solve the ensuing optimization have been reported.

Once cost constraints are incorporated, the equivalence among the maximal α-mutual information, maximal order-α Augustin–Csiszár mutual information, and maximal Arimoto mutual information of order α breaks down. Of those three alternatives, it is the maximal Augustin–Csiszár mutual information under cost constraints that appears in the error exponent functions. The challenge is that Augustin–Csiszár mutual information is much harder to evaluate, let alone maximize, than α-mutual information. The Phase 3 effort to encompass cost constraints started by Augustin [36] and was continued recently by Nakiboglu [43]. Their focus was to find a way to express (3) in terms of Rényi information measures. Although, as we explain in Item 62, they did not quite succeed, their efforts were instrumental in developing key properties of the Augustin–Csiszár mutual information.

1.5. Organization

To enhance readability and ease of reference, the rest of this work is organized in 81 items, grouped into Section 13 and an appendix.

Basic notions and notation (including the key concept of α-response) are collected in Section 2. Unlike much of the literature on the topic, we do not restrict attention to discrete input/output alphabets, nor do we impose any topological structures on them.

The paper is essentially self-contained. Section 3 covers the required background material on relative entropy, Rényi divergence of order α, and their conditional versions, including a key representation of Rényi divergence in terms of relative entropies and a tilted probability measure, and additive decompositions of Rényi divergence involving the α-response.

Section 4 studies the basic properties of α-mutual information and order-α Augustin–Csiszár mutual information. This includes their variational representations in terms of conventional (non-Rényi) information measures such as conditional relative entropy and mutual information, which are particularly simple to show in the main range of interest in applications to error exponents, namely, α(0,1).

The interrelationships between α-mutual information and order-α Augustin–Csiszár mutual information are covered in Section 5, which introduces the dual notions of α-adjunct and α-adjunct of an input probability measure.

The maximizations with respect to the input distribution of α-mutual information and order-α Augustin–Csiszár mutual information account for their role in the fundamental limits in data transmission through noisy channels. Section 6 gives a brief review of the results in [45] for the maximization of α-mutual information. For Augustin–Csiszár mutual information, Section 7 covers its unconstrained maximization, which coincides with its α-mutual information counterpart. Section 8 proposes an approach to find Cαc(θ), the maximal Augustin–Csiszár mutual information of order α(0,1) subject to E[b(X)]θ. Instead of trying to identify directly the input distribution that maximizes Augustin–Csiszár mutual information, the method seeks its α-adjunct. This is tantamount to maximizing α-mutual information over a larger set of distributions.

Section 9 shows

ρC11+ρc(θ)=minr0maxPXE0(ρ,PX,r,θ), (4)

where the maximization on the right side is unconstrained. In other words, the minimax of Gallager’s E0 function (3) with cost constraints is shown to be equal to the maximal Augustin–Csiszár mutual information, thereby bridging the existing gap between the Phase 1 and Phase 3 representations alluded to earlier in this introduction.

As in [48], Section 10 defines the sphere-packing and random-coding error exponent functions in the natural canonical form of Phase 2 (e.g., (2)), and gives a very simple proof of the nexus between the Phase 2 and Phase 3 representations, namely,

Esp(R)=supρ0ρC11+ρc(θ)ρR, (5)

with or without cost constraints. In this regard, we note that, although all the ingredients required were already present at the time the revised version of [24] was published three decades after the original, [48] does not cover the role of Rényi’s information measures in channel error exponents.

Examples illustrating the proposed method are given in Section 11 and Section 12 for the additive Gaussian noise channel under a quadratic cost function, and the additive exponential noise channel under a linear cost function, respectively. Simple parametric expressions are given for the error exponent functions, and the least favorable channels that account for the most likely error mechanism (Section 1.2) are identified in both cases.

2. Relative Information and Information Density

We begin with basic terminology and notation required for the subsequent development.

  • 1.

    If (A,F,P) is a probability space, XP indicates P[XF]=P(F) for all FF.

  • 2.

    If probability measures P and Q defined on the same measurable space (A,F) satisfy P(A)=0 for all AF such that Q(A)=0, we say that P is dominated by Q, denoted as PQ. If P and Q dominate each other, we write PQ. If there is an event such that P(A)=0 and Q(A)=1, we say that P and Q are mutually singular, and we write PQ.

  • 3.
    If PQ, then dPdQ is the Radon-Nikodym derivative of the dominated measure P with respect to the reference measure Q. Its logarithm is known as the relative information, namely, the random variable
    ıPQ(a)=logdPdQ(a)[,+),aA. (6)
    As with the Radon-Nikodym derivative, any identity involving relative informations can be changed on a set of measure zero under the reference measure without incurring in any contradiction. If PQR, then the chain rule of Radon-Nikodym derivatives yields
    ıPQ(a)+ıQR(a)=ıPR(a),aA. (7)
    Throughout the paper, the base of exp and log is the same and chosen by the reader unless explicitly indicated otherwise. We frequently define a probability measure P from the specification of ıPQ and QP since
    P(A)=AexpıPQ(a)dQ(a),AF. (8)
    If XP and YQ, it is often convenient to write ıXY(x) instead of ıPQ(x). Note that
    EexpıXY(Y)=1. (9)

    Example 1.

    If XNμX,σX2 (Gaussian with mean μX and variance σX2) and YNμY,σY2, then,
    ıXY(a)=12logσY2σX2+12(aμY)2σY2(aμX)2σX2loge. (10)
  • 4.

    Let (A,F) and (B,G) be measurable spaces, known as the input and output spaces, respectively. Likewise, A and B are referred to as the input and output alphabets respectively. The simplified notation PY|X:AB denotes a random transformation from (A,F) to (B,G), i.e., for any xA, PY|X=x(·) is a probability measure on (B,G), and for any BG, PY|X=·(B) is an F-measurable function.

  • 5.
    We abbreviate by PA the set of probability measures on (A,F), and by PA×B the set of probability measures on (A×B,FG). If PPA and PY|X:AB is a random transformation, the corresponding joint probability measure is denoted by PPY|XPA×B (or, interchangeably, PY|XP). The notation PPY|XQ simply indicates that the output marginal of the joint probability measure PPY|X is denoted by QPB, namely,
    Q(B)=PY|X(B|x)dPX(x)=EPY|X(B|X),BG. (11)
  • 6.
    If PXPY|XPY and PY|X=aPY, the information density ıX;Y:A×B[,) is defined as
    ıX;Y(a;b)=ıPY|X=aPY(b),(a,b)A×B. (12)

    Following Rényi’s terminology [49], if PXPY|XPX×PY, the dependence between X and Y is said to be regular, and the information density can be defined on (x,y)A×B. Henceforth, we assume that PY|X is such that the dependence between its input and output is regular regardless of the input probability measure. For example, if X=YR, then PY|X=a(A)=1{aA}, and their dependence is not regular, since for any PX with non-discrete components PXY Inline graphic PX × PY.

  • 7.
    Let α>0, and PXPY|XPY. The α-response to PXPA is the output probability measure PY[α]PY with relative information given by
    ıY[α]Y(y)=1αlogE[exp(αıX;Y(X;y)κα)],XPX, (13)
    where κα is a scalar that guarantees that PY[α] is a probability measure. Invoking (9), we obtain
    κα=αlogEE1α[exp(αıX;Y(X;Y¯))|Y¯],(X,Y¯)PX×PY. (14)
    For brevity, the dependence of κα on PX and PY|X is omitted. Jensen’s inequality applied to (·)α results in κα0 for α(0,1) and κα0 for α>1. Although the α-response has a long record of services to information theory, this terminology and notation were introduced recently in [45]. Alternative terminology and notation were proposed in [42], which refers to the α-response as the order α Rényi mean. Note that κ1=0 and the 1-response to PX is PY. If pY[α] and pY|X denote the densities of PY[α] and PY|X with respect to some common dominating measure, then (13) becomes
    pY[α](y)=expκααE1αpY|Xα(y|X),XPX. (15)
    For α>1 (resp. α<1) we can think of the normalized version of pY|Xα as a random transformation with less (resp. more) "noise" than pY|X.
  • 8.

    We will have opportunity to apply the following examples.

Example 2.

If Y=X+N, where XNμX,σX2 independent of NNμN,σN2, then the α-response to PX is

Y[α]NμX+μN,ασX2+σN2. (16)

Example 3.

Suppose that Y=X+N, where N is exponential with mean ζ, independent of X, which is a mixed random variable with density

fX(t)=ζαμδ(t)+1ζαμ1μet/μ1{t>0}, (17)

with αμζ. Then, Y[α], the α-response to PX, is exponential with mean αμ.

3. Relative Entropy and Rényi Divergence

Given a pair of probability measures (P,Q)PA2, relative entropy and Rényi divergence gauge the distinctness between P and Q.

  • 9.
    Provided PQ, the relative entropy is the expectation of the relative information with respect to the dominated measure
    D(PQ)=EıPQ(X),XP (18)
    =EexpıPQ(Y)ıPQ(Y),YQ (19)
    0, (20)
    with equality if and only if P=Q. If P Inline graphic Q, then D(PQ)=. As in Item 3, if XP and YQ, we may write D(XY) instead of D(PQ), in the same spirit that the expectation and entropy of P are written as E[X] and H(X), respectively.
  • 10.

    Arising in the sequel, a common optimization in information theory finds, among the probability measures satisfying an average cost constraint, that which is closest to a given reference measure Q in the sense of D(·Q). For that purpose, the following result proves sufficient. Incidentally, we often refer to unconstrained maximizations over probability distributions. It should be understood that those optimizations are still constrained to the sets PA or PB. As customary in information theory, we will abbreviate maxPXPA by maxX or maxPX.

    Theorem 1.

    Let PZPA and suppose that g:A[0,) is a Borel measurable mapping. Then,
    minXD(XZ)+E[g(X)]=logE[exp(g(Z))], (21)
    achieved uniquely by PX*PZ defined by
    ıX*Z(a)=g(a)logE[exp(g(Z))],aA. (22)

    Proof. 

    Note that since g is nonnegative, η=E[exp(g(Z))](0,1]. Furthermore,
    E[g(X*)]=g(t)exp(g(t))dPZ(t)E[exp(g(Z))]0,1eη. (23)
    Therefore, the subset of PA for which the term in {·} in (21) is finite is nonempty: Fix any PX from that subset, (which therefore satisfies PXPZPX*) and invoke the chain rule (7) to write
    D(XZ)+E[g(X)]=EıXX*(X)+ıX*Z(X)+g(X) (24)
    =D(XX*)logE[exp(g(Z))],XPX, (25)
    which is uniquely minimized by letting PX=PX*. Note that for typographical convenience we have denoted X*PX*. □
  • 11.
    Let p and q denote the Radon-Nikodym derivatives of probability measures P and Q, respectively, with respect to a common dominating σ-finite measure μ. The Rényi divergence of order α(0,1)(1,) between P and Q is defined as [25,50]
    Dα(PQ)=1α1logApαq1αdμ (26)
    =1α1logEexpαıPR(Z)+(1α)ıQR(Z),ZR (27)
    =1α1logEexpαıPQ(Y),YQ (28)
    =1α1logEexp(α1)ıPQ(X),XP, (29)
    where (28) and (29) hold if PQ, and in (27), R is a probability measure that dominates both P and Q. Note that (28) and (29) state that (t1)Dt(XY) and tD1+t(XY) are the cumulant generating functions of the random variables ıXY(Y) and ıXY(X), respectively. The relative entropy is the limit of Dα(PQ) as α1, so it is customary to let D1(PQ)=D(PQ). For any α>0, Dα(PQ)0 with equality if and only if P=Q. Furthermore, Dα(PQ) is non-decreasing in α, satisfies the skew-symmetric property
    (1α)Dα(PQ)=αD1α(QP),α[0,1], (30)
    and
    infα(0,1)Dα(PQ)=PQinfα>1Dα(PQ)=. (31)
  • 12.

    The expressions in the following pair of examples will come in handy in Section 11 and Section 12.

    Example 4.

    Suppose that σα2=ασ12+(1α)σ02>0 and α(0,1)(1,). Then,
    DαNμ0,σ02Nμ1,σ12=12logσ12σ02+12(α1)logσ12σα2+α(μ1μ0)22σα2loge, (32)
    DNμ0,σ02Nμ1,σ12=12logσ12σ02+12σ02σ121loge+(μ1μ0)22σ12loge (33)
    =limα1DαNμ0,σ02Nμ1,σ12. (34)

    Example 5.

    Suppose Z is exponentially distributed with unit mean, i.e., its probability density function is et1{t0}. For d0d1 and α such that (1α)μ0+αμ1>0 we obtain
    Dαμ0Z+d0μ1Z+d1=d0d1μ1loge+logμ1μ0+11αlogα+(1α)μ0μ1,
    Dμ0Z+d0μ1Z+d1=μ0μ11+d0d1μ1loge+logμ1μ0 (35)
    =limα1Dαμ0Z+d0μ1Z+d1. (36)
  • 13.
    Intimately connected with the notion of Rényi divergence is the tilted probability measure Pα defined, if Dα(P1P0)<, by
    ıPαQ(a)=αıP1Q(a)+(1α)ıP0Q(a)+(1α)Dα(P1P0), (37)
    where Q is any probability measure that dominates both P0 and P1. Although (37) is defined in general, our main emphasis is on the range α(0,1), in which, as long as P0 Inline graphic P1, the tilted probability measure is defined and satisfies PαP0 and PαP1, with corresponding relative informations
    ıPαP0(a)=ıPαQ(a)ıP0Q(a) (38)
    =(1α)Dα(P1P0)+αıP1Q(a)ıP0Q(a), (39)
    ıPαP1(a)=ıPαQ(a)ıP1Q(a) (40)
    =(1α)Dα(P1P0)(1α)ıP1Q(a)ıP0Q(a), (41)
    where we have used the chain rule for PαP0Q and PαP1Q. Taking a linear combination of (38)–(41) we conclude that, for all aA,
    (1α)Dα(P1P0)=(1α)ıPαP0(a)+αıPαP1(a). (42)

    Henceforth, we focus particular attention on the case α(0,1) since that is the region of interest in the application of Rényi information measures to the evaluation of error exponents in channel coding for codes whose rate is below capacity. In addition, often proofs simplify considerably for α(0,1).

  • 14.

    Much of the interplay between relative entropy and Rényi divergence hinges on the following identity, which appears, without proof, in (3) of [51].

    Theorem 2.

    Let α(0,1) and assume that P0Inline graphicP1 are defined on the same measurable space. Then, for any PP1 and PP0,
    αD(PP1)+(1α)D(PP0)=D(PPα)+(1α)Dα(P1P0), (43)
    where Pα is the tilted probability measure in (37) and (43) holds regardless of whether the relative entropies are finite. In particular,
    D(PPα)<max{D(PP0),D(PP1)}<. (44)

    Proof. 

    We distinguish three overlapping cases:
    • (1)
      D(PPα)<: Taking expectation of (42) with respect to aXP, yields (43) because
      EıPαP0(X)=D(PP0)D(PPα), (45)
      EıPαP1(X)=D(PP1)D(PPα), (46)
      where, thanks to the assumption that D(PPα)<, we have invoked Corollary A1 in the Appendix A twice with (P,Q,R)(P,Pα,P0) and (P,Q,R)(P,Pα,P1), respectively;
    • (2)
      max{D(PP0),D(PP1)}<: The proof is identical since we are entitled to invoke Corollary A1 to show (45) (resp., (46)) because D(PP0)< (resp., D(PP1)<).
    • (3)
      D(PPα)= and max{D(PP0),D(PP1)}=: both sides of (43) are equal to .
    Finally, to show that (44) follows from (43), simply recall from (31) that Dα(P1P0)<. □
  • 15.

    Relative entropy and Rényi divergence are related by the following fundamental variational representation.

    Theorem 3.

    Fix α(0,1) and (P1,P0)PA2. Then, the Rényi divergence between P1 and P0 satisfies
    (1α)Dα(P1P0)=minPαD(PP1)+(1α)D(PP0), (47)
    where the minimum is over PA. If P0Inline graphicP1, then the right side of (47) is attained by the tilted measure Pα, and the minimization can be restricted to the subset of probability measures which are dominated by both P1 and P0.

    Proof. 

    If P0P1, then both sides of (47) are + since there is no probability measure that is dominated by both P0 and P1. If P0 Inline graphic P1, then minimizing both sides of (43) with respect to P yields (47) and the fact that the tilted probability measure attains the minimum therein. □

    The variational representation in (47) was observed in [39] in the finite-alphabet case, and, contemporaneously, in full generality in [50]. Unlike Theorem 3, both of those references also deal with α>1. The function d(α)=(1α)Dα(P1P0), with d(1)=limα1d(α), is concave in α because the right side of (47) is a minimum of affine functions of α.

  • 16.
    Given random transformations PY|X:AB, QY|X:AB, and a probability measure PXPA on the input space, the conditional relative entropy is
    D(PY|XQY|X|PX)=D(PY|XPXQY|XPX) (48)
    =EDPY|X(·|X)QY|X(·|X),XPX. (49)
    Analogously, the conditional Rényi divergence is defined as
    Dα(PY|XQY|X|PX)=Dα(PY|XPXQY|XPX). (50)
    A word of caution: the notation in (50) conforms to that in [38,45] but it is not universally adopted, e.g., [43] uses the left side of (50) to denote the Rényi generalization of the right side of (49). We can express the conditional Rényi divergence as
    Dα(PY|XQY|X|PX)
    =1α1logEexp(α1)DαPY|X(·|X)QY|X(·|X),XPX, (51)
    =1α1logEdPY|XdQY|X(Y|X)α1,(X,Y)PXPY|X, (52)
    where (52) holds if PXPY|XPXQY|X. Jensen’s inequality applied to (51) results in
    Dα(PY|XQY|X|PX)EDα(PY|X(·|X)QY|X(·|X)),α(0,1); (53)
    Dα(PY|XQY|X|PX)EDα(PY|X(·|X)QY|X(·|X)),α>1. (54)
    Nevertheless, an immediate and crucial observation we can draw from (51) is that the unconstrained maximizations of the sides of (53) and of (54) over PX do coincide: for all α>0,
    supXDα(PY|XQY|X|PX)=supXEDα(PY|X(·|X)QY|X(·|X)) (55)
    =supaADα(PY|X=aQY|X=a). (56)
  • 17.

    Conditional Rényi divergence satisfies the following additive decomposition, originally pointed out, without proof, by Sibson [31] in the setting of finite A.

    Theorem 4.

    Given PXPA, QYPB, PY|X:AB, and α(0,1)(1,), we have
    Dα(PY|XQY|PX)=Dα(PY|XPY[α]|PX)+Dα(PY[α]QY). (57)
    Furthermore, with κα as in (14),
    DαPY|XPY[α]|PX=καα1. (58)
    Proof. Select an arbitrary probability measure RYPB that dominates both QY and PY, and, therefore, PY[α] too. Letting (X,Z)PX×RY, we have
    Dα(PY|XQY|PX)=1α1logEdPXYdPX×RY(X,Z)αdQYdRY(Z)1α (59)
    =1α1logEEexpαıX;Y(X;Z)|ZdPYdRY(Z)αdQYdRY(Z)1α (60)
    =καα1+1α1logEdPY[α]dPY(Z)αdPYdRY(Z)αdQYdRY(Z)1α (61)
    =καα1+1α1logEdPY[α]dRY(Z)αdQYdRY(Z)1α (62)
    =καα1+Dα(PY[α]QY), (63)
    where (61) follows from (13), and (62) follows from the chain rule of Radon-Nikodym derivatives applied to PY[α]PYRY. Then, (58) follows by specializing QY=PY[α], and the proof of (57) is complete, upon plugging (58) into the right side of (63). □

    A proof of (57) in the discrete case can be found in Appendix A of [37].

  • 18.
    For all α>0, given two inputs (PX,QX)PA2 and one random transformation PY|X:AB, Rényi divergence (and, in particular, relative entropy) satisfies the data processing inequality,
    Dα(PXQX)Dα(PYQY), (64)
    where PXPY|XPY, and QXPY|XQY. The data processing inequality for Rényi divergence was observed by Csiszár [52] in the more general context of f-divergences. More recently it was stated in [39,50]. Furthermore, given one input PXPA and two transformations PY|X:AB and QY|X:AB, conditioning cannot decrease Rényi divergence,
    Dα(PY|XQY|X|PX)Dα(PYQY). (65)

    Since Dα(PY|XQY|X|PX)=Dα(PXPY|XPXQY|X), (65) follows by applying (64) to a deterministic transformation which takes an input pair and outputs the second component. Inequalities (53) and (65) imply the convexity of Dα(PQ) in (P,Q) for α(0,1].

4. Dependence Measures

In this paper we are interested in three information measures that quantify the dependence between random variables X and Y, such that PXPY|XPY, namely, mutual information, and two of its generalizations, α- mutual information and Augustin–Csiszár mutual information of order α.

  • 19.
    The mutual information is
    I(X;Y)=I(PX,PY|X)=D(PY|XPY|PX) (66)
    =minQYD(PY|XQY|PX) (67)
    =minQYD(PXYPX×QY). (68)
  • 20.
    Given α(0,1)(1,), the α-mutual information is defined as (see [30,31,32,40,42,45])
    Iα(X;Y)=Iα(PX,PY|X) (69)
    =minQYDα(PY|XQY|PX) (70)
    =minQYDα(PXYPX×QY) (71)
    =DαPY|XPY[α]|PX (72)
    =1α1logEexp(α1)DαPY|X(·|X)PY[α],XPX (73)
    =DαPY|XPY|PXDαPY[α]PY (74)
    =καα1 (75)
    =αα1logE[E1α[exp(αıX;Y(X;Y¯))|Y¯]],(X,Y¯)PX×PY, (76)
    where (72) and (74) follow from (57); (73) is a special case of (51); (75) follows from Theorem 4; and, (76) is (14). In view of (67) and (69), we let I1(X;Y)=I(X;Y). The notation we use for α-mutual information conforms to that used in [40,42,45,53]. Other notations include Kα in [32,38,39] and Iαg in [43]. I0(X;Y) and I(X;Y) are defined by taking the corresponding limits.
  • 21.
    Theorem 4 and (72) result in the additive decomposition
    Iα(X;Y)=Dα(PY|XQY|PX)Dα(PY[α]QY), (77)
    for any QY with Dα(PY[α]QY)<, thereby generalizing the well-known decomposition for mutual information,
    I(X;Y)=D(PY|XQY|PX)D(PYQY), (78)
    which, in contrast to (77), is a simple consequence of the chain rule whenever the dependence between X and Y is regular, and of Lemma A1 in general.
  • 22.

    Example 6.

    Additive independent Gaussian noise. If Y=X+N, where XN0,σX2 independent of NN0,σN2, then, for α>0,
    Y[α]N0,ασX2+σN2, (79)
    Iα(X;X+N)=Iα(X+N;X)=12log1+ασX2σN2. (80)
  • 23.
    If α(0,1), (47) and (69) result in
    (1α)Iα(PX,PY|X)=minQXQY|XD(QXPX)+αD(QY|XPY|X|QX)+(1α)I(QX,QY|X). (81)

    For α>1 a proof of (81) is given in [39] for finite alphabets.

  • 24.
    Unlike I(PX,PY|X), we can express Iα(PX,PY|X) directly in terms of its arguments without involving the corresponding output distribution or the α-response to PX. This is most evident in the case of discrete alphabets, in which (76) becomes
    Iα(X;Y)=αα1logyBxAPX(x)PY|X=xα(y)1α, (82)
    I0(X;Y)=logmaxyBxAPX(x)1{PY|X(y|x)>0}, (83)
    I(X;Y)=logbYsupa:PX(a)>0PY|X(b|a). (84)
    For example, if X is discrete and Hα(X) denotes the Rényi entropy of order α, then for all α>0,
    Hα(X)=I1α(X;X). (85)

    If X and Y are equiprobable with P[XY]=δ, then, in bits, Iα(X;Y)=1hα(δ), where hα(δ) denotes the binary Rényi entropy.

  • 25.

    In the main region of interest, namely, α(0,1), frequently we use a different parametrization in terms of ρ>0, with α=11+ρ.

    Theorem 5.

    For any ρ>0, we have the upper bound
    ρI11+ρ(X;Y)minQY|X:ABD(QY|XPY|X|PX)+ρI(PX,QY|X). (86)

    Proof. 

    Fix QY|X:AB, and let PXQY|XQY. Then,
    I11+ρ(X;Y)D11+ρ(PXYPX×QY) (87)
    =1+ρρminRXY11+ρD(RXYPXY)+ρ1+ρD(RXYPX×QY) (88)
    1ρD(QY|XPXPXY)+D(QY|XPXPX×QY) (89)
    =1ρD(QY|XPY|X|PX)+I(PX,QY|X), (90)
    where (87), (88) and (90) follow from (69), (47) and (66) respectively. □

    Just like (53), we will show in Section 7 that (86) becomes an equality upon the unconstrained maximization of both sides.

  • 26.
    Before introducing the last dependence measure in this section, recall from Definition 7 and (58) that PY[α]PY, the α-response (of PY|X) to PX defined by
    ıY[α]Y(y)=1αlogE[expαıX;Y(X;y)+(1α)DαPY|XPY[α]|PX], (91)
    attains minQYDα(PY|XQY|PX), where the expectation is with respect to XPX. We proceed to define PYαPY, the α-response (of PY|X) to PX by means of
    ıYαY(y)=1αlogEexp(αıX;Y(X;y)+(1α)DαPY|X(·|X)PYα, (92)
    with XPX. Note that PY1=PY[1]=PY.
  • 27.
    In the case of discrete alphabets, (92) becomes the implicit equation
    PYαα(y)=aAPX(a)PY|Xα(y|a)bBPY|Xα(b|a)PYα1α(b),yB, (93)
    which coincides with (9.24) in Fano’s 1961 textbook [7], with s1α, and is also given by Haroutunian in (19) of [22]. For example, if A=B is discrete and Y=X, then PYα=PX, while PY[α]α(y)=cPX(y), yA.
  • 28.

    The α-response satisfies the following identity, which can be regarded as the counterpart of (57) satisfied by the α-response.

    Theorem 6.

    Fix PXPA, PY|X:AB and QYPB. Then,
    Dα(PYαQY)=1α1logEexp(1α)Dα(PY|X(·|X)PYα)Dα(PY|X(·|X)QY). (94)

    Proof. 

    For brevity we assume QYPY. Otherwise, the proof is similar adopting a reference measure that dominates both QY and PY. The definition of unconditional Rényi divergence in Item 11 implies that we can write (α1) times the exponential of the left side of (94) as
    exp(α1)Dα(PYαQY)=EdPYαdPY(Y)αdQYdPY(Y)1α (95)
    =EexpαıX;Y(X;Y)+(1α)DαPY|X(·|X)PYαdQYdPY(Y)1α=EEexpαıX;Y(X;Y)+(1α)ıQYPY(Y)+DαPY|X(·|X)PYα|X (96)
    =Eexp(1α)DαPY|X(·|X)PYαDαPY|X(·|X)QY, (97)
    where (X,Y)PX×PY, (96) follows from (92), and (97) follows from the definition of unconditional Rényi divergence in (27). □

    Theorem 7.

    If α(0,1], then
    Dα(PYαQY)EDα(PY|X(·|X)QY)EDα(PY|X(·|X)PYα) (98)
    D(PYαQY). (99)
    If α1, inequalities (98) and (99) are reversed.

    Proof. 

    Assume α(0,1]. Jensen’s inequality applied to the right side of (94) results in (98). To show (99), again we assume for brevity QYPY, and define the positive functions V:A×B(0,) and W:A×B(0,),
    V(x,y)=expαıX;Y(x;y)+(1α)ıYαY(y), (100)
    W(x,y)=expαıX;Y(x;y)+(1α)ıQYPY(y). (101)
    Note that, with (X,Y)PX×PY, and (x,y)A×B,
    E[V(x,Y)]=exp(α1)Dα(PY|X=xPYα), (102)
    E[W(x,Y)]=exp(α1)Dα(PY|X=xQY),EV(X,y)E[V(X,Y)|X]=exp(1α)ıYαY(y)· (103)
    =·EexpαıX;Y(X;y)+(1α)Dα(PY|X(·|X)PYα) (104)
    =dPYαdPY(y). (105)
    where (104) uses (100) and (102) and (105) follows from (92). Then,
    Dα(PY|X=xQY)Dα(PY|X=xPYα)
    =11αlogE[V(x,Y)]E[W(x,Y)] (106)
    11αEV(x,Y)E[V(x,Y)]logV(x,Y)W(x,Y) (107)
    =EV(x,Y)E[V(x,Y)]ıYαY(Y)ıQYPY(Y), (108)
    where the expectations are with respect to YPY, and
    • (107) follows from the log-sum inequality for integrable non-negative random variables,
      E[V]logE[V]E[W]EVlogVW; (109)
    • (108) ⇐ (100) and (101).
    Taking expectation with respect to XPX of (106)–(108) yields (99) because of Lemma A1 and (105). If α1, then Jensen’s inequality applied to the right side of (94) results in (98) but with the opposite inequality. Moreover, (107) is reversed and the remainder of the proof holds verbatim. □

    In the case of finite input-alphabets, a different proof of (99) is given in Appendix B of [54].

  • 29.
    Introduced in the unpublished dissertation [36] and rescued from oblivion in [32], the Augustin–Csiszár mutual information of order α is defined for α>0 as
    Iαc(X;Y)=Iαc(PX,PY|X)=minQYEDα(PY|X(·|X)QY) (110)
    =EDα(PY|X(·|X)PYα), (111)
    where (111) follows from (98) if α(0,1], and from the reverse of (99) if α1. We conform to the notation in [40], where Iαa was used to denote the difference between entropy and Arimoto-Rényi conditional entropy. In [32,39,43] the Augustin–Csiszár mutual information of order α is denoted by Iα. In Augustin’s original notation [36], Iρ(PX) means I1ρc(PX,PY|X), ρ(0,1). Independently of [36], Poltyrev [35] introduced a functional (expressed as a maximization over a reverse random transformation) which turns out to be ρI11+ρc(X;Y) and which he denoted by E0(ρ,PX), although in Gallager’s notation that corresponds to ρI11+ρ(X;Y), as we will see in (233). I0c(X;Y) and Ic(X;Y) are defined by taking the corresponding limits.
  • 30.
    In the discrete case, (110) boils down to
    Iαc(X;Y)=minQY1α1xAPX(x)logyBPY|Xα(y|x)QY1α(y), (112)
    which can be juxtaposed with the much easier expression in (82) for Iα(X;Y) involving no further optimization. Minimizing the Lagrangian, we can verify that the minimizer in (112) satisfies (93). With (X,Y¯)PX×QY, we have
    I0c(X;Y)=minQYElog1P[PY|X(Y¯|X)>0X], (113)
    Ic(X;Y)=minQYElogPY|X(Y¯|X)QY(Y¯), (114)
    where the expectations are with respect to X.
  • 31.
    The respective minimizers of (72) and (110), namely, the α-response and the α-response, are quite different. Most notably, in contrast to Item 7, an explicit expression for PYα is unknown. Instead of defining PYα through (92), [36] defines it, equivalently, as the fixed point of the operator (dubbed the Augustin operator in [43]) which maps the set of probability measures on the output space to itself,
    dTα(Q)dQ(y)=EdPY|XdQ(y|X)αexp(1α)Dα(PY|X(·|X)Q), (115)
    where XPX. Although we do not rely on them, Lemma 34.2 of (α(0,1)) and Lemma 13 of [43] (α>1) claim that the minimizer in (110), referred to in [43] as the Augustin mean of order α, is unique and is a fixed point of the operator Tα regardless of PX. Moreover, Lemma 13(c) of [43] establishes that for α(0,1) and finite input alphabets, repeated iterations of the operator Tα with initial argument PY[α] converge to PYα.
  • 32.

    It is interesting to contrast the next example with the formulas in Examples 2 and 6.

    Example 7.

    Additive independent Gaussian noise. If Y=X+N, where XN0,σX2 independent of NN0,σN2, then
    YαN0,σN2221α+Δ+snr, (116)
    snr=σX2σN2, (117)
    Δ=4snr+1αsnr2. (118)
    This result can be obtained by postulating a zero-mean Gaussian distribution with variance vα2 as PYα and verifying that (92) is indeed satisfied if vα2 is chosen as in (116). The first step is to invoke (32), which yields
    DαPY|X=xPYα=λα2+αx22sα2loge, (119)
    λα=logvα2σN2+1α1logvα2sα2, (120)
    where we have denoted sα2=αvα2+(1α)σN2. Since YN0,σX2+σN2,
    ıX;Y(x;y)=12logσX2+σN2σN2+12y2σX2+σN2(yx)2σN2loge, (121)
    ıYαY(y)=12logσX2+σN2vα2+12y2σX2+σN2y2vα2loge. (122)
    Assembling (120) and (121), the right side of (92) becomes
    1αlogEexp(αıX;Y(X;y)+(1α)DαPY|X(·|X)PYα=12logσX2+σN2σN2+12y2logeσX2+σN2+1α2αλα+1αlogEexpeα(yX)22σN2+α(1α)X22sα2 (123)
    =12logσX2+σN2σN2+1α2αλα+y2loge21σX2+σN2sα2α(1α)σX2σN2sα2+α2vα2σX2+12αlogσN2sα2σN2sα2+α2vα2σX2 (124)
    =12logσX2+σN2vα2+12y2σX2+σN2y2vα2loge, (125)
    where (124) follows by Gaussian integration, and the marvelous simplification in (125) is satisfied provided that we choose
    sα2=ασX2vα2vα2σN2. (126)
    Comparing (122) and (125), we see that (92) is indeed satisfied with YαN0,vα2 if vα2 satisfies the quadratic Equation (126), whose solution is in (116)–(118). Invoking (32) and (116), we obtain
    Iαc(X;X+N)=αsnr1+αΔ+αsnrloge+12log1+12Δ+snr1α12(1α)log21α+Δ+snr1+αΔ+αsnr. (127)

    Beyond its role in evaluating the Augustin–Csiszár mutual information for Gaussian inputs, the Gaussian distribution in (116) has found some utility in the analysis of finite blocklength fundamental limits for data transmission [55].

  • 33.

    This item gives a variational representation for the Augustin–Csiszár mutual information in terms of mutual information and conditional relative entropy (i.e., non-Rényi information measures). As we will see in Section 10, this representation accounts for the role played by Augustin–Csiszár mutual information in expressing error exponent functions.

    Theorem 8.

    For α(0,1), the Augustin–Csiszár mutual information satisfies the variational representation in terms of conditional relative entropy and mutual information,
    (1α)Iαc(PX,PY|X)=minQY|XαD(QY|XPY|X|PX)+(1α)I(PX,QY|X), (128)
    where the minimum is over all the random transformations from the input to the output spaces.

    Proof. 

    Invoking (47) with (P1,P0)(PY|X=x,QY) we obtain
    (1α)Dα(PY|X=xQY)=minRYαD(RYPY|X=x)+(1α)D(RYQY) (129)
    =minRY|X=xαD(RY|X=xPY|X=x)+(1α)D(RY|X=xQY). (130)
    Averaging over xPX, followed by minimization with respect to QY yields (128) upon recalling (67). □
    In the finite-alphabet case with α(0,1)(1,), the representation in (128) is implicit in the appendix of [32], and stated explicitly in [39], where it is shown by means of a minimax theorem. This is one of the instances in which the proof of the result is considerably easier for α(0,1); we can take the following route to show (128) for α>1. Neglecting to emphasize its dependence on PX, denote
    fα(QY,RY|X)=α1αD(RY|XPY|X|PX)+D(RY|XQY|PX). (131)
    Invoking (47) we obtain
    Dα(PY|X=xQY)=maxRY|X=xα1αD(RY|X=xPY|X=x)+D(RY|X=xQY). (132)
    Averaging (132) with respect to PX followed by minimization over QY, results in
    Iαc(PX,PY|X)=minQYmaxRY|Xfα(QY,RY|X) (133)
    maxRY|XminQYfα(QY,RY|X) (134)
    =maxRY|Xα1αD(RY|XPY|X|PX)+I(PX,RY|X), (135)
    which shows ≥ in (128). If a minimax theorem can be invoked to show equality in (134), then (128) is established for α>1. For that purpose, for fixed RY|X, f(·,RY|X) is convex and lower semicontinuous in QY on the set where it is finite. Rewriting
    f(QY,RY|X)=11αD(RY|XPY|X|PX)+D(RY|XQY|PX)D(RY|XPY|X|PX), (136)
    it can be seen that f(QY,·) is upper semicontinuous and concave (if α>1). A different, and considerably more intricate route is taken in Lemma 13(d) of [43], which also gives (128) for α>1 assuming finite input alphabets.
  • 34.

    Unlike mutual information, neither Iα(X;Y)=Iα(Y;X) nor Iαc(X;Y)=Iαc(Y;X) hold in general.

    Example 8.

    Erasure transformation. Let A={0,1},B={0,1,e},
    PY|X(b|a)=1δ,a=b;δ,b=e;0,abe, (137)
    with δ(0,1), and PX(0)=12. Then, we obtain, for α(0,1)(1,),
    Iα(X;Y)=Iαc(X;Y)=αα1logδ+(1δ)211α, (138)
    Iα(Y;X)=1α1logδ+(1δ)2α1, (139)
    Iαc(Y;X)=I(X;Y)=1δbits. (140)
  • 35.
    It was shown in Theorem 5.2 of [38] that α-mutual information satisfies the data processing lemma, namely, if X and Z are conditionally independent given Y, then
    Iα(X;Z)minIα(X;Y),Iα(Y;Z), (141)
    Iα(Z;X)minIα(Z;Y),Iα(Y;X). (142)

    As shown by Csiszár [32] using the data processing inequality for Rényi divergence, the data processing lemma also holds for Iαc.

  • 36.
    From (53), (54) and the monotonicity of Dα(PQ) in α, we obtain the ordering
    Iβ(X;Y)Iα(X;Y)Iαc(X;Y)Iνc(X;Y)I(X;Y),0<βαν<1; (143)
    I(X;Y)Iνc(X;Y)Iαc(X;Y)Iα(X;Y)Iβ(X;Y),1<ναβ. (144)
  • 37.

    The convexity/concavity properties of the generalized mutual informations are summarized next.

    Theorem 9.

    • (a)
      ρI11+ρ(X;Y) and ρI11+ρc(X;Y) are concave and monotonically non-decreasing in ρ0.
    • (b)
      I(·,PY|X) and Iαc(·,PY|X) are concave functions. The same holds for Iα(·,PY|X) if α>1.
    • (c)
      If α(0,1), then I(PX,·), Iα(PX,·) and Iαc(PX,·) are convex functions.

    Proof. 

    • (a)
      According to (81) and (128), respectively, with α=11+ρ(0,1), ρI11+ρ(X;Y) and ρI11+ρc(X;Y) are the infima of affine functions with nonnegative slopes.
    • (b)
      For mutual information the result goes back to [56] in the finite-alphabet case. In general, it holds since (67) is the infimum of linear functions of PX. The same reasoning applies to Augustin–Csiszár mutual information in view of (110). For α-mutual information with α>1, notice from (51) that Dα(PY|XQY|PX) is concave in PX if α>1. Therefore,
      Iα(λPX1+(1λ)PX0,PY|X) (145)
      =infQYDα(PY|XQY|λPX1+(1λ)PX0) (146)
      infQYλDα(PY|XQY|PX1)+(1λ)Dα(PY|XQY|PX0) (147)
      λIα(PX1,PY|X)+(1λ)Iα(PX0,PY|X). (148)
    • (c)
      The convexity of I(PX,·) and Iα(PX,·) follow from the convexity of Dα(PQ) in (P,Q) for α(0,1] as we saw in Item 18. To show convexity of Iαc(PX,·) if α(0,1), we apply (169) in Item 45 with PY|X=λPY|X1+(1λ)PY|X0, and invoke the convexity of Iα(PX,·):
      (1α)Iαc(PX,PY|X)
      =maxQX(1α)Iα(QX,λPY|X1+(1λ)PY|X0)D(PXQX),maxQXλ1α)Iα(QX,PY|X1)D(PXQX) (149)
      +(1λ)1α)Iα(QX,PY|X0)D(PXQX) (150)
      (1α)λIαc(PX,PY|X1)+(1λ)Iαc(PX,PY|X0). (151)

    Although not used in the sequel, we note, for completeness, that if α(0,1)(1,), [38] (see corrected version in [41]) shows that exp11αIα(·,PY|X)/(α1) is concave.

5. Interplay between Iα(PX,PY|X) and Iαc(PX,PY|X)

In this section we study the interplay between both notions of mutual informations of order α, and, in particular, various variational representations of these information measures.

  • 38.
    For given α(0,1)(1,) and PY|X:AB, define QX[α]PX, the α-adjunct of PX by
    ıQX[α]PX(x)=(α1)DαPY|X=xPY[α]κα, (152)
    with κα the constant in (14) and PY[α], the α-response to PX.
  • 39.

    Example 9.

    Let Y=X+N with XN0,σX2 independent of NN0,σN2, and snr=σX2σN2. The α-adjunct of the input is
    QX[α]=N0,σX21+α2snr1+αsnr. (153)
  • 40.

    Theorem 10.

    The α-response to QX[α] is PY[α], the α-response to PX.

    Proof. 

    We just need to verify that (92) is satisfied if we substitute Yα by Y[α], and instead of taking the expectation in the right side with respect to XPX we take it with respect to X˜QX[α]. Then,
    Eexp(αıX;Y(X˜;y)+(1α)DαPY|X(·|X˜)PY[α]
    =EexpıQX[α]PX(X)+αıX;Y(X;y)+(1α)DαPY|X(·|X)PY[α] (154)
    =Eexp(αıX;Y(X;y)κα) (155)
    =expαıY[α]Y(y), (156)
    where (154) is by change of measure, (155) follows by substitution of (152), and (156) is the same as (13). □
  • 41.
    For given α(0,1)(1,) and PY|X:AB, we define QXαPX, the α-adjunct of an input probability measure PX through
    ıQXαPX(x)=(1α)DαPY|X=xPYα+υα, (157)
    where PYα is the α-response to PX and υα is a normalizing constant so that QXα is a probability measure. According to (9), we must have
    EexpıQXαPX(X)=1,XPX. (158)
    Hence,
    υα=(α1)DαPY|XPYα|QXα. (159)
  • 42.

    With the aid of the expression in Example 7, we obtain

    Example 10.

    Let Y=X+N with XN0,σX2 independent of NN0,σN2, and snr=σX2σN2. Then, the α-adjunct of the input is
    QXα=N0,σX21+α(Δ+snr)1+α(Δsnr)+2α2snr, (160)
    which, in contrast to QX[α], has larger variance than σX2 if α(0,1).
  • 43.

    The following result is the dual of Theorem 10.

    Theorem 11.

    The α-response to QXα is PYα, the α-response to PX. Therefore,
    υα=(α1)IαQXα,PY|X. (161)

    Proof. 

    The proof is similar to that of Theorem 10. We just need to verify that we obtain the right side of (92) if on the right side of (91) we substitute PX by QXα and PY[α] by PYα. Let X¯QXα. Then,
    1αlogEexpαıX;Y(X¯;y)+(1α)DαPY|XPYα|QXα
    =1αlogEexpıQXαPX(X)+αıX;Y(X;y)υα (162)
    =1αlogEexpαıX;Y(X;y)+(1α)DαPY|X(·|X)PYα (163)
    =ıYαY(y), (164)
    where (162)–(164) follow by change of measure, (157), and (92), respectively. □
  • 44.

    By recourse to a minimax theorem, the following representation is given for α(0,1)(1,) in the case of finite alphabets in [39], and dropping the restriction on the finiteness of the output space in [43]. As we show, a very simple and general proof is possible for α(0,1).

    Theorem 12.

    Fix α(0,1), PXPA and PY|X:AB. Then,
    (1α)Iα(X;Y)=minQX(1α)Iαc(QX,PY|X)+D(QXPX), (165)
    where the minimum is attained by QX[α], the α-adjunct of PX defined in (152).

    Proof. 

    The variational representations in (81) and (128) result in (165). To show that the minimum is indeed attained by QX[α], recall from Theorem 10 that the α-response to QX[α] is PY[α]. Therefore, evaluating the term in {} in (165) for QXQX[α] yields, with X˜QX[α],
    (1α)Iαc(QX[α],PY|X)+D(QX[α]PX)
    =(1α)EDα(PY|X(·|X˜)PY[α])+D(QX[α]PX) (166)
    =κα (167)
    =(1α)Iα(X;Y), (168)
    where (167) follows from (152) and (168) results from (69)–(75). □
  • 45.

    For finite-input alphabets, Lemma 18(b) of [43] (earlier Theorem 3.4 of [35] gave an equivalent variational characterization assuming, in addition, finite output alphabets) established the following dual to Theorem 12.

    Theorem 13.

    Fix α(0,1), PXPA and PY|X:AB. Then,
    (1α)Iαc(X;Y)=maxQX(1α)Iα(QX,PY|X)D(PXQX). (169)
    The maximum is attained by QXα, the α-adjunct of PX defined by (157).

    Proof. 

    First observe that (165) implies that ≥ holds in (169). Second, the term in {} on the right side of (169) evaluated at QXQXα becomes
    (1α)Iα(QXα,PY|X)D(PXQXα)
    =(1α)Iα(QXα,PY|X)+(1α)Iαc(PX,PY|X)+υα (170)
    =(1α)Iαc(PX,PY|X), (171)
    where (170) follows by taking the expectation of minus (157) with respect to PX. Therefore, ≤ also holds in (169) and the maximum is attained by QXα, as we wanted to show. □

    Hinging on Theorem 8, Theorems 12 and 13 are given for α(0,1) which is the region of interest in the analysis of error exponents. Whenever, as in the finite-alphabet case, (128) holds for α>1, Theorems 12 and 13 also hold for α>1.

    Notice that since the definition of QXα involves PYα, the fact that it attains the maximum in (169) does not bring us any closer to finding Iαc(X;Y) for a specific input probability measure PX. Fortunately, as we will see in Section 8, (169) proves to be the gateway to the maximization of Iαc(X;Y) in the presence of input-cost constraints.

  • 46.
    Focusing on the main range of interest, α(0,1), we can express (169) as
    Iαc(PX,PY|X)=maxQXIα(QX,PY|X)11αD(PXQX) (172)
    =maxξ0I(ξ)ξ1α (173)
    =I(ξα)ξα1α, (174)
    where we have defined the function (dependent on α, PX, and PY|X)
    I(ξ)=maxQX:D(PXQX)ξIα(QX,PY|X), (175)
    and ξα is the solution to
    I˙(ξα)=11α. (176)

    Recall that the maxima over the input distribution in (172) and (175) are attained by the α-adjunct QXα defined in Item 41.

  • 47.

    At this point it is convenient to summarize the notions of input and output probability measures that we have defined for a given α, random transformation PY|X, and input probability measure PX:

    • PY: The familiar output probability measure PXPY|XPY, defined in Item 5.

    • PY[α]: The α-response to PX, defined in Item 7. It is the unique achiever of the minimization in the definition of α-mutual information in (67).

    • PYα: The α-response to PX defined in Item 26. It is the unique achiever of the minimization in the definition of Augustin–Csiszár α-mutual information in (110).

    • QX[α]: The α-adjunct of PX, defined in (152). The α-response to QX[α] is PY[α]. Furthermore, QX[α] achieves the minimum in (165).

    • QXα: The α-adjunct of PX, defined in (157). The α-response to QXα is PYα. Furthermore, QXα achieves the maximum in (169).

6. Maximization of Iα(X;Y)

Just like the maximization of mutual information with respect to the input distribution yields the channel capacity (of course, subject to conditions [57]), the maximization of Iα(X;Y) and of Iαc(X;Y) arises in the analysis of error exponents, as we will see in Section 10. A recent in-depth treatment of the maximization of α-mutual information is given in [45]. As we see most clearly in (82) for the discrete case, when it comes to its optimization, one advantage of Iα(X;Y) over I(X;Y) is that the input distribution does not affect the expression through its influence on the output distribution.

  • 48.

    The maximization of α-mutual information is facilitated by the following result.

    Theorem 14 ([45]).

    Given α(0,1)(1,); a random transformation PY|X:AB; and, a convex set PPA, the following are equivalent.
    • (a
      PX*P attains the maximal α-mutual information on P,
      Iα(PX*,PY|X)=maxPPIα(P,PY|X)<. (177)
    • (b
      For any PXP, and any output distribution QYPB,
      Dα(PY|XPY[α]*|PX)Dα(PY|XPY[α]*|PX*) (178)
      Dα(PY|XQY|PX*), (179)
      where PY[α]* is the α-response to PX*.
    Moreover, if PY[α] denotes the α-response to PX, then
    Dα(PY[α]PY[α]*)Iα(PX*,PY|X)Iα(PX,PY|X)<. (180)

    Note that, while Iα(·,PY|X) may not be maximized by a unique (or, in fact, by any) input distribution, the resulting α-response PY[α]* is indeed unique. If P is such that none of its elements attain the maximal Iα, it is known [42,45] that the α-response to any asymptotically optimal sequence of input distributions converges to PY[α]*. This is the counterpart of a result by Kemperman [58] concerning mutual information.

  • 49.

    The following example appears in [45].

    Example 11.

    Let Y=X+N where NN0,σN2 independent of X. Fix α(0,1) and P>0. Suppose that the set, PPA, of allowable input probability measures consists of those that satisfy the constraint
    Eexpeα(1α)X22α2P+σN2α2P+σN2αP+σN2. (181)
    We can readily check that X*N0,P satisfies (181) with equality, and as we saw in Example 2, its α-response is PY[α]*=N(0,αP+σ2). Theorem 14 establishes that PX* does indeed maximize the α-mutual information among all the distributions in P, yielding (recall Example 6)
    maxPXPIα(X;Y)=12log1+αPσ2. (182)
    Curiously, if, instead of P defined by the constraint (181), we consider the more conventional P={X:E[X2]P}, then the left side of (182) is unknown at present. Numerical evidence shows that it can exceed the right side by employing non-Gaussian inputs.
  • 50.
    Recalling (56) and (178) implies that if PX* attains the finite maximal unconstrained α-mutual information and its α-response is denoted by PY[α]*, then,
    maxXIα(X;Y)=maxPPIα(P,PY|X)=maxaADα(PY|X=aPY[α]*), (183)
    which requires that PX*(Aα*)=1, with
    Aα*=xA:Dα(PY|X=xPY[α]*)=maxaADα(PY|X=aPY[α]*). (184)
    For discrete alphabets, this requires that if xAα*, then PX*(x)=0, which is tantamount to
    yBPY|Xα(y|x)E1ααPY|Xα(y|X*)expα1αIα(X*;Y*), (185)
    with equality for all xA such that PX*(x)>0. For finite-alphabet random transformations this observation is equivalent to Theorem 5.6.5 in [9].
  • 51.
    Getting slightly ahead of ourselves, we note that, in view of (128), an important consequence of Theorem 15 below, is that, as anticipated in Item 25, the unconstrained maximization of Iα(X;Y) for α(0,1) can be expressed in terms of the solution to an optimization problem involving only conventional mutual information and conditional relative entropy. For ρ0,
    ρsupXI11+ρ(X;Y)=supXminQY|X:ABD(QY|XPY|X|PX)+ρI(PX,QY|X). (186)

7. Unconstrained Maximization of Iαc(X;Y)

  • 52.

    In view of the fact that it is much easier to determine the α-mutual information than the order-α Augustin–Csiszár information, it would be advantageous to show that the unconstrained maximum of Iαc(X;Y) equals the unconstrained maximum of Iα(X;Y). In the finite-alphabet setting, in which it is possible to invoke a "minisup” theorem (e.g., see Section 7.1.7 of [59]), Csiszár [32] showed this result for α>0. The assumption of finite output alphabets was dropped in Theorem 1 of [42], and further generalized in Theorem 3 of the same reference. As we see next, for α(0,1), it is possible to give an elementary proof without restrictions on the alphabets.

    Theorem 15.

    Let α(0,1). If the suprema are over PA, the set of all probability measures defined on the input space, then
    supXIαc(X;Y)=supXIα(X;Y). (187)
    Proof. In view of (143), ≥ holds in (187). To show ≤, we assume supXIα(X;Y)< as, otherwise, there is nothing left to prove. The unconstrained maximization identity in (183) implies
    supXIα(X;Y)=supaADα(PY|X=aPY[α]*) (188)
    =supPXPEDα(PY|X(·|X)PY[α]*) (189)
    infQQsupPXPEDα(PY|X(·|X)Q) (190)
    supPXPinfQQEDα(PY|X(·|X)Q) (191)
    =supXIαc(X;Y), (192)
    where PY[α]* is the unique α-response to any input that achieves the maximal α-mutual information, and if there is no such input, it is the limit of the α-responses to any asymptotically optimal input sequence (Item 48). □
    Furthermore, if {Xn} is asymptotically optimal for Iα, i.e., limnIα(Xn;Yn)=supXIα(X;Y), then {Xn} is also asymptotically optimal for Iαc because for any δ>0, we can find N, such that for all n>N,
    Iα(Xn;Yn)+δsupaADα(PY|X=aPY[α]*) (193)
    EDα(PY|X(·|Xn)PY[α]*) (194)
    Iαc(Xn;Yn) (195)
    Iα(Xn;Yn). (196)

8. Maximization of Iαc(X;Y) Subject to Average Cost Constraints

This section is at the heart of the relevance of Rényi information measures to error exponent functions.

  • 53.
    Given α(0,1), PY|X:AB, a cost function b:A[0,) and real scalar θ0, the objective is to maximize the Augustin–Csiszár mutual information allowing only those probability measures that satisfy E[b(X)]θ, namely,
    Cαc(θ)=supPX:E[b(X)]θIαc(PX,PY|X). (197)
    Unfortunately, identity (187) no longer holds when the maximizations over the input probability measure are cost-constrained, and, in general, we can only claim
    Cαc(θ)supPX:E[b(X)]θIα(PX,PY|X). (198)

    A conceptually simple approach to solve for Cαc(θ) is to

    • (a)

      postulate an input probability measure PX* that achieves the supremum in (197);

    • (b)

      solve for its α-response PY* using (92);

    • (c)
      show that (PX*,PY*) is a saddle point for the game with payoff function
      B(PX,QY)=DαPY|X=xQYdPX, (199)
      where QYPA and PX is chosen from the convex subset of PA of probability measures which satisfy E[b(X)]θ.

    Since PY* is already known, by definition, to be the α-response to PX*, verifying the saddle point is tantamount to showing that B(PX,PY*) is maximized by PX* among {PXPA:E[b(X)]θ}. Theorem 1 of [43] guarantees the existence of a saddle point in the case of finite input alphabets. In addition to the fact that it is not always easy to guess the optimum input PX* (see e.g., Section 12), the main stumbling block is the difficulty in determining the α-response to any candidate input distribution, although sometimes this is indeed feasible as we saw in Example 7.

  • 54.
    Naturally, Theorem 15 implies
    Cαc(θ)supXIα(X;Y). (200)

    If the unconstrained maximization of Iαc(·,PY|X) is achieved by an input distribution X that satisfies E[b(X)]θ, then equality holds in (200), which, in turn, is equal to Iαc(PX,PY|X). In that case, the average cost constraint is said to be inactive. For most cost functions and random transformations of practical interest, the cost constraint is active for all θ>0. To ascertain whether it is, we simply verify whether there exists an input achieving the right side of (200), which happens to satisfy the constraint. If so, Cαc(θ) has been found. The same holds if we can find a sequence {Xn} such that E[b(Xn)]θ and Iα(Xn;Yn)supXIα(X;Y). Otherwise, we proceed with the method described below. Thus, henceforth, we assume that the cost constraint is active.

  • 55.
    The approach proposed in this paper to solve for Cαc(θ) for α(0,1) hinges on the variational representation in (172), which allows us to sidestep having to find any α-response. Note that once we set out to maximize Iαc(PX,PY|X) over P={PXPA:E[b(X)]θ}, the allowable QX in the maximization in (175) range over a ξ-blow-up of P defined by
    Γξ(P)=QXPA:PXP,suchthatD(PXQX)ξ. (201)

    As we show in Item 56, we can accomplish such an optimization by solving an unconstrained maximization of the sum of α-mutual information and a term suitably derived from the cost function.

  • 56.
    It will not be necessary to solve for (176), as our goal is to further maximize (172) over PX subject to an average cost constraint. The Lagrangian corresponding to the constrained optimization in (197) is
    Lα(ν,PX)=Iαc(X;Y)νE[b(X)]+νθ, (202)
    where on the left side we have omitted, for brevity, the dependence on θ stemming from the last term on the right side. The Lagrange multiplier method (e.g., [60]) implies that if X* achieves the supremum in (197), then there exists ν*0 such that for all PX on A and ν0,
    Lα(ν*,PX)Lα(ν*,PX*)Lα(ν,PX*). (203)
    Note from (202) that the right inequality in (203) can only be achieved if
    E[b(X*)]=θ, (204)
    and, consequently,
    Cαc(θ)=Lα(ν*,PX*)=minν0maxPXLα(ν,PX)=maxPXminν0Lα(ν,PX). (205)
    The pivotal result enabling us to obtain Cαc(θ) without the need to deal with Augustin–Csiszár mutual information is the following.

    Theorem 16.

    Given α(0,1), ν0, PY|X:AB, and b:A[0,), denote the function
    Aα(ν)=maxXIα(X;Y)+11αlogEexp(1α)νb(X). (206)
    Then,
    supPXPALα(ν,PX)=νθ+Aα(ν), (207)
    and
    Cαc(θ)=minν0νθ+Aα(ν). (208)

    Proof. 

    Plugging (172) into (197) we obtain, with XPX, and X^QX,
    supPXPALα(ν,PX)=supPXIαc(X;Y)νE[b(X)]+νθ (209)
    =supPXPAmaxQXPAIα(QX,PY|X)11αD(PXQX)νE[b(X)]+νθ (210)
    =νθ+maxQXPAIα(QX,PY|X)11αinfPXD(PXQX)+ν(1α)E[b(X)] (211)
    =νθ+maxQXPAIα(QX,PY|X)+11αlogEexpν(1α)b(X^) (212)
    =νθ+Aα(ν), (213)
    where (209) and (213) follow from (202) and (206), respectively, and (212) follows by invoking Theorem 1 with ZQX and
    g(a)=(1α)νb(a), (214)
    which is nonnegative since α(0,1) and ν>0. Finally, (208) follows from (205) and (207). □

    In conclusion, we have shown that the maximization of Augustin–Csiszár mutual information of order α subject to E[b(X)]θ boils down to the unconstrained maximization of a Lagrangian consisting of the sum of α-mutual information and an exponential average of the cost function. Circumventing the need to deal with α-responses and with Augustin–Csiszár mutual information of order α leads to a particularly simple optimization, as illustrated in Section 11 and Section 12.

  • 57.
    Theorem 16 solves for the maximal Augustin–Csiszár mutual information of order α under an average cost constraint without having to find out the input probability measure PX* that attains it nor its α-response PY* (using the notation in Item 53). Instead, it gives the solution as
    Cαc(θ)=minν0νθ+maxXIα(X;Y)+11αlogEexp(1α)νb(X). (215)
    Although we are not going to invoke a minimax theorem, with the aid of Theorem 9-(b) we can see that the functional within the inner brackets is concave in PX; Furthermore, if V(0,1], then logEVν is easily seen to be convex in ν with the aid of the Cauchy-Schwarz inequality. Before we characterize the saddle point (ν*,QX*) of the game in (215) we note that (PX*,PY*) can be readily obtained from (ν*,QX*).

    Theorem 17.

    Fix α(0,1). Let ν*>0 denote the minimizer on the right side of (215), and QX* the input probability measure that attains the maximum in (206) (or (215)) for ν=ν*. Then,
    • (a)
      QX* is the α-adjunct of PX*.
    • (b)
      PY*=QY[α]*, the α-response to QX*.
    • (c)
      PX*QX* with
      ıPX*QX*(a)=(1α)ν*b(a)+τα,aA, (216)
      where τα is a normalizing constant ensuring that PX* is a probability measure.

    Proof. 

    • (a)
      We had already established in Theorem 13 that the maximum on the right side of (210) is achieved by the α-adjunct of PX. In the special case ν=ν*, such PX is PX*. Therefore, QX*, the argument that achieves the maximum in (206) for ν=ν*, is the α-adjunct of PX*.
    • (b)
      According to Theorem 11, the α-response to QX* is the α-response to PX*, which is PY* by definition.
    • (c)
      For ν=ν*, PX* achieves the supremum in (209) and the infimum in (211). Therefore, (216) follows from Theorem 1 with ZQX* and g(·) given by (214) particularized to ν=ν*.

    The saddle point of (215) admits the following characterization.

    Theorem 18.

    If α(0,1), the saddle point (ν*,QX*) of (215) satisfies
    Eb(X¯*)exp(1α)ν*b(X¯*)=θEexp(1α)ν*b(X¯*),X¯*QX*; (217)
    DαPY|X=aQY[α]*=ν*b(a)+cα(ν*),aA, (218)
    where QY[α]* is the α-response to QX*, and cα(ν*) does not depend on aA. Furthermore,
    Aα(ν*)=cα(ν*), (219)
    Cαc(θ)=ν*θ+cα(ν*). (220)
    Proof. First, we show that the scalar ν*0 that minimizes
    f(ν)=νθ+Iα(QX*,PY|X)+11αlogEexp(1α)νb(X¯*) (221)
    satisfies (217). If we abbreviate V=exp(1α)b(X¯*)(0,1], then the dominated convergence theorem results in
    ddννθ+11αlogEVν=θ+11αEVνlogVEVν. (222)
    Therefore, (217) is equivalent to f˙(ν*)=0, which is all we need on account of the convexity of f(·). To show (218), notice that for all aA,
    (1α)ν*b(a)τα=ıQX*PX*(a) (223)
    =(1α)Dα(PY|X=aPY*)+υα, (224)
    where (223) is (216) and (224) is (157) with PYαPY* in view of Theorem 17-(b). In conclusion, (218) holds with
    cα(ν*)=υα+ταα1. (225)
    Finally, (206) implies
    Aα(ν*)=Iα(QX*,PY|X)+11αlogEexp(1α)ν*b(X¯*)=1α1logEexp(α1)DαPY|X(·|X¯*)PY* (226)
    +11αlogEexp(α1)ν*b(X¯*)=1α1logEexp(α1)ν*b(X¯*)+cα(ν*) (227)
    +11αlogEexp(α1)ν*b(X¯*) (228)
    =cα(ν*), (229)
    where (227) follows from the definition of α-mutual information and Theorem 17-(b), and (228) follows from (218). Plugging (219) into (208) results in (220). □
  • 58.
    Typically, the application of Theorem 18 involves
    • (a)
      guessing the form of the auxiliary input QX* (modulo some unknown parameter),
    • (b)
      obtaining its α-response QY[α]*, and
    • (c)
      verifying that (217) and (218) are satisfied for some specific choice of the unknown parameter.
    With the same approach, we can postulate, for every ν0, an input distribution RXν, whose α-response RY[α]ν satisfies
    DαPY|X=aRY[α]ν=νb(a)+cα(ν),aA, (230)
    where the only condition we place on cα(ν) is that it not depend on aA. If this is indeed the case, then the same derivation in (226)–(229) results in
    Aα(ν)=cα(ν), (231)
    and we determine ν* as the solution to θ=c˙α(ν*), in lieu of (217). Section 11 and Section 12 illustrate the effortless nature of this approach to solve for Aα(ν). Incidentally, (230) can be seen as the α-generalization of the condition in Problem 8.2 of [48], elaborated later in [61].

9. Gallager’s E0 Functions and the Maximal Augustin–Csiszár Mutual Information

In keeping with Gallager’s setting [9], we stick to discrete alphabets throughout this section.

  • 59.
    In his derivation of an achievability result for discrete memoryless channels, Gallager [8] introduced the function (1), which we repeat for convenience,
    E0(ρ,PX)=logyBxAPX(x)PY|X11+ρ(y|x)1+ρ. (232)
    Comparing (82) and (232), we obtain
    E0(ρ,PX)=ρI11+ρ(X;Y), (233)
    which, as we mentioned in Section 1, is the observation by Csiszár in [30] that triggered the third phase in the representation of error exponents. Popularized in [9], the E0 function was employed by Shannon, Gallager and Berlekamp [10] for ρ0 and by Arimoto [62] for ρ(1,0) in the derivation of converse results in data transmission, the latter of which considers rates above capacity, a region in which error probability increases with blocklength, approaching one at an exponential rate. For the achievability part, [8] showed upper bounds on the error probability involving E0(ρ,PX) for ρ[0,1]. Therefore, for rates below capacity, the α-mutual information only enters the picture for α(0,1). One exception in which Rényi divergence of order greater than 1 plays a role at rates below capacity was found by Sason [63], where a refined achievability result is shown for binary linear codes for output symmetric channels (a case in which equiprobable PX maximizes (233)), as a function of their Hamming weight distribution.
    Although Gallager did not have the benefit of the insight provided by the Rényi information measures, he did notice certain behaviors of E0 reminiscent of mutual information. For example, the derivative of (233) with respect to ρ, at ρ0 is equal to I(X;Y). As pointed out by Csiszár in [32], in the absence of cost constraints, Gallager’s E0 function in (232) satisfies
    maxPXE0(ρ,PX)=ρmaxXI11+ρ(X;Y)=ρmaxXI11+ρc(X;Y), (234)
    in view of (233) and (187).
    Recall that Gallager’s modified E0 function in the case of cost constraints is
    E0(ρ,PX,r,θ)=logyBxAPX(x)exprb(x)rθPY|X11+ρ(y|x)1+ρ, (235)
    which, like (232) he introduced in order to show an achievability result. Up until now, no counterpart to (234) has been found with cost constraints and (235). This is accomplished in the remainder of this section.
  • 60.

    In the finite alphabet case the following result is useful to obtain a numerical solution for the functional in (206). More importantly, it is relevant to the discussion in Item 61.

    Theorem 19.

    In the special case of discrete alphabets, the function in (206) is equal to
    Aα(ν)=maxGαα1logyBaAG(a)PY|Xα(y|a)1α, (236)
    where the maximization is over all G:A[0,) such that
    aAG(a)exp(1α)νb(a)=1. (237)

    Proof. 

    Recalling (82) we have
    Iα(X;Y)+11αlogEexp(1α)νb(X)=αα1logyBxAPX(x)PY|X=xα(y)1α
    +11αlogEexp(1α)νb(X) (238)
    =αα1logyBEPY|Xα(y|X)Eexp(1α)νb(X)1α (239)
    =αα1logyBaAG(a)PY|Xα(y|a)1α, (240)
    where
    G(x)=PX(x)aAPX(a)exp(1α)νb(a). (241)
  • 61.

    We can now proceed to close the circle between the maximization of Augustin–Csiszár mutual information subject to average cost constraints (Phase 3 in Section 1) and Gallager’s approach (Phase 1 in Section 1).

    Theorem 20.

    In the discrete alphabet case, recalling the definitions in (202) and (235), for ρ>0,
    maxPXE0(ρ,PX,r,θ)=ρmaxPXL11+ρr+rρ,PX,r>0; (242)
    minr0maxPXE0(ρ,PX,r,θ)=ρC11+ρc(θ), (243)
    where the maximizations are over PA.

    Proof. 

    With
    α=11+ρandν=r1+ρρ=r1α, (244)
    the maximization of (235) with the respect to the input probability measure yields
    maxPXE0(ρ,PX,r,θ)
    =maxPX(1+ρ)rθlogyBxAPX(x)exprb(x)PY|X11+ρ(y|x)1+ρ (245)
    =ρνθ+ρmaxPXαα1logyBxAPX(x)exp(1α)νb(x)PY|Xα(y|x)1α (246)
    =ρνθ+ρmaxGαα1logyBxAG(x)PY|Xα(y|x)1α (247)
    =ρνθ+ρAα(ν) (248)
    =ρmaxPXLα(ν,PX), (249)
    where
    • the maximization on the right side of (247) is over all G:A[0,) that satisfy (237), since that constraint is tantamount to enforcing the constraint that PXPA on the left side of (247);
    • (248) ⟸ Theorem 19;
    • (249) ⟸ Theorem 16.
    The proof of (242) is complete once (244) is invoked to substitute α and ν from the right side of (249). If we now minimize the outer sides of (245)–(249) with respect to r we obtain, using (205) and (244),
    minr0maxPXE0(ρ,PX,r,θ)=ρminr0maxPXLαr1α,PX (250)
    =ρminν0maxPXLαν,PX (251)
    =ρC11+ρc(θ). (252)
    In p. 329 of [9], Gallager poses the unconstrained maximization (i.e., over PXPA) of the Lagrangian
    E0(ρ,PX,r,θ)+γaAPX(a)b(a)γθ. (253)

    Note the apparent discrepancy between the optimizations in (243) and (253): the latter is parametrized by r and γ (in addition to ρ and θ), while the maximization on the right side of (243) does not enforce any average cost constraint. In fact, there is no disparity since Gallager loc. cit. finds serendipitously that γ=0 regardless of r and θ, and, therefore, just one parameter is enough.

  • 62.
    The raison d’être for Augustin’s introduction of Iαc in [36] was his quest to view Gallager’s approach with average cost constraints under the optic of Rényi information measures. Contrasting (232) and (235) and inspired by the fact that, in the absence of cost constraints, (232) satisfies a variational characterization in view of (69) and (233), Augustin [36] dealt, not with (235), but with
    minQYDα(P˜Y|XQY|PX),whereP˜Y|X=x=PY|X=xexprb(x).

    Assuming finite alphabets, Augustin was able to connect this quantity with the maximal Iαc(X;Y) under cost constraints in an arcane analysis that invokes a minimax theorem. This line of work was continued in Section 5 of [43], which refers to minQYDα(P˜Y|XQY|PX) as the Rényi-Gallager information. Unfortunately, since P˜Y|X is not a random transformation, the conditional pseudo-Rényi divergence Dα(P˜Y|XQY|PX) need not satisfy the key additive decomposition in Theorem 4 so the approach of [36,43] fails to establish an identity equating the maximization of Gallager’s function (235) with the maximization of Augustin–Csiszár mutual information, which is what we have accomplished through a crisp and elementary analysis.

10. Error Exponent Functions

The central objects of interest in the error exponent analysis of data transmission are the functions Esp(R,PX) and Er(R,PX) of a random transformation PY|X:AB. Reflecting the three different phases referred to in Section 1, there is no unanimity in the definition of those functions. Following [48], we adopt the standard canonical Phase 2 (Section 1.2) definitions of those functions, which are given in Items 63 and 67.

  • 63.
    If R0 and PXPA, the sphere-packing error exponent function is (e.g., (10.19) of [48])
    Esp(R,PX)=minQY|X:ABI(PX,QY|X)RD(QY|XPY|X|PX). (254)
  • 64.

    As a function of R0, the basic properties of (254) for fixed (PX,PY|X) are as follows.

    • (a)

      If RI(PX,PY|X), then Esp(R,PX)=0;

    • (b)

      If R<I(PX,PY|X), then Esp(R,PX)>0;

    • (c)

      The infimum of the arguments for which the sphere-packing error exponent function is finite is denoted by R(PX);

    • (d)
      On the interval R(R(PX),I(PX,PY|X)), Esp(R,PX) is convex, strictly decreasing, continuous, and equal to (254) where the constraint is satisfied with equality. This implies that for R belonging to that interval, we can find ρR0 so that for all r0,
      Esp(r,PX)Esp(R,PX)ρRr+ρRR. (255)
  • 65.

    In view of Theorem 8 and its definition in (254), it is not surprising that Esp(R,PX) is intimately related to the Augustin–Csiszár mutual information, through the following key identity.

    Theorem 21.

    Esp(R,PX)=supρ0ρI11+ρc(X;Y)ρR,R0; (256)
    R(PX)=I0c(X;Y). (257)

    Proof. 

    First note that ≥ holds in (256) because from (128) we obtain, for all ρ0,
    ρI11+ρc(X;Y)=minQY|XD(QY|XPY|X|PX)+ρI(PX,QY|X) (258)
    minQY|X:I(PX,QY|X)RD(QY|XPY|X|PX)+ρI(PX,QY|X) (259)
    Esp(R,PX)+ρR, (260)
    where (260) follows from the definition in (254). To show ≤ in (256) for those R such that 0<Esp(R,PX)<, Property (d) in Item 64 allows us to write
    minQY|XD(QY|XPY|X|PX)+ρRI(PX,QY|X)=minr0Esp(r,PX)+ρRr (261)
    Esp(R,PX)+ρRR, (262)
    where (262) follows from (255).
    To determine the region where the sphere-packing error exponent is infinite and show (257), first note that if R<I0c(X;Y)=limα0Iαc(X;Y), then Esp(R,PX)= because for any ρ0, the function in {} on the right side of (256) satisfies
    ρI11+ρc(X;Y)ρR=ρI11+ρc(X;Y)ρI0c(X;Y)+ρI0c(X;Y)ρR (263)
    ρI0c(X;Y)ρR, (264)
    where (264) follows from the monotonicity of Iαc(X;Y) in α we saw in (143). Conversely, if I0c(X;Y)<R<, there exists ϵ(0,1) such that Iϵc(X;Y)<R, which implies that in the minimization
    Iϵc(X;Y)=minQY|Xϵ1ϵD(QY|XPY|X|PX)+I(PX,QY|X) (265)
    we may restrict to those QY|X such that I(PX,QY|X)R, and consequently, Iϵc(X;Y)ϵ1ϵEsp(R,PX). Therefore, to avoid a contradiction, we must have Esp(R,PX)<.
    The remaining case is I0c(X;Y)=. Again, the monotonicity of the Augustin–Csiszár mutual information implies that Iαc(X;Y)= for all α>0. So, (128) prescribes D(QY|XPY|X|PX)= for any QY|X is such that I(PX,QY|X)<. Therefore, Esp(R,PX)= for all R0, as we wanted to show. □

    Augustin [36] provided lower bounds on error probability for codes of type PX as a function of Iαc(X;Y) but did not state (256); neither did Csiszár in [32] as he was interested in a non-conventional parametrization (generalized cutoff rates) of the reliability function. As pointed out in p. 5605 of [64], the ingredients for the proof of (256) were already present in the hint of Problem 23 of Section II.5 of [24]. In the discrete case, an exponential lower bound on error probability for codes with constant composition PX is given as a function of I11+ρc(PX,PY|X) in [44,64]. As in [64], Nakiboglu [65] gives (256) as the definition of the sphere-packing function and connects it with (254) in Lemma 3 therein, within the context of discrete input alphabets.

    In the discrete case, (257) is well-known (e.g., [66]), and given by (83). As pointed out in [40], maxXI0c(X;Y) is the zero-error capacity with noiseless feedback found by Shannon [67], provided there is at least a pair (a1,a2)A2 such that PY|X=a1PY|X=a2. Otherwise, the zero-error capacity with feedback is zero.

  • 66.
    The critical rate, Rc(PX), is defined as the smallest abscissa at which the convex function Esp(·,PX) meets its supporting line of slope 1. According to (256),
    I12c(X;Y)=Rc(PX)+Esp(Rc(PX),PX). (266)
  • 67.
    If R0 and PXPA, the random-coding exponent function is (e.g., (10.15) of [48])
    Er(R,PX)=minQY|X:ABD(QY|XPY|X|PX)+[I(PX,QY|X)R]+, (267)
    with [t]+=max{0,t}.
  • 68.

    The random-coding error exponent function is determined by the sphere-packing error exponent function through the following relation, illustrated in Figure 1.

    Theorem 22.

    Er(R,PX)=minrREsp(r,PX)+rR (268)
    =0,RI(PX,PY|X);Esp(R,PX),R[Rc(PX),I(PX,PY|X)];I12c(X;Y)R,R[0,Rc(PX)]. (269)
    =supρ[0,1]ρI11+ρc(X;Y)ρR. (270)

    Proof. 

    Identities (268) and (269) are well-known (e.g., Lemma 10.4 and Corollary 10.4 in [48]). To show (270), note that (256) expresses Esp(·,PX) as the supremum of supporting lines parametrized by their slope ρ. By definition of critical rate (for brevity, we do not show explicitly its dependence on PX), if R[Rc,I(PX,PY|X)], then Esp(R,PX) can be obtained by restricting the optimization in (256) to ρ[0,1]. In that segment of values of R, Esp(R,PX)=Er(R,PX) according to (269). Moreover, on the interval R[0,Rc], we have
    maxρ[0,1]ρI11+ρc(X;Y)ρR=I12c(X;Y)R (271)
    =Esp(Rc,PX)+RcR (272)
    =Er(R,PX), (273)
    where we have used (266) and (269). □

    The first explicit connection between Er(R,PX) and the Augustin–Csiszár mutual information was made by Poltyrev [35] although he used a different form for Iαc(X;Y), as we discussed in (29).

  • 69.
    The unconstrained maximizations over the input distribution of the sphere-packing and random coding error exponent functions are denoted, respectively, by
    Esp(R)=supPXEsp(R,PX), (274)
    Er(R)=supPXEr(R,PX). (275)

    Coding theorems [8,9,10,22,48] have shown that when these functions coincide they yield the reliability function (optimum speed at which the error probability vanishes with blocklength) as a function of the rate R<maxXI(X;Y). The intuition is that, for the most favorable input distribution, errors occur when the channel behaves so atypically that codes of rate R are not reliable. There are many ways in which the channel may exhibit such behavior and they are all unlikely, but the most likely among them is the one that achieves (254).

    It follows from (187), (256) and (270) that (274) and (275) can be expressed as
    Esp(R)=supρ0ρsupXI11+ρ(X;Y)ρR, (276)
    Er(R)=supρ[0,1]ρsupXI11+ρ(X;Y)ρR. (277)

    Therefore, we can sidestep working with the Augustin–Csiszár mutual information in the absence of cost constraints.

  • 70.

    Shannon [1] showed that, operating at rates below maximal mutual information, it is possible to find codes whose error probability vanishes with blocklength; for the converse, instead of error probability, Shannon measured reliability by the conditional entropy of the message given the channel output. That alternative reliability measure, as well as its generalization to Arimoto-Rényi conditional entropy, is also useful analyzing the average performance over code ensembles. It turns out (see e.g., [28,68]) that, below capacity, those conditional entropies also vanish exponentially fast in much the same way as error probability with bounds that are governed by Esp(R) and Er(R) thereby lending additional operational significance to those functions.

  • 71.
    We now introduce a cost function b:A[0,) and real scalar θ0, and reexamine the optimizations in (274) and (275) allowing only those probability measures that satisfy E[b(X)]θ. With a patent, but unavoidable, abuse of notation we define
    Esp(R,θ)=supPX:E[b(X)]θEsp(R,PX) (278)
    =supρ0ρsupPX:E[b(X)]θI11+ρc(X;Y)ρR (279)
    =supρ0ρC11+ρc(θ)ρR (280)
    =supρ0ρR+ρminν0νθ+A11+ρ(ν)=supρ0ρR+minν0ρνθ (281)
    +maxXρI11+ρ(X;Y)+(1+ρ)logEexpρν1+ρb(X), (282)
    where (279), (281) and (282) follow from (256), (208) and (206), respectively.
  • 72.
    In parallel to (278)–(281),
    Er(R,θ)=supPX:E[b(X)]θEr(R,PX) (283)
    =supρ[0,1]ρsupPX:E[b(X)]θI11+ρc(X;Y)ρR (284)
    =supρ[0,1]ρC11+ρc(θ)ρR, (285)
    where (284) follows from (270). In particular, if we define the critical rate and the cutoff rate as
    Rc=supPX:E[b(X)]θRc(PX), (286)
    R0=supPX:E[b(X)]θI12c(X;Y), (287)
    respectively, then it follows from (270) that
    Er(R)=R0R,R[0,Rc]. (288)

    Summarizing, the evaluation of Esp(R,θ) and Er(R,θ) can be accomplished by the method proposed in Section 8, at the heart of which is the maximization in (206) involving α-mutual information instead of Augustin–Csiszár mutual information. In Section 11 and Section 12, we illustrate the evaluation of the error exponent functions with two important additive-noise examples.

Figure 1.

Figure 1

Esp(·,PX) and Er(·,PX).

11. Additive Independent Gaussian Noise; Input Power Constraint

We illustrate the procedure in Item 58 by taking Example 6 considerably further.

  • 73.
    Suppose A=B=R, b(x)=x2, and PY|X=a=Na,σN2. We start by testing whether we can find RXνPA such that its α-response satisfies (230). Naturally, it makes sense to try RXν=N0,σ2 for some yet to be determined σ2. As we saw in Example 6, this choice implies that its α-response is RY[α]ν=N0,ασ2+σN2. Specializing Example 4, we obtain
    DαPY|X=xRY[α]ν=DαNx,σN2N0,ασ2+σN2 (289)
    =12log1+ασ2σN212(1α)log1+α(1α)σ2α2σ2+σN2+12αx2α2σ2+σN2loge. (290)
    Therefore, (230) is indeed satisfied with
    cα(ν)=12log1+ασ2σN212(1α)log1+α(1α)σ2α2σ2+σN2, (291)
    ν=12αα2σ2+σN2loge, (292)
    where (292) follows if we choose the variance of the auxiliary input as
    σ2=loge2ανσN2α2 (293)
    =σN2α2αλ1. (294)
    In (294) we have introduced an alternative, more convenient, parametrization for the Lagrange multiplier
    λ=2νσN2loge(0,α). (295)
    In conclusion, with the choice in (293), N0,σ2 attains the maximum in (206), and in view of (231), Aα(ν) is given by the right side of (291) substituting σ2 by (293). Therefore, we have
    νθ+Aα(ν)=λ2snrloge+cαλloge2σN2 (296)
    =λ2snrloge+12log1+1λ1α12(1α)logαλ(1α)+logα1α, (297)
    where we denoted snr=θσN2.
    In accordance with Theorem 16 all that remains is to minimize (297) with respect to ν, or equivalently, with respect to λ. Differentiating (297) with respect to λ, the minimum is achieved at λ* satisfying
    snr=1λ*αλ*αλ*+αλ*, (298)
    whose only valid root (obtained by solving a quadratic equation) is
    λ*=1+αsnrαΔ2snr(1α)(0,α), (299)
    with Δ defined in (118). So, for α(0,1), (208) becomes
    Cαc(snrσN2)=1+αsnrαΔ4(1α)loge+12log1+2snr(1α)1+αsnrαΔ1α12(1α)logαsnr+αΔ12snrα2. (300)
    Letting α=11+ρ, we obtain
    C11+ρc(snrσN2)=snr2ρ1βloge+12log(1+βsnr)1+ρ2ρlog(1+ρ)β, (301)
    with
    β=1211αsnr+Δsnr=1211+ρsnr+4snr+1+ρsnr12. (302)
  • 74.
    Alternatively, it is instructive to apply Theorem 18 to the current Gaussian/quadratic cost setting. Suppose we let QX*=N0,σ*2, where σ*2 is to be determined. With the aid of the formulas
    EX2eμX2=σ21+2μσ232, (303)
    EeμX2=11+2μσ2, (304)
    where μ0, and XN0,σ2, (217) becomes
    1snr=σN2σ*2+(1α)λ*, (305)
    upon substituting σ2σ*2 and
    μν*1αloge=λ*1α2σN2. (306)
    Likewise (218) translates into (291) and (292) with (ν,σ2)(ν*,σ*2), namely,
    cα(ν*)=12log1+ασ*2σN212(1α)log1+α(1α)σ*2α2σ*2+σN2, (307)
    λ*=ασN2α2σ*2+σN2. (308)

    Eliminating σ*2 from (305) by means of (308) results in (299) and the same derivation that led to (300) shows that it is equal to ν*θ+cα(ν*).

  • 75.
    Applying Theorem 17, we can readily find the input distribution, PX*, that attains Cαc(θ) as well as its α-response PY* (recall the notation in Item 53). According to Example 2, PY*, the α-response to QX* is Gaussian with zero mean and variance
    σN2+ασ*2=σN21+1λ*1α (309)
    =σN2221α+Δ+snr, (310)
    where (309) follows from (308) and (310) follows by using the expression for Δ in (118). Note from Example 7 that PY* is nothing but the α-response to N0,snrσN2. We can easily verify from Theorem 17 that indeed PX*=N0,snrσN2 since in this case (216) becomes
    ıPX*QX*(a)=(1α)ν*a2+τα, (311)
    which can only be satisfied by PX*=N0,snrσN2 in view of (305). As an independent confirmation, we can verify, after some algebra, that the right sides of (127) and (300) are identical.

    In fact, in the current Gaussian setting, we could start by postulating that the distribution that maximizes the Augustin–Csiszár mutual information under the second moment constraint does not depend on α and is given by PX*=N0,θ. Its α-response PYα* was already obtained in Example 7. Then, an alternative method to find Cαc(θ), given in Section 6.2 of [43], is to follow the approach outlined in Item 53. To validate the choice of PX* we must show that it maximizes B(PX,PYα*) (in the notation introduced in (199)) among the subset of PA which satisfies E[X2]θ. This follows from the fact that DαPY|X=xPYα* is an affine function of x2.

  • 76.

    Let’s now use the result in Item 73 to evaluate, with a novel parametrization, the error exponent functions for the Gaussian channel under an average power constraint.

    Theorem 23.

    Let A=B=R, b(x)=x2, and PY|X=a=Na,σN2. Then, for β[0,1],
    Esp(R,snrσN2)=snr2(1β)loge12log1+snrβ(1β), (312)
    R=12log1+β2β(1β)+1snr. (313)
    The critical rate and cutoff rate are, respectively,
    Rc=12log12+snr4+121+snr24, (314)
    R0=121+snr21+snr24loge+12log12+121+snr24. (315)

    Proof. 

    Expression (315) for the cutoff rate follows by letting ρ=1 in (301) and (302). The supremum in (281) is attained by ρ*0 that satisfies (recall the concavity result in Theorem 9-(a))
    R=ddρρC11+ρc(snrσN2)|ρρ* (316)
    =12logsnr+1β12log1+ρ*, (317)
    obtained after a dose of symbolic computation working with (301). In particular, letting ρ*=1, we obtain the critical rate in (314). Note that if in (302) we substitute ρρ*, with ρ* given as a function of R, snr and β by (317), we end up with an equation involving R, snr, and β. We proceed to verify that that equation is, in fact, (312). By solving a quadratic equation, we can readily check that (302) is the positive root of
    1+ρ=snr(1β)+1β. (318)
    If we particularize (318) to ρρ*, with ρ* given by (317), namely,
    ρ*=1+exp(2R)snr+1β, (319)
    we obtain
    exp(2R)=snrβ+1snrβ(1β)+1, (320)
    which is (313). Notice that the right side of (320) is monotonic increasing in β>0 ranging from 1 (for β=0) to 1+snr (for β=1). Therefore, β[0,1] spans the whole gamut of values of R of interest.
    Assembling (281), (301) and (317), we obtain
    Esp(R,snrσN2)
    =ρ*R+snr21βloge+ρ*2log(1+βsnr)1+ρ*2log(1+ρ*)β=ρ*R+snr21βloge+ρ*2log(1+βsnr)1+ρ*2logβ (321)
    +(1+ρ*)R1+ρ*2logsnr+1β (322)
    =R+snr21βloge12log1+βsnr (323)
    =snr2(1β)loge12log1+snrβ(1β), (324)
    where (324) follows by substituting (313) on the left side. □
    Note that the parametric expression in (312) and (313) (shown in Figure 2) is, in fact, a closed-form expression for Esp(R,snrσN2) since we can invert (313) to obtain
    β=121exp(2R)1+1+4snr(1exp(2R)). (325)
    The random coding error exponent is
    Er(R,θ)=Esp(R,θ),R(Rc,12log(1+snr));R0R,R[0,Rc], (326)
    with the critical rate Rc and cutoff rate R0 in (314) and (315), respectively. It can be checked that (326) coincides with the expression given by Gallager [9] (p. 340) where he optimizes (235) with respect to ρ and r, but not PX, which he just assumes to be PX=N0,θ. The expression for Rc in (314) can be found in (7.4.34) of [9]; R0 in (314) is implicit in p. 340 of [9], and explicit in e.g., [69].
  • 77.
    The expression for Esp(R,θ) in Theorem 23 has more structure than meets the eye. The analysis in Item 73 has shown that Esp(R,PX) is maximized over PX with second moment not exceeding θ by PX*=N0,θ regardless of R0,12log(1+snr). The fact that we have found a closed-form expression for (254) when evaluated at such input probability measure and PY|X=a=Na,σN2 is indicative that the minimum therein is attained by a Gaussian random transformation QY|X*. This is indeed the case: define the random transformation
    QY|X=a*=Nβa,σ12, (327)
    σ12σN2=1+snrβ(1β). (328)
    In comparison with the nominal random transformation PY|X=a=Na,σN2, this channel attenuates the input and contaminates it with a more powerful noise. Then,
    I(PX*,QY|X*)=12log1+β2β(1β)+1snr=R. (329)
    Furthermore, invoking (33), we get
    D(QY|X*PY|X|PX*)=EDNβX*,σ12NX*,σN2 (330)
    =12(β1)2snr+σ12σN21loge12logσ12σN2 (331)
    =snr2(1β)loge12log1+snrβ(1β) (332)
    =Esp(R,snrσN2), (333)
    where (333) is (312). Therefore, QY|X* does indeed achieve the minimum in (254) if PY|X=a=Na,σN2 and PX*=N0,θ. So, the most likely error mechanism is the result of atypically large noise strength and an attenuated received signal. Both effects cannot be combined into additional noise variance: there is no σ2>0 such that QY|X=a=Na,σ2 achieves the minimum in (254).

Figure 2.

Figure 2

Esp(R,snrσN2) in (312) and (313); logarithms in base 2.

12. Additive Independent Exponential Noise; Input-Mean Constraint

This section finds the sphere-packing error exponent for the additive independent exponential noise channel under an input-mean constraint.

  • 78.
    Suppose that A=B=[0,), b(x)=x, and
    Y=X+N, (334)
    where N is exponentially distributed, independent of X, and E[N]=ζ. Therefore PY|X=a has density
    pY|X=a(t)=1ζetaζ1{ta}. (335)
    It is shown in [70,71] that
    maxX:E[X]θI(X;X+N)=log1+snr, (336)
    snr=θζ, (337)
    achieved by a mixed random variable with density
    fX*(t)=ζζ+θδ(t)+θ(ζ+θ)2et/(ζ+θ)1{t>0}. (338)
    To determine Cαc(snrζ), α(0,1), we invoke Theorem 18. A sensible candidate for the auxiliary input distribution QX* is a mixed random variable with density
    qX*(t)=Γ*δ(t)+1Γ*1μet/μ1{t>0}, (339)
    μ=ζαΓ*, (340)
    where Γ*(0,1) is yet to be determined. This is an attractive choice because its α-response, QY[α]*, is particularly simple: exponential with mean αμ=ζΓ*, as we can verify using Laplace transforms. Then, if Z is exponential with unit mean, with the aid of Example 5, we can write
    DαPY|X=xQY[α]*=Dα(ζZ+xαμZ) (341)
    =xαμloge+logαμζ+11αlogα+(1α)ζαμ (342)
    =Γ*xζlogelogΓ*+11αlogα+(1α)Γ*. (343)
    So, (218) is satisfied with
    ν*=Γ*ζloge, (344)
    cα(ν*)=11αlogα+(1α)Γ*logΓ*. (345)
    To evaluate (217), it is useful to note that if γ>1, then
    EZeγZ=1(1+γ)2, (346)
    EeγZ=11+γ. (347)
    Therefore, the left side of (217) specializes to, with X¯*QX*,
    Eb(X¯*)exp(1α)ν*b(X¯*)=μ(1Γ*)1+μ(1α)ν*loge2 (348)
    =ζα1Γ*1, (349)
    while the expectation on the right side of (217) is given by
    Eexp(1α)ν*b(X¯*)=α+Γ*αΓ*. (350)
    Therefore, (217) yields
    snr=1Γ*1α+(1α)Γ* (351)
    whose solution is
    Γ*=12ρsnr1+snr2+4ρsnr1snr, (352)
    with ρ=1αα. So, finally, (220), (344) and (345) give the closed-form expression
    Cαc(θ)=snrΓ*logelogΓ*+11αlogα+(1α)Γ*. (353)

    As in Item 73, we can postulate an auxiliary distribution that satisfies (230) for every ν0. This is identical to what we did in (341)–(343) except that now (344) and (345) hold for generic ν and Γ. Then, (351) is the result of solving θ=c˙α(ν*), which is, in fact, somewhat simpler than obtaining it through (217).

  • 79.

    We proceed to get a very simple parametric expression for Esp(R,θ).

    Theorem 24.

    Let A=B=[0,), b(x)=x, and Y=X+N, with N exponentially distributed, independent of X, and E[N]=ζ. Then, under the average cost constraint E[b(X)]ζsnr,
    Esp(R,ζsnr)=1η1loge+logη, (354)
    R=log(1+ηsnr), (355)
    where η(0,1].

    Proof. 

    Rewriting (353), results in
    ρC11+ρc(θ)=ρsnrΓ*logeρlogΓ*+(1+ρ)log1+ρΓ*1+ρ, (356)
    which is monotonically decreasing with ρ. With Γ˙*=ρΓ*(ρ,snr), the counterpart of (317) is now
    R=ddρρC11+ρc(θ)|ρρ* (357)
    =(Γ*+ρ*Γ˙*)snr1Γ*+1+ρ*1+ρ*Γ*loge+log1+ρ*Γ*Γ*+ρ*Γ* (358)
    =(Γ*+ρ*Γ˙*)snr+1Γ*Γ*11+ρ*Γ*loge+log1+ρ*Γ*Γ*+ρ*Γ* (359)
    =log1+ρ*Γ*Γ*+ρ*Γ*, (360)
    where the drastic simplification in (360) occurs because, with the current parametrization, (351) becomes
    1Γ*=(1+ρ*Γ*)Γ*snr. (361)
    Now we go ahead and express both ρ* and Γ* as functions of snr and R exclusively. We may rewrite (357)–(360) as
    ρ*Γ*=exp(R)Γ*1exp(R), (362)
    which, when plugged in (361), results in
    Γ*=1snr1exp(R)<1, (363)
    ρ*=(1+snr)exp(R)11exp(R)2>0, (364)
    where the inequalities in (363) and (364) follow from R<log(1+snr). So, in conclusion,
    Esp(R,θ)=maxρ0ρC11+ρc(θ)ρR (365)
    =ρ*C11+ρ*c(θ)ρ*R (366)
    =ρ*snrΓ*logeρ*logΓ*+(1+ρ*)log1+ρ*Γ*1+ρ*ρ*R (367)
    =ρ*snrΓ*logeρ*logΓ*+(1+ρ*)(R+logΓ*)ρ*R (368)
    =ρ*snrΓ*loge+logΓ*+R (369)
    =snrexp(R)11loge+logexp(R)1snr (370)
    =1η1loge+logη, (371)
    where we have introduced
    η=exp(R)1snr=Γ*1snrΓ*. (372)
    Evidently, the left identity in (372) is the same as (355). □
    The critical rate and the cutoff rate are obtained by particularizing (360) and (356) to ρ*=1 and ρ=1, respectively. This yields
    Rc=log1+Γ1*2Γ1*, (373)
    R0=snrΓ1*logelog4Γ1*+2log1+Γ1*, (374)
    Γ1*=1+snr2+4snr1snr2snr. (375)
    As in (326), the random coding error exponent is
    Er(R,ζsnr)=Esp(R,ζsnr),R(Rc,log(1+snr));R0R,R[0,Rc], (376)
    with the critical rate Rc and cutoff rate R0 in (373) and (375), respectively. This function is shown along with Esp(R,ζsnr) in Figure 3 for snr=3.
  • 80.
    In parallel to Item 77, we find the random transformation that explains the most likely mechanism to produce errors at every rate R, namely the minimizer of (254) when PX=PX*, the maximizer of the Augustin–Csiszár mutual information of order α. In this case, PX* is not as trivial to guess as in Section 11, but since we already found QX* in (339) with Γ=Γ*, we can invoke Theorem 17 to show that the density of PX* achieving the maximal order-α Augustin–Csiszár mutual information is
    pX*(t)=Γ*α+(1α)Γ*δ(t)+1Γ*α+(1α)Γ*αΓ*ζetΓ*/ζ1{t>0}, (377)
    whose mean is, as it should,
    αζΓ*1Γ*α+(1α)Γ*=ζsnr=θ. (378)
    Let QY* be exponential with mean θ+κ, and QY|X=a* have density
    qY|X=a*(t)=1κetaκ1{ta}, (379)
    with
    κ=ζη, (380)
    and η as defined in (372). Using Laplace transforms, we can verify that PX*QY|X*QY* where PX* is the probability measure with density in (377). Let Z be unit-mean exponentially distributed. Writing mutual information as the difference between the output differential entropy and the noise differential entropy we get
    I(PX*,QY|X*)=h((θ+κ)Z)h(κZ) (381)
    =log1+θκ (382)
    =R, (383)
    in view of (363). Furthermore, using (335) and (379),
    D(QY|X*PY|X|PX*)=logζκ+κζ1loge (384)
    =logη+1η1loge (385)
    =Esp(R,ζsnr), (386)
    where we have used (380) and (354). Therefore, we have shown that QY|X* is indeed the minimizer of (254). In this case, the most likely mechanism for errors to happen is that the channel adds independent exponential noise with mean ζ/η, instead of the nominal mean ζ. In this respect, the behavior is reminiscent of that of the exponential timing channel for which the error exponent is dominated (at least above critical rate) by an exponential server which is slower than the nominal [72].

Figure 3.

Figure 3

Error exponent functions in (354), (355) and (376).

13. Recap

  • 81.

    The analysis of the fundamental limits of noisy channels in the regime of vanishing error probability with blocklength growing without bound expresses channel capacity in terms of a basic information measure: the input–output mutual information maximized over the input distribution. In the regime of fixed nonzero error probability, the asymptotic fundamental limit is a function of not only capacity but channel dispersion [73], which is also expressible in terms of an information measure: the variance of the information density obtained with the capacity-achieving distribution. In the regime of exponentially decreasing error probability (at fixed rate below capacity) the analysis of the fundamental limits has gone through three distinct phases. No information measures were involved during the first phase and any optimization with respect to various auxiliary parameters and input distribution had to rely on standard convex optimization techniques, such as Karush-Kuhn-Tucker conditions, which not only are cumbersome to solve in this particular setting, but shed little light on the structure of the solution. The second phase firmly anchored the problem in a large deviations foundation, with the fundamental limits expressed in terms of conditional relative entropy as well as mutual information. Unfortunately, the associated maximinimization in (2) did not immediately lend itself to analytical progress. Thanks to Csiszár’s realization of the relevance of Rényi’s information measures to this problem, the third phase has found a way to, not only express the error exponent functions as a function of information measures, but to solve the associated optimization problems in a systematic way. While, in the absence of cost constraints, the problem reduces to finding the maximal α-mutual information, cost constraints make the problem much more challenging because of the difficulty in determining the order-α Augustin–Csiszár mutual information. Fortunately, thanks to the introduction of an auxiliary input distribution (the α-adjunct of the distribution that maximizes Iαc), we have shown that α-mutual information also comes to the rescue in the maximization of the order-α Augustin–Csiszár mutual information in the presence of average cost constraints. We have also finally ended the isolation of Gallager’s E0 function with cost constraints from the representations in Phases 2 and 3. The pursuit of such a link is what motivated Augustin in 1978 to define a generalized mutual information measure. Overall, the analysis has given yet another instance of the benefits of variational representations of information measures, leading to solutions based on saddle points. However, we have steered clear of off-the-shelf minimax theorems and their associated topological constraints.

    We have worked out two channels/cost constraints (additive Gaussian noise with quadratic cost, and additive exponential noise with a linear cost) that admit closed-form error-exponent functions, most easily expressed in parametric form. Furthermore, in Items 77 and 80 we have illuminated the structure of those closed-form expressions by identifying the anomalous channel behavior responsible for most errors at every given rate. In the exponential noise case, the solution is simply a noisier exponential channel, while in the Gaussian case it is the result of both a noisier Gaussian channel and an attenuated input.

    These observations prompt the question of whether there might be an alternative general approach that eschews Rényi’s information measures to arrive at not only the most likely anomalous channel behavior, but the error exponent functions themselves.

Acknowledgments

The manuscript incorporates constructive suggestions by Academic Editor Igal Sason and the anonymous referees.

Appendix A

Recall that the relative information ıPQ is defined only if PQ, while D(PQ)[0,+] is always defined and equal to + if (but not only if) P Inline graphic Q.

Lemma A1.

If QR and XPR, then

EıPR(X)ıQR(X)=D(PQ), (A1)

regardless of whether the right side is finite.

Proof. 

If PQR, we may invoke the chain rule (7) to decompose

ıPR(a)ıQR(a)=ıPQ(a). (A2)

Then, the result follows by taking expectations of (A2) when aXP.

To show that (A1) also holds when P Inline graphic Q, i.e., that the expectation on the left side is +, we invoke the Lebesgue decomposition theorem (e.g. p. 384 of [74]), which ensures that we can find α[0,1), P0Q and P1Q, such that

P=αP1+(1α)P0. (A3)

Since P1P0, we have

D(P1P)=log1α, (A4)
D(P0P)=log11α. (A5)

If X1P1, then

EıPR(X1)ıQR(X1)=EıP1R(X1)ıQR(X1)EıP1R(X1)ıPR(X1) (A6)
=D(P1Q)D(P1P) (A7)
=D(P1Q)log1α, (A8)

where

  • (A7) ⟸ (A1) with (P,Q,R)(P1,Q,R) and (A1) with (P,Q,R)(P1,P,R), which we are entitled to invoke since P1 is dominated by both Q and R;

  • (A8) ⟸ (A4).

Analogously, if X0P0, then

EıPR(X0)=EıP0R(X0)EıP0R(X0)ıPR(X0) (A9)
=D(P0R)D(P0P) (A10)
=D(P0R)log11α. (A11)

Therefore, we are ready to conclude that

EıPR(X)ıQR(X)
=αEıPR(X1)ıQR(X1)+(1α)EıPR(X0)ıQR(X0) (A12)
=αD(P1Q)+(1α)D(P0R)(1α)EıQR(X0)h(α) (A13)
=+, (A14)

where

  • (A12) ⟸ (A3);

  • (A13) ⟸h(·) is the binary entropy function, (A8) and (A11);

  • (A14) ⟸EıQR(X0)=P0xA:dQdR(x)=0=1P0Q.

Corollary A1.

Suppose that QR and XPR. Then,

EıQR(X)=D(PR)D(PQ), (A15)

as long as at least one of the relative entropies on the right side is finite.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Shannon C.E. A Mathematical Theory of Communication. Bell Syst. Tech. J. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
  • 2.Rice S.O. Communication in the Presence of Noise–Probability of Error for Two Encoding Schemes. Bell Syst. Tech. J. 1950;29:60–93. doi: 10.1002/j.1538-7305.1950.tb00933.x. [DOI] [Google Scholar]
  • 3.Shannon C.E. Probability of Error for Optimal Codes in a Gaussian Channel. Bell Syst. Tech. J. 1959;38:611–656. doi: 10.1002/j.1538-7305.1959.tb03905.x. [DOI] [Google Scholar]
  • 4.Elias P. Coding for Noisy Channels. IRE Conv. Rec. 1955;4:37–46. [Google Scholar]
  • 5.Feinstein A. Error Bounds in Noisy Channels without Memory. IRE Trans. Inf. Theory. 1955;1:13–14. doi: 10.1109/TIT.1955.1055131. [DOI] [Google Scholar]
  • 6.Shannon C.E. Certain Results in Coding Theory for Noisy Channels. Inf. Control. 1957;1:6–25. doi: 10.1016/S0019-9958(57)90039-6. [DOI] [Google Scholar]
  • 7.Fano R.M. Transmission of Information. Wiley; New York, NY, USA: 1961. [Google Scholar]
  • 8.Gallager R.G. A Simple Derivation of the Coding Theorem and Some Applications. IEEE Trans. Inf. Theory. 1965;11:3–18. doi: 10.1109/TIT.1965.1053730. [DOI] [Google Scholar]
  • 9.Gallager R.G. Information Theory and Reliable Communication. Wiley; New York, NY, USA: 1968. [Google Scholar]
  • 10.Shannon C.E., Gallager R.G., Berlekamp E. Lower Bounds to Error Probability for Coding on Discrete Memoryless Channels, I. Inf. Control. 1967;10:65–103. doi: 10.1016/S0019-9958(67)90052-6. [DOI] [Google Scholar]
  • 11.Shannon C.E., Gallager R.G., Berlekamp E. Lower Bounds to Error Probability for Coding on Discrete Memoryless Channels, II. Inf. Control. 1967;10:522–552. doi: 10.1016/S0019-9958(67)91200-4. [DOI] [Google Scholar]
  • 12.Dobrushin R.L. Asymptotic Estimates of the Error Probability for Transmission of Messages over a Discrete Memoryless Communication Channel with a Symmetric Transition Probability Matrix. Theory Probab. Appl. 1962;7:270–300. doi: 10.1137/1107027. [DOI] [Google Scholar]
  • 13.Dobrushin R.L. Optimal Binary Codes for Low Rates of Information Transmission. Theory Probab. Appl. 1962;7:208–213. doi: 10.1137/1107020. [DOI] [Google Scholar]
  • 14.Kullback S., Leibler R.A. On Information and Sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  • 15.Csiszár I., Körner J. Graph Decomposition: A New Key to Coding Theorems. IEEE Trans. Inf. Theory. 1981;27:5–11. doi: 10.1109/TIT.1981.1056281. [DOI] [Google Scholar]
  • 16.Barg A., Forney G.D., Jr. Random codes: Minimum Distances and Error Exponents. IEEE Trans. Inf. Theory. 2002;48:2568–2573. doi: 10.1109/TIT.2002.800480. [DOI] [Google Scholar]
  • 17.Sason I., Shamai S. Performance Analysis of Linear Codes under Maximum-likelihood Decoding: A Tutorial. Found. Trends Commun. Inf. Theory. 2006;3:1–222. doi: 10.1561/0100000009. [DOI] [Google Scholar]
  • 18.Ashikhmin A.E., Barg A., Litsyn S.N. A New Upper Bound on the Reliability Function of the Gaussian Channel. IEEE Trans. Inf. Theory. 2000;46:1945–1961. doi: 10.1109/18.868471. [DOI] [Google Scholar]
  • 19.Haroutunian E.A., Haroutunian M.E., Harutyunyan A.N. Reliability Criteria in Information Theory and in Statistical Hypothesis Testing. Found. Trends Commun. Inf. Theory. 2007;4:97–263. doi: 10.1561/0100000008. [DOI] [Google Scholar]
  • 20.Scarlett J., Peng L., Merhav N., Martinez A., Guillén i Fàbregas A. Expurgated Random-coding Ensembles: Exponents, Refinements, and Connections. IEEE Trans. Inf. Theory. 2014;60:4449–4462. doi: 10.1109/TIT.2014.2322033. [DOI] [Google Scholar]
  • 21.Somekh-Baruch A., Scarlett J., Guillén i Fàbregas A. A Recursive Cost-Constrained Construction that Attains the Expurgated Exponent; Proceedings of the 2019 IEEE International Symposium on Information Theory; Paris, France. 7–12 July 2019; pp. 2938–2942. [Google Scholar]
  • 22.Haroutunian E.A. Estimates of the Exponent of the Error Probability for a Semicontinuous Memoryless Channel. Probl. Inf. Transm. 1968;4:29–39. [Google Scholar]
  • 23.Blahut R.E. Hypothesis Testing and Information Theory. IEEE Trans. Inf. Theory. 1974;20:405–417. doi: 10.1109/TIT.1974.1055254. [DOI] [Google Scholar]
  • 24.Csiszár I., Körner J. Information Theory: Coding Theorems for Discrete Memoryless Systems. Academic; New York, NY, USA: 1981. [Google Scholar]
  • 25.Rényi A. On Measures of Information and Entropy. In: Neyman J., editor. Berkeley Symposium on Mathematical Statistics and Probability. University of California Press; Berkeley, CA, USA: 1961. pp. 547–561. [Google Scholar]
  • 26.Campbell L.L. A Coding Theorem and Rényi’s Entropy. Inf. Control. 1965;8:423–429. doi: 10.1016/S0019-9958(65)90332-3. [DOI] [Google Scholar]
  • 27.Arimoto S. Topics in Information Theory. Bolyai; Keszthely, Hungary: 1975. Information Measures and Capacity of Order α for Discrete Memoryless Channels; pp. 41–52. [Google Scholar]
  • 28.Sason I., Verdú S. Arimoto-Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory. 2018;64:4–25. doi: 10.1109/TIT.2017.2757496. [DOI] [Google Scholar]
  • 29.Fano R.M. Class Notes for Course 6.574: Statistical Theory of Information. Massachusetts Institute of Technology; Cambridge, MA, USA: 1953. [Google Scholar]
  • 30.Csiszár I. A Class of Measures of Informativity of Observation Channels. Period. Mat. Hung. 1972;2:191–213. doi: 10.1007/BF02018661. [DOI] [Google Scholar]
  • 31.Sibson R. Information Radius. Z. Wahrscheinlichkeitstheorie Und Verw. Geb. 1969;14:149–161. doi: 10.1007/BF00537520. [DOI] [Google Scholar]
  • 32.Csiszár I. Generalized Cutoff Rates and Rényi’s Information Measures. IEEE Trans. Inf. Theory. 1995;41:26–34. doi: 10.1109/18.370121. [DOI] [Google Scholar]
  • 33.Arimoto S. Computation of Random Coding Exponent Functions. IEEE Trans. Inf. Theory. 1976;22:665–671. doi: 10.1109/TIT.1976.1055640. [DOI] [Google Scholar]
  • 34.Candan C. Chebyshev Center Computation on Probability Simplex with α-divergence Measure. IEEE Signal Process. Lett. 2020;27:1515–1519. doi: 10.1109/LSP.2020.3018661. [DOI] [Google Scholar]
  • 35.Poltyrev G.S. Random Coding Bounds for Discrete Memoryless Channels. Probl. Inf. Transm. 1982;18:9–21. [Google Scholar]
  • 36.Augustin U. Ph.D. Thesis. Universität Erlangen-Nürnberg; Erlangen, Germany: 1978. Noisy Channels. [Google Scholar]
  • 37.Tomamichel M., Hayashi M. Operational Interpretation of Rényi Information Measures via Composite Hypothesis Testing against Product and Markov Distributions. IEEE Trans. Inf. Theory. 2018;64:1064–1082. doi: 10.1109/TIT.2017.2776900. [DOI] [Google Scholar]
  • 38.Polyanskiy Y., Verdú S. Arimoto Channel Coding Converse and Rényi Divergence; Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing; Monticello, IL, USA. 29 September–1 October 2010; pp. 1327–1333. [Google Scholar]
  • 39.Shayevitz O. On Rényi Measures and Hypothesis Testing; Proceedings of the 2011 IEEE International Symposium on Information Theory; St. Petersburg, Russia. 31 July–5 August 2011; pp. 894–898. [Google Scholar]
  • 40.Verdú S. α-Mutual Information; Proceedings of the 2015 Information Theory and Applications Workshop (ITA); San Diego, CA, USA. 1–6 February 2015. [Google Scholar]
  • 41.Ho S.W., Verdú S. Convexity/Concavity of Rényi Entropy and α-Mutual Information; Proceedings of the 2015 IEEE International Symposium on Information Theory; Hong Kong, China. 15–19 June 2015; pp. 745–749. [Google Scholar]
  • 42.Nakiboglu B. The Rényi Capacity and Center. IEEE Trans. Inf. Theory. 2019;65:841–860. doi: 10.1109/TIT.2018.2861002. [DOI] [Google Scholar]
  • 43.Nakiboglu B. The Augustin Capacity and Center. arXiv. 2018 doi: 10.1134/S003294601904001X.1803.07937 [DOI] [Google Scholar]
  • 44.Dalai M. Some Remarks on Classical and Classical-Quantum Sphere Packing Bounds: Rényi vs. Kullback–Leibler. Entropy. 2017;19:355. doi: 10.3390/e19070355. [DOI] [Google Scholar]
  • 45.Cai C., Verdú S. Conditional Rényi Divergence Saddlepoint and the Maximization of α-Mutual Information. Entropy. 2019;21:969. doi: 10.3390/e21100969. [DOI] [Google Scholar]
  • 46.Vázquez-Vilar G., Martinez A., Guillén i Fàbregas A. A Derivation of the Cost-constrained Sphere-Packing Exponent; Proceedings of the 2015 IEEE International Symposium on Information Theory; Hong Kong, China. 15–19 June 2015; pp. 929–933. [Google Scholar]
  • 47.Wyner A.D. Capacity and Error Exponent for the Direct Detection Photon Channel. IEEE Trans. Inf. Theory. 1988;34:1449–1471. doi: 10.1109/18.21284. [DOI] [Google Scholar]
  • 48.Csiszár I., Körner J. Information Theory: Coding Theorems for Discrete Memoryless Systems. 2nd ed. Cambridge University Press; Cambridge, UK: 2011. [Google Scholar]
  • 49.Rényi A. On Measures of Dependence. Acta Math. Hung. 1959;10:441–451. doi: 10.1007/BF02024507. [DOI] [Google Scholar]
  • 50.van Erven T., Harremoës P. Rényi Divergence and Kullback-Leibler Divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]
  • 51.Csiszár I., Matúš F. Information Projections Revisited. IEEE Trans. Inf. Theory. 2003;49:1474–1490. doi: 10.1109/TIT.2003.810633. [DOI] [Google Scholar]
  • 52.Csiszár I. Information-type Measures of Difference of Probability Distributions and Indirect Observations. Stud. Sci. Math. Hung. 1967;2:299–318. [Google Scholar]
  • 53.Nakiboglu B. The Sphere Packing Bound via Augustin’s Method. IEEE Trans. Inf. Theory. 2019;65:816–840. doi: 10.1109/TIT.2018.2882547. [DOI] [Google Scholar]
  • 54.Nakiboglu B. The Augustin Capacity and Center. Probl. Inf. Transm. 2019;55:299–342. doi: 10.1134/S003294601904001X. [DOI] [Google Scholar]
  • 55.Vázquez-Vilar G. Error Probability Bounds for Gaussian Channels under Maximal and Average Power Constraints. arXiv. 20191907.03163 [Google Scholar]
  • 56.Shannon C.E. Geometrische Deutung einiger Ergebnisse bei der Berechnung der Kanalkapazität. Nachrichtentechnische Z. 1957;10:1–4. [Google Scholar]
  • 57.Verdú S., Han T.S. A General Formula for Channel Capacity. IEEE Trans. Inf. Theory. 1994;40:1147–1157. doi: 10.1109/18.335960. [DOI] [Google Scholar]
  • 58.Kemperman J.H.B. On the Shannon Capacity of an Arbitrary Channel. K. Ned. Akad. Van Wet. Indag. Math. 1974;77:101–115. doi: 10.1016/1385-7258(74)90000-6. [DOI] [Google Scholar]
  • 59.Aubin J.P. Mathematical Methods of Game and Economic Theory. North-Holland; Amsterdam, The Netherlands: 1979. [Google Scholar]
  • 60.Luenberger D.G. Optimization by Vector Space Methods. Wiley; New York, NY, USA: 1969. [Google Scholar]
  • 61.Gastpar M., Rimoldi B., Vetterli M. To Code, or Not to Code: Lossy Source–Channel Communication Revisited. IEEE Trans. Inf. Theory. 2003;49:1147–1158. doi: 10.1109/TIT.2003.810631. [DOI] [Google Scholar]
  • 62.Arimoto S. On the Converse to the Coding Theorem for Discrete Memoryless Channels. IEEE Trans. Inf. Theory. 1973;19:357–359. doi: 10.1109/TIT.1973.1055007. [DOI] [Google Scholar]
  • 63.Sason I. On the Rényi Divergence, Joint Range of Relative Entropies, Measures and a Channel Coding Theorem. IEEE Trans. Inf. Theory. 2016;62:23–34. doi: 10.1109/TIT.2015.2504100. [DOI] [Google Scholar]
  • 64.Dalai M., Winter A. Constant Compositions in the Sphere Packing Bound for Classical-quantum Channels. IEEE Trans. Inf. Theory. 2017;63:5603–5617. doi: 10.1109/TIT.2017.2726555. [DOI] [Google Scholar]
  • 65.Nakiboglu B. The Sphere Packing Bound for Memoryless Channels. Probl. Inf. Transm. 2020;56:201–244. doi: 10.1134/S0032946020030011. [DOI] [Google Scholar]
  • 66.Dalai M. Lower Bounds on the Probability of Error for Classical and Classical-quantum Channels. IEEE Trans. Inf. Theory. 2013;59:8027–8056. doi: 10.1109/TIT.2013.2283794. [DOI] [Google Scholar]
  • 67.Shannon C.E. The Zero Error Capacity of a Noisy Channel. IRE Trans. Inf. Theory. 1956;2:8–19. doi: 10.1109/TIT.1956.1056798. [DOI] [Google Scholar]
  • 68.Feder M., Merhav N. Relations Between Entropy and Error Probability. IEEE Trans. Inf. Theory. 1994;40:259–266. doi: 10.1109/18.272494. [DOI] [Google Scholar]
  • 69.Einarsson G. Signal Design for the Amplitude-limited Gaussian Channel by Error Bound Optimization. IEEE Trans. Commun. 1979;27:152–158. doi: 10.1109/TCOM.1979.1094267. [DOI] [Google Scholar]
  • 70.Anantharam V., Verdú S. Bits through Queues. IEEE Trans. Inf. Theory. 1996;42:4–18. doi: 10.1109/18.481773. [DOI] [Google Scholar]
  • 71.Verdú S. The Exponential Distribution in Information Theory. Probl. Inf. Transm. 1996;32:86–95. [Google Scholar]
  • 72.Arikan E. On the Reliability Exponent of the Exponential Timing Channel. IEEE Trans. Inf. Theory. 1996;48:1681–1689. doi: 10.1109/TIT.2002.1003846. [DOI] [Google Scholar]
  • 73.Polyanskiy Y., Poor H.V., Verdú S. Channel Coding Rate in the Finite Blocklength Regime. IEEE Trans. Inf. Theory. 2010;56:2307–2359. doi: 10.1109/TIT.2010.2043769. [DOI] [Google Scholar]
  • 74.Royden H.L., Fitzpatrick P. Real Analysis. 4th ed. Prentice Hall; Boston, FL, USA: 2010. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES