Skip to main content
Entropy logoLink to Entropy
. 2023 Feb 13;25(2):346. doi: 10.3390/e25020346

The Cauchy Distribution in Information Theory

Sergio Verdú 1
Editor: Sangun Park1
PMCID: PMC9955388  PMID: 36832712

Abstract

The Gaussian law reigns supreme in the information theory of analog random variables. This paper showcases a number of information theoretic results which find elegant counterparts for Cauchy distributions. New concepts such as that of equivalent pairs of probability measures and the strength of real-valued random variables are introduced here and shown to be of particular relevance to Cauchy distributions.

Keywords: information measures, Cauchy distribution, relative entropy, Kullback–Leibler divergence, differential entropy, Fisher’s information, entropy power inequality, f-divergence, Rényi divergence, mutual information, data transmission, lossy data compression

1. Introduction

Since the inception of information theory [1], the Gaussian distribution has emerged as the paramount example of a continuous random variable leading to closed-form expressions for information measures and extremality properties possessing great pedagogical value. In addition, the role of the Gaussian distribution as a ubiquitous model for analog information sources and for additive thermal noise has elevated the corresponding formulas for rate–distortion functions and capacity–cost functions to iconic status in information theory. Beyond discrete random variables, by and large, information theory textbooks confine their coverage and examples to Gaussian random variables.

The exponential distribution has also been shown [2] to lead to closed-form formulas for various information measures such as differential entropy, mutual information and relative entropy, as well as rate–distortion functions for Markov processes and the capacity of continuous-time timing channels with memory such as the exponential-server queue [3].

Despite its lack of moments, the Cauchy distribution also leads to pedagogically attractive closed-form expressions for various information measures. In addition to showcasing those, we introduce an attribute, which we refer to as the strength of a real-valued random variable, under which the Cauchy distribution is shown to possess optimality properties. Along with the stability of the Cauchy law, those properties result in various counterparts to the celebrated fundamental limits for memoryless Gaussian sources and channels.

To enhance readability and ease of reference, the rest of this work is organized in 120 items grouped into 17 sections, plus an appendix.

Section 2 presents the family of Cauchy random variables and their basic properties as well as multivariate generalizations, and the Rider univariate density which includes the Cauchy density as a special case and finds various information theoretic applications.

Section 3 gives closed-form expressions for the differential entropies of the univariate and multivariate densities covered in Section 2.

Introduced previously for unrelated purposes, the Shannon and η-transforms reviewed in Section 4 prove useful to derive several information theoretic results for Cauchy and related laws.

Applicable to any real-valued random variable and inspired by information theory, the central notion of strength is introduced in Section 5 along with its major properties. In particular, it is shown that convergence in strength is an intermediate criterion between convergence in probability and convergence in Lq, q>0, and that differential entropy is continuous with respect to the addition of independent vanishing strength noise.

Section 6 shows that, for any ρ>0 the maximal differential entropy density satisfying Elog1+|Z|ρθ can be obtained in closed form, but its shape (not just its scale) depends on the value of θ. In particular, the Cauchy density is the solution only if ρ=2, and θ=log4. In contrast, we show that, among all the random variables with a given strength, the centered Cauchy density has maximal differential entropy, regardless of the value of the constraint. This result suggests the definition of entropy strength of Z, as the strength of a Cauchy random variable whose differential entropy is the same as that of Z. Modulo a factor, entropy power is the square of entropy strength. Section 6 also gives a maximal differential entropy characterization of the standard spherical Cauchy multivariate density.

Information theoretic terminology for the logarithm of the Radon–Nikodym derivative, as well as its distribution, the relative information spectrum is given in Section 7. The relative information spectrum for Cauchy distributions is found and shown to depend on their location and scale through a single scalar. This is a rare property, not satisfied by most common families such as Gaussian, exponential, Laplace, etc. Section 8 introduces the notion of equivalent pairs of probability measures, which plays an important role not only in information theory but in statistical inference. Distinguishing P1 from Q1 has the same fundamental limits as distinguishing P2 from Q2 if (P1,Q1) and (P2,Q2) are equivalent pairs. Section 9 studies the interplay between f-divergences and equivalent pairs. A simple formula for the f-divergence between Cauchy distributions results from the explicit expression for the relative information spectrum found in Section 7. These results are then used to easily derive a host of explicit expressions for χ2-divergence, relative entropy, total variation distance, Hellinger divergence and Rényi divergence in Section 10, Section 11, Section 12, Section 13 and Section 14, respectively.

In addition to the Fisher information matrix of the Cauchy family, Section 15 finds a counterpart of de Bruijn’s identity [4] for convolutions with scaled Cauchy random variables, instead of convolutions with scaled Gaussian random variables as in the conventional setting.

Section 16 is devoted to mutual information. The mutual information between a Cauchy random variable and its noisy version contaminated by additive independent Cauchy noise exhibits a pleasing counterpart (modulo a factor of two) with the Gaussian case, in which the signal-to-noise ratio is now given by the ratio of strengths rather than variances. With Cauchy noise, Cauchy inputs maximize mutual information under an output strength constraint. The elementary fact that an output variance constraint translates directly into an input variance constraint does not carry over to input and output strengths, and indeed we identify non-Cauchy inputs that may achieve higher mutual information than a Cauchy input with the same strength. Section 16 also considers the dual setting in which the input is Cauchy, but the additive noise need not be. Lower bounds on the mutual information, attained by Cauchy noise, are offered. However, as the bounds do not depend exclusively on the noise strength, they do not rule out the possibility that a non-Cauchy noise with identical strength may be least favorable. If distortion is measured by strength, the rate–distortion function of a Cauchy memoryless source is shown to admit (modulo a factor of two) the same rate–distortion function as the memoryless Gaussian source with mean–square distortion, replacing the source variance by its strength. Theorem 17 gives a very general continuity result for mutual information that encompasses previous such results. While convergence in probability to zero of the input to an additive-noise transformation does not imply vanishing input-output mutual information, convergence in strength does under very general conditions on the noise distribution.

Some concluding observations about generalizations and open problems are collected in Section 17, including a generalization of the notion of strength.

Those definite integrals used in the main body are collected and justified in the Appendix A.

2. The Cauchy Distribution and Generalizations

In probability theory, the Cauchy (also known as Lorentz and as Breit–Wigner) distribution is the prime example of a real-valued random variable none of whose moments of order one or higher exists, and as such it is not encompassed by either the law of large numbers or the central limit theorem.

  1. A real-valued random variable V is said to be standard Cauchy if its probability density function is
    fV(x)=1π1x2+1,xR. (1)
    Furthermore, X is said to be Cauchy if there exist λ0 and μR such that X=λV+μ, in which case
    fX(x)=|λ|π1(xμ)2+λ2,xR, (2)
    where μ and |λ| are referred to as the location (or median) and scale, respectively, of the Cauchy distribution. If μ=0, (2) is said to be centered Cauchy.
  2. Since E[max{0,V}]=E[max{0,V}]=, the mean of a Cauchy random variable does not exist. Furthermore, E[|V|q]= for q1, and the moment generating function of V does not exist (except, trivially, at 0). The characteristic function of the standard Cauchy random variable is
    EeiωV=e|ω|,ωR. (3)
  3. Using (3), we can verify that a Cauchy random variable has the curious property that adding an independent copy to it has the same effect, statistically speaking, as adding an identical copy. In addition to the Gaussian and Lévy distributions, the Cauchy distribution is stable: a linear combination of independent copies remains in the family, and is infinitely divisible: it can be expressed as an n-fold convolution for any n. It follows from (3) that if {V1,V2,} are independent, standard Cauchy, and a is a deterministic sequence with finite 1-norm a1, then i=1aiVi has the same distribution as a1V. In particular, the time average of independent identically distributed Cauchy random variables has the same distribution as any of the random variables. The families {λV,λI} and {V+μ,μI}, with I any interval of the real line, are some of the simplest parametrized random variables that are not an exponential family.

  4. If Θ is uniformly distributed on [π2,π2], then tanΘ is standard Cauchy. This follows since, in view of (1) and (A1), the standard Cauchy cumulative distribution function is
    FV(x)=12+1πarctan(x),xR. (4)
    Therefore, V has unit semi-interquartile length. The functional inverse of (4) is the standard Cauchy quantile function given by
    QV(t)=tanπt12,t(0,1). (5)
  5. If X1 and X2 are standard Gaussian with correlation coefficient ρ(1,1), then X1/X2 is Cauchy with scale 1ρ2 and location ρ. This implies that the reciprocal of a standard Cauchy random variable is also standard Cauchy.

  6. Taking the cue from the Gaussian case, we say that a random vector is multivariate Cauchy if any linear combination of its components has a Cauchy distribution. Necessary and sufficient conditions for a characteristic function to be that of a multivariate Cauchy were shown by Ferguson [5]. Unfortunately, no general expression is known for the corresponding probability density function. This accounts for the fact that one aspect, in which the Cauchy distribution does not quite reach the wealth of information theoretic results attainable with the Gaussian distribution, is in the study of multivariate models of dependent random variables. Nevertheless, special cases of multivariate Cauchy distribution do admit some interesting information theoretic results as we will see below. The standard spherical multivariate Cauchy probability density function on Rn is (e.g., [6])
    fVn(x)=Γn+12πn+121+x2n+12, (6)
    where Γ(·) is the Gamma function. Therefore, Vn=(V1,,Vn) are exchangeable random variables. If X0,X1,,Xn are independent standard normal, then the vector X01Xn has the density in (6). With the aid of (A10), we can verify that any subset of k{1,,n1} components of Vn is distributed according to Vk. In particular, the marginals of (6) are given by (1). Generalizing (3), the characteristic function of (6) is
    EeitVn=et,tRn. (7)
  7. In parallel to Item 1, we may generalize (6) by dropping the restriction that it be centered at the origin and allowing ellipsoidal deformation, i.e., letting Zn=Λ12Vn+μ with μRn and a positive definite n×n matrix Λ. Therefore,
    fZn(x)=Γn+12πn+12det12(Λ)1+(xμ)Λ1(xμ)n+12. (8)

    While ρZn is a Cauchy random variable for ρRn{0}, (8) fails to encompass every multivariate Cauchy distribution—in particular, the important case of independent Cauchy random variables. Another reason the usefulness of the model in (8) is limited is that it is not closed under independent additions: if Vn and V¯n are independent, each distributed according to (6); then, Λ12Vn+Λ¯12V¯n, while multivariate Cauchy, does not have a density of the type in (8) unless Λ=αΛ¯ for some α>0.

  8. Another generalization of the (univariate) Cauchy distribution, which comes into play in our analysis, was introduced by Rider in 1958 [7]. With ρ>0 and βρ>1,
    fVβ,ρ(x)=κβ,ρ(1+|x|ρ)β,xR, (9)
    κβ,ρ=ρΓ(β)2Γ1ρΓβ1ρ. (10)

    In addition to the (β,ρ) parametrization in (9), we may introduce scale and location parameters by means of λVβ,ρ+μ, just as we did in the Cauchy case (β,ρ)=(1,2). Another notable special case is νVν+12,2, which is the centered Student-t random variable, itself equivalent to a Pearson type VII distribution.

3. Differential Entropy

  • 9.
    The differential entropy of a Cauchy random variable is
    h(λV+μ)=log|λ|+h(V), (11)
    h(V)=fV(t)logfV(t)dt=log(4π), (12)
    using (A3). Throughout this paper, unless the logarithm base is explicitly shown, it can be chosen by the reader as long as it is the same on both sides of the equation. For natural logarithms, the information measure unit is the nat.
  • 10.
    An alternative, sometimes advantageous, expression for the differential entropy of a real-valued random variable is feasible if its cumulative distribution function FX is continuous and strictly monotonic. Then, the quantile function is its functional inverse, i.e., FX(QX(t))=t for all t(0,1), which implies that Q˙X(t)fX(QX(t))=1 for all t(0,1). Moreover, since X and QX(U) with U uniformly distributed on [0,1] have identical distributions, we obtain
    h(X)=E[logfX(X)]=E[logfX(QX(U))]=01logQ˙X(t)dt. (13)

    Since (4) is indeed continuous and strictly monotonic, we can verify that we recover (12) by means of (5), (13) and (A2).

  • 11.
    Despite not having finite moments, an independent identically distributed sequence of Cauchy random variables {Zi} is information stable in the sense that
    1ni=1nlogfZ(Zi)h(Z),a.s. (14)
    because of the strong law of large numbers.
  • 12.
    With Vn distributed according to the standard spherical multivariate Cauchy density in (6), it is shown in [8] that
    Eloge1+Vn2=ψn+12+loge4+γ, (15)
    where γ is the Euler–Mascheroni constant and ψ(·) is the digamma function. Therefore, the differential entropy of (6) is, in nats, (see also [9])
    h(Vn)=n+12Eloge1+Vn2+n+12logeπlogeΓn+12 (16)
    =n+12loge(4π)+γ+ψn+12logeΓn+12, (17)
    whose growth is essentially linear with n: the conditional differential entropy
    h(Vn+1|Vn)=h(Vn+1)h(Vn) is monotonically decreasing with
    h(V2|V1)=32(γ+ψ(32))+loge4=2.306... (18)
    limnh(Vn+1|Vn)=12(1+γ+loge(4π))=2.054... (19)
  • 13.
    By the scaling law of differential entropy and its invariance to location, we obtain
    hΛ12Vn+μ=h(Vn)+12log|det(Λ)|. (20)
  • 14.
    Invoking (A6), we obtain a closed-form formula for the differential entropy, in nats, of the generalized Cauchy distribution (9) as
    h(Vβ,ρ)=logeκβ,ρ+βEloge1+|Vβ,ρ|ρ (21)
    =logeκβ,ρ+βψ(β)βψβ1ρ, (22)
    with κβ,ρ defined in (10).
  • 15.
    The Rényi differential entropy of order α(0,1)(1,) of an absolutely continuous random variable with probability density function fX is
    hα(X)=11αlogfXα(t)dt. (23)
    For Cauchy random variables, we obtain, with the aid of (A12),
    hα(λV+μ)=log|λ|+hα(V), (24)
    hα(V)=12α1αlogπ+11αlogΓ(α12)Γ(α),α>12, (25)
    which is infinite for α(0,12], converges to log(4π) (cf. (12)) as α1, and to logπ, the reciprocal of the mode height, as α.
  • 16.
    Invoking (A13), the Rényi differential entropy of order α1βρ,1(1,) of the generalized Cauchy distribution (9) is
    hα(Vβ,ρ)=α1αlogκβ,ρ+11αlog2Γβα1ρΓ1ρρΓ(βα). (26)

4. The Shannon- and η-Transforms

In this section, we recall the definitions of two notions introduced in [10] for the unrelated purpose of expressing the asymptotic singular value distribution of large random matrices.

  • 17.
    The Shannon transform of a nonnegative random variable X is the function VX:[0,)[0,), defined by
    VX(θ)=Eloge1+θX. (27)

    Unless VX(θ)= for all θ>0 (e.g., if X has the log-Cauchy density 1πx11+log2x,x>0), or VX(θ)=0, θ0, (which occurs if X=0 a.s.), the Shannon transform is a strictly concave continuous function from VX(0)=0, which grows without bound as θ.

  • 18.
    If V is standard Cauchy, then (A4) results in
    VV2(θ2)=2loge1+|θ|, (28)
    and the handy relationship
    Elogβ2+λ2V2=2log|β|+|λ|. (29)
  • 19.
    For the distribution in (9) with (β,ρ)=(2,2), (A7) results in
    VV2,22(θ2)=2loge1+|θ|2|θ|1+|θ|. (30)
  • 20.
    The η-transform ηX:[0,)(0,1] of a non-negative random variable is defined as the function
    ηX(θ)=E11+θX=1θV˙X(θ), (31)
    which is intimately related to the Cauchy–Stieltjes transform [11]. For example,
    ηV2(θ2)=11+|θ|, (32)
    ηV2,22(θ2)=1+2|θ|(1+|θ|)2. (33)

5. Strength

The purpose of this section is to introduce an attribute which is particularly useful to compare random variables that do not have finite moments.

  • 21.
    The strength ς(Z)[0,+] of a real-valued random variable Z is defined as
    ς(Z)=infς>0:Elog1+Z2ς2log4. (34)
    It follows that the only random variable with zero strength is Z=0, almost surely. If the inequality in (34) is not satisfied for any ς>0, then ς(Z)=. Otherwise, ς(Z) is the unique positive solution ς>0 to
    Elog1+Z2ς2=log4. (35)

    If ς(Z)ς, then (35) holds with ≤.

  • 22.
    The set of probability measures whose strength is upper bounded by a given finite nonnegative constant,
    Aς=PZ:ς(Z)ς, (36)
    is convex: The set A0 is a singleton as seen in Item 21, while, for 0<ς<, we can express (36) as
    Aς=PZ:Elog1+Z2ς2log4. (37)

    Therefore, if PZ0Aς and PZ1Aς, we must have αPZ1+(1α)PZ0Aς.

  • 23.
    The peculiar constant in the definition of strength is chosen so that if V is standard Cauchy, then its strength is ς(V)=1 because, in view of (29),
    Elog1+V2=log4. (38)
  • 24.
    If Z=kR, a.s., then its strength is
    ς(Z)=|k|3. (39)
  • 25.
    The left side of (35) is the Shannon transform of Z2 evaluated at ς2, which is continuous in ς2. If ς(Z)(0,) then, (35) can be written as
    ς2(Z)=1VZ21(loge4), (40)
    where, on the right side, we have denoted the functional inverse of the Shannon transform. Clearly, the square root of the right side of (40) cannot be expressed as the expectation with respect to Z of any b:RR that does not depend on PZ. Nevertheless, thanks to (37), (36) can be expressed as
    Aς=PZ:Ebς2Z1,withbς2(x)=log41+x2ς2. (41)
  • 26.

    Theorem 1.

    The strength of a real-valued random variable satisfies the following properties:
    • (a) 
      ς(λZ)=|λ|ς(Z). (42)
    • (b) 
      ς2(Z)13E[Z2], (43)
      with equality if and only if |Z| is deterministic.
    • (c) 
      If 0<q<2, and Zq=E1q[|Z|q]<, then
      ς(Z)κq1qZq,withκq=maxx>0log4(1+x2)xq. (44)
    • (d) 
      If V is standard Cauchy, independent of X, then ς(X+V) is the solution to
      VX2(ς+1)2=2log21+ς1, (45)
      if it exists, otherwise, ς(X+V)=. Moreover, ≤ holds in (45) if ς(X+V)ς.
    • (e) 
      2log2min{1,ς(Z)}Elog1+Z22log2max{1,ς(Z)}. (46)
    • (f) 
      If 0<ς(Z)<, then
      h(Z)=log(4πς(Z))D(Zς(Z)V), (47)
      where V is standard Cauchy, and D(XY) stands for the relative entropy with reference probability measure PY and dominated measure PX.
    • (g) 
      h(Z)<ς(Z)<Elog1+Z2<. (48)
    • (h) 
      If V is standard Cauchy, then
      ς(Z)<andh(Z)RD(ZλV)<,forallλ>0. (49)
    • (i) 
      The finiteness of strength is sufficient for the finiteness of the entropy of the integer part of the random variable, i.e.,
      H(Z)=ς(Z)=.
    • (j) 
      If ZnZ in Lq for any q(0,1], then ς(Zn)ς(Z).
    • (k) 
      Zn0i.p.Elog1+Zn20ς(Zn)0. (50)
    • (l) 
      If ς(Xn)0, then ς(Z+Xn)ς(Z).
    • (m) 
      If ς(Xn)0, ς(Z)< and Z is independent of Xn, then h(Z+Xn)h(Z).

    Proof. 

    For the first three properties, it is clear that they are satisfied if ς(Z)=0, i.e., Z=0 almost surely.
    • (a)
      If ς2(0,) is the solution to (35), then λ2ς2 is a solution to (35) with λZ taking the role of Z. If (35) has no solution, neither does its version in which λZ takes the role of Z.
    • (b)
      Jensen’s inequality applied to the left side of (35) results in 3ς2E[Z2]. The strict concavity of log(1+t) implies that equality holds if and only if Z2 is deterministic. If (35) has no solution, the same reasoning implies that E[Z2]=.
    • (c)
      First, it is easy to check that, for q(0,2), the function fq:(0,)(0,) given by fq(t)=tqlog4(1+t2) attains its maximum κq at a unique point. Assume ς(Z)(0,). Since κqtqlog4(1+t2) for all t>0, letting t=|Z|/ς(Z) and taking expectations, (35) (choosing 4 as the logarithm base) results in
      κqςq(Z)E|Z|q1, (51)
      which is the same as (44). If ς(Z)=, then =E[log(1+Z2)]κqE[|Z|q].
    • (d)
      Invoking (A4) with α2=ς2+x2 and |sinβ|=ςx2+ς2, we obtain
      Elog1+(x+V)2ς2=log(1+ς)2+x2ς2 (52)
      =log1+x2(1+ς)22logςς+1. (53)
      Substituting x by X and averaging over X, the result follows from the definition of strength.
    • (e)
      The result holds trivially if either ς(Z)=0 or ς(Z)=. Otherwise, we simply rewrite (35) as
      2log(2ς(Z))=Elogς2(Z)+Z2, (54)
      and upper/lower bound the right side by Elog1+Z2.
    • (f)
      D(Zς(Z)V)=h(Z)+log(ς(Z)π)+Elog1+Z2ς2(Z) (55)
      =log(4πς(Z))h(Z), (56)
      where (55) and (56) follow from (2) and (35), respectively.
    • (g)
      • If ς(Z)<, then Elog1+Z2< and h(Z)< follow from (46) and (47), respectively.
      • If Elog(1+Z2)<, the dominated convergence theorem implies
        limςElog1+Zς2=0. (57)
        Excluding the case Z=0 a.s. for which both Elog(1+Z2) and ς(Z) are zero, we have
        limς0Elog1+Zς2=limς0VZ21ς2=. (58)
        Since (35) is continuous in ς, it must have a finite solution in view of (57) and (58).
    • (h)
      It is sufficient to assume λ=1 for the condition on the right of (49) because the condition on the left holds if and only if it holds for αZ, for any α>0 and D(αZαV)=D(ZV). If h(Z)<, then
      D(ZV)=h(Z)+logπ+Elog1+Z2, (59)
      which is finite unless either h(Z)= or E[log(1+Z2)]=. This establishes ⟹ in view of (48). To establish ⟸, it is enough to show that
      D(ZV)<Elog1+Z2<, (60)
      in view of (48) and the fact that, according to (59), h(Z)> if both D(ZV) and Elog1+Z2 are finite. To show (60), we invoke the following variational representation of relative entropy (first noted by Kullback [12] for absolutely continuous random variables): If PZPV, then
      D(ZV)=maxQ:QPVElogdQdPV(Z), (61)
      attained only at Q=PZ. Let Q be the absolutely continuous random variable with probability density function
      q(x)=loge24|x|loge2|x|1{|x|2}+181{|x|<2}. (62)
      Then,
      >D(ZV)>Elogq(Z)fV(Z)=E1{|Z|2}logπloge24+log1+Z2log|Z|loge2|Z| (63)
      +E1{|Z|<2}logπ8+log1+Z2 (64)
      >15E1{Z2}log1+Z2+log5π8 (65)
      15Elog1+Z2+45log5log8π, (66)
      where (65) holds since
      45log(1+x2)log(πloge2)+2logloge|x|+log|x|,|x|>2. (67)
    • (i)
      ς(Z)<Elog(1+Z2)< (68)
      Elog(1+|Z|)< (69)
      H(Z)<, (70)
      where (68)–(70) follow from (48), log(1+x2)2log(1+|x|), and p. 3743 in [13], respectively.
    • (j)
      If ς(Z)=0, then Z=0 a.e., and the result follows from (44). For all (x,z)R2,
      loge1+(x+z)21+z2loge1+12(x2+|x|4+x2) (71)
      2q|x|q, (72)
      where (71) follows by maximizing the left side over zR. Denote the difference between the right side and the left side of (72) by fq(x), an even function which satisfies fq(0)=0, and
      f˙q(x)=2xq124+x2>0,x>0,0<q1. (73)
      Therefore, (72) follows. Assuming 0<ς(Z)<, we have
      Elog(1+Zn2)Elog(1+Z2)Elog(1+Zn2)log(1+Z2) (74)
      2qE|ZnZ|qloge. (75)
      Now, because of the scaling property in (42), we may assume without loss of generality that ς(Z)=1. Thus, (74) and (75) result in
      Elog(1+Zn2)log42qE|ZnZ|qloge, (76)
      which requires that ς(Zn)1, since, by assumption, the right side vanishes. Assume now that ς(Z)=, and therefore, Elog(1+Z2)=. Inequality (75) remains valid in this case, implying that, as soon as the right side is finite (which it must be for all sufficiently large n), Elog(1+Zn2)=, and therefore, ς(Zn)= in view of (48).
    • (k)
      • 1st ⟸
             For any ϵ>0, Markov’s inequality results in
        P[|Zn|>ϵ]=Plog1+Zn2>log1+ϵ2Elog1+Zn2log1+ϵ2. (77)
      • First, we show that, for any α>0, we have
        Elog1+Zn20Elog1+αZn20. (78)
        The case 0<α<1 is trivial. The case α>1 follows because Elog1+Zn20 implies
        Elog1+αZn2=Elog1+(α1)Zn2, (79)
        where ≥ is obvious, and ≤ holds because
        log1+αt2=log1+t2+log1+(α1)t21+t2 (80)
        log1+t2+log1+(α1)t2. (81)
        If ς(Zn)= infinitely often, so is Elog1+Zn2 in view of (48). Assume that lim supς(Zn)=ς(0,], and ς(Zn) is finite for all sufficiently large. Then, there is a subsequence such that ς(Zni), and
        log4=Elog1+Zniς(Zni)2Elog1+Zniλ2, (82)
        for all sufficiently large i and λ<ς. Consequently, (78) implies that Elog1+Zn2¬0.
      • 2nd ⟸
           Suppose that Elog1+Zn20. Therefore, there is a subsequence along which Elog1+Zni2>η>0. If ηlog4, then ς(Zni)>1 along the subsequence. Because of the continuity of the Shannon transform and the fact that it grows without bound as its argument goes to infinity (Item 25), if 0<η<log4, we can find 1<α< such that Elog1+αZni2>log4, which implies ς(Zni)>α1/2. Therefore, ς(Zn)0 as we wanted to show.
    • (l)
      We start by showing that
      Elog1+Xn20Ef(Xn)0, (83)
      where we have denoted the right side of (71) with arbitrary logarithm base by f(x). Since f˙(x)=2loge4+x2, it is easy to verify that
      0f(x)log(1+x2)log43,xR, (84)
      where the lower and upper bounds are attained uniquely at x=0 and |x|=12, respectively. The lower bound results in ⟸ in (83). To show ⟹, decompose, for arbitrary ϵ>0,
      Ef(Xn)=Ef(Xn)1{|Xn|<ϵ}+Ef(Xn)1{|Xn|ϵ} (85)
      f(ϵ)+Ef(Xn)1{|Xn|ϵ} (86)
      f(ϵ)+AϵElog1+Xn21{|Xn|ϵ} (87)
      f(ϵ)+Aϵϵ3, (88)
      where
      Aϵ=1+log43log(1+ϵ2), (89)
      (87) holds from the upper bound in (84), and the fact that (89) is decreasing in ϵ, and (88) holds for all sufficiently large n if Elog1+Xn20. Since the right side of (88) goes to 0 as ϵ0, (83) is established. Assume 0<ς(Z)<. From the linearity property (42), we have ς(Z+Xn)=ς(Z)·ς(Z¯+X¯n) with Z¯=ς1(Z)Z and X¯n=ς1(Z)Xn which satisfies ς(X¯n)0. Therefore, we may restrict attention to ς(Z)=1 without loss of generality. Following (71) and (74), and abbreviating Zn=Z+Xn, we obtain
      Elog(1+Zn2)log4Elog(1+Zn2)log(1+Z2) (90)
      Ef(Xn). (91)
      Thus, the desired result follows in view of (50) and (83). To handle the case ς(Z)=, we use the same reasoning as in the proof of (i) since (83) remains valid in that case.
    • (m)
      If ς(Z)=0, then Z=0 a.s., h(Z)= and h(Xn) in view of Part (f). Assume henceforth that ς(Z)>0. Since h(Z+Xn)h(Z), it suffices to show
      lim supnh(Xn+Z)h(Z). (92)
      Under the assumptions, Part (l) guarantees that
      ς(Xn+Z)ς(Z). (93)
      If V is a standard Cauchy random variable, then ς(Z+Xn)Vς(Z)V in distribution as the characteristic function converges: eς(Z+Xn)|t|eς(Z)|t| for all t. Analogously, according to Part (k), Z+XnDZ since Xn0 in probability. Since the strength of Xn+Z is finite for all sufficiently large n, we may invoke (47) to express, for those n,
      h(Xn+Z)h(Z)=logς(Z+Xn)ς(Z)D(Z+Xnς(Z+Xn)V)+D(Zς(Z)V). (94)
      The lower semicontinuity of relative entropy under weak convergence (which, in turn, is a corollary to the Donsker–Varadhan [14,15] variational representation of relative entropy) results in
      lim infnD(Z+Xnς(Z+Xn)V)D(Zς(Z)V), (95)
      because Z+XnDZ and ς(Z+Xn)VDς(Z)V. Therefore, (92) follows from (94) and (95).
         □
  • 27.
    In view of (42) and Item 23, ς(λV)=|λ| if V is standard Cauchy. Furthermore, if X1 and X2 are centered independent Cauchy random variables, then their sum is centered Cauchy with
    ς(X1+X2)=ς(X1)+ς(X2). (96)
    More generally, it follows from Theorem 1-(d) that, if X1 is centered Cauchy, and (96) holds for X2=αX and all αR, then X must be centered Cauchy. Invoking (52), we obtain
    ς(λV+μ)=|λ|3+134λ2+3μ2, (97)
    which is also valid for λ=0 as we saw in Item 24.
  • 28.
    If X is standard Gaussian, then ς2(X)=0.171085, and ς2(σX)=σ2ς2(X). Therefore, if X1 and X2 are zero-mean independent Gaussian random variables, then
    ς2(X1+X2)=ς2(X1)+ς2(X2). (98)
    Thus, in this case, ς(X1+X2)<ς(X1)+ς(X2).
  • 29.
    It follows from Theorem 1-(d) that, with X independent of standard Cauchy V, we obtain ς(X+V)>ς(X)+ς(V) whenever X is such that
    VX2(2+ς(X))2>2loge1+ς(X)2+ς(X). (99)
    An example is the heavy-tailed probability density function
    fX(x)=1πlog4(1+x2)1+x2, (100)
    for which 7.0158=ς(X+V)>ς(X)+ς(V)=6.8457.
  • 30.
    Using (A8), we can verify that, if X is zero-mean uniform with variance σ2, then
    ς2(X)=3c2σ2=0.221618σ2, (101)
    where c is the solution to loge(1+c2)+2carctan(c)=2+loge4.
  • 31.
    We say that Zn0 in strength if ς(Zn)0. Parts (j) and (k) of Theorem 1 show that this convergence criterion is intermediate between the traditional in probability and Lq criteria. It is not equivalent to either one: If
    Zn=0,with probability11n;2n,with probability1n, (102)
    then ς(Zn)1, while Zn0 in probability. If, instead, Zn=32n, with probability 1n, then Zn0 in strength, but not in Lq for any 0<q.
  • 32.
    The assumption in Theorem 1-(m) that Xn0 in strength cannot be weakened to convergence in probability. Suppose that Xn is absolutely continuous with probability density function
    fXn(t)=n1,t0,1n;0,t(,0)1n,2;1nloge2tloge2t,t[2,). (103)
    We have Xn0 in probability since, regardless of how small ϵ>0, P[Xn>ϵ]=1n for all n1ϵ. Furthermore,
    h(Xn+Z)h(Xn)=, (104)
    because (103) is the mixture of a uniform and an infinite differential entropy probability density function, and differential entropy is concave. We conclude that h(Xn+Z)h(Z), since h(Z)<.
  • 33.
    The following result on the continuity of differential entropy is shown in [16]: if X and Z are independent, E[|Z|]< and E[|X|]<, then
    limϵ0h(ϵX+Z)=h(Z). (105)

    This result is weaker than Theorem 1-(m) because finite first absolute moment implies finite strength as we saw in (44), and ϵX0 in L1 if ϵ0, and therefore, it vanishes in strength too.

  • 34.
    If Z and V are centered and standard Cauchy, respectively, then minλD(ZλV) is achieved by λ=ς(Z). Otherwise, in general, this does not hold. Since D(ZλV)=VZ2λ2h(Z)+loge(πλ), the minimum is attained at the solution to
    ηZ21λ2=12, (106)
    where we have used the η-transform in (31). If Z=V2,2, recalling (32), (106), results in λ=21, while ς(V2,2)=0.302.
  • 35.
    Using (28) and the concavity of log(1+x), we can verify that
    ς(Xα)ας(X1)+(1α)ς(X0),XααPX1+(1α)PX0, (107)
    if X0 and X1 are centered Cauchy, or, more generally, if X0=λ0X, X1=λ1X and VX2(θ2) is concave on θ. Not only is this property not satisfied if X=1 but (107) need not hold in that case, as we can verify numerically for α=0.1, λ1=1 and λ0>20.

6. Maximization of Differential Entropy

  • 36.
    Among random variables with a given second moment (resp. first absolute moment), differential entropy is maximized by the zero-mean Gaussian (resp. Laplace) distribution. More generally, among random variables with a given p-absolute moment μ, differential entropy is maximized by the parameter-p Subbotin (or generalized normal) distribution with p-absolute moment μ [17]
    fX(x)=p11p2Γ(1p)μ1pe|x|ppμ,xR. (108)
    Among nonnegative random variables with a given mean, differential entropy is maximized by the exponential distribution. In those well-known solutions, the cost function is an affine function of the negative logarithm of the maximal differential entropy probability density function. Is there a cost function such that, among all random variables with a given expected cost, the Cauchy distribution is the maximal differential entropy solution? To answer this question, we adopt a more general viewpoint. Consider the following result, whose special case ρ=2 was solved in [18] using convex optimization:

    Theorem 2.

    Fix ρ>0 and θ>0.
    maxZ:Eloge1+|Z|ρθh(Z)=h(Vβ,ρ), (109)
    where Vβ,ρ is defined in (9), the right side of (109) is given in (22), and β>ρ1 is the solution to
    θ=ψ(β)ψβ1ρ. (110)
    Therefore, the standard Cauchy distribution is the maximal differential entropy distribution provided that ρ=2 and θ=loge4.

    Proof. 

    • (a)
      For every ρ>0 and θ>0, there is a unique β>ρ1 that satisfies (110) because the function of β on the right side is strictly monotonically decreasing, grows without bound as β1ρ, and goes to zero as β.
    • (b)
      For any Z which satisfies Eloge1+|Z|ρθ, its relative entropy, in nats, with respect to Vβ,ρ is
      D(ZVβ,ρ)=h(Z)logeκβ,ρ+βEloge1+|Z|ρ (111)
      h(Z)logeκβ,ρ+βθ (112)
      =h(Z)logeκβ,ρ+βψ(β)βψβ1ρ (113)
      =h(Vβ,ρ)h(Z), (114)
      where (113) and (114) follow from (110) and (22), respectively. Since relative entropy is nonnegative, and zero only if both measures are identical, not only does (2) hold but any random variable other than Vβ,ρ achieves strictly lower differential entropy.
         □
  • 37.

    An unfortunate consequence stemming from Theorem 2 is that, while we were able to find out a cost function such that the Cauchy distribution is the maximal differential entropy distribution under an average cost constraint, this holds only for a specific value of the constraint. This behavior is quite different from the classical cases discussed in Item 36 for which the solution is, modulo scale, the same regardless of the value of the cost constraint. As we see next, this deficiency is overcome by the notion of strength introduced in Section 5.

  • 38.

    Theorem 3.

    Strength constraint. The differential entropy of a real-valued random variable with strength ς(Z) is upper bounded by
    h(Z)log4πς(Z). (115)
    If 0<ς(Z)<, equality holds if and only if Z has a centered Cauchy density, i.e., Z=λV for some λ>0.

    Proof. 

    • (a)
      If Z is not an absolutely continuous random variable, or more generally, h(Z)= such as in the case ς(Z)=0 in which Z=0 with probability one, then (115) is trivially satisfied.
    • (b)
      If 0<ς(Z)< and h(Z)>, then we invoke (47) to conclude that not only does (115) hold, but it is satisfied with equality if and only if Z=ς(Z)V.
       □
  • 39.
    The entropy power of a random variable Z is the variance of a Gaussian random variable whose differential entropy is h(Z), i.e.,
    N(Z)=12πeexp2h(Z). (116)
    While the power of a Cauchy random variable is infinite, its entropy power is given by
    N(λV+μ)=12πeexp2h(λV+μ)=8πλ2e. (117)
    In the same spirit as the definition of entropy power, Theorem 3 suggests the definition of NC(Z), the entropy strength of Z, as the strength of a centered Cauchy random variable whose differential entropy is h(Z), i.e., h(Z)=hNC(Z)V. Therefore,
    NC(Z)=14πexph(Z) (118)
    =ς(Z)expDZς(Z)V (119)
    ς(Z), (120)
    where (119) follows from (56), and (120) holds with equality if and only if Z is centered Cauchy. Note that, for all (α,μ)R2,
    NC(αZ+μ)=|α|NC(Z). (121)
    Comparing (116) and (118), we see that entropy power is simply a scaled version of the entropy strength squared,
    N(Z)=8πeNC2(Z). (122)
    The entropy power inequality (e.g., [19,20]) states that, if X1 and X2 are independent real-valued random variables, then
    N(X1+X2)N(X1)+N(X2), (123)
    regardless of whether they have moments. According to (122), we may rewrite the entropy power inequality (123) replacing each entropy power by the corresponding squared entropy strength. Therefore, the squared entropy strength of the sum of independent random variables satisfies
    NC2(X1+X2)NC2(X1)+NC2(X2). (124)

    It is well-known that equality holds in (123), and hence (124), if and only if both random variables are Gaussian. Indeed, if X1 and X2 are centered Cauchy with respective strengths ς1>0 and ς2>0, then (124) becomes ς1+ς22>ς12+ς22.

  • 40.
    Theorem 3 implies that any random variable with infinite differential entropy has infinite strength. There are indeed random variables with finite differential entropy and infinite strength. For example, let Z[2,) be an absolutely continuous random variable with probability density function
    fZ(t)=0.473991...loge2n,tn,n+1n,n{2,3,};0,elsewhere. (125)

    Then, h(Z)=1.99258... nats, while the entropy of the quantized version as well as the strength satisfy H(Z)==ς(Z).

  • 41.
    With the same approach, we may generalize Theorem 3 to encompass the full slew of the generalized Cauchy distributions in (9). To that end, fix ρ>0 and define the (ρ,θ)-strength of a random variable as
    ςρ,θ(Z)=infς>0:Eloge1+Zςρθ. (126)
    Therefore, ςρ,θ(Z)=ς(Z) for (ρ,θ)=(2,loge4), and if (β,ρ,θ) satisfy (110), then ςρ,θ(Vβ,ρ)=1. As in Item 25, if ςρ,θ(Z)(0,), we have
    ςρ,θρ(Z)=1V|Z|ρ1(θ). (127)
  • 42.

    Theorem 4.

    Generalized strength constraint. Fix ρ>0 and θ>0. The differential entropy of a real-valued random variable with (ρ,θ)-strength ςρ,θ(Z) is upper bounded by
    h(Z)logςρ,θ(Z)+h(Vβ,ρ), (128)
    where β is given by the solution to (110), Vβ,ρ has the generalized Cauchy density (9), and h(Vβ,ρ) is given in (21). If ςρ,θ(Z)<, equality holds if and only if Z is a constant times Vβ,ρ.

    Proof. 

    As with Theorem 3, in the proof, we may assume 0<ςρ,θ(Z)< to avoid trivialities. Then,
    Eloge1+Zςρ,θ(Z)ρ=θ, (129)
    and, in nats,
    DZςρ,θ(Z)Vβ,ρ=h(Z)logeκβ,ρςρ,θ(Z)+βEloge1+Zςρ,θ(Z)ρ (130)
    =h(Z)logeκβ,ρςρ,θ(Z)+βθ (131)
    =h(Z)logeκβ,ρςρ,θ(Z)+βψ(β)βψβ1ρ (132)
    =h(Z)+logeςρ,θ(Z)+h(Vβ,ρ), (133)
    where (130), (131), (132), and (133) follow from (9), (129), (110), and (22), respectively.    □
  • 43.

    In the multivariate case, we may find a simple upper bound on differential entropy based on the strength of the norm of the random vector.

    Theorem 5.

    The differential entropy of a random vector Zn is upper bounded by
    h(Zn)nlogς(Zn)+n+12log(4π)logΓn+12. (134)

    Proof. 

    As in the proof of Theorem 3, we may assume that 0<ς(Zn)<. As usual, Vn denotes the standard spherical multivariate Cauchy density in (6). Since for α0, fαVn(xn)=|α|nfVn(α1xn), we have
    D(Znς(Zn)Vn)=h(Zn)Elogfς(Zn)Vn(Zn) (135)
    =h(Zn)+nlogς(Zn)logΓn+12πn+12+n+12Elog1+Zn2ς2(Zn) (136)
    =h(Zn)+nlogς(Zn)logΓn+12+n+12log(4π), (137)
    where (136) and (137) follow from (6) and the definition of strength, respectively.    □

    For n=1, Theorem 5 becomes the bound in (115). For n=2,3,, the right side of (15) is greater than loge4, and, therefore, ς(Zn)>1. Consequently, in the multivariate case, there is no Zn such that (134) is tight.

  • 44.
    To obtain a full generalization of Theorem 3 in the multivariate case, it is advisable to define the strength of a random n-vector as
    ς(Zn)=infς>0:ElogfVnς1ZnhVn (138)
    =ς2,θn(Zn) (139)
    for θn=ψn+12+γ+loge4. To verify (139), note (15)–(17). Notice that ς(λVn)=|λ| and for n=1, (138) is equal to (34). The following result provides a maximal differential entropy characterization of the standard spherical multivariate Cauchy density.

    Theorem 6.

    Let Vn have the standard multivariate Cauchy density (6), Then,
    h(Zn)nlogς(Zn)+h(Vn), (140)
    where h(Vn) is given in (17). If 0<ς(Zn)<, equality holds in (140) if and only if Zn=λVn for some λ0.

    Proof. 

    Assume 0<ς(Zn)<. Then,
    DZnς(Zn)Vn=h(Zn)+nlogς(Zn)ElogfVnς1(Zn)Zn (141)
    =h(Zn)+nlogς(Zn)+h(Vn) (142)
    in view of (138). Hence, the difference between right and left sides of (140) is equal to zero if and only if Zn=λVn for some λ0; otherwise, it is positive.    □

7. Relative Information

  • 45.
    For probability measures P and Q on the same measurable space (A,F), such that PQ, the logarithm of their Radon–Nikodym derivative is the relative information denoted by
    ıPQ(x)=logdPdQ(x). (143)
  • 46.
    As usual, we may employ the notation ıXY(x) to denote ıPXPY(x). The distributions of the random variables ıXY(X) and ıXY(Y) are referred to as relative information spectra (e.g., [21]). It can be shown that there is a one-to-one correspondence between the cumulative distributions of ıXY(X) and ıXY(Y). For example, if they are absolutely continuous random variables with respective probability density functions fXY and f¯XY, then
    fXY(α)=exp(α)f¯XY(α),αR. (144)
    Obviously, the distributions of ıXY(X) and dPXdPY(X) determine each other. One caveat is that relative information may take the value . It can be shown that
    P[ıXY(X)=]=0, (145)
    P[ıXY(Y)=]=1Eexp(ıXY(X)). (146)
  • 47.

    The information spectra determine all measures of the distance between the respective probability measures of interest (e.g., [22,23]), including f-divergences and Rényi divergences. For example, the relative entropy (or Kullback–Leibler divergence) of the dominated measure P with respect to the reference measure Q is the average of the relative information when the argument is distributed according to P, i.e., D(XY)=E[ıXY(X)]. If P¬Q, then D(PQ)=.

  • 48.
    The information spectra also determine the fundamental trade-off in hypothesis testing. Let αν(P1,P0) denote the minimal probability of deciding P0 when P1 is true subject to the constraint that the probability of deciding P1 when P0 is true is no larger than ν. A consequence of the Neyman–Pearson lemma is
    αν(P1,P0)=minγRPıP1P0(Y1)γexp(γ)νPıP1P0(Y0)>γ, (147)
    where Y0P0 and Y1P1.
  • 49.
    Cauchy distributions are absolutely continuous with respect to each other and, in view of (2),
    ıλ1V+μ1λ0V+μ0(x)=log|λ1||λ0|+log(xμ0)2+λ02(xμ1)2+λ12. (148)
  • 50.
    The following result, proved in Item 58, shows that the relative information spectrum corresponding to Cauchy distributions with respective scale/locations (λ1,μ1) and (λ0,μ0) depends on the four parameters through the single scalar
    ζ(λ1,μ1,λ0,μ0)=λ12+λ02+(μ1μ0)22|λ0λ1|1, (149)
    where equality holds if and only if (λ1,μ1)=(λ0,μ0).

    Theorem 7.

    Suppose that λ1λ00, and V is standard Cauchy. Denote
    Z=dPλ1V+μ1dPλ0V+μ0(λ1V+μ1). (150)
    Then,
    • (a) 
      EZ=ζ(λ1,μ1,λ0,μ0), (151)
    • (b) 
      Z has the same distribution as the random variable
      ζ+ζ21cosΘ, (152)
      where Θ is uniformly distributed on [π,π] and ζ=ζ(λ1,μ1,λ0,μ0). Therefore, the probability density function of Z is
      fZ(z)=1π1ζ21(zζ)2, (153)
      on the interval 0<ζζ21<z<ζ+ζ21.
  • 51.
    The indefinite integral (e.g., see 2.261 in [24])
    dx2ζxx21=arcsinxζζ21 (154)
    results, with Xi=λiV+μi, i=0,1, in
    P[ıX1X0(X1)logt]=1,ζ+ζ21t;12+1πarcsintζζ21,ζζ21<t<ζ+ζ21;0,0<tζζ21. (155)
  • 52.
    For future use, note that the endpoints of the support of (153) are their respective reciprocals. Furthermore,
    fZ1z=zfZ(z), (156)
    which implies
    f1Z(z)=1zfZ(z). (157)

8. Equivalent Pairs of Probability Measures

  • 53.

    Suppose that P1 and Q1 are probability measures on (A1,F1) such that P1Q1 and P2 and Q2 are probability measures on (A2,F2) such that P2Q2. We say that (P1,Q1) and (P2,Q2) are equivalent pairs, and write (P1,Q1)(P2,Q2), if the cumulative distribution functions of ıP1Q1(X1) and ıP2Q2(X2) are identical with X1P1 and X2P2. Naturally, ≡ is an equivalence relationship. Because of the one-to-one correspondence indicated in Item 46, the definition of equivalent pairs does not change if we require equality of the information spectra under the dominated measure, i.e., that ıP1Q1(Y1) and ıP2Q2(Y2) be equally distributed Y1Q1 and Y2Q2. Obviously, the requirement that the information spectra coincide is the same as requiring that the distributions of dP1dQ1(Y1) and dP2dQ2(Y2) are equal. As in Item 46, we also employ the notation (X1,Y1)(X2,Y2) to indicate (P1,Q1)(P2,Q2) if X1P1, X2P2, Y1Q1, and Y2Q2.

  • 54.

    Suppose that the output probability measures of a certain (random or deterministic) transformation are Q0 and Q1 when the input is distributed according to P0 and P1, respectively. If (P0,P1)(Q0,Q1), then the transformation is a sufficient statistic for deciding between P0 and P1 (i.e., the case of a binary parameter).

  • 55.
    If (A,F) is a measurable space on which the probability measures PX1PX2 are defined, and ϕ:AB is a (F,G)-measurable injective function, then Pϕ(X1)Pϕ(X2) are probability measures on (B,G) and
    ıX1X2(x)=ıϕ(X1)ϕ(X2)ϕ(x). (158)

    Consequently, (X1,X2)(ϕ(X1),ϕ(X2)).

  • 56.
    The most important special case of Item 55 is an affine transformation of an arbitrary real-valued random variable X, which enables the reduction of four-parameter problems into two-parameter problems: for all (λ2,μ1,μ2)R3 and λ10,
    (λ1X+μ1,λ2X+μ2)(X,λX+μ), (159)
    with
    λ=λ2λ1andμ=μ2μ1λ1, (160)
    by choosing the affine function ϕ(x)=xμ1λ1.
  • 57.

    Theorem 8.

    If XnRn is an even random vector, i.e., PXn=PXn, then
    (Xn+μ1,Xn+μ2)(Xn+μ3,Xn+μ4), (161)
    whenever |μ1μ2|=|μ3μ4|.

    Proof. 

    • (a)
      If μ1μ2=μ3μ4, then (161) holds even if Xn is not even because the function xμ is injective, in particular, with μ=μ3μ1=μ4μ2.
    • (b)
      If μ1μ2=μ4μ3, then
      (Xn+μ1,Xn+μ2)(Xn,Xn+μ2μ1) (162)
      (Xn,Xn+μ3μ4) (163)
      (Xn+μ3μ4,Xn) (164)
      (Xn+μ3μ4,Xn) (165)
      (Xn+μ3,Xn+μ4), (166)
      where (162) and (166) follow from Part (a), (164) follows because x+μ3μ4 is injective, and (165) holds because Xn is even.
         □
  • 58.

    We now proceed to prove Theorem 7.

    Proof. 

    Since λV and λV have identical distributions, we may assume for convenience that λ1>0 and λ0>0. Furthermore, capitalizing on Item 56, we may assume λ1=1, μ1=0, λ0=λ, and μ0=μ, and then recover the general result letting λ=λ0λ1 and μ=μ0μ1λ1. Invoking (A9) and (A10), we have
    EdPVdPλV+μ(V)=1λE(Vμ)2+λ2V2+1 (167)
    =1πλ(tμ)2+λ2(t2+1)2dt (168)
    =λ2+μ2+12λ, (169)
    and we can verify that we recover (151) through the aforementioned substitution. Once we have obtained the expectation of Z=dPVdPλV+μ(V), we proceed to determine its distribution. Denoting the right side of (169) by ζ, we have
    ZE[Z]=1λλ2+(Vμ)21+V2ζ (170)
    =12λ(1λ2μ2)(V21)4μV1+V2 (171)
    =12λ(1λ2μ2)(sin2Θcos2Θ)4μsinΘcosΘ (172)
    =12λ(λ2+μ21)cos2Θ2μsin2Θ (173)
    =12λ(λ2+μ21)2+4μ2cos2Θ+ϕλ,μ (174)
    =ζ21cos2Θ+ϕλ,μ, (175)
    where Θ is uniformly distributed on [π,π]. We have substituted V=tanΘ (see Item 4) in (172), and invoked elementary trigonometric identities in (173) and (174). Since the phase in (175) does not affect it, the distribution of Z is indeed as claimed in (152), and (153) follows because the probability density function of cosΘ is
    fcosΘ(t)=1π11t2,|t|<1. (176)
       □
  • 59.
    In general, it need not hold that (X,Y)(Y,X)—for example, if X and Y are zero-mean Gaussian with different variances. However, the class of scalar Cauchy distributions does satisfy this property since the result of Theorem 7 is invariant to swapping λ1λ0 and μ1μ0. More generally, Theorem 7 implies that, if λ1λ0γ1γ00, then
    (λ1V+μ1,λ0V+μ0)(γ1V+ν1,γ0V+ν0)λ12+λ02+(μ1μ0)2|λ0λ1|=γ12+γ02+(ν1ν0)2|γ0γ1|. (177)

    Curiously, (177) implies that (V,V+1)(V,2V+1).

  • 60.

    For location–dilation families of random variables, we saw in Item 56 how to reduce a four-parameter problem into a two-parameter problem since (λ1V+μ1,λ0V+μ0)(V,λV+μ) with the appropriate substitution. In the Cauchy case, Theorem 7 reveals that, in fact, we can go one step further and turn it into a one-parameter problem. We have two basic ways of doing this:

    • (a)

      (λ1V+μ1,λ0V+μ0)(V,V+μ) with μ2=2ζ2.

    • (b)
      (λ1V+μ1,λ0V+μ0)(V,λV) with either
      λ=ζζ21<1,orλ=ζ+ζ21>1, (178)
      which are the solutions to ζ=λ2+12λ.

9. f-Divergences

This section studies the interplay of f-divergences and equivalent pairs of measures.

  • 61.
    If PQ and f:[0,)R is convex and right-continuous at 0, f-divergence is defined as
    Df(PQ)=EfdPdQ(Y),YQ. (179)
  • 62.
    The most important property of f-divergence is the data processing inequality
    Df(PXQX)Df(PYQY), (180)
    where PY and QY are the responses of a (random or deterministic) transformation to PX and QX, respectively. If f is strictly convex at 1 and Df(PXQX)<, then (PX,QX)(PY,QY) is necessary and sufficient for equality in (180).
  • 63.

    If (P,Q)(Q,P), then Df(PQ)=Df(PQ) with the transform f(t)=tf(1t), which satisfies f=f.

  • 64.

    Theorem 9.

    If P1Q1 and P2Q2, then
    (P1,Q1)(P2,Q2)Df(P1Q1)=Df(P2Q2),f, (181)
    where f stands for all convex right-continuous f:[0,)R.

    Proof. 

    As mentioned in Item 53, (P1,Q1)(P2,Q2) is equivalent to dP1dQ1(Y1) and dP2dQ2(Y2) having identical distributions with Y1Q1 and Y2Q2.
    • According to (179), Df(PQ) is determined by the distribution of the random variable dPdQ(Y), YQ.
    • For tR, the function ft(x)=etx, x0, is convex and right-continuous at 0, and Dft(PQ) is the moment generating function, evaluated at t, of the random variable dPdQ(Y), YQ. Therefore, Dft(P1Q1)=Dft(P2Q2) for all t implies that (P1,Q1)(P2,Q2).
         □
  • 65.

    Since PQ is not necessary in order to define (finite) Df(PQ), it is possible to enlarge the scope of Theorem 9 by defining (P1,Q1)(P2,Q2) dropping the restriction that P1Q1 and P2Q2. For that purpose, let μ1 and μ2 be σ-finite measures on (A1,F1) and (A2,F2), respectively, and denote pi=dPidμi, qi=dQidμi, i=1,2. Then, we say (P1,Q1)(P2,Q2) if

    • (a)

      when restricted to [0,1], the random variables p1(Y1)q1(Y1) and p2(Y2)q2(Y2) have identical distributions with Y1Q1 and Y2Q2;

    • (b)

      when restricted to [0,1], the random variables q1(X1)p1(X1) and q2(X2)p2(X2) have identical distributions with X1P1 and X2P2.

    Note that those conditions imply that

    • (c)

      Q1({ωA1:p1(ω)=q1(ω)})=Q2({ωA2:p2(ω)=q2(ω)});

    • (d)

      Q1({ωA1:p1(ω)=0})=Q2({ωA2:p2(ω)=0});

    • (e)

      P1({ωA1:q1(ω)=0})=P2({ωA2:q2(ω)=0}).

    For example, if P1Q1 and P2Q2, then (P1,Q1)(P2,Q2). To show the generalized version of Theorem 9, it is convenient to use the symmetrized form
    Df(PQ)=0p<qqfpqdμ+0q<ppfqpdμ+f(1)Q[p=q]. (182)
  • 66.
    Suppose that there is a class C of probability measures on a given measurable space with the property that there exists a convex function g:(0,)R (right-continuous at 0) such that, if (P1,Q1)C2 and (P2,Q2)C2, then
    Dg(P1Q1)=Dg(P2Q2)(P1,Q1)(P2,Q2). (183)

    In such case, Theorem 9 indicates that C2 can be partitioned into equivalence classes such that, within every equivalence class, the value of Df(PQ) is constant, though naturally dependent on f. Throughout C2, the value of Dg(PQ) determines the value of Df(PQ), i.e., we can express Df(PQ)=ϑf,gDg(PQ), where ϑf,g is a non-decreasing function. Consider the following examples:

    • (a)
      Let C be the class of real-valued Gaussian probability measures with given variance σ2>0. Then,
      DNμ1,σ2Nμ2,σ2=(μ1μ2)2σ2loge. (184)

      Since Theorem 8 implies that (Nμ1,σ2,Nμ2,σ2)(Nμ3,σ2,Nμ4,σ2) as long as (μ1μ2)2=(μ3μ4)2, (184) indicates that (183) is satisfied with g(t) given by the right-continuous extension of tlogt. Therefore, we can conclude that, regardless of f, DfNμ1,σ2Nμ2,σ2 depends on (μ1,μ2,σ2) only through (μ1μ2)2/σ2.

    • (b)
      Let C be the collection of all Cauchy random variables. Theorem 7 reveals that (183) is also satisfied if g(x)=x2 because, if XP and YQ, then
      EdPdQ(X)=EdPdQ(Y)2. (185)
  • 67.
    An immediate consequence of Theorems 7 and 9 is that, for any valid f, the f-divergence between Cauchy densities is symmetric,
    Df(λ1V+μ1λ0V+μ0)=Df(λ0V+μ0λ1V+μ1). (186)
    This property does not generalize to the multivariate case. While, in view of Theorem 8,
    (Λ12Vn+μ1,Λ12Vn+μ2)(Λ12Vn+μ2,Λ12Vn+μ1), (187)
    in general, (Λ12Vn,Vn)(Vn,Λ12Vn) since the corresponding relative entropies do not coincide as shown in [8].
  • 68.

    It follows from Item 66 and Theorem 7 that any f-divergence between Cauchy probability measures Df(λ1V+μ1λ0V+μ0) is a monotonically increasing function of ζ(λ1,μ1,λ0,μ0) given by (149). The following result shows how to obtain that function from f.

    Theorem 10.

    With fZ given in (153),
    Df(λ1V+μ1λ0V+μ0)=ζζ21ζ+ζ21f1zfZ(z)dz (188)
    =Efζ+ζ21cosΘ1 (189)
    =ζζ21ζ+ζ211zfzfZ(z)dz. (190)
    where Θ is uniformly distributed on [0,π] in (189).

    Proof. 

    In view of (179) and the definition of Z in Theorem 7,
    Df(λ1V+μ1λ0V+μ0)=Ef1Z, (191)
    thereby justifying (188) and (189) since we saw in Theorem 7 that Z has the distribution of ζ+ζ21cosΘ with Θ uniformly distributed on [0,π]. Item 52 results in (190). Alternatively, we can rely on Item 63 and substitute f by f on the right side of (188).    □
  • 69.
    Suppose now that we have two sequences of Cauchy measures with respective parameters (λ1(n),μ1(n)) and (λ0(n),μ0(n)) such that ζ(λ1(n),μ1(n),λ0(n),μ0(n))1. Then, Theorem 10 indicates that
    Dfλ1(n)V+μ1(n)λ0(n)V+μ0(n)f(1). (192)
    The most common f-divergences are such that f(1)=0 since in that case Df(PQ)0. In addition, adding the function αtα to f(t) does not change the value of Df(PQ) and with appropriately chosen α, we can turn f(t) into canonical form in which not only f(1)=0 but f(t)0. In the special case in which the second measure is fixed, Theorem 9 in [25] shows that, if esssupdPndQ(Y)1 with YQ, then
    limnDf(PnQ)Dg(PnQ)=limt1f(t)g(t), (193)
    provided the limit on the right side exists; otherwise, the left side lies between the left and right limits at 1. In the Cauchy case, we can allow the second probability to depend on n and sharpen that result by means of Theorem 10. In particular, it can be shown that
    limnDfλ1(n)V+μ1(n)λ0(n)V+μ0(n)Dgλ1(n)V+μ1(n)λ0(n)V+μ0(n)=f˙(0)+f˙(0+)g˙(0)+g˙(0+) (194)
    provided the right side is not 00.

10. χ2-Divergence

  • 70.
    With either f(x)=(x1)2 or f(x)=x21, f-divergence is the χ2-divergence,
    χ2(PQ)=EdPdQ(X)1,XP. (195)
  • 71.
    If P and Q are Cauchy distributions, then (149), (151) and (195) result in
    χ2(λ1V+μ1λ0V+μ0)=ζ(λ1,μ1,λ0,μ0)1 (196)
    =(|λ0||λ1|)2+(μ1μ0)22|λ0λ1|, (197)
    a formula obtained in Appendix D of [26] using complex analysis and the Cauchy integral formula. In addition, invoking complex analysis and the maximal group invariant results in [27,28], ref. [26] shows that any f-divergence between Cauchy distributions can be expressed as a function of their χ2 divergence, although [26] left open how to obtain that function, which is given by Theorem 10 substituting ζ=1+χ2.

11. Relative Entropy

  • 72.
    The relative entropy between Cauchy distributions is given by
    D(λ1V+μ1λ0V+μ0)=log(|λ0|+|λ1|)2+(μ1μ0)24|λ0λ1|, (198)
    where λ1λ00. The special case λ1=λ0 of (198) was found in Example 4 of [29]. The next four items give different simple justifications for (198). An alternative proof was recently given in Appendix C of [26] using complex analysis holomorphisms and the Cauchy integral formula. Yet another, much more involved, proof is reported in [30]. See also Remark 19 in [26] for another route invoking the Lévy–Khintchine formula and the Frullani integral.
  • 73.
    Since for absolutely continuous random variables D(XY)=h(X)E[logfY(X)],
    DVλV+μ=h(V)+logπ|λ|+Elogλ2+(Vμ)2 (199)
    =log(4|λ|)+log(1+|λ|)2+μ2, (200)
    where (200) follows from (12) and (A4) with α2=λ2+μ2 and cosβ=μ|α|.

    Now, substituting λ=λ0λ1 and μ=μ0μ1λ1, we obtain (198) since, according to Item 56, (V,λV+μ)(λ1V+μ1,λ0V+μ0).

  • 74.
    From the formula found in Example 4 of [29] and the fact that, according to (197), χ2=μ22λ2 when λ1=λ0=λ, we obtain
    D(λV+μλV)=log1+μ24λ2=log1+12χ2. (201)

    Moreover, as argued in Item 60, (201) is also valid for the relative entropy between Cauchy distributions with λ1λ0 as long as χ2 is given in (197). Indeed, we can verify that the right side of (201) becomes (198) with said substitution.

  • 75.
    By the definition of relative entropy, and Theorem 7,
    D(λ1V+μ1λ0V+μ0)=ElogZ (202)
    =12π02πlogζ+ζ21cosθdθ (203)
    =log1+ζ2, (204)
    where (204) follows from (A14). Then, (198) results by plugging into (204) the value of ζ in (149).
  • 76.

    Evaluating (190) with f(t)=tlogt results in (202).

  • 77.
    If V is standard Cauchy, independent of Cauchy V1 and V0, then (198) results in
    D(λV+ϵV1λV+ϵV0)=ϵ24λ2(λ1λ0)2+(μ1μ0)2loge+o(ϵ2), (205)
    where V1=λ1V+λ1 and V0=λ1V+λ1, and V is an independent (or exact) copy of V. In contrast, the corresponding result in the Gaussian case in which X, X1, X0 are independent Gaussian with means μ,μ1,μ0 and variances σ2,σ12,σ02, respectively, is
    D(X+ϵX1X+ϵX0)=ϵ22σ2(μ1μ0)2loge+o(ϵ2). (206)

    In fact, it is shown in Lemma 1 of [31] that (206) holds even if X1 and X0 are not Gaussian but have finite variances. It is likely that (205) holds even if V1 and V0 are not Cauchy, but have finite strengths.

  • 78.
    An important information theoretic result due to Csiszár [32] is that if Q1Q2 and P is such that
    EıQ1Q2(X)=D(Q1Q2),XP, (207)
    then the following Pythagorean identity holds
    D(PQ2)=D(PQ1)+D(Q1Q2). (208)
    Among other applications, this result leads to elegant proofs of minimum relative entropy results. For example, the closest Gaussian to a given P with a finite second moment has the same first and second moments as P. If we let Q1 and Q2 be centered Cauchy with strengths λ1 and λ2, respectively, then the orthogonality condition (207) becomes, with the aid of (148) and (198),
    VX2λ21VX2λ11=2loge1+λ1λ22loge2. (209)

    If, in addition, P is centered Cauchy, we can use (28) to verify that (209) holds only in the trivial cases in which either λ1=λ2 or P=Q1. For non-Cauchy P, (208) may indeed be satisfied with λ1λ2. For example, using (30), if X=V2,2, then (209), and therefore (208), holds with (λ1,λ2)=(2,0.35459).

  • 79.
    Mutually absolutely continuous random variables may be such that
    D(XZ)<=D(ZX). (210)

    An easy example is that of Gaussian X and Cauchy Z, or, if we let X be Cauchy, (210) holds with Z having the very heavy-tailed density function in (62).

  • 80.
    While relative entropy is lower semi-continuous, it is not continuous. For example, using the Cauchy distribution, we can show that relative entropy is not stable against small contamination of a Gaussian random variable: if X is Gaussian independent of V, then no matter how small λ0,
    D(λ|V|+Xλ|V|+X)=. (211)

12. Total Variation Distance

  • 81.
    With f(x)=|x1|, f-divergence becomes the total variation distance (with range [0,2]). Moreover, we have the following representation:

    Theorem 11.

    If PQ and (P,Q)(Q,P), then
    12|PQ|=2P[Z>1]P[Z1], (212)
    with Z=dPdQ(X), XP.

    Proof. 

    12|PQ|=maxAFP(A)Q(A) (213)
    =Pω:dPdQ(ω)>1Qω:dPdQ(ω)>1 (214)
    =Pω:dPdQ(ω)>1Pω:dQdP(ω)>1 (215)
    =P[Z>1]P[Z<1] (216)
    where (215) and (216) follow from (P,Q)(Q,P) and PQ, respectively.    □
  • 82.
    Example 15 of [33] shows that the total variation distance between centered Cauchy distributions is
    Pλ1VPλ0V=4πarctan||λ1||λ0||2|λ0λ1| (217)
    =4πarctan12χ2(Pλ1VPλ0V) (218)
    in view of (197). Since any f-divergence between Cauchy distributions depends on the parameters only through the corresponding χ2-divergence, (217)–(218) imply the general formula
    Pλ1V+μ1Pλ0V+μ0=4πarctan12χ2(Pλ1V+μ1Pλ0V+μ0). (219)
    Alternatively, applying Theorem 11 to the case of Cauchy random variables, note that, in this case, Z is an absolutely continuous random variable with density function (153). Therefore, P[Z1] = 1, and
    P[Z>1]=1π1ζ+ζ2112zζz21dz (220)
    =12+1πarctan12χ2, (221)
    where (221) follows from (154) and the identity arcsinδ1+δ=arctanδ specialized to δ=12χ2=12(ζ1). Though more laborious (see [26]), (219) can also be verified by direct integration.

13. Hellinger Divergence

  • 83.
    The Hellinger divergence, Hα(PQ) of order α(0,1)(1,), is the fα-divergence with
    fα(t)=tα1α1. (222)
    Notable special cases are
    H2(PQ)=χ2(PQ), (223)
    limα1Hα(PQ)=D(PQ), (224)
    H12(PQ)=2H2(PQ), (225)
    where H2(PQ) is known as the squared Hellinger distance.
  • 84.
    For Cauchy random variables, Theorem 10 yields
    Hα(λ1V+μ1λ0V+μ0)=1α1E[Zα]1 (226)
    =Pα1(ζ)1α1, (227)
    where ζ is as given in (149), and we have used (A15) and Pα(·) denotes the Legendre function of the first kind, which satisfies Pα=Pα1 (see 8.2.1. in [34]).

14. Rényi Divergence

  • 85.
    For absolutely continuous probability measures P and Q, with corresponding probability density functions p and q, the Rényi divergence of orderα[0,1)(1,) is [35]
    Dα(PQ)=1α1logpα(t)q1α(t)dt. (228)
    Note that, if (P1,Q1)(P2,Q2), then Dα(P1Q1)=Dα(P2Q2). Moreover, although Rényi divergence of order α is not an f-divergence, it is in one-to-one correspondence with the Hellinger divergence of order α:
    Dα(PQ)=1α1log1+(α1)Hα(PQ). (229)
  • 86.
    An extensive table of order-α Rényi divergences for various continuous random variables can be found in [36]. An addition to that list for Cauchy random variables can be obtained plugging (227) into (229):
    Dα(λ1V+μ1λ0V+μ0)=logPα1(ζ)α1 (230)
    =1α1logPα1λ12+λ02+(μ1μ0)22|λ0λ1|, (231)
    for α(0,1)(1,).
  • 87.
    Suppose that λ(0,1). Then, (A16) yields
    D12(VλV)=2log2λπK1λ2, (232)
    where K(·) stands for the complete elliptical integral of the first kind in (A18). As indicated in Item 60, to obtain D12(λ1V+μ1λ0V+μ0), we just need to substitute λ by ζζ21 in (232), with ζ given by (149).
  • 88.
    Notice that, specializing (86) to (α,μ0,μ1,λ0,λ1)=(12,0,0,λ,1), (232) results in the identity
    P1212λ+λ2=2λπK1λ2,λ(0,1). (233)
    Writing the complete elliptical integral of the first kind and the Legendre function of the first kind as special cases of the Gauss hypergeometric function, González [37] noticed the simpler identity (see also 8.13.8 in [34])
    P12λ=2πK1λ2,λ(0,1). (234)

    We can view (233) and (234) as complementary of each other since they constrain the argument of the Legendre function to belong to (1,) and (0,1), respectively.

  • 89.
    Since P1(z)=z, particularizing (230), we obtain
    D2(λ1V+μ1λ0V+μ0)=logζ=logλ12+λ02+(μ1μ0)22|λ0λ1|. (235)
  • 90.
    Since P2(z)=12(3z21), for Cauchy random variables, we obtain
    D3(PQ)=12log1+3χ2(PQ)+32χ4(PQ). (236)
  • 91.
    For Cauchy random variables, the Rényi divergence for integer order 4 or higher can be obtained through (235), (236) and the recursion (dropping (PQ) for typographical convenience)
    (n+1)exp(n+1)Dn+2=(2n+1)ζexpnDn+1nexp(n1)Dn, (237)
    which follows from (230) and the recursion of the Legendre polynomials
    (n+1)Pn+1(z)=(2n+1)zPn(z)nPn1(z), (238)
    which, in fact, also holds for non-integer n (see 8.5.3 in [34]).
  • 92.
    The Chernoff information
    C(PQ)=supλ(0,1)(1λ)Dλ(PQ) (239)
    satisfies C(PQ)=C(QP) regardless of (P,Q). If, as in the case of Cauchy measures, (P,Q)(Q,P), then Chernoff information is equal to the Bhattacharyya distance:
    C(PQ)=12D12(PQ)=log1p(t)q(t)dt=log1H2(PQ), (240)
    where H2(PQ) is the squared Hellinger distance, which is the f-divergence with f(t)=12(1t)2. Together with Item 87, (240) gives the Chernoff information for Cauchy distributions. While it involves the complete elliptical integral function, its simplicity should be contrasted with the formidable expression for Gaussian distributions, recently derived in [38]. The reason (240) holds is that the supremum in (239) is achieved at λ=12. To see this, note that
    f(λ)=(1λ)Dλ(PQ)=λD1λ(QP) (241)
    =λD1λ(PQ) (242)
    =f(1λ), (243)
    where (241) reflects the skew-symmetry of Rényi divergence, and (242) holds because (P,Q)(Q,P). Since f(λ):λ[0,1] is concave and its own mirror image, it is maximized at λ=12.

15. Fisher’s Information

  • 93.
    The score function of the standard Cauchy density (1) is
    ρV(x)=logefV(x)=loge(1+x2)=2x1+x2. (244)
    Then, ρV(V) is a zero-mean random variable with second moment equal to Fisher’s information
    J(V)=EρV2(V)=1π4t2(1+t2)3dt=12, (245)
    where we have used (A11). Since Fisher’s information is invariant to location and scales as J(X)=α2J(αX), we obtain
    J(λV+μ)=12λ2. (246)

    Together with (117), the product of entropy power and Fisher information is 4πe, thereby abiding by Stam’s inequality [4], 1N(X)J(X).

  • 94.
    Introduced in [39], Fisher’s information of a density function (245) quantifies its similarity with a slightly shifted version of itself. A more general notion is the Fisher information matrix of a random transformation PY|X:RkY satisfying the regularity condition
    D(PY|X=αPY|X=θ)=o(αθ). (247)
    Then, the Fisher information matrix of PY|X at θ has coefficients
    Jij(θ,PY|X)=EαiıPY|X=αPY|X=θ(Yθ)αjıPY|X=αPY|X=θ(Yθ)|αθ, (248)
    and satisfies (with relative entropy in nats)
    D(PY|X=αPY|X=θ)=12(αθ)J(θ,PY|X)(αθ)+o(αθ2). (249)
    For the Cauchy family, the parametrization vector has two components, location and strength, namely, θ=(μ,λ). The regularity condition (247) is satisfied in view of (205), and we can use the closed-form expression in (205) to obtain
    J11(θ,PY|X)=J22(θ,PY|X)=12λ2, (250)
    J12(θ,PY|X)=J21(θ,PY|X)=0. (251)
  • 95.
    The relative Fisher information is defined as
    J(PQ)=EıPQ(X)2,XP. (252)
    Although the purpose of this definition is to avoid some of the pitfalls of the classical definition of Fisher’s information, not only do equivalent pairs fail to have the same relative Fisher information but, unlike relative entropy or f-divergence, relative Fisher information is not transparent to injective transformations. For example, J(XY)=λ2J(λXλY). Centered Cauchy random variables illustrate this fact since
    J(VλV)=(4+λ)(λ1)22λ(1+λ)2andJ(λVV)=(4λ+1)(λ1)22λ2(1+λ)2. (253)
  • 96.
    de Bruijn’s identity [4] states that, if NN0,1 is independent of X, then, in nats,
    ddthX+tN=12JX+tN,t>0. (254)
    As well as serving as the key component in the original proofs of the entropy power inequality, the differential equation in (254) provides a concrete link between Shannon theory and its prehistory. As we show in Theorem 12, it turns out that there is a Cauchy counterpart of de Bruijn’s identity (254). Before stating the result, we introduce the following notation for a parametrized random variable Yt (to be specified later):
    logefYt(y)=ylogefYt(y)=fYt1(y)yfYt(y), (255)
    2logefYt(y)=tlogefYt(y)=fYt1(y)tfYt(y), (256)
    J(Yt)=ElogefYt(Yt)2, (257)
    K(Yt)=E2logefYt(Yt)2, (258)
    i.e., J(Yt) and K(Yt) are the Fisher information with respect to location and with respect to dilation, respectively (corresponding to the coefficients J11 and J22 of the Fisher information matrix when θ=(μ,λ) as in Item 94. The key to (254) is that Yt=X+tN, NN0,1 satisfies the partial differential equation
    2y2fYt(y)=tfYt(y). (259)

    Theorem 12.

    Suppose that X is independent of standard Cauchy V. Then, in nats,
    d2dt2hX+tV=JX+tVKX+tV,t>0. (260)

    Proof. 

    Equation (259) does not hold in the current case in which Yt=X+tV, and
    fYt(y)=tπE1t2+(Xy)2. (261)
    However, some algebra (the differentiation/integration swaps can be justified invoking the bounded convergence theorem) indicates that the convolution with the Cauchy density satisfies the Laplace partial differential equation
    2y2fYt(y)=2t2fYt(y)=2tπE3(Xy)2t2(t2+(Xy)2)3. (262)
    The derivative of the differential entropy of Yt is, in nats,
    ddthYt=tfYt(y)dylogefYt(y)tfYt(y)dy (263)
    =tfYt(y)dylogefYt(y)tfYt(y)dy. (264)
    Taking another derivative, the left side of (260) becomes
    d2dt2hYt=2t2fYt(y)logefYt(y)dytfYt(y)tlogefYt(y)dy (265)
    =2y2fYt(y)logefYt(y)dyfYt1(y)tfYt(y)2dy (266)
    =2y2fYt(y)logefYt(y)dyK(Yt) (267)
    =J(Yt)K(Yt), (268)
    where
    • (265) ⟸ the first term on the right side of (264) is zero;
    • (266) ⟸ (262);
    • (267) ⟸ (258);
    • (268) ⟸ integration by parts, exactly as in [4] (or p. 673 of [19]).
         □
  • 97.

    Theorem 12 reveals that the increasing function fX(t)=hX+tV is concave (which does not follow from the concavity of differential entropy functional of the density). In contrast, it was shown by Costa [40] that the entropy power NX+tN, with NN0,1 is concave in t.

16. Mutual Information

  • 98.
    Most of this section is devoted to an additive noise model. We begin with the simplest case in which XC is centered Cauchy independent of WC, also centered Cauchy with ς(WC)>0. Then, (11) yields
    I(XC;XC+WC)=h(XC+WC)h(WC) (269)
    =log4π(ς(XC)+ς(WC))log4πς(WC) (270)
    =log1+ς(XC)ς(WC), (271)
    thereby establishing a pleasing parallelism with Shannon’s formula [1] for the mutual information between a Gaussian random variable and its sum with an independent Gaussian random variable. Aside from a factor of 12, in the Cauchy case, the role of the variance is taken by the strength. Incidentally, as shown in [2], if N is standard exponential on (0,), an independent X on [0,) can be found so that X+N is exponential, in which case the formula (271) also applies because the ratio of strengths of exponentials is equal to the ratio of their means. More generally, if input and noise are independent non-centered Cauchy, their locations do not affect the mutual information, but they do affect their strengths, so, in that case, (271) holds provided that the strengths are evaluated for the centered versions of the Cauchy random variables.
  • 99.
    It is instructive, as well as useful in the sequel, to obtain (271) through a more circuitous route. Since YC=XC+WC is centered Cauchy with strength ς(YC)=ς(XC)+ς(WC), the information density (e.g., [41]) is defined as
    ıXC;YC(x;y)=logdPXCYCd(PXC×PYC)(x,y) (272)
    =logfYC|XC(y|x)fYC(y) (273)
    =logς(YC)ς(WC)+log1+y2ς2(YC)log1+(yx)2ς2(WC). (274)
    Averaging with respect to (XC,YC)=(XC,XC+WC), we obtain
    I(XC;YC)=EıXC;YC(XC;YC) (275)
    =logς(YC)ς(WC)+log4log4=log1+ς(XC)ς(WC). (276)
  • 100.
    If the strengths of output Y=X+N and independent noise N are finite and their differential entropies are not , we can obtain a general representation of the mutual information without requiring that either input or noise be Cauchy. Invoking (56) and I(X;X+N)=h(X+N)h(N), we have
    I(X;Y)=logNC(Y)NC(N) (277)
    =logς(Y)ς(N)+D(Nς(N)V)D(Yς(Y)V), (278)
    since, as we saw in (49), the finiteness of the strengths guarantees the finiteness of the relative entropies in (278). We can readily verify the alternative representation in which strength is replaced by standard deviation, and the standard Cauchy V is replaced by standard normal W:
    I(X;Y)=12logN(Y)N(N) (279)
    =logσ(Y)σ(N)+D(Nσ(N)W)D(Yσ(Y)W). (280)
    A byproduct of (278) is the upper bound
    I(X;Y)logς(Y)NC(N) (281)
    =logς(Y)ς(N)+D(Nς(N)V), (282)
    where (281) follows from NC(Y)ς(Y), and (282) follows by dropping the last term on the right side of (278). Note that (281) is the counterpart of the upper bound given by Shannon [1] in which the standard deviation of Y takes the place of the strength in the numerator, and the square root of the noise entropy power takes the place of the entropy strength in the denominator. Shannon gave his bound three years before Kullback and Leibler introduced relative entropy in [42]. The counterpart of (282) with analogous substitutions of strengths by standard deviations was given by Pinsker [43], and by Ihara [44] for continuous-time processes.
  • 101.

    We proceed to investigate the maximal mutual information between the (possibly non-Cauchy) input and its additive Cauchy-noise contaminated version.

    Theorem 13.

    Maximal mutual information: output strength constraint. For any ης(WC)>0,
    maxX:ς(X+WC)ηI(X;X+WC)=logης(WC), (283)
    where WC is centered Cauchy independent of X. The maximum in (283) is attained uniquely by the centered Cauchy distribution with strength ης(WC).

    Proof. 

    For centered Cauchy noise, the upper bound in (282) simplifies to
    I(X;X+WC)logς(X+WC)ς(WC), (284)
    which shows ≤ in (283). If the input is centered Cauchy XC with strength ης(WC), then ς(XC+WC)=η, and I(XC;XC+WC) is equal to the right side in view of (271).
       □
  • 102.
    In the information theory literature, the maximization of mutual information over the input distribution is usually carried out under a constraint on the average cost E[b(X)] for some real-valued function b. Before we investigate whether the optimization in (283) can be cast into that conventional paradigm, it is instructive to realize that the maximization of mutual information in the case of input-independent additive Gaussian noise can be viewed as one in which we allow any input such that the output variance is constrained, and because the output variance is the sum of input and noise variances that the familiar optimization over variance constrained inputs obtains. Likewise, in the case of additive exponential noise and random variables taking nonnegative values, if we constrain the output mean, automatically we are constraining the input mean. In contrast, the output strength is not equal to the sum of Cauchy noise strength and the input strength, unless the input is Cauchy. Indeed, as we saw in Theorem 1-(d), the output strength depends not only on the input strength but on the shape of its probability density function. Since the noise is Cauchy, (45) yields
    ς(X+WC)ης2,θ(X)ς(WC)+η,withθ=2log2ηη+ς(WC) (285)
    Elogς(WC)+η+X22log2η, (286)
    which is the same input constraint found in [45] (see also Lemma 6 in [46] and Section V in [47]) in which η affects not only the allowed expected cost but the definition of the cost function itself. If X is centered Cauchy with strength ης(WC), then (286) is satisfied with equality, in keeping with the fact that that input achieves the maximum in (283). Any alternative input with the same strength that produces output strength lower than or equal to η can only result in lower mutual information. However, as we saw in Item 29, we can indeed find input distributions with strength ης(WC) that can produce output strength higher than η. Can any of those input distributions achieve I(X;Y)>logης(WC)? The answer is affirmative. If we let X=Vβ,2, defined in (9), we can verify numerically that, for β[0.8,1),
    I(X;X+V)>logς(X)+1. (287)
    We conclude that, at least for θς(WC)1,ς(V0.8,2)=(1,3.126), the capacity–input–strength function satisfies
    C(θ)=maxX:ς(X)θI(X;X+WC)>log1+θς(WC). (288)
  • 103.

    Although not always acknowledged, the key step in the maximization of mutual information over the input distribution for a given random transformation is to identify the optimal output distribution. The results in Items 101 and 102 point out that it is mathematically more natural to impose constraints on the attributes of the observed noisy signal than on the transmitted noiseless signal. In the usual framework of power constraints, both formulations are equivalent as an increase in the gain of the receiver antenna (or a decrease in the front-end amplifier thermal noise) of κ dB has the same effect as an increase of κ dB in the gain of the transmitter antenna (or increase in the output power of the transmitted amplifier). When, as in the case of strength, both formulations lead to different solutions, it is worthwhile to recognize that what we usually view as transmitter/encoder constraints also involve receiver features.

  • 104.
    Consider a multiaccess channel Yi=X1i+X2i+Wi, where Wi is a sequence of strength ς(W) independent centered Cauchy random variables. While the capacity region is unknown if we place individual cost or strength constraints on the transmitters, it is easily solvable if we impose an output strength constraint. In that case, the capacity region is the triangle
    Cη=(R1,R2)[0,)2:R1+R2logης(W), (289)
    where η>ς(W) is the output strength constraint. To see this, note (a) the corner points are achievable thanks to Theorem 13; (b) if the transmitters are synchronous, a time-sharing strategy with Cauchy distributed inputs satisfies the output strength constraint in view of (107); (c) replacing the independent encoders by a single encoder which encodes both messages would not be able to achieve higher rate sum. It is also possible to achieve (289) using the successive decoding strategy invented by Cover [48] and Wyner [49] for the Gaussian multiple-access channel: fix α(0,1); to achieve R1=αlogης(W) and R2=(1α)logης(W), we let the transmitters use random coding with sequences of independent Cauchy random variables with respective strengths
    ς1=ηςα(W)η1α>0, (290)
    ς2=ςα(W)η1ας(W)>0, (291)
    which abide by the output strength constraint since ς1+ς2+ς(W)=η, and
    R1=log1+ς1ς2+ς(W), (292)
    R2=log1+ς2ς(W), (293)
    a rate-pair which is achievable by successive decoding by using a single-user decoder for user 1, which treats the codeword transmitted by user 2 as noise; upon decoding the message of user 1, it is re-encoded and subtracted from the received signal, thereby presenting a single-user decoder for user 2 with a signal devoid of any trace of user 1 (with high probability).
  • 105.
    The capacity per unit energy of the additive Cauchy-noise channel Yi=Xi+λVi, where {Vi} is an independent sequence of standard Cauchy random variables, was shown in [29] to be equal to (4λ2)1loge, even though the capacity-cost function of such a channel is unknown. A corollary to Theorem 13 is that the capacity per unit output strength of the same channel is
    CO=1λmaxηλληlogηλ=logeλe. (294)
    By only considering Cauchy distributed inputs, the capacity per unit input strength is lower bounded by
    CImaxγ>01γlog1+γλ=logeλ (295)
    but is otherwise unknown as it is not encompassed by the formula in [29].
  • 106.
    We turn to the scenario, dual to that in Theorem 13, in which the input is Cauchy but the noise need not be. As Shannon showed in [1], if the input is Gaussian, among all noise distributions with given second moment, independent Gaussian noise is the least favorable. Shannon showed that fact applying the entropy power inequality to the numerator on the right side of (279), and then further weakened the resulting lower bound by replacing the noise entropy power in the denominator by its variance. Taking a cue from this simple approach, we apply the entropy strength inequality (124) to (277) to obtain
    I(XC;XC+W)=12logNC2(Y)NC2(W) (296)
    12logNC2(XC)+NC2(W)NC2(W) (297)
    =12log1+ς2(XC)NC2(W) (298)
    12log1+ς2(XC)ςC2(W), (299)
    where (299) follows from NC2(W)ςC2(W). Unfortunately, unlike the case of Gaussian input, this route falls short of showing that Cauchy noise of a given strength is least favorable because the right side of (299) is strictly smaller than the Cauchy-input Cauchy-noise mutual information in (271). Evidently, while the entropy power inequality is tight for Gaussian random variables, it is not for Cauchy random variables as we observed in Item 39. For this approach to succeed showing that, under a strength constraint, the least favorable noise is centered Cauchy we would need that, if W is independent of standard Cauchy V, then NC(V+W)NC(W)1. (See Item 119c-(a).)
  • 107.

    As in Item 102, the counterpart in the Cauchy-input case is more challenging due to the fact that, unlike variance, the output strength need not be equal to the sum of input and noise strength. The next two results give lower bounds which, although achieved by Cauchy noise, do not just depend on the noise distribution through its strength.

    Theorem 14.

    If XC is centered Cauchy, independent of W with 0<ς(W)<, denote Y=XC+W. Then,
    I(XC;XC+W)logς(Y)ς(W)logς(W)ς(Y)ς(XC), (300)
    with equality if W is centered Cauchy.

    Proof. 

    Let us abbreviate ς=ς(Y)ς(XC). Consider the following chain:
    D(Yς(Y)V)D(Wς(W)V)=DXC+WXC+ςVD(Wς(W)V) (301)
    DWςVD(Wς(W)V) (302)
    =logς(W)ς+Elogς2+W2ς2(W)+W2 (303)
    logς(W)ς, (304)
    where
    • (301) ⟸XC is centered Cauchy;
    • (302) ⟸ relative entropy data processing theorem applied to a random transformation that consists of the addition of independent “noise” XC;
    • (303) ⟸ both relative entropies are finite since ς(W)<;
    • (304) ⟸ the elementary observation
      logς2+t2ς2(W)+t20,ς<ς(W);2logςς(W),ςς(W). (305)
    The desired bound (300) now follows in view of (278). It holds with equality in W being centered Cauchy as, in that case, ς(Y)=ς(XC)+ς(WC).    □

    Although the lower bound in Theorem 14 is achieved by a centered Cauchy, it does not rule out the existence of W such that ς(W)=ς(WC) and I(XC;XC+W)<I(XC;XC+WC).

  • 108.

    For the following lower bound, it is advisable to assume for notational simplicity and without loss of generality that ς(XC)=1. To remove that restriction, we may simply replace W by ς(XC)W.

    Theorem 15.

    Let V be standard Cauchy independent of W. Then,
    I(V;V+W)log1+1λ(W), (306)
    where λ(W) is the solution to
    Elog(2+λ)2+W2λ2+W2=2log1+1λ. (307)
    Equality holds in (306) if W is a centered Cauchy random variable, in which case, λ(W)=ς(W).

    Proof. 

    It can be shown that, if PXY=PXPY|X=PYPX|Y and QY|X is an auxiliary random transformation such that PXQY|X=QYQX|Y where QY is the response of QY|X to PX, then
    I(X;Y)=D(PX|YQX|Y|PY)+E[ıX;Y¯(X;Y)], (308)
    where (X,Y)PXPY|X and the information density ıX;Y¯ corresponds to the joint probability measure PXQY|X. We can participate decomposition of mutual information to the case where PX=PV,PY|X==PW+,QY|X==PWc+x where Wc is centered Cauchy with strenght λ > 0. Then, PXQY|X is the joint distribution of V and V + WC, and
    ıX;Y¯(x;y=logλ1+λlog(λ2+(yx)2)+log((1+λ)2+y2). (309)
    Taking expectation with respect to (x,y)=(V,V+t), and invoking (52), we obtain
    E[ıX;Y¯(V;V+t)]=logλ1+λ+E[log(1+λ)2+(V+t)2λ2+t2] (310)
    =logλ1+λ+log(2+λ)2+t2λ2+t2. (311)
    Finally, taking expectation with respect to t=W, we obtain
    E[ıX;Y¯(V;V+W)]=E[log(2+λ)2+t2λ2+W2]log(1+1λ). (312)
    If λ=λ(W), namely, the solution to (307), then (306) follows as a result of (108). If W=ς(W)V, then the solution to (307) is λ(W)=ς(W), and the equality in (306) can be seen by specializing (271) to (ς(XC),ς(WC))=(1,ς(W)).    □
  • 109.
    As we just saw, if W is centered Cauchy, then the solution to (307) satisfies λ(W)=ς(W). On the other hand, we have
    0.302..=ς(V2,2)<λ(V2,2)=0.349 (313)
    4.961...=λ(W)<ς(W)=5.845 (314)
    if W has the probability density function in (100).
  • 110.
    As the proof indicates, at the expense of additional computation, we may sharpen the lower bound in Theorem 15 to show
    I(V;V+W)maxλ>0Elog(2+λ)2+W2λ2+W2log1+1λ, (315)
    which is attained at the solution to
    λ2+ληW21(2+λ)2ηW21λ2+12λ+2=0. (316)
  • 111.

    Theorem 16.

    The rate–distortion function of a memoryless source whose distribution is centered Cauchy with strength ς(X) such that the time-average of the distortion strength is upper bounded by D is given by
    R(D)=logς(X)D,0<D<ς(X);0,0<Dς(X). (317)

    Proof. 

    If Dς(X), reproducing the source by (0,,0) results in time-average of the distortion strength equal to 1ni=1nς(Xi)=ς(X). Therefore, R(D)=0. If 0<D<ς(X), we proceed to determine the minimal I(X;X^) among all PX^|X such that ς(XX^)D. For any such random transformation,
    I(X;X^)=h(X)h(X|X^) (318)
    =h(X)h(XX^|X^) (319)
    h(X)h(XX^) (320)
    =log4πς(X)h(XX^) (321)
    log4πς(X)log4πς(XX^) (322)
    logς(X)D, (323)
    where (320) holds because conditioning cannot increase differential entropy, and (322) follows from Theorem 3 applied to Z=XX^. The fact that there is an allowable PX^|X that achieves the lower bound with equality is best seen by letting X=X^+Z, where Z and X^ are independent centered Cauchy random variables with ς(Z)=D and ς(X^)=ς(X)D. Then, PX^|XPX=PX|X^PX^ is such that the X marginal is indeed centered Cauchy with strength ς(X), and ς(XX^)=D. Recalling (271),
    I(X^;X)=log1+ς(X)Dς(Z)=logς(X)D, (324)
    and the lower bound in (323) can indeed be satisfied with equality. We are not finished yet since we need to justify that the rate–distortion function is indeed
    R(D)=minPX^|X:ς(XX^)DI(X;X^), (325)
    which does not follow from the conventional memoryless lossy compression theorem with average distortion because, although the distortion measure is separable, it is not the average of a function with respect to the joint probability measure PXX^. This departure from the conventional setting does not impact the direct part of the theorem (i.e., ≤ in (325)), but it does affect the converse and in particular the proof of the fact that the n-version of the right side of (325) single-letterizes. To that end, it is sufficient to show that the function of D on the right side of (325) is convex (e.g., see pp. 316–317 in [19]). In the conventional setting, this follows from the convexity of the mutual information in the random transformation since, with a distortion function d(·,·), we have
    E[d(X,X^α)]=αE[d(X,X^1)]+(1α)E[d(X,X^0)], (326)
    where (X,X^1)PXPX^|X1, (X,X^0)PXPX^|X0, and (X,X^α)αPXPX^|X1+(1α)PXPX^|X0. Unfortunately, as we saw in Item 35, strength is not convex on the probability measure so, in general, we cannot claim that
    ς(XX^α)ας(XX^1)+(1α)ς(XX^0). (327)
    The way out of this quandary is to realize that (327) is only needed for those PX^|X0 and PX^|X1 that attain the minimum on the right side of (325) for different distortion bounds D0 and D1. As we saw earlier in this proof, those optimal random transformations are such that XX^0 and XX^1 are centered Cauchy. Fortuitously, as we noted in (107), (327) does indeed hold when we restrict attention to mixtures of centered Cauchy distributions.    □

    Theorem 16 gives another example in which the Shannon lower bound to the rate–distortion function is tight. In addition to Gaussian sources with mean–square distortion, other examples can be found in [50]. Another interesting aspect of the lossy compression of memoryless Cauchy sources under strength distortion measure is that it is optimally successively refinable in the sense of [51,52]. As in the Gaussian case, this is a simple consequence of the stability of the Cauchy distribution and the fact that the strength of the sum of independent Cauchy random variables is equal to the sum of their respective strengths (Item 27).

  • 112.

    The continuity of mutual information can be shown under the following sufficient conditions

    Theorem 17.

    Suppose that Xn is a sequence of real-valued random variables that vanishes in strength, Z is independent of Xn, h(Z)> and 0<ς(Z)<. Then,
    limnI(Xn;Xn+Z)=0. (328)

    Proof. 

    Under the assumptions, h(Z)R. Therefore, I(Xn;Xn+Z)=h(Xn+Z)h(Z), and (328) follows from Theorem 1-(m).    □
  • 113.
    The assumption h(Z)> is not superfluous for the validity of Theorem 17 even though it was not needed in Theorem 1-(m). Suppose that Z is integer valued, and Xn=(nL)1(0,12) where L{2,3,} has probability mass function
    PL(k)=0.986551...klog22k,k=2,3, (329)

    Then, I(Xn;Xn+Z)=H(Xn)=H(L)=, while E[|Xn|]=0.328289...n, and therefore, ς(Xn)0.

  • 114.
    In the case in which Vn and Wn are standard spherical multivariate Cauchy random variables with densities in (6), it follows from (7) that λXVn+λWWn has the same distribution as (|λX|+|λW|)Vn. Therefore,
    IVn;λXVn+λWWn=hλXVn+λWWnhλWWn (330)
    =nlog1+|λX||λW|, (331)
    where we have used the scaling law h(αXn)=nlog|α|+h(Xn). There is no possibility of a Cauchy-counterpart of the celebrated log-determinant formula for additive Gaussian vectors (e.g., Theorem 9.2.1 in [41]) because, as pointed out in Item 7, Λ12Vn+Λ¯12Wn is not distributed according to the ellipsoidal density in (8) unless Λ and Λ¯ are proportional, in which case the setup reverts to that in (330).
  • 115.
    To conclude this section, we leave aside additive noise models and consider the mutual information between a partition of the components of the standard spherical multivariate Cauchy density (6). If IJ=, then (17) yields
    I{Vi,iI};{Vj,jJ}=h|I|+h|J|h|I|+|J|, (332)
    where hn stands for the right side of (17). For example, if ij, then, in nats,
    I(Vi;Vj)=2h(V1)h(V1,V2) (333)
    =2loge(4π)32loge(4π)+γ+ψ(32)logeΓ(32) (334)
    =loge(8π)3=0.22417 (335)
    More generally, the dependence index among the n random variables in the standard spherical multivariate Cauchy density is (see also [9,53]), in nats,
    D(PVnPV1××PVn)=nh(V1)h(Vn) (336)
    =n12loge(4π)+logeΓn+12n+12γ+ψn+12 (337)
    =n2loge(8π)+k=1n2loge(2k1)n+12k1,neven;n12loge(4π)+k=1n12logekn+12k,nodd. (338)
  • 116.
    The shared information of n random variables is a generalization of mutual information introduced in [54] for deriving the fundamental limit of interactive data exchange among agents who have access to the individual components and establish a dialog to ensure that all of them find out the value of the random vector. The shared information of Xn is defined as
    S(Xn)=minΠ1|Π|1DPXn=1|Π|PX(I), (339)
    where X(J)={Xi,iJ}, with JI={1,,n}, and the minimum is over all partitions of I:
    Π={I,=1,,|Π|},with=1|Π|I=I,IIj=,j,
    such that |Π|>1. If we divide (338) by n1, we obtain the shared information of n random variables distributed according to the standard spherical multivariate Cauchy model. This is a consequence of the following result, which is of independent interest.

    Theorem 18.

    If Xn are exchangeable random variables, any subset of which have finite differential entropy, then for any partition Π of {1,,n},
    1|Π|1DPXn=1|Π|PX(I)1n1D(PXnPX1××PXn). (340)

    Proof. 

    Fix any partition Π with |Π|=L{2,,n1} chunks. Denote by n the number of chunks in Π with cardinality {1,,n1}. Therefore,
    =1n1n=L,and=1n1n=n. (341)
    By exchangeability, any chunk of cardinality k has the same differential entropy, which we denote by hk. Then,
    DPXn=1|Π|PX(I)=hn+=1n1nh, (342)
    and the difference of the left minus the right sides of (340) multiplied by (n1)(L1) is readily seen to equal
    (n1)hn+(n1)=1n1nh+(L1)hn(L1)nh1
    =(n1)n1n(L1)h1+(Ln)hn+(n1)=2n1nh (343)
    (n1)n1n(L1)+=2n1(n)nh1+Ln+=2n1(1)nhn (344)
    =0 (345)
    where
    • (344) ⟸ for all {2,,n1},
      h1n1hn+nn1h1, (346)
      since h1,,hn is a concave sequence, i.e., 2hkhk1+hk+1 as a result of the sub-modularity of differential entropy.
    • (345) ⟸ (341).
         □

    Naturally, the same proof applies to n discrete exchangeable random variables with finite joint entropy.

17. Outlook

  • 117.
    We have seen that a number of key information theoretic properties pertaining to the Gaussian law are also satisfied in the Cauchy case. Conceptually, those extensions shed light on the underlying reason the conventional Gaussian results hold. Naturally, we would like to explore how far beyond the Cauchy law those results can be expanded. As far as the maximization of differential entropy is concerned, the essential step is to redefine strength tailoring it to the desired law: Fix a reference random variable W with probability density function fW and finite differential entropy h(W)R, and define the W-strength of a real valued random variable Z as
    ςW(Z)=infς>0:ElogfWZςhW. (347)
    For example,
    • (a)
      For α>0, ςW(αW)=α;
    • (b)
      if W is standard normal, then ςW2(Z)=E[Z2];
    • (c)
      if V is standard Cauchy, then ςV(Z)=ς(Z);
    • (d)
      if W is standard exponential, then ςW(Z)=E[Z] if Z0 a.s., otherwise, ςW(Z)=;
    • (e)
      if W is standard (μ=1) Subbotin (108) with p>0, then, ςWp(Z)=E[|Z|p];
    • (f)
      if W has the Rider distribution in (9), then ςW(Z)=ςρ,θ(Z) defined in (126) for θ chosen as in (110);
    • (g)
      if W is uniformly distributed on [1,1], ςW(Z)=esssup|Z|;
    • (h)
      if W is standard Rayleigh, then ςW(Z)=infς>0:EZ2ς2logeZ22ς22+γ if Z0 a.s., otherwise, ςW(Z)=.
    The pivotal Theorems 3 and 4 admit the following generalization.

    Theorem 19.

    Suppose h(W)R and ς>0. Then,
    maxZ:ςW(Z)ςh(Z)=h(W)+logς. (348)

    Proof. 

    Fix any Z in the feasible set. For any σςW(Z) such that ElogfWZσhW, we have
    0D(σ1ZW)=h(Z)+logσElogfWZσ (349)
    h(Z)+logσ+h(W). (350)
    Therefore, h(Z)h(W)+logςW(Z), by definition of ςW(Z), thereby establishing ≤ in (348). Equality holds since ςW(ςW)=ς.    □
    A corollary to Theorem 19 is a very general form of the Shannon lower bound for the rate–distortion function of a memoryless source Z such that the distortion is constrained to have W-strength not higher than D, namely,
    R(D)h(Z)h(W)logD. (351)
    Theorem 19 finds an immediate extension to the multivariate case
    maxZn:ςWn(Zn)ςh(Zn)=h(Wn)+nlogς, (352)
    where, for Wn with h(Wn)R, we have defined
    ςWn(Zn)=infς>0:ElogfWnς1ZnhWn. (353)

    For example, if Wn is zero-mean multivariate Gaussian with positive definite covariance Σ, then ςWn2(Zn)=1nEZnΣ1Zn.

  • 118.

    One aspect in which we have shown that Cauchy distributions lend themselves to simplification unavailable in the Gaussian case is the single-parametrization of their likelihood ratio, which paves the way for a slew of closed-form expressions for f-divergences and Rényi divergences. It would be interesting to identify other multiparameter (even just scale/location) families of distributions that enjoy the same property. To that end, it is natural, though by no means hopeful, to study various generalizations of the Cauchy distribution such as the Student-t random variable, or more generally, the Rider distribution in (9). The information theoretic study of general stable distributions is hampered by the fact that they are characterized by their characteristic functions (e.g., p. 164 in [55]), which so far, have not lent themselves to the determination of relative entropy or even differential entropy.

  • 119.

    Although we cannot expect that the cornucopia of information theoretic results in the Gaussian case can be extended to other domains, we have been able to show that a number of those results do find counterparts in the Cauchy case. Nevertheless, much remains to be explored. To name a few,

    • (a)
      The concavity of the entropy-strength NC(X+tV)—a counterpart of Costa’s entropy power inequality [40] would guarantee the least favorability of Cauchy noise among all strength-constrained noises as well as the entropy strength inequality
      NC(X+tV)NC(tV)+NC(X). (354)
    • (b)

      Information theoretic analyses quantifying the approach to normality in the central limit theorem are well-known (e.g., [56,57,58]). It would be interesting to explore the decrease in the relative entropy (relative to the Cauchy law) of independent sums distributed according to a law in the domain of attraction of the Cauchy distribution [55].

    • (c)

      Since de Bruijn’s identity is one of the ancestors of the i-mmse formula of [59], and we now have a counterpart of de Bruijn’s identity for convolutions with scaled Cauchy, it is natural to wonder if there may be some sort of integral representation of the mutual information between a random variable and its noisy version contaminated by additive Cauchy noise. In this respect, note that counterparts for the i-mmse formula for models other than additive Gaussian noise have been found in [60,61,62].

    • (d)

      Mutual information is robust against the addition of small non-Gaussian contamination in the sense that its effects are the same as if it were Gaussian [63]. The proof methods rely on Taylor series expansions that require the existence of moments. Any Cauchy counterparts (recall Item 77) would require substantially different methods.

    • (e)

      Pinsker [41] showed that Gaussian processes are information stable imposing only very mild assumptions. The key is that, modulo a factor, the variance of the information density is upper bounded by its mean, the mutual information. Does the spherical multivariate Cauchy distribution enjoy similar properties?

  • 112.

    Although not surveyed here, there are indeed a number of results in the engineering literature advocating Cauchy models in certain heavy-tailed infinite-variance scenarios (see, e.g., [45] and the references therein.) At the end, either we abide by the information theoretic maxim that “there is nothing more practical than a beautiful formula”, or we pay heed to Poisson, who after pointing out in [64] that Laplace’s proof of the central limit theorem broke down for what we now refer to as the Cauchy law, remarked that “Mais nous ne tiendrons pas compte de ce cas particulier, quil nous suffira d’avoir remarqué à cause de sa singularité, et qui ne se recontre sans doute pas dans la pratique”.

Appendix A. Definite Integrals

0x11+t2dt=arctan(x), (A1)
1212logcos(πt)dt=log12, (A2)
log(1+t2)1+t2dt=πlog4, (A3)
log(α22αtcosβ+t2)1+t2dt=πlog(1+α2+2α|sinβ|), (A4)
log(1+t2)1+(ξtκ)2dt=πξlogκ2+ξ+122logξ,ξ>0, (A5)
κβ,ρloge(1+|t|ρ)(1+|t|ρ)βdt=ψ(β)ψβ1ρ,βρ>1, (A6)
loge1+θ2t21+t22=πloge(1+|θ|)|θ|1+|θ|, (A7)
ααloge(t2+ς2)dt=4ςarctanας4α+2αlogeα2+ς2, (A8)
t2(1+t2)2dt=π2, (A9)
1(1+t2)2dt=π2, (A10)
t2(1+t2)3dt=π8, (A11)
1(β2+t2)νdt=πβ12νΓν12Γν,ν>12, (A12)
01(1+tρ)νdt=Γν1ρΓ1+1ρΓν,ν>1ρ>0, (A13)
0πlogα+βcosθdθ=πlogα2+12α2β2,α|β|>0, (A14)
0πlogβ+β21cosθαdθ=πPα(β),β>0, (A15)
0dt1+t2β2+t2=K1β2,β(0,1), (A16)

where

  • (A2) is a special case of 4.384.21 in [24];

  • (A3) is a special case of (A4);

  • (A4) is 4.296.2 in [24];

  • (A5) follows from (A4) by change of variable;

  • (A6), with κβ,ρ defined in (10) and ψ(·) denoting the digamma function, follows from 4.256 in [24] by change of variable x=(1+tp)12n and n=mp;

  • (A7) is a special case of 4.295.25 in [24];

  • (A8) follows from 2.733.1 in [24];

  • (A9)–(A10) follow from 3.252.6 in [24];

  • (A11) can be obtained by integration by parts and (A10);

  • (A12), with Γ(·) denoting the gamma function, is a special case of 3.251.11 in [24];

  • (A13) can be obtained from 3.251.11 in [24] by change of variable;

  • (A14) is 4.224.9 in [24];

  • (A15) is 8.822.1 in [24] with Pα(x) the Legendre function of the first kind, which is a solution to
    ddx1x2du(x)dx+α(α+1)u(x)=0; (A17)
  • (A16) is a special case of 3.152.1 in [24] with the complete elliptic integral of the first kind defined as 8.112.1 in [24], namely,
    K(k)=0π2dα1k2sin2α,|k|<1. (A18)
    Note that mathematica defines the complete elliptic integral function EllipticK such that
    K(k)=EllipticKk21k21k2,|k|<1. (A19)

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The author declares no conflict of interest.

Funding Statement

This research received no external funding.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

References

  • 1.Shannon C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27:379–423, 623–656. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
  • 2.Verdú S. The exponential distribution in information theory. Probl. Inf. Transm. 1996;32:86–95. [Google Scholar]
  • 3.Anantharam V., Verdú S. Bits through queues. IEEE Trans. Inf. Theory. 1996;42:4–18. doi: 10.1109/18.481773. [DOI] [Google Scholar]
  • 4.Stam A. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inf. Control. 1959;2:101–112. doi: 10.1016/S0019-9958(59)90348-1. [DOI] [Google Scholar]
  • 5.Ferguson T.S. A representation of the symmetric bivariate Cauchy distribution. Ann. Math. Stat. 1962;33:1256–1266. doi: 10.1214/aoms/1177704357. [DOI] [Google Scholar]
  • 6.Fang K.T., Kotz S., Ng K.W. Symmetric Multivariate and Related Distributions. CRC Press; Boca Raton, FL, USA: 2018. [Google Scholar]
  • 7.Rider P.R. Generalized Cauchy distributions. Ann. Inst. Stat. Math. 1958;9:215–223. doi: 10.1007/BF02892507. [DOI] [Google Scholar]
  • 8.Bouhlel N., Rousseau D. A generic formula and some special cases for the Kullback–Leibler divergence between central multivariate Cauchy distributions. Entropy. 2022;24:838. doi: 10.3390/e24060838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Abe S., Rajagopal A.K. Information theoretic approach to statistical properties of multivariate Cauchy-Lorentz distributions. J. Phys. A Math. Gen. 2001;34:8727–8731. doi: 10.1088/0305-4470/34/42/301. [DOI] [Google Scholar]
  • 10.Tulino A.M., Verdú S. Random matrix theory and wireless communications. Found. Trends Commun. Inf. Theory. 2004;1:1–182. doi: 10.1561/0100000001. [DOI] [Google Scholar]
  • 11.Widder D.V. The Stieltjes transform. Trans. Am. Math. Soc. 1938;43:7–60. doi: 10.1090/S0002-9947-1938-1501933-2. [DOI] [Google Scholar]
  • 12.Kullback S. Information Theory and Statistics. Dover; New York, NY, USA: 1968. Originally published in 1959 by JohnWiley. [Google Scholar]
  • 13.Wu Y., Verdú S. Rényi information dimension: Fundamental limits of almost lossless analog compression. IEEE Trans. Inf. Theory. 2010;56:3721–3747. doi: 10.1109/TIT.2010.2050803. [DOI] [Google Scholar]
  • 14.Donsker M.D., Varadhan S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975;28:1–47. doi: 10.1002/cpa.3160280102. [DOI] [Google Scholar]
  • 15.Donsker M.D., Varadhan S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, III. Commun. Pure Appl. Math. 1977;29:369–461. doi: 10.1002/cpa.3160290405. [DOI] [Google Scholar]
  • 16.Lapidoth A., Moser S.M. Capacity bounds via duality with applications to multiple-antenna systems on flat-fading channels. IEEE Trans. Inf. Theory. 2003;49:2426–2467. doi: 10.1109/TIT.2003.817449. [DOI] [Google Scholar]
  • 17.Subbotin M.T. On the law of frequency of error. Mat. Sb. 1923;31:296–301. [Google Scholar]
  • 18.Kapur J.N. Maximum-Entropy Models in Science and Engineering. Wiley-Eastern; New Delhi, India: 1989. [Google Scholar]
  • 19.Cover T.M., Thomas J.A. Elements of Information Theory. 2nd ed. Wiley; New York, NY, USA: 2006. [Google Scholar]
  • 20.Dembo A., Cover T.M., Thomas J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory. 1991;37:1501–1518. doi: 10.1109/18.104312. [DOI] [Google Scholar]
  • 21.Han T.S. Information Spectrum Methods in Information Theory. Springer; Heidelberg, Germany: 2003. [Google Scholar]
  • 22.Vajda I. Theory of Statistical Inference and Information. Kluwer; Dordrecht, The Netherlands: 1989. [Google Scholar]
  • 23.Deza E., Deza M.M. Dictionary of Distances. Elsevier; Amsterdam, The Netherlands: 2006. [Google Scholar]
  • 24.Gradshteyn I.S., Ryzhik I.M. Table of Integrals, Series, and Products. 7th ed. Academic Press; Burlington, MA, USA: 2007. [Google Scholar]
  • 25.Sason I., Verdú S. f-divergence inequalities. IEEE Trans. Inf. Theory. 2016;62:5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]
  • 26.Nielsen F., Okamura K. On f-divergences between Cauchy distributions; Proceedings of the International Conference on Geometric Science of Information; Paris, France. 21–23 July 2021; pp. 799–807. [Google Scholar]
  • 27.Eaton M.L. Proceedings of the Regional Conference Series in Probability and Statistics. Volume 1 Institute of Mathematical Statistics; Hayward, CA, USA: 1989. Group Invariance Applications in Statistics. [Google Scholar]
  • 28.McCullagh P. On the distribution of the Cauchy maximum-likelihood estimator. Proc. R. Soc. London. Ser. A Math. Phys. Sci. 1993;440:475–479. [Google Scholar]
  • 29.Verdú S. On channel capacity per unit cost. IEEE Trans. Inf. Theory. 1990;36:1019–1030. doi: 10.1109/18.57201. [DOI] [Google Scholar]
  • 30.Chyzak F., Nielsen F. A closed-form formula for the Kullback–Leibler divergence between Cauchy distributions. arXiv. 20191905.10965 [Google Scholar]
  • 31.Verdú S. Mismatched estimation and relative entropy. IEEE Trans. Inf. Theory. 2010;56:3712–3720. doi: 10.1109/TIT.2010.2050800. [DOI] [Google Scholar]
  • 32.Csiszár I. I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975;3:146–158. doi: 10.1214/aop/1176996454. [DOI] [Google Scholar]
  • 33.Sason I., Verdú S. Bounds among f-divergences. arXiv. 20151508.00335 [Google Scholar]
  • 34.Abramowitz M., Stegun I.A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Volume 55 US Government Printing Office; Washington, DC, USA: 1964. [Google Scholar]
  • 35.Rényi A. On measures of information and entropy. In: Neyman J., editor. Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press; Berkeley, CA, USA: 1961. pp. 547–561. [Google Scholar]
  • 36.Gil M., Alajaji F., Linder T. Rényi divergence measures for commonly used univariate continuous distributions. Inf. Sci. 2013;249:124–131. doi: 10.1016/j.ins.2013.06.018. [DOI] [Google Scholar]
  • 37.González M. Elliptic integrals in terms of Legendre polynomials. Glasg. Math. J. 1954;2:97–99. doi: 10.1017/S2040618500033104. [DOI] [Google Scholar]
  • 38.Nielsen F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy. 2022;24:1400. doi: 10.3390/e24101400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fisher R.A. Theory of statistical estimation. Math. Proc. Camb. Math. Soc. 1925;22:700–725. doi: 10.1017/S0305004100009580. [DOI] [Google Scholar]
  • 40.Costa M.H.M. A new entropy power inequality. IEEE Trans. Inf. Theory. 1985;31:751–760. doi: 10.1109/TIT.1985.1057105. [DOI] [Google Scholar]
  • 41.Pinsker M.S. Information and Information Stability of Random Variables and Processes. Holden-Day; San Francisco, CA, USA: 1964. Originally published in Russian in 1960. [Google Scholar]
  • 42.Kullback S., Leibler R.A. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  • 43.Pinsker M.S. Calculation of the rate of message generation by a stationary random process and the capacity of a stationary channel. Dokl. Akad. Nauk. 1956;111:753–766. [Google Scholar]
  • 44.Ihara S. On the capacity of channels with additive non-Gaussian noise. Inf. Control. 1978;37:34–39. doi: 10.1016/S0019-9958(78)90413-8. [DOI] [Google Scholar]
  • 45.Fahs J., Abou-Faycal I.C. A Cauchy input achieves the capacity of a Cauchy channel under a logarithmic constraint; Proceedings of the 2014 IEEE International Symposium on Information Theory; Honolulu, HI, USA. 29 June–4 July 2014; pp. 3077–3081. [Google Scholar]
  • 46.Rioul O., Magossi J.C. On Shannon’s formula and Hartley’s rule: Beyond the mathematical coincidence. Entropy. 2014;16:4892–4910. doi: 10.3390/e16094892. [DOI] [Google Scholar]
  • 47.Dytso A., Egan M., Perlaza S., Poor H., Shamai S. Optimal inputs for some classes of degraded wiretap channels; Proceedings of the 2018 IEEE Information Theory Workshop; Guangzhou, China. 25–29 November 2018; pp. 1–7. [Google Scholar]
  • 48.Cover T.M. Some advances in broadcast channels. In: Viterbi A.J., editor. Advances in Communication Systems. Volume 4. Academic Press; New York, NY, USA: 1975. pp. 229–260. [Google Scholar]
  • 49.Wyner A.D. Recent results in the Shannon theory. IEEE Trans. Inf. Theory. 1974;20:2–9. doi: 10.1109/TIT.1974.1055171. [DOI] [Google Scholar]
  • 50.Berger T. Rate Distortion Theory. Prentice-Hall; Englewood Cliffs, NJ, USA: 1971. [Google Scholar]
  • 51.Koshelev V.N. Estimation of mean error for a discrete successive approximation scheme. Probl. Inf. Transm. 1981;17:20–33. [Google Scholar]
  • 52.Equitz W.H.R., Cover T.M. Successive refinement of information. IEEE Trans. Inf. Theory. 1991;37:269–274. doi: 10.1109/18.75242. [DOI] [Google Scholar]
  • 53.Kotz S., Nadarajah S. Multivariate t-Distributions and Their Applications. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
  • 54.Csiszár I., Narayan P. The secret key capacity of multiple terminals. IEEE Trans. Inf. Theory. 2004;50:3047–3061. doi: 10.1109/TIT.2004.838380. [DOI] [Google Scholar]
  • 55.Kolmogorov A.N., Gnedenko B.V. Limit Distributions for Sums of Independent Random Variables. Addison-Wesley; Reading, MA, USA: 1954. [Google Scholar]
  • 56.Barron A.R. Entropy and the central limit theorem. Ann. Probab. 1986;14:336–342. doi: 10.1214/aop/1176992632. [DOI] [Google Scholar]
  • 57.Artstein S., Ball K., Barthe F., Naor A. Solution of Shannon’s problem on the monotonicity of entropy. J. Am. Math. Soc. 2004;17:975–982. doi: 10.1090/S0894-0347-04-00459-X. [DOI] [Google Scholar]
  • 58.Tulino A.M., Verdú S. Monotonic decrease of the non-Gaussianness of the sum of independent random variables: A simple proof. IEEE Trans. Inf. Theory. 2006;52:4295–4297. doi: 10.1109/TIT.2006.880066. [DOI] [Google Scholar]
  • 59.Guo D., Shamai S., Verdú S. Mutual information and minimum mean–square error in Gaussian channels. IEEE Trans. Inf. Theory. 2005;51:1261–1282. doi: 10.1109/TIT.2005.844072. [DOI] [Google Scholar]
  • 60.Guo D., Shamai S., Verdú S. Mutual information and conditional mean estimation in Poisson channels. IEEE Trans. Inf. Theory. 2008;54:1837–1849. doi: 10.1109/TIT.2008.920206. [DOI] [Google Scholar]
  • 61.Jiao J., Venkat K., Weissman T. Relations between information and estimation in discrete-time Lévy channels. IEEE Trans. Inf. Theory. 2017;63:3579–3594. doi: 10.1109/TIT.2017.2692211. [DOI] [Google Scholar]
  • 62.Arras B., Swan Y. IT formulae for gamma target: Mutual information and relative entropy. IEEE Trans. Inf. Theory. 2018;64:1083–1091. doi: 10.1109/TIT.2017.2759279. [DOI] [Google Scholar]
  • 63.Pinsker M.S., Prelov V., Verdú S. Sensitivity of channel capacity. IEEE Trans. Inf. Theory. 1995;41:1877–1888. doi: 10.1109/18.476313. [DOI] [Google Scholar]
  • 64.Poisson S.D. Connaisance des Tems, ou des Mouvemens Célestes a l’usage des Astronomes, et des Navigateurs, pour l’an 1827. Bureau des longitudes; Paris, France: 1824. Sur la probabilité des résultats moyens des observations; pp. 273–302. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

Not applicable.


Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES