Abstract
The Gaussian law reigns supreme in the information theory of analog random variables. This paper showcases a number of information theoretic results which find elegant counterparts for Cauchy distributions. New concepts such as that of equivalent pairs of probability measures and the strength of real-valued random variables are introduced here and shown to be of particular relevance to Cauchy distributions.
Keywords: information measures, Cauchy distribution, relative entropy, Kullback–Leibler divergence, differential entropy, Fisher’s information, entropy power inequality, f-divergence, Rényi divergence, mutual information, data transmission, lossy data compression
1. Introduction
Since the inception of information theory [1], the Gaussian distribution has emerged as the paramount example of a continuous random variable leading to closed-form expressions for information measures and extremality properties possessing great pedagogical value. In addition, the role of the Gaussian distribution as a ubiquitous model for analog information sources and for additive thermal noise has elevated the corresponding formulas for rate–distortion functions and capacity–cost functions to iconic status in information theory. Beyond discrete random variables, by and large, information theory textbooks confine their coverage and examples to Gaussian random variables.
The exponential distribution has also been shown [2] to lead to closed-form formulas for various information measures such as differential entropy, mutual information and relative entropy, as well as rate–distortion functions for Markov processes and the capacity of continuous-time timing channels with memory such as the exponential-server queue [3].
Despite its lack of moments, the Cauchy distribution also leads to pedagogically attractive closed-form expressions for various information measures. In addition to showcasing those, we introduce an attribute, which we refer to as the strength of a real-valued random variable, under which the Cauchy distribution is shown to possess optimality properties. Along with the stability of the Cauchy law, those properties result in various counterparts to the celebrated fundamental limits for memoryless Gaussian sources and channels.
To enhance readability and ease of reference, the rest of this work is organized in 120 items grouped into 17 sections, plus an appendix.
Section 2 presents the family of Cauchy random variables and their basic properties as well as multivariate generalizations, and the Rider univariate density which includes the Cauchy density as a special case and finds various information theoretic applications.
Section 3 gives closed-form expressions for the differential entropies of the univariate and multivariate densities covered in Section 2.
Introduced previously for unrelated purposes, the Shannon and -transforms reviewed in Section 4 prove useful to derive several information theoretic results for Cauchy and related laws.
Applicable to any real-valued random variable and inspired by information theory, the central notion of strength is introduced in Section 5 along with its major properties. In particular, it is shown that convergence in strength is an intermediate criterion between convergence in probability and convergence in , , and that differential entropy is continuous with respect to the addition of independent vanishing strength noise.
Section 6 shows that, for any the maximal differential entropy density satisfying can be obtained in closed form, but its shape (not just its scale) depends on the value of . In particular, the Cauchy density is the solution only if , and . In contrast, we show that, among all the random variables with a given strength, the centered Cauchy density has maximal differential entropy, regardless of the value of the constraint. This result suggests the definition of entropy strength of Z, as the strength of a Cauchy random variable whose differential entropy is the same as that of Z. Modulo a factor, entropy power is the square of entropy strength. Section 6 also gives a maximal differential entropy characterization of the standard spherical Cauchy multivariate density.
Information theoretic terminology for the logarithm of the Radon–Nikodym derivative, as well as its distribution, the relative information spectrum is given in Section 7. The relative information spectrum for Cauchy distributions is found and shown to depend on their location and scale through a single scalar. This is a rare property, not satisfied by most common families such as Gaussian, exponential, Laplace, etc. Section 8 introduces the notion of equivalent pairs of probability measures, which plays an important role not only in information theory but in statistical inference. Distinguishing from has the same fundamental limits as distinguishing from if and are equivalent pairs. Section 9 studies the interplay between f-divergences and equivalent pairs. A simple formula for the f-divergence between Cauchy distributions results from the explicit expression for the relative information spectrum found in Section 7. These results are then used to easily derive a host of explicit expressions for -divergence, relative entropy, total variation distance, Hellinger divergence and Rényi divergence in Section 10, Section 11, Section 12, Section 13 and Section 14, respectively.
In addition to the Fisher information matrix of the Cauchy family, Section 15 finds a counterpart of de Bruijn’s identity [4] for convolutions with scaled Cauchy random variables, instead of convolutions with scaled Gaussian random variables as in the conventional setting.
Section 16 is devoted to mutual information. The mutual information between a Cauchy random variable and its noisy version contaminated by additive independent Cauchy noise exhibits a pleasing counterpart (modulo a factor of two) with the Gaussian case, in which the signal-to-noise ratio is now given by the ratio of strengths rather than variances. With Cauchy noise, Cauchy inputs maximize mutual information under an output strength constraint. The elementary fact that an output variance constraint translates directly into an input variance constraint does not carry over to input and output strengths, and indeed we identify non-Cauchy inputs that may achieve higher mutual information than a Cauchy input with the same strength. Section 16 also considers the dual setting in which the input is Cauchy, but the additive noise need not be. Lower bounds on the mutual information, attained by Cauchy noise, are offered. However, as the bounds do not depend exclusively on the noise strength, they do not rule out the possibility that a non-Cauchy noise with identical strength may be least favorable. If distortion is measured by strength, the rate–distortion function of a Cauchy memoryless source is shown to admit (modulo a factor of two) the same rate–distortion function as the memoryless Gaussian source with mean–square distortion, replacing the source variance by its strength. Theorem 17 gives a very general continuity result for mutual information that encompasses previous such results. While convergence in probability to zero of the input to an additive-noise transformation does not imply vanishing input-output mutual information, convergence in strength does under very general conditions on the noise distribution.
Some concluding observations about generalizations and open problems are collected in Section 17, including a generalization of the notion of strength.
Those definite integrals used in the main body are collected and justified in the Appendix A.
2. The Cauchy Distribution and Generalizations
In probability theory, the Cauchy (also known as Lorentz and as Breit–Wigner) distribution is the prime example of a real-valued random variable none of whose moments of order one or higher exists, and as such it is not encompassed by either the law of large numbers or the central limit theorem.
-
A real-valued random variable V is said to be standard Cauchy if its probability density function is
(1) Furthermore, X is said to be Cauchy if there exist and such that , in which case
where and are referred to as the location (or median) and scale, respectively, of the Cauchy distribution. If , (2) is said to be centered Cauchy.(2) - Since , the mean of a Cauchy random variable does not exist. Furthermore, for , and the moment generating function of V does not exist (except, trivially, at 0). The characteristic function of the standard Cauchy random variable is
(3) Using (3), we can verify that a Cauchy random variable has the curious property that adding an independent copy to it has the same effect, statistically speaking, as adding an identical copy. In addition to the Gaussian and Lévy distributions, the Cauchy distribution is stable: a linear combination of independent copies remains in the family, and is infinitely divisible: it can be expressed as an n-fold convolution for any n. It follows from (3) that if are independent, standard Cauchy, and is a deterministic sequence with finite -norm , then has the same distribution as . In particular, the time average of independent identically distributed Cauchy random variables has the same distribution as any of the random variables. The families and , with any interval of the real line, are some of the simplest parametrized random variables that are not an exponential family.
-
If is uniformly distributed on , then is standard Cauchy. This follows since, in view of (1) and (A1), the standard Cauchy cumulative distribution function is
(4) Therefore, V has unit semi-interquartile length. The functional inverse of (4) is the standard Cauchy quantile function given by(5) If and are standard Gaussian with correlation coefficient , then is Cauchy with scale and location . This implies that the reciprocal of a standard Cauchy random variable is also standard Cauchy.
- Taking the cue from the Gaussian case, we say that a random vector is multivariate Cauchy if any linear combination of its components has a Cauchy distribution. Necessary and sufficient conditions for a characteristic function to be that of a multivariate Cauchy were shown by Ferguson [5]. Unfortunately, no general expression is known for the corresponding probability density function. This accounts for the fact that one aspect, in which the Cauchy distribution does not quite reach the wealth of information theoretic results attainable with the Gaussian distribution, is in the study of multivariate models of dependent random variables. Nevertheless, special cases of multivariate Cauchy distribution do admit some interesting information theoretic results as we will see below. The standard spherical multivariate Cauchy probability density function on is (e.g., [6])
where is the Gamma function. Therefore, are exchangeable random variables. If are independent standard normal, then the vector has the density in (6). With the aid of (A10), we can verify that any subset of components of is distributed according to . In particular, the marginals of (6) are given by (1). Generalizing (3), the characteristic function of (6) is(6) (7) -
In parallel to Item 1, we may generalize (6) by dropping the restriction that it be centered at the origin and allowing ellipsoidal deformation, i.e., letting with and a positive definite matrix . Therefore,
(8) While is a Cauchy random variable for , (8) fails to encompass every multivariate Cauchy distribution—in particular, the important case of independent Cauchy random variables. Another reason the usefulness of the model in (8) is limited is that it is not closed under independent additions: if and are independent, each distributed according to (6); then, , while multivariate Cauchy, does not have a density of the type in (8) unless for some .
-
Another generalization of the (univariate) Cauchy distribution, which comes into play in our analysis, was introduced by Rider in 1958 [7]. With and ,
(9) (10) In addition to the parametrization in (9), we may introduce scale and location parameters by means of , just as we did in the Cauchy case . Another notable special case is , which is the centered Student-t random variable, itself equivalent to a Pearson type VII distribution.
3. Differential Entropy
-
9.The differential entropy of a Cauchy random variable is
(11)
using (A3). Throughout this paper, unless the logarithm base is explicitly shown, it can be chosen by the reader as long as it is the same on both sides of the equation. For natural logarithms, the information measure unit is the nat.(12) -
10.An alternative, sometimes advantageous, expression for the differential entropy of a real-valued random variable is feasible if its cumulative distribution function is continuous and strictly monotonic. Then, the quantile function is its functional inverse, i.e., for all , which implies that for all . Moreover, since X and with U uniformly distributed on have identical distributions, we obtain
(13) Since (4) is indeed continuous and strictly monotonic, we can verify that we recover (12) by means of (5), (13) and (A2).
-
11.Despite not having finite moments, an independent identically distributed sequence of Cauchy random variables is information stable in the sense that
because of the strong law of large numbers.(14) -
12.With distributed according to the standard spherical multivariate Cauchy density in (6), it is shown in [8] that
where is the Euler–Mascheroni constant and is the digamma function. Therefore, the differential entropy of (6) is, in nats, (see also [9])(15) (16)
whose growth is essentially linear with n: the conditional differential entropy(17) is monotonically decreasing with(18) (19) -
13.By the scaling law of differential entropy and its invariance to location, we obtain
(20) - 14.
-
15.The Rényi differential entropy of order of an absolutely continuous random variable with probability density function is
(23) - 16.
4. The Shannon- and -Transforms
In this section, we recall the definitions of two notions introduced in [10] for the unrelated purpose of expressing the asymptotic singular value distribution of large random matrices.
-
17.The Shannon transform of a nonnegative random variable X is the function , defined by
(27) Unless for all (e.g., if X has the log-Cauchy density ), or , , (which occurs if a.s.), the Shannon transform is a strictly concave continuous function from , which grows without bound as .
- 18.
- 19.
-
20.The -transform of a non-negative random variable is defined as the function
which is intimately related to the Cauchy–Stieltjes transform [11]. For example,(31) (32) (33)
5. Strength
The purpose of this section is to introduce an attribute which is particularly useful to compare random variables that do not have finite moments.
-
21.The strength of a real-valued random variable Z is defined as
(34) It follows that the only random variable with zero strength is , almost surely. If the inequality in (34) is not satisfied for any , then . Otherwise, is the unique positive solution to(35) If , then (35) holds with ≤.
-
22.The set of probability measures whose strength is upper bounded by a given finite nonnegative constant,
is convex: The set is a singleton as seen in Item 21, while, for , we can express (36) as(36) (37) Therefore, if and , we must have .
-
23.The peculiar constant in the definition of strength is chosen so that if V is standard Cauchy, then its strength is because, in view of (29),
(38) -
24.If , a.s., then its strength is
(39) -
25.The left side of (35) is the Shannon transform of evaluated at , which is continuous in . If then, (35) can be written as
where, on the right side, we have denoted the functional inverse of the Shannon transform. Clearly, the square root of the right side of (40) cannot be expressed as the expectation with respect to Z of any that does not depend on . Nevertheless, thanks to (37), (36) can be expressed as(40) (41) -
26.
Theorem 1.
The strength of a real-valued random variable satisfies the following properties:-
(a)
(42) -
(b)
(43) with equality if and only if is deterministic. -
(c) If , and , then
(44) -
(d) If V is standard Cauchy, independent of X, then is the solution to
if it exists, otherwise, . Moreover, ≤ holds in (45) if .(45) -
(e)
(46) -
(f) If , then
where V is standard Cauchy, and stands for the relative entropy with reference probability measure and dominated measure .(47) -
(g)
(48) -
(h) If V is standard Cauchy, then
(49) -
(i) The finiteness of strength is sufficient for the finiteness of the entropy of the integer part of the random variable, i.e.,
-
(j) If in for any , then .
-
(k)
(50) -
(l) If , then .
-
(m) If , and Z is independent of , then .
Proof.
For the first three properties, it is clear that they are satisfied if , i.e., almost surely.- (a)
- (b)
- (c)
-
(d)
Substituting x by X and averaging over X, the result follows from the definition of strength.
-
(e)The result holds trivially if either or . Otherwise, we simply rewrite (35) as
and upper/lower bound the right side by .(54) - (f)
-
(g)
-
(h)It is sufficient to assume for the condition on the right of (49) because the condition on the left holds if and only if it holds for , for any and . If , then
which is finite unless either or . This establishes ⟹ in view of (48). To establish ⟸, it is enough to show that(59)
in view of (48) and the fact that, according to (59), if both and are finite. To show (60), we invoke the following variational representation of relative entropy (first noted by Kullback [12] for absolutely continuous random variables): If , then(60)
attained only at . Let Q be the absolutely continuous random variable with probability density function(61) (62) - (i)
-
(j)If , then a.e., and the result follows from (44). For all ,
(71)
where (71) follows by maximizing the left side over . Denote the difference between the right side and the left side of (72) by , an even function which satisfies , and(72) (73) Now, because of the scaling property in (42), we may assume without loss of generality that . Thus, (74) and (75) result in
which requires that , since, by assumption, the right side vanishes. Assume now that , and therefore, . Inequality (75) remains valid in this case, implying that, as soon as the right side is finite (which it must be for all sufficiently large n), , and therefore, in view of (48).(76) -
(k)
-
1st ⟸
For any , Markov’s inequality results in
(77) -
⟹First, we show that, for any , we have
The case is trivial. The case follows because implies(78)
where ≥ is obvious, and ≤ holds because(79) (80)
If infinitely often, so is in view of (48). Assume that , and is finite for all sufficiently large. Then, there is a subsequence such that , and(81)
for all sufficiently large i and . Consequently, (78) implies that .(82) -
2nd ⟸
Suppose that . Therefore, there is a subsequence along which . If , then along the subsequence. Because of the continuity of the Shannon transform and the fact that it grows without bound as its argument goes to infinity (Item 25), if , we can find such that , which implies . Therefore, as we wanted to show.
-
1st ⟸
-
(l)We start by showing that
where we have denoted the right side of (71) with arbitrary logarithm base by . Since , it is easy to verify that(83)
where the lower and upper bounds are attained uniquely at and , respectively. The lower bound results in ⟸ in (83). To show ⟹, decompose, for arbitrary ,(84) (85) (86) (87)
where(88)
(87) holds from the upper bound in (84), and the fact that (89) is decreasing in , and (88) holds for all sufficiently large n if . Since the right side of (88) goes to 0 as , (83) is established. Assume . From the linearity property (42), we have with and which satisfies . Therefore, we may restrict attention to without loss of generality. Following (71) and (74), and abbreviating , we obtain(89) (90) (91) -
(m)If , then a.s., and in view of Part (f). Assume henceforth that . Since , it suffices to show
Under the assumptions, Part (l) guarantees that(92)
If V is a standard Cauchy random variable, then in distribution as the characteristic function converges: for all t. Analogously, according to Part (k), since in probability. Since the strength of is finite for all sufficiently large n, we may invoke (47) to express, for those n,(93)
The lower semicontinuity of relative entropy under weak convergence (which, in turn, is a corollary to the Donsker–Varadhan [14,15] variational representation of relative entropy) results in(94)
because and . Therefore, (92) follows from (94) and (95).(95) □
-
(a)
-
27.In view of (42) and Item 23, if V is standard Cauchy. Furthermore, if and are centered independent Cauchy random variables, then their sum is centered Cauchy with
More generally, it follows from Theorem 1-(d) that, if is centered Cauchy, and (96) holds for and all , then X must be centered Cauchy. Invoking (52), we obtain(96)
which is also valid for as we saw in Item 24.(97) -
28.If X is standard Gaussian, then , and . Therefore, if and are zero-mean independent Gaussian random variables, then
Thus, in this case, .(98) -
29.It follows from Theorem 1-(d) that, with X independent of standard Cauchy V, we obtain whenever X is such that
An example is the heavy-tailed probability density function(99)
for which .(100) -
30.Using (A8), we can verify that, if X is zero-mean uniform with variance , then
where c is the solution to .(101) -
31.We say that in strength if . Parts (j) and (k) of Theorem 1 show that this convergence criterion is intermediate between the traditional in probability and criteria. It is not equivalent to either one: If
then , while in probability. If, instead, , with probability , then in strength, but not in for any .(102) -
32.The assumption in Theorem 1-(m) that in strength cannot be weakened to convergence in probability. Suppose that is absolutely continuous with probability density function
(103) We have in probability since, regardless of how small , for all . Furthermore,
because (103) is the mixture of a uniform and an infinite differential entropy probability density function, and differential entropy is concave. We conclude that , since .(104) -
33.The following result on the continuity of differential entropy is shown in [16]: if X and Z are independent, and , then
(105) This result is weaker than Theorem 1-(m) because finite first absolute moment implies finite strength as we saw in (44), and in if , and therefore, it vanishes in strength too.
- 34.
- 35.
6. Maximization of Differential Entropy
-
36.Among random variables with a given second moment (resp. first absolute moment), differential entropy is maximized by the zero-mean Gaussian (resp. Laplace) distribution. More generally, among random variables with a given p-absolute moment , differential entropy is maximized by the parameter-p Subbotin (or generalized normal) distribution with p-absolute moment [17]
Among nonnegative random variables with a given mean, differential entropy is maximized by the exponential distribution. In those well-known solutions, the cost function is an affine function of the negative logarithm of the maximal differential entropy probability density function. Is there a cost function such that, among all random variables with a given expected cost, the Cauchy distribution is the maximal differential entropy solution? To answer this question, we adopt a more general viewpoint. Consider the following result, whose special case was solved in [18] using convex optimization:(108) Theorem 2.
Fix and .(109) (110) Therefore, the standard Cauchy distribution is the maximal differential entropy distribution provided that andProof.
-
(a)For every and , there is a unique that satisfies (110) because the function of on the right side is strictly monotonically decreasing, grows without bound as , and goes to zero as .
-
(b)For any Z which satisfies , its relative entropy, in nats, with respect to is
(111) (112) (113)
where (113) and (114) follow from (110) and (22), respectively. Since relative entropy is nonnegative, and zero only if both measures are identical, not only does (2) hold but any random variable other than achieves strictly lower differential entropy.(114) □
-
(a)
-
37.
An unfortunate consequence stemming from Theorem 2 is that, while we were able to find out a cost function such that the Cauchy distribution is the maximal differential entropy distribution under an average cost constraint, this holds only for a specific value of the constraint. This behavior is quite different from the classical cases discussed in Item 36 for which the solution is, modulo scale, the same regardless of the value of the cost constraint. As we see next, this deficiency is overcome by the notion of strength introduced in Section 5.
-
38.
Theorem 3.
Strength constraint. The differential entropy of a real-valued random variable with strength is upper bounded by(115) If , equality holds if and only if Z has a centered Cauchy density, i.e., for some .Proof.
-
(a)If Z is not an absolutely continuous random variable, or more generally, such as in the case in which with probability one, then (115) is trivially satisfied.
- (b)
□ -
(a)
-
39.The entropy power of a random variable Z is the variance of a Gaussian random variable whose differential entropy is , i.e.,
(116) While the power of a Cauchy random variable is infinite, its entropy power is given by(117) In the same spirit as the definition of entropy power, Theorem 3 suggests the definition of , the entropy strength of Z, as the strength of a centered Cauchy random variable whose differential entropy is , i.e., . Therefore,(118) (119)
where (119) follows from (56), and (120) holds with equality if and only if Z is centered Cauchy. Note that, for all ,(120) (121) Comparing (116) and (118), we see that entropy power is simply a scaled version of the entropy strength squared,(122) The entropy power inequality (e.g., [19,20]) states that, if and are independent real-valued random variables, then
regardless of whether they have moments. According to (122), we may rewrite the entropy power inequality (123) replacing each entropy power by the corresponding squared entropy strength. Therefore, the squared entropy strength of the sum of independent random variables satisfies(123) (124) It is well-known that equality holds in (123), and hence (124), if and only if both random variables are Gaussian. Indeed, if and are centered Cauchy with respective strengths and , then (124) becomes .
-
40.Theorem 3 implies that any random variable with infinite differential entropy has infinite strength. There are indeed random variables with finite differential entropy and infinite strength. For example, let be an absolutely continuous random variable with probability density function
(125) Then, nats, while the entropy of the quantized version as well as the strength satisfy .
-
41.With the same approach, we may generalize Theorem 3 to encompass the full slew of the generalized Cauchy distributions in (9). To that end, fix and define the -strength of a random variable as
(126) Therefore, for , and if satisfy (110), then . As in Item 25, if , we have(127) -
42.
Theorem 4.
Generalized strength constraint. Fix and . The differential entropy of a real-valued random variable with -strength is upper bounded by(128) Proof.
As with Theorem 3, in the proof, we may assume to avoid trivialities. Then,(129) and, in nats,(130) (131) (132) (133) -
43.
In the multivariate case, we may find a simple upper bound on differential entropy based on the strength of the norm of the random vector.
Theorem 5.
The differential entropy of a random vector is upper bounded by(134) Proof.
As in the proof of Theorem 3, we may assume that . As usual, denotes the standard spherical multivariate Cauchy density in (6). Since for , , we have(135) (136) (137) For , Theorem 5 becomes the bound in (115). For , the right side of (15) is greater than , and, therefore, . Consequently, in the multivariate case, there is no such that (134) is tight.
-
44.To obtain a full generalization of Theorem 3 in the multivariate case, it is advisable to define the strength of a random n-vector as
(138)
for . To verify (139), note (15)–(17). Notice that and for , (138) is equal to (34). The following result provides a maximal differential entropy characterization of the standard spherical multivariate Cauchy density.(139) Theorem 6.
Let have the standard multivariate Cauchy density (6), Then,(140) Proof.
Assume . Then,(141) (142)
7. Relative Information
-
45.For probability measures P and Q on the same measurable space , such that , the logarithm of their Radon–Nikodym derivative is the relative information denoted by
(143) -
46.As usual, we may employ the notation to denote . The distributions of the random variables and are referred to as relative information spectra (e.g., [21]). It can be shown that there is a one-to-one correspondence between the cumulative distributions of and . For example, if they are absolutely continuous random variables with respective probability density functions and , then
(144) Obviously, the distributions of and determine each other. One caveat is that relative information may take the value . It can be shown that(145) (146) -
47.
The information spectra determine all measures of the distance between the respective probability measures of interest (e.g., [22,23]), including f-divergences and Rényi divergences. For example, the relative entropy (or Kullback–Leibler divergence) of the dominated measure P with respect to the reference measure Q is the average of the relative information when the argument is distributed according to P, i.e., . If , then .
-
48.The information spectra also determine the fundamental trade-off in hypothesis testing. Let denote the minimal probability of deciding when is true subject to the constraint that the probability of deciding when is true is no larger than . A consequence of the Neyman–Pearson lemma is
where and .(147) -
49.Cauchy distributions are absolutely continuous with respect to each other and, in view of (2),
(148) -
50.The following result, proved in Item 58, shows that the relative information spectrum corresponding to Cauchy distributions with respective scale/locations and depends on the four parameters through the single scalar
where equality holds if and only if .(149) Theorem 7.
Suppose that , and V is standard Cauchy. Denote(150) Then,-
(a)
(151) -
(b) Z has the same distribution as the random variable
(152) where Θ is uniformly distributed on and . Therefore, the probability density function of Z is(153) on the interval .
-
(a)
- 51.
-
52.For future use, note that the endpoints of the support of (153) are their respective reciprocals. Furthermore,
which implies(156) (157)
8. Equivalent Pairs of Probability Measures
-
53.
Suppose that and are probability measures on such that and and are probability measures on such that . We say that and are equivalent pairs, and write , if the cumulative distribution functions of and are identical with and . Naturally, ≡ is an equivalence relationship. Because of the one-to-one correspondence indicated in Item 46, the definition of equivalent pairs does not change if we require equality of the information spectra under the dominated measure, i.e., that and be equally distributed and . Obviously, the requirement that the information spectra coincide is the same as requiring that the distributions of and are equal. As in Item 46, we also employ the notation to indicate if , , , and .
-
54.
Suppose that the output probability measures of a certain (random or deterministic) transformation are and when the input is distributed according to and , respectively. If , then the transformation is a sufficient statistic for deciding between and (i.e., the case of a binary parameter).
-
55.If is a measurable space on which the probability measures are defined, and is a -measurable injective function, then are probability measures on and
(158) Consequently, .
-
56.The most important special case of Item 55 is an affine transformation of an arbitrary real-valued random variable X, which enables the reduction of four-parameter problems into two-parameter problems: for all and ,
with(159)
by choosing the affine function .(160) -
57.
Theorem 8.
If is an even random vector, i.e., , then(161) whenever . -
58.
We now proceed to prove Theorem 7.
Proof.
Since and have identical distributions, we may assume for convenience that and . Furthermore, capitalizing on Item 56, we may assume , , , and , and then recover the general result letting and . Invoking (A9) and (A10), we have(167) (168) (169) and we can verify that we recover (151) through the aforementioned substitution. Once we have obtained the expectation of , we proceed to determine its distribution. Denoting the right side of (169) by , we have(170) (171) (172) (173) (174) (175) where is uniformly distributed on . We have substituted (see Item 4) in (172), and invoked elementary trigonometric identities in (173) and (174). Since the phase in (175) does not affect it, the distribution of Z is indeed as claimed in (152), and (153) follows because the probability density function of is(176) □ -
59.In general, it need not hold that —for example, if X and Y are zero-mean Gaussian with different variances. However, the class of scalar Cauchy distributions does satisfy this property since the result of Theorem 7 is invariant to swapping and . More generally, Theorem 7 implies that, if , then
(177) Curiously, (177) implies that .
-
60.
For location–dilation families of random variables, we saw in Item 56 how to reduce a four-parameter problem into a two-parameter problem since with the appropriate substitution. In the Cauchy case, Theorem 7 reveals that, in fact, we can go one step further and turn it into a one-parameter problem. We have two basic ways of doing this:
-
(a)
with .
-
(b)with either
which are the solutions to .(178)
-
(a)
9. -Divergences
This section studies the interplay of f-divergences and equivalent pairs of measures.
-
61.If and is convex and right-continuous at 0, f-divergence is defined as
(179) -
62.The most important property of f-divergence is the data processing inequality
where and are the responses of a (random or deterministic) transformation to and , respectively. If f is strictly convex at 1 and , then is necessary and sufficient for equality in (180).(180) -
63.
If , then with the transform , which satisfies .
-
64.
Theorem 9.
If and , then(181) where stands for all convex right-continuous .Proof.
As mentioned in Item 53, is equivalent to and having identical distributions with and .-
⟹According to (179), is determined by the distribution of the random variable , .
-
⟸For , the function , , is convex and right-continuous at 0, and is the moment generating function, evaluated at t, of the random variable , . Therefore, for all t implies that .□
-
⟹
-
65.
Since is not necessary in order to define (finite) , it is possible to enlarge the scope of Theorem 9 by defining dropping the restriction that and . For that purpose, let and be -finite measures on and , respectively, and denote , , . Then, we say if
-
(a)
when restricted to , the random variables and have identical distributions with and ;
-
(b)
when restricted to , the random variables and have identical distributions with and .
Note that those conditions imply that
-
(c)
;
-
(d)
;
-
(e)
.
For example, if and , then . To show the generalized version of Theorem 9, it is convenient to use the symmetrized form(182) -
(a)
-
66.Suppose that there is a class of probability measures on a given measurable space with the property that there exists a convex function (right-continuous at 0) such that, if and , then
(183) In such case, Theorem 9 indicates that can be partitioned into equivalence classes such that, within every equivalence class, the value of is constant, though naturally dependent on f. Throughout , the value of determines the value of , i.e., we can express , where is a non-decreasing function. Consider the following examples:
-
(a)Let be the class of real-valued Gaussian probability measures with given variance . Then,
(184) Since Theorem 8 implies that as long as , (184) indicates that (183) is satisfied with given by the right-continuous extension of . Therefore, we can conclude that, regardless of f, depends on only through .
-
(b)Let be the collection of all Cauchy random variables. Theorem 7 reveals that (183) is also satisfied if because, if and , then
(185)
-
(a)
-
67.An immediate consequence of Theorems 7 and 9 is that, for any valid f, the f-divergence between Cauchy densities is symmetric,
(186) This property does not generalize to the multivariate case. While, in view of Theorem 8,
in general, since the corresponding relative entropies do not coincide as shown in [8].(187) -
68.
It follows from Item 66 and Theorem 7 that any f-divergence between Cauchy probability measures is a monotonically increasing function of given by (149). The following result shows how to obtain that function from f.
Proof.
In view of (179) and the definition of Z in Theorem 7,(191) -
69.Suppose now that we have two sequences of Cauchy measures with respective parameters and such that . Then, Theorem 10 indicates that
(192) The most common f-divergences are such that since in that case . In addition, adding the function to does not change the value of and with appropriately chosen , we can turn into canonical form in which not only but . In the special case in which the second measure is fixed, Theorem 9 in [25] shows that, if with , then
provided the limit on the right side exists; otherwise, the left side lies between the left and right limits at 1. In the Cauchy case, we can allow the second probability to depend on n and sharpen that result by means of Theorem 10. In particular, it can be shown that(193)
provided the right side is not .(194)
10. -Divergence
-
70.With either or , f-divergence is the -divergence,
(195) -
71.If P and Q are Cauchy distributions, then (149), (151) and (195) result in
(196)
a formula obtained in Appendix D of [26] using complex analysis and the Cauchy integral formula. In addition, invoking complex analysis and the maximal group invariant results in [27,28], ref. [26] shows that any f-divergence between Cauchy distributions can be expressed as a function of their divergence, although [26] left open how to obtain that function, which is given by Theorem 10 substituting .(197)
11. Relative Entropy
-
72.The relative entropy between Cauchy distributions is given by
where . The special case of (198) was found in Example 4 of [29]. The next four items give different simple justifications for (198). An alternative proof was recently given in Appendix C of [26] using complex analysis holomorphisms and the Cauchy integral formula. Yet another, much more involved, proof is reported in [30]. See also Remark 19 in [26] for another route invoking the Lévy–Khintchine formula and the Frullani integral.(198) -
73.Since for absolutely continuous random variables ,
(199)
where (200) follows from (12) and (A4) with and(200) Now, substituting and , we obtain (198) since, according to Item 56, .
-
74.From the formula found in Example 4 of [29] and the fact that, according to (197), when , we obtain
(201) Moreover, as argued in Item 60, (201) is also valid for the relative entropy between Cauchy distributions with as long as is given in (197). Indeed, we can verify that the right side of (201) becomes (198) with said substitution.
- 75.
- 76.
-
77.If V is standard Cauchy, independent of Cauchy and , then (198) results in
where and , and is an independent (or exact) copy of V. In contrast, the corresponding result in the Gaussian case in which X, , are independent Gaussian with means and variances , respectively, is(205) (206) In fact, it is shown in Lemma 1 of [31] that (206) holds even if and are not Gaussian but have finite variances. It is likely that (205) holds even if and are not Cauchy, but have finite strengths.
-
78.An important information theoretic result due to Csiszár [32] is that if and P is such that
then the following Pythagorean identity holds(207) (208) Among other applications, this result leads to elegant proofs of minimum relative entropy results. For example, the closest Gaussian to a given P with a finite second moment has the same first and second moments as P. If we let and be centered Cauchy with strengths and , respectively, then the orthogonality condition (207) becomes, with the aid of (148) and (198),(209) If, in addition, P is centered Cauchy, we can use (28) to verify that (209) holds only in the trivial cases in which either or . For non-Cauchy P, (208) may indeed be satisfied with . For example, using (30), if , then (209), and therefore (208), holds with .
-
79.Mutually absolutely continuous random variables may be such that
(210) An easy example is that of Gaussian X and Cauchy Z, or, if we let X be Cauchy, (210) holds with Z having the very heavy-tailed density function in (62).
-
80.While relative entropy is lower semi-continuous, it is not continuous. For example, using the Cauchy distribution, we can show that relative entropy is not stable against small contamination of a Gaussian random variable: if X is Gaussian independent of V, then no matter how small ,
(211)
12. Total Variation Distance
- 81.
-
82.Example 15 of [33] shows that the total variation distance between centered Cauchy distributions is
(217)
in view of (197). Since any f-divergence between Cauchy distributions depends on the parameters only through the corresponding -divergence, (217)–(218) imply the general formula(218) (219) Alternatively, applying Theorem 11 to the case of Cauchy random variables, note that, in this case, Z is an absolutely continuous random variable with density function (153). Therefore, = 1, and(220)
where (221) follows from (154) and the identity specialized to . Though more laborious (see [26]), (219) can also be verified by direct integration.(221)
13. Hellinger Divergence
-
83.The Hellinger divergence, of order , is the -divergence with
(222) Notable special cases are(223) (224)
where is known as the squared Hellinger distance.(225) - 84.
14. Rényi Divergence
-
85.For absolutely continuous probability measures P and Q, with corresponding probability density functions p and q, the Rényi divergence of order is [35]
(228) Note that, if , then . Moreover, although Rényi divergence of order is not an f-divergence, it is in one-to-one correspondence with the Hellinger divergence of order :(229) - 86.
- 87.
-
88.
Writing the complete elliptical integral of the first kind and the Legendre function of the first kind as special cases of the Gauss hypergeometric function, González [37] noticed the simpler identity (see also 8.13.8 in [34])
(234) We can view (233) and (234) as complementary of each other since they constrain the argument of the Legendre function to belong to and , respectively.
-
89.Since , particularizing (230), we obtain
(235) -
90.Since , for Cauchy random variables, we obtain
(236) -
91.For Cauchy random variables, the Rényi divergence for integer order 4 or higher can be obtained through (235), (236) and the recursion (dropping for typographical convenience)
which follows from (230) and the recursion of the Legendre polynomials(237)
which, in fact, also holds for non-integer n (see 8.5.3 in [34]).(238) -
92.The Chernoff information
satisfies regardless of . If, as in the case of Cauchy measures, , then Chernoff information is equal to the Bhattacharyya distance:(239)
where is the squared Hellinger distance, which is the f-divergence with . Together with Item 87, (240) gives the Chernoff information for Cauchy distributions. While it involves the complete elliptical integral function, its simplicity should be contrasted with the formidable expression for Gaussian distributions, recently derived in [38]. The reason (240) holds is that the supremum in (239) is achieved at . To see this, note that(240) (241) (242)
where (241) reflects the skew-symmetry of Rényi divergence, and (242) holds because . Since is concave and its own mirror image, it is maximized at .(243)
15. Fisher’s Information
-
93.The score function of the standard Cauchy density (1) is
(244) Then, is a zero-mean random variable with second moment equal to Fisher’s information
where we have used (A11). Since Fisher’s information is invariant to location and scales as , we obtain(245) (246) Together with (117), the product of entropy power and Fisher information is , thereby abiding by Stam’s inequality [4], .
-
94.Introduced in [39], Fisher’s information of a density function (245) quantifies its similarity with a slightly shifted version of itself. A more general notion is the Fisher information matrix of a random transformation satisfying the regularity condition
(247) Then, the Fisher information matrix of at has coefficients
and satisfies (with relative entropy in nats)(248) (249) -
95.The relative Fisher information is defined as
(252) Although the purpose of this definition is to avoid some of the pitfalls of the classical definition of Fisher’s information, not only do equivalent pairs fail to have the same relative Fisher information but, unlike relative entropy or f-divergence, relative Fisher information is not transparent to injective transformations. For example, . Centered Cauchy random variables illustrate this fact since(253) -
96.de Bruijn’s identity [4] states that, if is independent of X, then, in nats,
As well as serving as the key component in the original proofs of the entropy power inequality, the differential equation in (254) provides a concrete link between Shannon theory and its prehistory. As we show in Theorem 12, it turns out that there is a Cauchy counterpart of de Bruijn’s identity (254). Before stating the result, we introduce the following notation for a parametrized random variable (to be specified later):(254) (255) (256) (257)
i.e., and are the Fisher information with respect to location and with respect to dilation, respectively (corresponding to the coefficients and of the Fisher information matrix when as in Item 94. The key to (254) is that , satisfies the partial differential equation(258) (259) Theorem 12.
Suppose that X is independent of standard Cauchy V. Then, in nats,(260) Proof.
Equation (259) does not hold in the current case in which , and(261) However, some algebra (the differentiation/integration swaps can be justified invoking the bounded convergence theorem) indicates that the convolution with the Cauchy density satisfies the Laplace partial differential equation(262) The derivative of the differential entropy of is, in nats,(263) (264) Taking another derivative, the left side of (260) becomes(265) (266) (267) (268) where-
□
-
97.
Theorem 12 reveals that the increasing function is concave (which does not follow from the concavity of differential entropy functional of the density). In contrast, it was shown by Costa [40] that the entropy power , with is concave in t.
16. Mutual Information
-
98.Most of this section is devoted to an additive noise model. We begin with the simplest case in which is centered Cauchy independent of , also centered Cauchy with . Then, (11) yields
(269) (270)
thereby establishing a pleasing parallelism with Shannon’s formula [1] for the mutual information between a Gaussian random variable and its sum with an independent Gaussian random variable. Aside from a factor of , in the Cauchy case, the role of the variance is taken by the strength. Incidentally, as shown in [2], if N is standard exponential on , an independent X on can be found so that is exponential, in which case the formula (271) also applies because the ratio of strengths of exponentials is equal to the ratio of their means. More generally, if input and noise are independent non-centered Cauchy, their locations do not affect the mutual information, but they do affect their strengths, so, in that case, (271) holds provided that the strengths are evaluated for the centered versions of the Cauchy random variables.(271) -
99.It is instructive, as well as useful in the sequel, to obtain (271) through a more circuitous route. Since is centered Cauchy with strength , the information density (e.g., [41]) is defined as
(272) (273) (274) Averaging with respect to , we obtain(275) (276) -
100.If the strengths of output and independent noise N are finite and their differential entropies are not , we can obtain a general representation of the mutual information without requiring that either input or noise be Cauchy. Invoking (56) and , we have
(277)
since, as we saw in (49), the finiteness of the strengths guarantees the finiteness of the relative entropies in (278). We can readily verify the alternative representation in which strength is replaced by standard deviation, and the standard Cauchy V is replaced by standard normal W:(278) (279) (280) A byproduct of (278) is the upper bound(281)
where (281) follows from , and (282) follows by dropping the last term on the right side of (278). Note that (281) is the counterpart of the upper bound given by Shannon [1] in which the standard deviation of Y takes the place of the strength in the numerator, and the square root of the noise entropy power takes the place of the entropy strength in the denominator. Shannon gave his bound three years before Kullback and Leibler introduced relative entropy in [42]. The counterpart of (282) with analogous substitutions of strengths by standard deviations was given by Pinsker [43], and by Ihara [44] for continuous-time processes.(282) -
101.
We proceed to investigate the maximal mutual information between the (possibly non-Cauchy) input and its additive Cauchy-noise contaminated version.
Theorem 13.
Maximal mutual information: output strength constraint. For any ,(283) where is centered Cauchy independent of X. The maximum in (283) is attained uniquely by the centered Cauchy distribution with strength .Proof.
For centered Cauchy noise, the upper bound in (282) simplifies to(284) which shows ≤ in (283). If the input is centered Cauchy with strength , then , and is equal to the right side in view of (271).□ -
102.In the information theory literature, the maximization of mutual information over the input distribution is usually carried out under a constraint on the average cost for some real-valued function . Before we investigate whether the optimization in (283) can be cast into that conventional paradigm, it is instructive to realize that the maximization of mutual information in the case of input-independent additive Gaussian noise can be viewed as one in which we allow any input such that the output variance is constrained, and because the output variance is the sum of input and noise variances that the familiar optimization over variance constrained inputs obtains. Likewise, in the case of additive exponential noise and random variables taking nonnegative values, if we constrain the output mean, automatically we are constraining the input mean. In contrast, the output strength is not equal to the sum of Cauchy noise strength and the input strength, unless the input is Cauchy. Indeed, as we saw in Theorem 1-(d), the output strength depends not only on the input strength but on the shape of its probability density function. Since the noise is Cauchy, (45) yields
(285)
which is the same input constraint found in [45] (see also Lemma 6 in [46] and Section V in [47]) in which affects not only the allowed expected cost but the definition of the cost function itself. If X is centered Cauchy with strength , then (286) is satisfied with equality, in keeping with the fact that that input achieves the maximum in (283). Any alternative input with the same strength that produces output strength lower than or equal to can only result in lower mutual information. However, as we saw in Item 29, we can indeed find input distributions with strength that can produce output strength higher than . Can any of those input distributions achieve ? The answer is affirmative. If we let , defined in (9), we can verify numerically that, for ,(286) (287) We conclude that, at least for , the capacity–input–strength function satisfies(288) -
103.
Although not always acknowledged, the key step in the maximization of mutual information over the input distribution for a given random transformation is to identify the optimal output distribution. The results in Items 101 and 102 point out that it is mathematically more natural to impose constraints on the attributes of the observed noisy signal than on the transmitted noiseless signal. In the usual framework of power constraints, both formulations are equivalent as an increase in the gain of the receiver antenna (or a decrease in the front-end amplifier thermal noise) of dB has the same effect as an increase of dB in the gain of the transmitter antenna (or increase in the output power of the transmitted amplifier). When, as in the case of strength, both formulations lead to different solutions, it is worthwhile to recognize that what we usually view as transmitter/encoder constraints also involve receiver features.
-
104.Consider a multiaccess channel , where is a sequence of strength independent centered Cauchy random variables. While the capacity region is unknown if we place individual cost or strength constraints on the transmitters, it is easily solvable if we impose an output strength constraint. In that case, the capacity region is the triangle
where is the output strength constraint. To see this, note (a) the corner points are achievable thanks to Theorem 13; (b) if the transmitters are synchronous, a time-sharing strategy with Cauchy distributed inputs satisfies the output strength constraint in view of (107); (c) replacing the independent encoders by a single encoder which encodes both messages would not be able to achieve higher rate sum. It is also possible to achieve (289) using the successive decoding strategy invented by Cover [48] and Wyner [49] for the Gaussian multiple-access channel: fix ; to achieve and , we let the transmitters use random coding with sequences of independent Cauchy random variables with respective strengths(289) (290)
which abide by the output strength constraint since , and(291) (292)
a rate-pair which is achievable by successive decoding by using a single-user decoder for user 1, which treats the codeword transmitted by user 2 as noise; upon decoding the message of user 1, it is re-encoded and subtracted from the received signal, thereby presenting a single-user decoder for user 2 with a signal devoid of any trace of user 1 (with high probability).(293) -
105.The capacity per unit energy of the additive Cauchy-noise channel , where is an independent sequence of standard Cauchy random variables, was shown in [29] to be equal to , even though the capacity-cost function of such a channel is unknown. A corollary to Theorem 13 is that the capacity per unit output strength of the same channel is
(294) By only considering Cauchy distributed inputs, the capacity per unit input strength is lower bounded by
but is otherwise unknown as it is not encompassed by the formula in [29].(295) -
106.We turn to the scenario, dual to that in Theorem 13, in which the input is Cauchy but the noise need not be. As Shannon showed in [1], if the input is Gaussian, among all noise distributions with given second moment, independent Gaussian noise is the least favorable. Shannon showed that fact applying the entropy power inequality to the numerator on the right side of (279), and then further weakened the resulting lower bound by replacing the noise entropy power in the denominator by its variance. Taking a cue from this simple approach, we apply the entropy strength inequality (124) to (277) to obtain
(296) (297) (298)
where (299) follows from . Unfortunately, unlike the case of Gaussian input, this route falls short of showing that Cauchy noise of a given strength is least favorable because the right side of (299) is strictly smaller than the Cauchy-input Cauchy-noise mutual information in (271). Evidently, while the entropy power inequality is tight for Gaussian random variables, it is not for Cauchy random variables as we observed in Item 39. For this approach to succeed showing that, under a strength constraint, the least favorable noise is centered Cauchy we would need that, if W is independent of standard Cauchy V, then . (See Item 119c-(a).)(299) -
107.
As in Item 102, the counterpart in the Cauchy-input case is more challenging due to the fact that, unlike variance, the output strength need not be equal to the sum of input and noise strength. The next two results give lower bounds which, although achieved by Cauchy noise, do not just depend on the noise distribution through its strength.
Theorem 14.
If is centered Cauchy, independent of W with , denote . Then,(300) with equality if W is centered Cauchy.Proof.
Let us abbreviate . Consider the following chain:(301) (302) (303) (304) whereAlthough the lower bound in Theorem 14 is achieved by a centered Cauchy, it does not rule out the existence of W such that and .
-
108.
For the following lower bound, it is advisable to assume for notational simplicity and without loss of generality that . To remove that restriction, we may simply replace W by .
Theorem 15.
Let V be standard Cauchy independent of W. Then,(306) where is the solution to(307) Equality holds in (306) if W is a centered Cauchy random variable, in which case, .Proof.
It can be shown that, if and is an auxiliary random transformation such that where is the response of to , then(308) where and the information density corresponds to the joint probability measure . We can participate decomposition of mutual information to the case where where Wc is centered Cauchy with strenght λ > 0. Then, is the joint distribution of V and V + WC, and(309) Taking expectation with respect to , and invoking (52), we obtain(310) (311) Finally, taking expectation with respect to , we obtain(312) - 109.
-
110.As the proof indicates, at the expense of additional computation, we may sharpen the lower bound in Theorem 15 to show
which is attained at the solution to(315) (316) -
111.
Theorem 16.
The rate–distortion function of a memoryless source whose distribution is centered Cauchy with strength such that the time-average of the distortion strength is upper bounded by D is given by(317) Proof.
If , reproducing the source by results in time-average of the distortion strength equal to . Therefore, . If , we proceed to determine the minimal among all such that . For any such random transformation,(318) (319) (320) (321) (322) (323) where (320) holds because conditioning cannot increase differential entropy, and (322) follows from Theorem 3 applied to . The fact that there is an allowable that achieves the lower bound with equality is best seen by letting , where Z and are independent centered Cauchy random variables with and . Then, is such that the X marginal is indeed centered Cauchy with strength , and . Recalling (271),(324) and the lower bound in (323) can indeed be satisfied with equality. We are not finished yet since we need to justify that the rate–distortion function is indeed(325) which does not follow from the conventional memoryless lossy compression theorem with average distortion because, although the distortion measure is separable, it is not the average of a function with respect to the joint probability measure . This departure from the conventional setting does not impact the direct part of the theorem (i.e., ≤ in (325)), but it does affect the converse and in particular the proof of the fact that the n-version of the right side of (325) single-letterizes. To that end, it is sufficient to show that the function of D on the right side of (325) is convex (e.g., see pp. 316–317 in [19]). In the conventional setting, this follows from the convexity of the mutual information in the random transformation since, with a distortion function , we have(326) where , , and . Unfortunately, as we saw in Item 35, strength is not convex on the probability measure so, in general, we cannot claim that(327) The way out of this quandary is to realize that (327) is only needed for those and that attain the minimum on the right side of (325) for different distortion bounds and . As we saw earlier in this proof, those optimal random transformations are such that and are centered Cauchy. Fortuitously, as we noted in (107), (327) does indeed hold when we restrict attention to mixtures of centered Cauchy distributions. □Theorem 16 gives another example in which the Shannon lower bound to the rate–distortion function is tight. In addition to Gaussian sources with mean–square distortion, other examples can be found in [50]. Another interesting aspect of the lossy compression of memoryless Cauchy sources under strength distortion measure is that it is optimally successively refinable in the sense of [51,52]. As in the Gaussian case, this is a simple consequence of the stability of the Cauchy distribution and the fact that the strength of the sum of independent Cauchy random variables is equal to the sum of their respective strengths (Item 27).
-
112.
The continuity of mutual information can be shown under the following sufficient conditions
Theorem 17.
Suppose that is a sequence of real-valued random variables that vanishes in strength, Z is independent of , and . Then,(328) Proof.
Under the assumptions, . Therefore, , and (328) follows from Theorem 1-(m). □ -
113.The assumption is not superfluous for the validity of Theorem 17 even though it was not needed in Theorem 1-(m). Suppose that Z is integer valued, and where has probability mass function
(329) Then, , while , and therefore, .
-
114.In the case in which and are standard spherical multivariate Cauchy random variables with densities in (6), it follows from (7) that has the same distribution as . Therefore,
(330)
where we have used the scaling law . There is no possibility of a Cauchy-counterpart of the celebrated log-determinant formula for additive Gaussian vectors (e.g., Theorem 9.2.1 in [41]) because, as pointed out in Item 7, is not distributed according to the ellipsoidal density in (8) unless and are proportional, in which case the setup reverts to that in (330).(331) -
115.To conclude this section, we leave aside additive noise models and consider the mutual information between a partition of the components of the standard spherical multivariate Cauchy density (6). If , then (17) yields
where stands for the right side of (17). For example, if , then, in nats,(332) (333) (334) (335) -
116.The shared information of n random variables is a generalization of mutual information introduced in [54] for deriving the fundamental limit of interactive data exchange among agents who have access to the individual components and establish a dialog to ensure that all of them find out the value of the random vector. The shared information of is defined as
where , with , and the minimum is over all partitions of :(339)
such that . If we divide (338) by , we obtain the shared information of n random variables distributed according to the standard spherical multivariate Cauchy model. This is a consequence of the following result, which is of independent interest.Theorem 18.
If are exchangeable random variables, any subset of which have finite differential entropy, then for any partition Π of ,(340) Proof.
Fix any partition with chunks. Denote by the number of chunks in with cardinality . Therefore,(341) By exchangeability, any chunk of cardinality k has the same differential entropy, which we denote by . Then,(342) and the difference of the left minus the right sides of (340) multiplied by is readily seen to equal(343) (344) (345) where- (344) ⟸ for all ,
since is a concave sequence, i.e., as a result of the sub-modularity of differential entropy.(346) -
□
Naturally, the same proof applies to n discrete exchangeable random variables with finite joint entropy.
17. Outlook
-
117.We have seen that a number of key information theoretic properties pertaining to the Gaussian law are also satisfied in the Cauchy case. Conceptually, those extensions shed light on the underlying reason the conventional Gaussian results hold. Naturally, we would like to explore how far beyond the Cauchy law those results can be expanded. As far as the maximization of differential entropy is concerned, the essential step is to redefine strength tailoring it to the desired law: Fix a reference random variable W with probability density function and finite differential entropy , and define the W-strength of a real valued random variable Z as
For example,(347) -
(a)For , ;
-
(b)if W is standard normal, then ;
-
(c)if V is standard Cauchy, then ;
-
(d)if W is standard exponential, then if a.s., otherwise, ;
-
(e)if W is standard Subbotin (108) with , then, ;
- (f)
-
(g)if W is uniformly distributed on , ;
-
(h)if W is standard Rayleigh, then if a.s., otherwise, .
Theorem 19.
Suppose and . Then,(348) Proof.
Fix any Z in the feasible set. For any such that , we have(349) (350) Therefore, , by definition of , thereby establishing ≤ in (348). Equality holds since . □A corollary to Theorem 19 is a very general form of the Shannon lower bound for the rate–distortion function of a memoryless source Z such that the distortion is constrained to have W-strength not higher than D, namely,(351) Theorem 19 finds an immediate extension to the multivariate case
where, for with , we have defined(352) (353) For example, if is zero-mean multivariate Gaussian with positive definite covariance , then .
-
(a)
-
118.
One aspect in which we have shown that Cauchy distributions lend themselves to simplification unavailable in the Gaussian case is the single-parametrization of their likelihood ratio, which paves the way for a slew of closed-form expressions for f-divergences and Rényi divergences. It would be interesting to identify other multiparameter (even just scale/location) families of distributions that enjoy the same property. To that end, it is natural, though by no means hopeful, to study various generalizations of the Cauchy distribution such as the Student-t random variable, or more generally, the Rider distribution in (9). The information theoretic study of general stable distributions is hampered by the fact that they are characterized by their characteristic functions (e.g., p. 164 in [55]), which so far, have not lent themselves to the determination of relative entropy or even differential entropy.
-
119.
Although we cannot expect that the cornucopia of information theoretic results in the Gaussian case can be extended to other domains, we have been able to show that a number of those results do find counterparts in the Cauchy case. Nevertheless, much remains to be explored. To name a few,
-
(a)The concavity of the entropy-strength —a counterpart of Costa’s entropy power inequality [40] would guarantee the least favorability of Cauchy noise among all strength-constrained noises as well as the entropy strength inequality
(354) -
(b)
Information theoretic analyses quantifying the approach to normality in the central limit theorem are well-known (e.g., [56,57,58]). It would be interesting to explore the decrease in the relative entropy (relative to the Cauchy law) of independent sums distributed according to a law in the domain of attraction of the Cauchy distribution [55].
-
(c)
Since de Bruijn’s identity is one of the ancestors of the i-mmse formula of [59], and we now have a counterpart of de Bruijn’s identity for convolutions with scaled Cauchy, it is natural to wonder if there may be some sort of integral representation of the mutual information between a random variable and its noisy version contaminated by additive Cauchy noise. In this respect, note that counterparts for the i-mmse formula for models other than additive Gaussian noise have been found in [60,61,62].
-
(d)
Mutual information is robust against the addition of small non-Gaussian contamination in the sense that its effects are the same as if it were Gaussian [63]. The proof methods rely on Taylor series expansions that require the existence of moments. Any Cauchy counterparts (recall Item 77) would require substantially different methods.
-
(e)
Pinsker [41] showed that Gaussian processes are information stable imposing only very mild assumptions. The key is that, modulo a factor, the variance of the information density is upper bounded by its mean, the mutual information. Does the spherical multivariate Cauchy distribution enjoy similar properties?
-
(a)
-
112.
Although not surveyed here, there are indeed a number of results in the engineering literature advocating Cauchy models in certain heavy-tailed infinite-variance scenarios (see, e.g., [45] and the references therein.) At the end, either we abide by the information theoretic maxim that “there is nothing more practical than a beautiful formula”, or we pay heed to Poisson, who after pointing out in [64] that Laplace’s proof of the central limit theorem broke down for what we now refer to as the Cauchy law, remarked that “Mais nous ne tiendrons pas compte de ce cas particulier, quil nous suffira d’avoir remarqué à cause de sa singularité, et qui ne se recontre sans doute pas dans la pratique”.
Appendix A. Definite Integrals
(A1) |
(A2) |
(A3) |
(A4) |
(A5) |
(A6) |
(A7) |
(A8) |
(A9) |
(A10) |
(A11) |
(A12) |
(A13) |
(A14) |
(A15) |
(A16) |
where
(A6), with defined in (10) and denoting the digamma function, follows from 4.256 in [24] by change of variable and ;
(A12), with denoting the gamma function, is a special case of 3.251.11 in [24];
(A13) can be obtained from 3.251.11 in [24] by change of variable;
-
(A16) is a special case of 3.152.1 in [24] with the complete elliptic integral of the first kind defined as 8.112.1 in [24], namely,
(A18) Note that mathematica defines the complete elliptic integral function EllipticK such that(A19)
Institutional Review Board Statement
Not applicable.
Data Availability Statement
Not applicable.
Conflicts of Interest
The author declares no conflict of interest.
Funding Statement
This research received no external funding.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
- 1.Shannon C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27:379–423, 623–656. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
- 2.Verdú S. The exponential distribution in information theory. Probl. Inf. Transm. 1996;32:86–95. [Google Scholar]
- 3.Anantharam V., Verdú S. Bits through queues. IEEE Trans. Inf. Theory. 1996;42:4–18. doi: 10.1109/18.481773. [DOI] [Google Scholar]
- 4.Stam A. Some inequalities satisfied by the quantities of information of Fisher and Shannon. Inf. Control. 1959;2:101–112. doi: 10.1016/S0019-9958(59)90348-1. [DOI] [Google Scholar]
- 5.Ferguson T.S. A representation of the symmetric bivariate Cauchy distribution. Ann. Math. Stat. 1962;33:1256–1266. doi: 10.1214/aoms/1177704357. [DOI] [Google Scholar]
- 6.Fang K.T., Kotz S., Ng K.W. Symmetric Multivariate and Related Distributions. CRC Press; Boca Raton, FL, USA: 2018. [Google Scholar]
- 7.Rider P.R. Generalized Cauchy distributions. Ann. Inst. Stat. Math. 1958;9:215–223. doi: 10.1007/BF02892507. [DOI] [Google Scholar]
- 8.Bouhlel N., Rousseau D. A generic formula and some special cases for the Kullback–Leibler divergence between central multivariate Cauchy distributions. Entropy. 2022;24:838. doi: 10.3390/e24060838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Abe S., Rajagopal A.K. Information theoretic approach to statistical properties of multivariate Cauchy-Lorentz distributions. J. Phys. A Math. Gen. 2001;34:8727–8731. doi: 10.1088/0305-4470/34/42/301. [DOI] [Google Scholar]
- 10.Tulino A.M., Verdú S. Random matrix theory and wireless communications. Found. Trends Commun. Inf. Theory. 2004;1:1–182. doi: 10.1561/0100000001. [DOI] [Google Scholar]
- 11.Widder D.V. The Stieltjes transform. Trans. Am. Math. Soc. 1938;43:7–60. doi: 10.1090/S0002-9947-1938-1501933-2. [DOI] [Google Scholar]
- 12.Kullback S. Information Theory and Statistics. Dover; New York, NY, USA: 1968. Originally published in 1959 by JohnWiley. [Google Scholar]
- 13.Wu Y., Verdú S. Rényi information dimension: Fundamental limits of almost lossless analog compression. IEEE Trans. Inf. Theory. 2010;56:3721–3747. doi: 10.1109/TIT.2010.2050803. [DOI] [Google Scholar]
- 14.Donsker M.D., Varadhan S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, I. Commun. Pure Appl. Math. 1975;28:1–47. doi: 10.1002/cpa.3160280102. [DOI] [Google Scholar]
- 15.Donsker M.D., Varadhan S.R.S. Asymptotic evaluation of certain Markov process expectations for large time, III. Commun. Pure Appl. Math. 1977;29:369–461. doi: 10.1002/cpa.3160290405. [DOI] [Google Scholar]
- 16.Lapidoth A., Moser S.M. Capacity bounds via duality with applications to multiple-antenna systems on flat-fading channels. IEEE Trans. Inf. Theory. 2003;49:2426–2467. doi: 10.1109/TIT.2003.817449. [DOI] [Google Scholar]
- 17.Subbotin M.T. On the law of frequency of error. Mat. Sb. 1923;31:296–301. [Google Scholar]
- 18.Kapur J.N. Maximum-Entropy Models in Science and Engineering. Wiley-Eastern; New Delhi, India: 1989. [Google Scholar]
- 19.Cover T.M., Thomas J.A. Elements of Information Theory. 2nd ed. Wiley; New York, NY, USA: 2006. [Google Scholar]
- 20.Dembo A., Cover T.M., Thomas J.A. Information theoretic inequalities. IEEE Trans. Inf. Theory. 1991;37:1501–1518. doi: 10.1109/18.104312. [DOI] [Google Scholar]
- 21.Han T.S. Information Spectrum Methods in Information Theory. Springer; Heidelberg, Germany: 2003. [Google Scholar]
- 22.Vajda I. Theory of Statistical Inference and Information. Kluwer; Dordrecht, The Netherlands: 1989. [Google Scholar]
- 23.Deza E., Deza M.M. Dictionary of Distances. Elsevier; Amsterdam, The Netherlands: 2006. [Google Scholar]
- 24.Gradshteyn I.S., Ryzhik I.M. Table of Integrals, Series, and Products. 7th ed. Academic Press; Burlington, MA, USA: 2007. [Google Scholar]
- 25.Sason I., Verdú S. f-divergence inequalities. IEEE Trans. Inf. Theory. 2016;62:5973–6006. doi: 10.1109/TIT.2016.2603151. [DOI] [Google Scholar]
- 26.Nielsen F., Okamura K. On f-divergences between Cauchy distributions; Proceedings of the International Conference on Geometric Science of Information; Paris, France. 21–23 July 2021; pp. 799–807. [Google Scholar]
- 27.Eaton M.L. Proceedings of the Regional Conference Series in Probability and Statistics. Volume 1 Institute of Mathematical Statistics; Hayward, CA, USA: 1989. Group Invariance Applications in Statistics. [Google Scholar]
- 28.McCullagh P. On the distribution of the Cauchy maximum-likelihood estimator. Proc. R. Soc. London. Ser. A Math. Phys. Sci. 1993;440:475–479. [Google Scholar]
- 29.Verdú S. On channel capacity per unit cost. IEEE Trans. Inf. Theory. 1990;36:1019–1030. doi: 10.1109/18.57201. [DOI] [Google Scholar]
- 30.Chyzak F., Nielsen F. A closed-form formula for the Kullback–Leibler divergence between Cauchy distributions. arXiv. 20191905.10965 [Google Scholar]
- 31.Verdú S. Mismatched estimation and relative entropy. IEEE Trans. Inf. Theory. 2010;56:3712–3720. doi: 10.1109/TIT.2010.2050800. [DOI] [Google Scholar]
- 32.Csiszár I. I-Divergence geometry of probability distributions and minimization problems. Ann. Probab. 1975;3:146–158. doi: 10.1214/aop/1176996454. [DOI] [Google Scholar]
- 33.Sason I., Verdú S. Bounds among f-divergences. arXiv. 20151508.00335 [Google Scholar]
- 34.Abramowitz M., Stegun I.A. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Volume 55 US Government Printing Office; Washington, DC, USA: 1964. [Google Scholar]
- 35.Rényi A. On measures of information and entropy. In: Neyman J., editor. Proceedings of the 4th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press; Berkeley, CA, USA: 1961. pp. 547–561. [Google Scholar]
- 36.Gil M., Alajaji F., Linder T. Rényi divergence measures for commonly used univariate continuous distributions. Inf. Sci. 2013;249:124–131. doi: 10.1016/j.ins.2013.06.018. [DOI] [Google Scholar]
- 37.González M. Elliptic integrals in terms of Legendre polynomials. Glasg. Math. J. 1954;2:97–99. doi: 10.1017/S2040618500033104. [DOI] [Google Scholar]
- 38.Nielsen F. Revisiting Chernoff information with likelihood ratio exponential families. Entropy. 2022;24:1400. doi: 10.3390/e24101400. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Fisher R.A. Theory of statistical estimation. Math. Proc. Camb. Math. Soc. 1925;22:700–725. doi: 10.1017/S0305004100009580. [DOI] [Google Scholar]
- 40.Costa M.H.M. A new entropy power inequality. IEEE Trans. Inf. Theory. 1985;31:751–760. doi: 10.1109/TIT.1985.1057105. [DOI] [Google Scholar]
- 41.Pinsker M.S. Information and Information Stability of Random Variables and Processes. Holden-Day; San Francisco, CA, USA: 1964. Originally published in Russian in 1960. [Google Scholar]
- 42.Kullback S., Leibler R.A. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
- 43.Pinsker M.S. Calculation of the rate of message generation by a stationary random process and the capacity of a stationary channel. Dokl. Akad. Nauk. 1956;111:753–766. [Google Scholar]
- 44.Ihara S. On the capacity of channels with additive non-Gaussian noise. Inf. Control. 1978;37:34–39. doi: 10.1016/S0019-9958(78)90413-8. [DOI] [Google Scholar]
- 45.Fahs J., Abou-Faycal I.C. A Cauchy input achieves the capacity of a Cauchy channel under a logarithmic constraint; Proceedings of the 2014 IEEE International Symposium on Information Theory; Honolulu, HI, USA. 29 June–4 July 2014; pp. 3077–3081. [Google Scholar]
- 46.Rioul O., Magossi J.C. On Shannon’s formula and Hartley’s rule: Beyond the mathematical coincidence. Entropy. 2014;16:4892–4910. doi: 10.3390/e16094892. [DOI] [Google Scholar]
- 47.Dytso A., Egan M., Perlaza S., Poor H., Shamai S. Optimal inputs for some classes of degraded wiretap channels; Proceedings of the 2018 IEEE Information Theory Workshop; Guangzhou, China. 25–29 November 2018; pp. 1–7. [Google Scholar]
- 48.Cover T.M. Some advances in broadcast channels. In: Viterbi A.J., editor. Advances in Communication Systems. Volume 4. Academic Press; New York, NY, USA: 1975. pp. 229–260. [Google Scholar]
- 49.Wyner A.D. Recent results in the Shannon theory. IEEE Trans. Inf. Theory. 1974;20:2–9. doi: 10.1109/TIT.1974.1055171. [DOI] [Google Scholar]
- 50.Berger T. Rate Distortion Theory. Prentice-Hall; Englewood Cliffs, NJ, USA: 1971. [Google Scholar]
- 51.Koshelev V.N. Estimation of mean error for a discrete successive approximation scheme. Probl. Inf. Transm. 1981;17:20–33. [Google Scholar]
- 52.Equitz W.H.R., Cover T.M. Successive refinement of information. IEEE Trans. Inf. Theory. 1991;37:269–274. doi: 10.1109/18.75242. [DOI] [Google Scholar]
- 53.Kotz S., Nadarajah S. Multivariate t-Distributions and Their Applications. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
- 54.Csiszár I., Narayan P. The secret key capacity of multiple terminals. IEEE Trans. Inf. Theory. 2004;50:3047–3061. doi: 10.1109/TIT.2004.838380. [DOI] [Google Scholar]
- 55.Kolmogorov A.N., Gnedenko B.V. Limit Distributions for Sums of Independent Random Variables. Addison-Wesley; Reading, MA, USA: 1954. [Google Scholar]
- 56.Barron A.R. Entropy and the central limit theorem. Ann. Probab. 1986;14:336–342. doi: 10.1214/aop/1176992632. [DOI] [Google Scholar]
- 57.Artstein S., Ball K., Barthe F., Naor A. Solution of Shannon’s problem on the monotonicity of entropy. J. Am. Math. Soc. 2004;17:975–982. doi: 10.1090/S0894-0347-04-00459-X. [DOI] [Google Scholar]
- 58.Tulino A.M., Verdú S. Monotonic decrease of the non-Gaussianness of the sum of independent random variables: A simple proof. IEEE Trans. Inf. Theory. 2006;52:4295–4297. doi: 10.1109/TIT.2006.880066. [DOI] [Google Scholar]
- 59.Guo D., Shamai S., Verdú S. Mutual information and minimum mean–square error in Gaussian channels. IEEE Trans. Inf. Theory. 2005;51:1261–1282. doi: 10.1109/TIT.2005.844072. [DOI] [Google Scholar]
- 60.Guo D., Shamai S., Verdú S. Mutual information and conditional mean estimation in Poisson channels. IEEE Trans. Inf. Theory. 2008;54:1837–1849. doi: 10.1109/TIT.2008.920206. [DOI] [Google Scholar]
- 61.Jiao J., Venkat K., Weissman T. Relations between information and estimation in discrete-time Lévy channels. IEEE Trans. Inf. Theory. 2017;63:3579–3594. doi: 10.1109/TIT.2017.2692211. [DOI] [Google Scholar]
- 62.Arras B., Swan Y. IT formulae for gamma target: Mutual information and relative entropy. IEEE Trans. Inf. Theory. 2018;64:1083–1091. doi: 10.1109/TIT.2017.2759279. [DOI] [Google Scholar]
- 63.Pinsker M.S., Prelov V., Verdú S. Sensitivity of channel capacity. IEEE Trans. Inf. Theory. 1995;41:1877–1888. doi: 10.1109/18.476313. [DOI] [Google Scholar]
- 64.Poisson S.D. Connaisance des Tems, ou des Mouvemens Célestes a l’usage des Astronomes, et des Navigateurs, pour l’an 1827. Bureau des longitudes; Paris, France: 1824. Sur la probabilité des résultats moyens des observations; pp. 273–302. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Not applicable.