Skip to main content
Entropy logoLink to Entropy
. 2019 Dec 30;22(1):51. doi: 10.3390/e22010051

An Integral Representation of the Logarithmic Function with Applications in Information Theory

Neri Merhav 1, Igal Sason 1,*
PMCID: PMC7516482  PMID: 33285826

Abstract

We explore a well-known integral representation of the logarithmic function, and demonstrate its usefulness in obtaining compact, easily computable exact formulas for quantities that involve expectations and higher moments of the logarithm of a positive random variable (or the logarithm of a sum of i.i.d. positive random variables). The integral representation of the logarithm is proved useful in a variety of information-theoretic applications, including universal lossless data compression, entropy and differential entropy evaluations, and the calculation of the ergodic capacity of the single-input, multiple-output (SIMO) Gaussian channel with random parameters (known to both transmitter and receiver). This integral representation and its variants are anticipated to serve as a useful tool in additional applications, as a rigorous alternative to the popular (but non-rigorous) replica method (at least in some situations).

Keywords: integral representation, logarithmic expectation, universal data compression, entropy, differential entropy, ergodic capacity, SIMO channel, multivariate Cauchy distribution

1. Introduction

In analytic derivations pertaining to many problem areas in information theory, one frequently encounters the need to calculate expectations and higher moments of expressions that involve the logarithm of a positive-valued random variable, or more generally, the logarithm of the sum of several i.i.d. positive random variables. The common practice, in such situations, is either to resort to upper and lower bounds on the desired expression (e.g., using Jensen’s inequality or any other well-known inequalities), or to apply the Taylor series expansion of the logarithmic function. A more modern approach is to use the replica method (see, e.g., in [1] (Chapter 8)), which is a popular (but non-rigorous) tool that has been borrowed from the field of statistical physics with considerable success.

The purpose of this work is to point out to an alternative approach and to demonstrate its usefulness in some frequently encountered situations. In particular, we consider the following integral representation of the logarithmic function (to be proved in the sequel),

lnx=0eueuxudu,x>0. (1)

The immediate use of this representation is in situations where the argument of the logarithmic function is a positive-valued random variable, X, and we wish to calculate the expectation, E{lnX}. By commuting the expectation operator with the integration over u (assuming that this commutation is valid), the calculation of E{lnX} is replaced by the (often easier) calculation of the moment-generating function (MGF) of X, as

E{lnX}=0euEeuXduu. (2)

Moreover, if X1,,Xn are positive i.i.d. random variables, then

E{ln(X1++Xn)}=0euE{euX1}nduu. (3)

This simple idea is not quite new. It has been used in the physics literature, see, e.g., [1] (Exercise 7.6, p. 140), [2] (Equation (2.4) and onward) and [3] (Equation (12) and onward). With the exception of [4], we are not aware of any work in the information theory literature where it has been used. The purpose of this paper is to demonstrate additional information-theoretic applications, as the need to evaluate logarithmic expectations is not rare at all in many problem areas of information theory. Moreover, the integral representation (1) is useful also for evaluating higher moments of lnX, most notably, the second moment or variance, in order to assess the statistical fluctuations around the mean.

We demonstrate the usefulness of this approach in several application areas, including entropy and differential entropy evaluations, performance analysis of universal lossless source codes, and calculations of the ergodic capacity of the Rayleigh single-input multiple-output (SIMO) channel. In some of these examples, we also demonstrate the calculation of variances associated with the relevant random variables of interest. As a side remark, in the same spirit of introducing integral representations and applying them, Simon and Divsalar [5,6] brought to the attention of communication theorists useful, definite-integral forms of the Q–function (Craig’s formula [7]) and Marcum Q–function, and demonstrated their utility in applications.

It should be pointed out that most of our results remain in the form of a single- or double-definite integral of certain functions that depend on the parameters of the problem in question. Strictly speaking, such a definite integral may not be considered a closed-form expression, but nevertheless, we can say the following.

  • (a)

    In most of our examples, the expression we obtain is more compact, more elegant, and often more insightful than the original quantity.

  • (b)

    The resulting definite integral can actually be considered a closed-form expression “for every practical purpose” since definite integrals in one or two dimensions can be calculated instantly using built-in numerical integration operations in MATLAB, Maple, Mathematica, or other mathematical software tools. This is largely similar to the case of expressions that include standard functions (e.g., trigonometric, logarithmic, exponential functions, etc.), which are commonly considered to be closed-form expressions.

  • (c)

    The integrals can also be evaluated by power series expansions of the integrand, followed by term-by-term integration.

  • (d)

    Owing to Item (c), the asymptotic behavior in the parameters of the model can be evaluated.

  • (e)

    At least in two of our examples, we show how to pass from an n–dimensional integral (with an arbitrarily large n) to one– or two–dimensional integrals. This passage is in the spirit of the transition from a multiletter expression to a single–letter expression.

To give some preliminary flavor of our message in this work, we conclude this introduction by mentioning a possible use of the integral representation in the context of calculating the entropy of a Poissonian random variable. For a Poissonian random variable, N, with parameter λ, the entropy (in nats) is given by

H(λ)=ElneλλNN!=λλlnλ+E{lnN!}, (4)

where the nontrivial part of the calculation is associated with the last term, E{lnN!}. In [8], this term was handled by using a nontrivial formula due to Malmstén (see [9] (pp. 20–21)), which represents the logarithm of Euler’s Gamma function in an integral form (see also [10]). In Section 2, we derive the relevant quantity using (1), in a simpler and more transparent form which is similar to [11] ((2.3)–(2.4)).

The outline of the remaining part of this paper is as follows. In Section 2, we provide some basic mathematical background concerning the integral representation (2) and some of its variants. In Section 3, we present the application examples. Finally, in Section 4, we summarize and provide some outlook.

2. Mathematical Background

In this section, we present the main mathematical background associated with the integral representation (1), and provide several variants of this relation, most of which are later used in this paper. For reasons that will become apparent shortly, we extend the scope to the complex plane.

Proposition 1.

lnz=0eueuzudu,Re(z)0. (5)

Proof. 

lnz=(z1)01dv1+v(z1) (6)
=(z1)010eu[1+v(z1)]dudv (7)
=(z1)0eu01euv(z1)dvdu (8)
=0euu1eu(z1)du (9)
=0eueuzudu, (10)

where Equation (7) holds as Re{1+v(z1)}>0 for all v(0,1), based on the assumption that Re(z)0; (8) holds by switching the order of integration. □

Remark 1.

In [12] (p. 363, Identity (3.434.2)), it is stated that

0eμxeνxxdx=lnνμ,Re(μ)>0,Re(ν)>0. (11)

Proposition 1 also applies to any purely imaginary number, z, which is of interest too (see Corollary 1 in the sequel, and the identity with the characteristic function in (14)).

Proposition 1 paves the way to obtaining some additional related integral representations of the logarithmic function for the reals.

Corollary 1.

([12] (p. 451, Identity 3.784.1)) For every x>0,

lnx=0cos(u)cos(ux)udu. (12)

Proof. 

By Proposition 1 and the identity lnxReln(ix) (with i:=1), we get

lnx=0eucos(ux)udu. (13)

Subtracting both sides by the integral in (13) for x=1 (which is equal to zero) gives (12). □

Let X be a real-valued random variable, and let ΦX(ν):=EeiνX be the characteristic function of X. Then, by Corollary 1,

ElnX=0cos(u)Re{ΦX(u)}udu, (14)

where we are assuming, here and throughout the sequel, that the expectation operation and the integration over u are commutable, i.e., Fubini’s theorem applies.

Similarly, by returning to Proposition 1 (confined to a real-valued argument of the logarithm), the calculation of E{lnX} can be replaced by the calculation of the MGF of X, as

E{lnX}=0euEeuXduu. (15)

In particular, if X1,,Xn are positive i.i.d. random variables, then

E{ln(X1++Xn)}=0euEeuX1nduu. (16)

Remark 2.

One may further manipulate (15) and (16) as follows. As lnx1sln(xs) for any s0 and x>0, then the expectation of lnX can also be represented as

E{lnX}=1s0euEeuXsduu,s0. (17)

The idea is that if, for some s{0,1}, E{euXs} can be expressed in closed form, whereas it cannot for s=1 (or even E{euXs}< for some s{0,1}, but not for s=1), then (17) may prove useful. Moreover, if X1,,Xn are positive i.i.d. random variables, s>0, and Y:=(X1s++Xns)1/s, then

E{lnY}=1s0euEeuX1snduu. (18)

For example, if {Xi} are i.i.d. standard Gaussian random variables and s=2, then (18) enables to calculate the expected value of the logarithm of a chi-squared distributed random variable with n degrees of freedom. In this case,

EeuX12=12πeux2ex2/2dx=12u+1, (19)

and, from (18) with s=2,

E{lnY}=120eu(2u+1)n/2duu. (20)

Note that according to the pdf of a chi-squared distribution, one can express E{lnY} as a one-dimensional integral even without using (18). However, for general s>0, the direct calculation of Elni=1n|Xi|s leads to an n-dimensional integral, whereas (18) provides a one-dimensional integral whose integrand involves in turn the calculation of a one-dimensional integral too.

Identity (1) also proves useful when one is interested, not only in the expected value of lnX, but also in higher moments, in particular, its second moment or variance. In this case, the one-dimensional integral becomes a two-dimensional one. Specifically, for any s>0,

Var{lnX}=E{ln2(X)}[E{lnX}]2=1s2E00eueuXsevevXsdudvuv (21)
1s200euE{euXs}evE{evXs}dudvuv (22)
=1s200Ee(u+v)XsEeuXsEevXsdudvuv (23)
=1s200CoveuXs,evXsdudvuv. (24)

More generally, for a pair of positive random variables, (X,Y), and for s>0,

Cov{lnX,lnY}=1s200CoveuXs,evYsdudvuv. (25)

For later use, we present the following variation of the basic identity.

Proposition 2.

Let X be a random variable, and let

MX(s):=EesX,sR, (26)

be the MGF of X. If X is non-negative, then

Eln(1+X)=0eu[1MX(u)]udu, (27)
Varln(1+X)=00e(u+v)uvMX(uv)MX(u)MX(v)dudv. (28)

Proof. 

Equation (27) is a trivial consequence of (15). As for (28), we have

Varln(1+X)=Eln2(1+X)Eln(1+X)2=E0euu1euXdu0evv1evXdv00e(u+v)[1MX(u)][1MX(v)]uvdudv=00e(u+v)uvE1euX1evXdudv (29)
00e(u+v)[1MX(u)MX(v)+MX(u)MX(v)]uvdudv=00e(u+v)uv1MX(u)MX(v)+MX(uv)dudv (30)
00e(u+v)uv1MX(u)MX(v)+MX(u)MX(v)dudv (31)
=00e(u+v)uvMX(uv)MX(u)MX(v)dudv. (32)

 □

The following result relies on the validity of (5) to the right-half complex plane, and its derivation is based on the identity ln(1+x2)ln(1+ix)+ln(1ix) for all xR. In general, it may be used if the characteristic function of a random variable X has a closed-form expression, whereas the MGF of X2 does not admit a closed-form expression (see Proposition 2). We introduce the result, although it is not directly used in the paper.

Proposition 3.

Let X be a real-valued random variable, and let

ΦX(u):=EeiuX,uR, (33)

be the characteristic function of X. Then,

Eln(1+X2)=20euu1ReΦX(u)du, (34)

and

Varln(1+X2)=200euvuv[Re{ΦX(u+v)}+Re{ΦX(uv)}2Re{ΦX(u)}Re{ΦX(v)}]dudv. (35)

As a final note, we point out that the fact that the integral representation (2) replaces the expectation of the logarithm of X by the expectation of an exponential function of X, has an additional interesting consequence: an expression like ln(n!) becomes the integral of the sum of a geometric series, which, in turn, is easy to express in closed form (see [11] ((2.3)–(2.4))). Specifically,

ln(n!)=k=1nlnk=k=1n0(eueuk)duu=0neuk=1neukduu=0eun1eun1euduu. (36)

Thus, for a positive integer-valued random variable, N, the calculation of E{lnN!} requires merely the calculation of E{N} and the MGF, E{euN}. For example, if N is a Poissonian random variable, as discussed near the end of the Introduction, both E{N} and E{euN} are easy to evaluate. This approach is a simple, direct alternative to the one taken in [8] (see also [10]), where Malmstén’s nontrivial formula for lnΓ(z) (see [9] (pp. 20–21)) was invoked. (Malmstén’s formula for lnΓ(z) applies to a general, complex–valued z with Re(z)>0; in the present context, however, only integer real values of z are needed, and this allows the simplification shown in (36)). The above described idea of the geometric series will also be used in one of our application examples, in Section 3.4.

3. Applications

In this section, we show the usefulness of the integral representation of the logarithmic function in several problem areas in information theory. To demonstrate the direct computability of the relevant quantities, we also present graphs of their numerical calculation. In some of the examples, we also demonstrate calculations of the second moments and variances.

3.1. Differential Entropy for Generalized Multivariate Cauchy Densities

Let (X1,,Xn) be a random vector whose probability density function is of the form

f(x1,,xn)=Cn1+i=1ng(xi)q,(x1,,xn)Rn, (37)

for a certain non–negative function g and positive constant q such that

Rndx1+i=1ng(xi)q<. (38)

We refer to this kind of density as a generalized multivariate Cauchy density, because the multivariate Cauchy density is obtained as a special case where g(x)=x2 and q=12(n+1). Using the Laplace transform relation,

1sq=1Γ(q)0tq1estdt,q>0,Re(s)>0, (39)

f can be represented as a mixture of product measures:

f(x1,,xn)=Cn1+i=1ng(xi)q=CnΓ(q)0tq1etexpti=1ng(xi)dt. (40)

Defining

Z(t):=etg(x)dx,t>0, (41)

we get from (40),

1=CnΓ(q)0tq1etRnexpti=1ng(xi)dx1dxndt=CnΓ(q)0tq1etetg(x)dxndt=CnΓ(q)0tq1etZn(t)dt, (42)

and so,

Cn=Γ(q)0tq1etZn(t)dt. (43)

The calculation of the differential entropy of f is associated with the evaluation of the expectation Eln1+i=1ng(Xi). Using (27),

Eln1+i=1ng(Xi)=0euu1Eexpui=1ng(Xi)du. (44)

From (40) and by interchanging the integration,

Eexpui=1ng(Xi)=CnΓ(q)0tq1etRnexp(t+u)i=1ng(xi)dx1dxndt=CnΓ(q)0tq1etZn(t+u)dt. (45)

In view of (40), (44), and (45), the differential entropy of (X1,,Xn) is therefore given by

h(X1,,Xn)=qEln1+i=1ng(Xi)lnCn=q0euu1CnΓ(q)0tq1etZn(t+u)dtdulnCn=qCnΓ(q)00tq1e(t+u)uZn(t)Zn(t+u)dtdulnCn. (46)

For g(x)=|x|θ, with an arbitrary θ>0, we obtain from (41) that

Z(t)=2Γ(1/θ)θt1/θ. (47)

In particular, for θ=2 and q=12(n+1), we get the multivariate Cauchy density from (37). In this case, as Γ12=π, it follows from (47) that Z(t)=πt for t>0, and from (43)

Cn=Γn+12πn/20t(n+1)/21ettn/2dt=Γn+12πn/2Γ12=Γn+12π(n+1)/2. (48)

Combining (46), (47) and (48) gives

h(X1,,Xn)=n+12π(n+1)/200e(t+u)ut1tt+un/2dtdu+(n+1)lnπ2lnΓn+12. (49)

Figure 1 displays the normalized differential entropy, 1nh(X1,,Xn), for 1n100.

Figure 1.

Figure 1

The normalized differential entropy, 1nh(X1,,Xn) (see (49)), for a multivariate Cauchy density, f(x1,,xn)=Cn/[1+i=1nxi2](n+1)/2, with Cn in (48).

We believe that the interesting point, conveyed in this application example, is that (46) provides a kind of a “single–letter expression”; the n–dimensional integral, associated with the original expression of the differential entropy h(X1,,Xn), is replaced by the two-dimensional integral in (46), independently of n.

As a final note, we mention that a lower bound on the differential entropy of a different form of extended multivariate Cauchy distributions (cf. [13] (Equation (42))) was derived in [13] (Theorem 6). The latter result relies on obtaining lower bounds on the differential entropy of random vectors whose densities are symmetric log-concave or γ-concave (i.e., densities f for which fγ is concave for some γ<0).

3.2. Ergodic Capacity of the Fading SIMO Channel

Consider the SIMO channel with L receive antennas and assume that the channel transfer coefficients, {hi}i=1L, are independent, zero–mean, circularly symmetric complex Gaussian random variables with variances {σi2}i=1L. Its ergodic capacity (in nats per channel use) is given by

C=Eln1+ρ=1L|h|2=Eln1+ρ=1Lf2+g2, (50)

where f:=Re{h}, g:=Im{h}, and ρ:=PN0 is the signal–to–noise ratio (SNR) (see, e.g., [14,15]).

Paper [14] is devoted, among other things, to the exact evaluation of (50) by finding the density of the random variable defined by =1L(f2+g2), and then taking the expectation w.r.t. that density. Here, we show that the integral representation in (5) suggests a more direct approach to the evaluation of (50). It should also be pointed out that this approach is more flexible than the one in [14], as the latter strongly depends on the assumption that {hi} are Gaussian and statistically independent. The integral representation approach also allows other distributions of the channel transfer gains, as well as possible correlations between the coefficients and/or the channel inputs. Moreover, we are also able to calculate the variance of ln1+ρ=1L|h|2, as a measure of the fluctuations around the mean, which is obviously related to the outage.

Specifically, in view of Proposition 2 (see (27)), let

X:=ρ=1L(f2+g2). (51)

For all u>0,

MX(u)=Eexpρu=1L(f2+g2)==1LEeuρf2Eeuρg2==1L11+uρσ2, (52)

where (52) holds since

Eeuρf2=Eeuρg2=dwπσ2ew2/σ2euρw2=11+uρσ2. (53)

From (27), (50) and (52), the ergodic capacity (in nats per channel use) is given by

C=Eln1+ρ=1Lf2+g2=0euu1=1L11+uρσ2du=0ex/ρx1=1L11+σ2xdx. (54)

A similar approach appears in [4] (Equation (12)).

As for the variance, from Proposition 2 (see (28)) and (52),

Varln1+ρ=1L[f2+g2]=00e(x+y)/ρxy=1L11+σ2(x+y)=1L1(1+σ2x)(1+σ2y)dxdy. (55)

A similar analysis holds for the multiple-input single-output (MISO) channel. By partial–fraction decomposition of the expression (see the right side of (54))

1x1=1L11+σ2x,

the ergodic capacity C can be expressed as a linear combination of integrals of the form

0ex/ρdx1+σ2x=1σ20etdtt+1/(σ2ρ)=e1/(σ2ρ)σ21/(σ2ρ)essds=1σ2e1/(σ2ρ)E11σ2ρ, (56)

where E1(·) is the (modified) exponential integral function, defined as

E1(x):=xessds,x>0. (57)

A similar representation appears also in [14] (Equation (7)).

Consider the example of L=2, σ12=12 and σ22=1. From (54), the ergodic capacity of the SIMO channel is given by

C=0ex/ρx11(x/2+1)(x+1)dx=0ex/ρ(x+3)dx(x+1)(x+2)=2e1/ρE11ρe2/ρE12ρ. (58)

The variance in this example (see (55)) is given by

Varln1+ρ=12(f2+g2)=00e(x+y)/ρxy11+0.5(x+y)(1+x+y)1(1+0.5x)(1+0.5y)(1+x)(1+y)dxdy=00e(x+y)/ρ(2xy+6x+6y+10)dxdy(x+1)(y+1)(x+2)(y+2)(x+y+1)(x+y+2). (59)

Figure 2 depicts the ergodic capacity C as a function of the SNR, ρ, in dB (see (58), and divide by ln2 for conversion to bits per channel use). The same example exactly appears in the lower graph of Figure 1 in [14]. The variance appears in Figure 3 (see (59), and similarly divide by ln22).

Figure 2.

Figure 2

The ergodic capacity C (in bits per channel use) of the SIMO channel as a function of ρ=SNR (in dB) for L=2 receive antennas, with noise variances σ12=12 and σ22=1.

Figure 3.

Figure 3

The variance of ln(1+ρ=1L|h|2) (in [bits-per-channel-use]2) of the SIMO channel as a function of ρ=SNR (in dB) for L=2 receive antennas, with noise variances σ12=12 and σ22=1.

3.3. Universal Source Coding for Binary Arbitrarily Varying Sources

Consider a source coding setting, where there are n binary DMS’s, and let xi[0,1] denote the Bernoulli parameter of source no. i{1,,n}. Assume that a hidden memoryless switch selects uniformly at random one of these sources, and the data is then emitted by the selected source. Since it is unknown a-priori which source is selected at each instant, a universal lossless source encoder (e.g., a Shannon or Huffman code) is designed to match a binary DMS whose Bernoulli parameter is given by 1ni=1nxi. Neglecting integer length constraints, the average redundancy in the compression rate (measured in nats per symbol), due to the unknown realization of the hidden switch, is about

Rn=hb1ni=1nxi1ni=1nhb(xi), (60)

where hb:[0,1][0,ln2] is the binary entropy function (defined to the base e), and the redundancy is given in nats per source symbol. Now, let us assume that the Bernoulli parameters of the n sources are i.i.d. random variables, X1,,Xn, all having the same density as that of some generic random variable X, whose support is the interval [0,1]. We wish to evaluate the expected value of the above defined redundancy, under the assumption that the realizations of X1,,Xn are known. We are then facing the need to evaluate

R¯n=Ehb1ni=1nXiE{hb(X)}. (61)

We now express the first and second terms on the right-hand side of (61) as a function of the MGF of X.

In view of (5), the binary entropy function hb admits the integral representation

hb(x)=01uxeux+(1x)eu(1x)eudu,x[0,1], (62)

which implies that

E{hb(X)}=01uEXeuX+E(1X)eu(1X)eudu. (63)

The expectations on the right-hand side of (63) can be expressed as functionals of the MGF of X, MX(ν)=E{eνX}, and its derivative, for ν<0. For all uR,

EXeuX=MX(u), (64)

and

E(1X)eu(1X)=M1X(u)=ddsesMX(s)|s=u=euMX(u)MX(u). (65)

On substituting (64) and (65) into (63), we readily obtain

E{hb(X)}=01uMX(u)+MX(u)MX(u)1eudu. (66)

Define Yn:=1ni=1nXi. Then,

MYn(u)=MXnun,uR, (67)

which yields, in view of (66), (67) and the change of integration variable, t=un, the following:

Ehb1ni=1nXi=E{hb(Yn)}=01uMYn(u)+MYn(u)MYn(u)1eudu=01tMXn1(t)MX(t)+MXn(t)MXn1(t)MX(t)1entdt. (68)

Similarly as in Section 3.1, here too, we pass from an n-dimensional integral to a one-dimensional integral. In general, similar calculations can be carried out for higher integer moments, thus passing from n-dimensional integration for a moment of order s to an s-dimensional integral, independently of n.

For example, if X1,,Xn are i.i.d. and uniformly distributed on [0,1], then the MGF of a generic random variable X distributed like all {Xi} is given by

MX(t)={et1t,t0,1,t=0. (69)

From (68), it can be verified numerically that Ehb1ni=1nXi is monotonically increasing in n, being equal (in nats) to 12, 0.602, 0.634, 0.650, 0.659 for n=1,,5, respectively, with the limit hb12=ln20.693 as we let n (this is expected by the law of large numbers).

3.4. Moments of the Empirical Entropy and the Redundancy of K–T Universal Source Coding

Consider a stationary, discrete memoryless source (DMS), P, with a finite alphabet X of size |X| and letter probabilities {P(x),xX}. Let (X1,,Xn) be an n–vector emitted from P, and let {P^(x),xX} be the empirical distribution associated with (X1,,Xn), that is, P^(x)=n(x)n, for all xX, where n(x) is the number of occurrences of the letter x in (X1,,Xn).

It is well known that in many universal lossless source codes for the class of memoryless sources, the dominant term of the length function for encoding (X1,,Xn) is nH^, where H^ is the empirical entropy,

H^=xP^(x)lnP^(x). (70)

For code length performance analysis (as well as for entropy estimation per se), there is therefore interest in calculating the expected value E{H^} as well as Var{H^}. Another motivation comes from the quest for estimating the entropy as an objective on its own right, and then the expectation and the variance suffice for the calculation of the mean square error of the estimate, H^. Most of the results that are available in the literature, in this context, concern the asymptotic behavior for large n as well as bounds (see, e.g., [16,17,18,19,20,21,22,23,24,25,26,27,28,29,30], as well as many other related references therein). The integral representation of the logarithm in (5), on the other hand, allows exact calculations of the expectation and the variance. The expected value of the empirical entropy is given by

E{H^}=xE{P^(x)lnP^(x)}=xE0duuP^(x)euP^(x)P^(x)eu=0duuxE{P^(x)euP^(x)}eu. (71)

For convenience, let us define the function ϕn:X×R(0,) as

ϕn(x,t):=EetP^(x)=1P(x)+P(x)et/nn, (72)

which yields,

EP^(x)euP^(x)=ϕn(x,u), (73)
EP^2(x)euP^(x)=ϕn(x,u), (74)

where ϕn and ϕn are first and second order derivatives of ϕn w.r.t. t, respectively. From (71) and (73),

E{H^}=0duuxϕn(x,u)eu=0duueuxP(x)1P(x)(1eu)n1enu (75)

where the integration variable in (75) was changed using a simple scaling by n.

Before proceeding with the calculation of the variance of H^, let us first compare the integral representation in (75) to the alternative sum, obtained by a direct, straightforward calculation of the expected value of the empirical entropy. A straightforward calculation gives

E{H^}=xk=0nnkPk(x)[1P(x)]nk·kn·lnnk (76)
=xk=1nn1k1Pk(x)[1P(x)]nk·lnnk. (77)

We next compare the computational complexity of implementing (75) to that of (77). For large n, in order to avoid numerical problems in computing (77) by standard software, one may use the Gammaln function in Matlab/Excel or the LogGamma in Mathematica (a built-in function for calculating the natural logarithm of the Gamma function) to obtain that

n1k1Pk(x)[1P(x)]nk=exp{Gammaln(n)Gammaln(k)Gammaln(nk+1)+klnP(x)+(nk)ln1P(x)}. (78)

The right-hand side of (75) is the sum of |X| integrals, and the computational complexity of each integral depends on neither n, nor |X|. Hence, the computational complexity of the right-hand side of (75) scales linearly with |X|. On the other hand, the double sum on the right-hand side of (77) consists of n·|X| terms. Let α:=n|X| be fixed, which is expected to be large (α1) if a good estimate of the entropy is sought. The computational complexity of the double sum on the right-hand side of (77) grows like α|X|2, which scales quadratically in |X|. Hence, for a DMS with a large alphabet, or when n|X|, there is a significant computational reduction by evaluating (75) in comparison to the right-hand side of (77).

We next move on to calculate the variance of H^.

Var{H^}=E{H^2}E2{H^} (79)
=x,xE{P^(x)lnP^(x)·P^(x)lnP^(x)}E2{H^}. (80)

The second term on the right-hand side of (80) has already been calculated. For the first term, let us define, for xx,

ψn(x,x,s,t):=E{exp{sP^(x)+tP^(x)}={(k,):k+n}{n!k!!(nk)!·Pk(x)P(x) (81)
·1P(x)P(x)nkesk/n+t/n}={(k,):k+n}{n!k!!(nk)!·P(x)es/nkP(x)et/n (82)
·1P(x)P(x)nk} (83)
=1P(x)(1es/n)P(x)(1et/n)n. (84)

Observe that

E{P^(x)P^(x)exp{uP^(x)vP^(x)}=2ψn(x,x,s,t)st|s=u,t=v (85)
:=ψn(x,x,u,v). (86)

For xx, we have

E{P^(x)lnP^(x)·P^(x)lnP^(x)}
=EP^(x)P^(x)00dudvuv·eueuP^(x)·evevP^(x)=00dudvuv[euvEP^(x)P^(x)evEP^(x)P^(x)euP^(x) (87)
euEP^(x)P^(x)evP^(x)+EP^(x)P^(x)euP^(x)vP^(x)]=00dudvuv[euvψn(x,x,0,0)evψn(x,x,u,0) (88)
euψn(x,x,0,v)+ψn(x,x,u,v)], (89)

and for x=x,

E{[P^(x)lnP^(x)]2}=EP^2(x)00dudvuv·eueuP^(x)·evevP^(x)=00dudvuv[euvEP^2(x)evEP^2(x)euP^(x) (90)
euEP^2(x)evP^(x)+EP^2(x)e(u+v)P^(x)]=00dudvuv[euvϕn(x,0)evϕn(x,u) (91)
euϕn(x,v)+ϕn(x,uv)]. (92)

Therefore,

Var{H^}=x00dudvuveuvϕn(x,0)evϕn(x,u)euϕn(x,v)+ϕn(x,uv)+xx00dudvuv[euvψn(x,x,0,0)evψn(x,x,u,0)euψn(x,x,0,v)+ψn(x,x,u,v)]E2{H^}. (93)

Defining (see (74) and (86))

Z(r,s,t):=xϕn(x,r)+xxψn(x,x,s,t), (94)

we have

Var{H^}=00dudvuv[euvZ(0,0,0)evZ(u,u,0)euZ(v,0,v)+Z(uv,u,v)]E2{H^}. (95)

To obtain numerical results, it would be convenient to particularize now the analysis to the binary symmetric source (BSS). From (75),

E{H^}=0duueu1+eu2n1eun. (96)

For the variance, it follows from (84) that for xx with x,x{0,1} and s,tR,

ψn(x,x,s,t)=es/n+et/n2n, (97)
ψn(x,x,s,t)=2ψn(x,x,s,t)st=1411nes/n+et/n2n2e(s+t)/n, (98)

and, from (87)–(89), for xx

E{P^(x)lnP^(x)·P^(x)lnP^(x)}=1411n00dudvuv[euveu/n+v1+eu/n2n2                                e(u+v/n)1+ev/n2n2                                +e(u+v)/neu/n+ev/n2n2]. (99)

From (72), for x{0,1} and tR,

ϕn(x,t)=1+et/n2n, (100)
ϕn(x,t)=2ϕn(x,t)t2=et/n4n1+et/n2n21+net/n, (101)

and, from (90)–(92), for x{0,1},

E{[P^(x)lnP^(x)]2}=14n00dudvuv{(n+1)euveu/n+v1+eu/n2n21+neu/n                  eu+v/n1+ev/n2n21+nev/n                  +e(u+v)/n1+e(u+v)/n2n21+ne(u+v)/n}. (102)

Combining Equations (93), (99), and (102), gives the following closed-form expression for the variance of the empirical entropy:

Var{H^}=121+1n00dudvuve(u+v)evfnuneufnvn+fnu+vn+1211n00dudvuve(u+v)evgnun,0eugn0,vn+gnun,vn0duueu1+eu2n1eun2, (103)

where

fn(s):=es1+es2n21+nesn+1, (104)
gn(s,t)=estes+et2n2. (105)

For the BSS, ln2E{H^}=E{D(P^P)} and the standard deviation of H^ both decay at the rate of 1n as n grows without bound, according to Figure 4. This asymptotic behavior of E{D(P^P)} is supported by the well-known result [31] (see also [18] (Section 3.C) and references therein) that for the class of discrete memoryless sources {P} with a given finite alphabet X,

lnP^(X1,,Xn)P(X1,,Xn)12χd2, (106)

in law, where χd2 is a chi-squared random variable with d degrees of freedom. The left-hand side of (106) can be rewritten as

lnexp{nH^}exp{nH^nD(P^P)}=nD(P^P), (107)

and so, E{D(P^P)} decays like d2n, which is equal to 12n for the BSS. In Figure 4, the base of the logarithm is 2, and therefore, E{D(P^P)}=1E{H^} decays like log2e2n0.7213n. It can be verified numerically that 1E{H^} (in bits) is equal to 7.25·103 and 7.217·104 for n=100 and n=1000, respectively (see Figure 4), which confirms (106) and (107). Furthermore, the exact result here for the standard deviation, which decays like 1n, scales similarly to the concentration inequality in [32] ((9)).

Figure 4.

Figure 4

1E{H^} and std(H^) for a BSS (in bits per source symbol) as a function of n.

We conclude this subsection by exploring a quantity related to the empirical entropy, which is the expected code length associated with the universal lossless source code due to Krichevsky and Trofimov [23]. In a nutshell, this is a predictive universal code, which at each time instant t, sequentially assigns probabilities to the next symbol according to (a biased version of) the empirical distribution pertaining to the data seen thus far, x1,,xt. Specifically, consider the code length function (in nats),

L(xn)=t=0n1lnQ(xt+1|xt), (108)

where

Q(xt+1=x|x1,,xt)=Nt(x)+st+s|X|, (109)

Nt(x) is the number of occurrences of the symbol xX in (x1,,xt), and s>0 is a fixed bias parameter needed for the initial coding distribution (t=0).

We now calculate the redundancy of this universal code,

Rn=E{L(Xn)}nH, (110)

where H is the entropy of the underlying source. From Equations (108), (109), and (110), we can represent Rn as follows,

Rn=1nt=0n1Eln(t+s|X|)P(Xt+1)Nt(Xt+1)+s. (111)

The expectation on the right-hand side of (111) satisfies

Eln(t+s|X|)P(Xt+1)Nt(Xt+1)+s=xP(x)Eln(t+s|X|)P(x)Nt(x)+s=0eusxP(x)E{euNt(x)}xP(x)eu(s|X|+t)P(x)duu=0eusxP(x)[1P(x)(1eu)]txP(x)eu(s|X|+t)P(x)duu, (112)

which gives from (111) and (112) that the redundancy is given by

Rn=1nt=0n1Eln(t+s|X|)P(Xt+1)Nt(Xt+1)+s=1n0eusxP(x)t=0n1[1P(x)(1eu)]txP(x)eus|X|P(x)t=0n1euP(x)tduu=1n0eusx1[1P(x)(1eu)]n1euxP(x)eus|X|P(x)1euP(x)n1euP(x)duu=1n0eus|X|x[1P(x)(1eu)]n1euxP(x)eus|X|P(x)(1euP(x)n)1euP(x)duu. (113)

Figure 5 displays nRn as a function of lnn for s=12 in the range 1n5000. As can be seen, the graph is nearly a straight line with slope 12, which is in agreement with the theoretical result that Rnlnn2n (in nats per symbol) for large n (see [23] (Theorem 2)).

Figure 5.

Figure 5

The function nRn vs. lnn for the BSS and s=12, in the range 2n5000.

4. Summary and Outlook

In this work, we have explored a well-known integral representation of the logarithmic function, and demonstrated its applications in obtaining exact formulas for quantities that involve expectations and second order moments of the logarithm of a positive random variable (or the logarithm of a sum of i.i.d. such random variables). We anticipate that this integral representation and its variants can serve as useful tools in many additional applications, representing a rigorous alternative to the replica method in some situations.

Our work in this paper focused on exact results. In future research, it would be interesting to explore whether the integral representation we have used is useful also in obtaining upper and lower bounds on expectations (and higher order moments) of expressions that involves logarithms of positive random variables. In particular, could the integrand of (1) be bounded from below and/or above in a nontrivial manner, that would lead to new interesting bounds? Moreover, it would be even more useful if the corresponding bounds on the integrand would lend themselves to closed-form expressions of the resulting definite integrals.

Another route for further research relies on [12] (p. 363, Identity (3.434.1)), which states that

0eνueμuuρ+1du=μρνρρ·Γ(1ρ),Re(μ)>0,Re(ν)>0,Re(ρ)<1. (114)

Let ν:=1, and μ:=i=1nXi where {Xi}i=1n are positive i.i.d. random variables. Taking expectations of both sides of (114) and rearranging terms, gives

E{(i=1nXi)ρ}=1+ρΓ(1ρ)0euMXn(u)uρ+1du,ρ(0,1), (115)

where X is a random variable having the same density as of the Xi’s, and MX(u):=EeuX (for uR) denotes the MGF of X. Since

lnx=limρ0xρ1ρ,x>0, (116)

it follows that (115) generalizes (3) for the logarithmic expectation. Identity (115), for the ρ-th moment of a sum of i.i.d. positive random variables with ρ(0,1), may be used in some information-theoretic contexts rather than invoking Jensen’s inequality.

Acknowledgments

The authors are thankful to Cihan Tepedelenlioǧlu and Zbigniew Golebiewski for bringing references [4,11], respectively, to their attention.

Author Contributions

Investigation, N.M. and I.S.; Writing-original draft, N.M. and I.S.; Writing-review & editing, N.M. and I.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Mézard M., Montanari A. Information, Physics, and Computation. Oxford University Press; New York, NY, USA: 2009. [Google Scholar]
  • 2.Esipov S.E., Newman T.J. Interface growth and Burgers turbulence: The problem of random initial conditions. Phys. Rev. E. 1993;48:1046–1050. doi: 10.1103/PhysRevE.48.1046. [DOI] [PubMed] [Google Scholar]
  • 3.Song J., Still S., Rojas R.D.H., Castillo I.P., Marsili M. Optimal work extraction and mutual information in a generalized Szilárd engine. arXiv. 2019 doi: 10.1103/PhysRevE.103.052121.1910.04191 [DOI] [PubMed] [Google Scholar]
  • 4.Rajan A., Tepedelenlioǧlu C. Stochastic ordering of fading channels through the Shannon transform. IEEE Trans. Inform. Theory. 2015;61:1619–1628. doi: 10.1109/TIT.2015.2400432. [DOI] [Google Scholar]
  • 5.Simon M.K. A new twist on the Marcum Q-function and its application. IEEE Commun. Lett. 1998;2:39–41. doi: 10.1109/4234.660797. [DOI] [Google Scholar]
  • 6.Simon M.K., Divsalar D. Some new twists to problems involving the Gaussian probability integral. IEEE Trans. Inf. Theory. 1998;46:200–210. doi: 10.1109/26.659479. [DOI] [Google Scholar]
  • 7.Craig J.W. MILCOM 91-Conference Record. IEEE; Piscataway, NJ, USA: A new, simple and exact result for calculating the probability of error for two-dimensional signal constellations; pp. 25.5.1–25.5.5. [Google Scholar]
  • 8.Appledorn C.R. The entropy of a Poisson distribution. SIAM Rev. 1988;30:314–317. doi: 10.1137/1029046. [DOI] [Google Scholar]
  • 9.Erdélyi A., Magnus W., Oberhettinger F., Tricomi F.G., Bateman H. Higher Transcendental Functions. Volume 1 McGraw-Hill; New York, NY, USA: 1987. [Google Scholar]
  • 10.Martinez A. Spectral efficiency of optical direct detection. JOSA B. 2007;24:739–749. doi: 10.1364/JOSAB.24.000739. [DOI] [Google Scholar]
  • 11.Knessl C. Integral representations and asymptotic expansions for Shannon and Rényi entropies. Appl. Math. Lett. 1998;11:69–74. doi: 10.1016/S0893-9659(98)00013-5. [DOI] [Google Scholar]
  • 12.Ryzhik I.M., Gradshteĭn I.S. Tables of Integrals, Series, and Products. Academic Press; New York, NY, USA: 1965. [Google Scholar]
  • 13.Marsiglietti A., Kostina V. A lower bound on the differential entropy of log-concave random vectors with applications. Entropy. 2018;20:185. doi: 10.3390/e20030185. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dong A., Zhang H., Wu D., Yuan D. 2015 IEEE 82nd Vehicular Technology Conference (VTC2015-Fall) IEEE; Piscataway, NJ, USA: 2015. Logarithmic expectation of the sum of exponential random variables for wireless communication performance evaluation. [Google Scholar]
  • 15.Tse D., Viswanath P. Fundamentals of Wireless Communication. Cambridge University Press; Cambridge, UK: 2005. [Google Scholar]
  • 16.Barron A., Rissanen J., Yu B. The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory. 1998;44:2743–2760. doi: 10.1109/18.720554. [DOI] [Google Scholar]
  • 17.Blumer A.C. Minimax universal noiseless coding for unifilar and Markov sources. IEEE Trans. Inf. Theory. 1987;33:925–930. doi: 10.1109/TIT.1987.1057366. [DOI] [Google Scholar]
  • 18.Clarke B.S., Barron A.R. Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inf. Theory. 1990;36:453–471. doi: 10.1109/18.54897. [DOI] [Google Scholar]
  • 19.Clarke B.S., Barron A.R. Jeffreys’ prior is asymptotically least favorable under entropy risk. J. Stat. Plan. Infer. 1994;41:37–60. doi: 10.1016/0378-3758(94)90153-8. [DOI] [Google Scholar]
  • 20.Davisson L.D. Universal noiseless coding. IEEE Trans. Inf. Theory. 1973;29:783–795. doi: 10.1109/TIT.1973.1055092. [DOI] [Google Scholar]
  • 21.Davisson L.D. Minimax noiseless universal coding for Markov sources. IEEE Trans. Inf. Theory. 1983;29:211–215. doi: 10.1109/TIT.1983.1056652. [DOI] [Google Scholar]
  • 22.Davisson L.D., McEliece R.J., Pursley M.B., Wallace M.S. Efficient universal noiseless source codes. IEEE Trans. Inf. Theory. 1981;27:269–278. doi: 10.1109/TIT.1981.1056355. [DOI] [Google Scholar]
  • 23.Krichevsky R.E., Trofimov V.K. The performance of universal encoding. IEEE Trans. Inf. Theory. 1981;27:199–207. doi: 10.1109/TIT.1981.1056331. [DOI] [Google Scholar]
  • 24.Merhav N., Feder M. Universal prediction. IEEE Trans. Inf. Theory. 1998;44:2124–2147. doi: 10.1109/18.720534. [DOI] [Google Scholar]
  • 25.Rissanen J. A universal data compression system. IEEE Trans. Inf. Theory. 1983;29:656–664. doi: 10.1109/TIT.1983.1056741. [DOI] [Google Scholar]
  • 26.Rissanen J. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory. 1984;30:629–636. doi: 10.1109/TIT.1984.1056936. [DOI] [Google Scholar]
  • 27.Rissanen J. Fisher information and stochastic complexity. IEEE Trans. Inf. Theory. 1996;42:40–47. doi: 10.1109/18.481776. [DOI] [Google Scholar]
  • 28.Shtarkov Y.M. Universal sequential coding of single messages. IPPI. 1987;23:175–186. [Google Scholar]
  • 29.Weinberger M.J., Rissanen J., Feder M. A universal finite memory source. IEEE Trans. Inf. Theory. 1995;41:643–652. doi: 10.1109/18.382011. [DOI] [Google Scholar]
  • 30.Xie Q., Barron A.R. Asymptotic minimax regret for data compression, gambling and prediction. IEEE Trans. Inf. Theory. 1997;46:431–445. [Google Scholar]
  • 31.Wald A. Tests of statistical hypotheses concerning several parameters when the number of observations is large. Trans. Am. Math. Soc. 1943;54:426–482. doi: 10.1090/S0002-9947-1943-0012401-3. [DOI] [Google Scholar]
  • 32.Mardia J., Jiao J., Tánczos E., Nowak R.D., Weissman T. Concentration inequalities for the empirical distribution of discrete distributions: Beyond the method of types. Inf. Inference. 2019:1–38. doi: 10.1093/imaiai/iaz025. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES