Skip to main content
Entropy logoLink to Entropy
. 2018 Nov 22;20(12):896. doi: 10.3390/e20120896

Tight Bounds on the Rényi Entropy via Majorization with Applications to Guessing and Compression

Igal Sason 1
PMCID: PMC7512480  PMID: 33266620

Abstract

This paper provides tight bounds on the Rényi entropy of a function of a discrete random variable with a finite number of possible values, where the considered function is not one to one. To that end, a tight lower bound on the Rényi entropy of a discrete random variable with a finite support is derived as a function of the size of the support, and the ratio of the maximal to minimal probability masses. This work was inspired by the recently published paper by Cicalese et al., which is focused on the Shannon entropy, and it strengthens and generalizes the results of that paper to Rényi entropies of arbitrary positive orders. In view of these generalized bounds and the works by Arikan and Campbell, non-asymptotic bounds are derived for guessing moments and lossless data compression of discrete memoryless sources.

Keywords: Majorization, Rényi entropy, Rényi divergence, cumulant generating functions, guessing moments, lossless source coding, fixed-to-variable source codes, Huffman algorithm, Tunstall codes

1. Introduction

Majorization theory is a simple and productive concept in the theory of inequalities, which also unifies a variety of familiar bounds [1,2]. These mathematical tools find various applications in diverse fields (see, e.g., [3]) such as economics [2,4,5], combinatorial analysis [2,6], geometric inequalities [2], matrix theory [2,6,7,8], Shannon theory [5,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25], and wireless communications [26,27,28,29,30,31,32,33].

This work, which relies on the majorization theory, has been greatly inspired by the recent insightful paper by Cicalese et al. [12] (the research work in the present paper has been initialized while the author handled [12] as an associate editor). The work in [12] provides tight bounds on the Shannon entropy of a function of a discrete random variable with a finite number of possible values, where the considered function is not one to one. For that purpose, and while being of interest by its own right (see [12], Section 6), a tight lower bound on the Shannon entropy of a discrete random variable with a finite support was derived in [12] as a function of the size of the support, and the ratio of the maximal to minimal probability masses. The present paper aims to extend the bounds in [12] to Rényi entropies of arbitrary positive orders (note that the Shannon entropy is equal to the Rényi entropy of order 1), and to study the information-theoretic applications of these (non-trivial) generalizations in the context of non-asymptotic analysis of guessing moments and lossless data compression.

The motivation for this work is rooted in the diverse information-theoretic applications of Rényi measures [34]. These include (but are not limited to) asymptotically tight bounds on guessing moments [35], information-theoretic applications such as guessing subject to distortion [36], joint source-channel coding and guessing with application to sequential decoding [37], guessing with a prior access to a malicious oracle [38], guessing while allowing the guesser to give up and declare an error [39], guessing in secrecy problems [40,41], guessing with limited memory [42], and guessing under source uncertainty [43]; encoding tasks [44,45]; Bayesian hypothesis testing [9,22,23], and composite hypothesis testing [46,47]; Rényi generalizations of the rejection sampling problem in [48], motivated by the communication complexity in distributed channel simulation, where these generalizations distinguish between causal and noncausal sampler scenarios [49]; Wyner’s common information in distributed source simulation under Rényi divergence measures [50]; various other source coding theorems [23,39,51,52,53,54,55,56,57,58], channel coding theorems [23,58,59,60,61,62,63,64], including coding theorems in quantum information theory [65,66,67].

The presentation in this paper is structured as follows: Section 2 provides notation and essential preliminaries for the analysis in this paper. Section 3 and Section 4 strengthen and generalize, in a non-trivial way, the bounds on the Shannon entropy in [12] to Rényi entropies of arbitrary positive orders (see Theorems 1 and 2). Section 5 relies on the generalized bound from Section 4 and the work by Arikan [35] to derive non-asymptotic bounds for guessing moments (see Theorem 3); Section 5 also relies on the generalized bound in Section 4 and the source coding theorem by Campbell [51] (see Theorem 4) for the derivation of non-asymptotic bounds for lossless compression of discrete memoryless sources (see Theorem 5).

2. Notation and Preliminaries

Let

  • P be a probability mass function defined on a finite set X;

  • pmax and pmin be, respectively, the maximal and minimal positive masses of P;

  • GP(k) be the sum of the k largest masses of P for k{1,,|X|} (note that GP(1)=pmax and GP(|X|)=1);

  • Pn, for an integer n2, be the set of all probability mass functions defined on X with |X|=n; without any loss of generality, let X={1,,n};

  • Pn(ρ), for ρ1 and an integer n2, be the subset of all probability measures PPn such that
    pmaxpminρ. (1)

Definition 1 (Majorization).

Consider discrete probability mass functions P and Q defined on the same (finite or countably infinite) set X. It is said that P is majorized by Q (or Q majorizes P), and it is denoted by PQ, if GP(k)GQ(k) for all k{1,,|X|1} (recall that GP(|X|)=GQ(|X|)=1). If P and Q are defined on finite sets of different cardinalities, then the probability mass function which is defined over the smaller set is first padded by zeros for making the cardinalities of these sets be equal.

By Definition 1, a unit mass majorizes any other distribution; on the other hand, the equiprobable distribution on a finite set is majorized by any other distribution defined on the same set.

Definition 2 (Schur-convexity/concavity).

A function f:PnR is said to be Schur-convex if for every P,QPn such that PQ, we have f(P)f(Q). Likewise, f is said to be Schur-concave if f is Schur-convex, i.e., P,QPn and PQ imply that f(P)f(Q).

Definition 3 

(Rényi entropy [34]). Let X be a random variable taking values on a finite or countably infinite set X, and let PX be its probability mass function. The Rényi entropy of order α(0,1)(1,) is given by

Hα(X)=Hα(PX)=11αlogxXPXα(x). (2)

Unless explicitly stated, the logarithm base can be chosen by the reader, with exp indicating the inverse function of log.

By its continuous extension,

H0(X)=log{xX:PX(x)>0}, (3)
H1(X)=H(X), (4)
H(X)=log1pmax (5)

where H(X) is the (Shannon) entropy of X.

Proposition 1

(Schur-concavity of the Rényi entropy (Appendix F.3.a (p. 562) of [2])). The Rényi entropy of an arbitrary order α>0 is Schur-concave; in particular, for α=1, the Shannon entropy is Schur-concave.

Remark 1.

[17] (Theorem 2) strengthens Proposition 1, though it is not needed for our analysis.

Definition 4 

(Rényi divergence [34]). Let P and Q be probability mass functions defined on a finite or countably infinite set X. The Rényi divergence of order α[0,] is defined as follows:

  • If α(0,1)(1,), then
    Dα(PQ)=1α1logxXPα(x)Q1α(x). (6)
  • By the continuous extension of Dα(PQ),
    D0(PQ)=maxA:P(A)=1log1Q(A), (7)
    D1(PQ)=D(PQ), (8)
    D(PQ)=logsupxXP(x)Q(x), (9)
    where D(PQ) in the right side of (8) is the relative entropy (a.k.a. Kullback-Leibler divergence).

Throughout this paper, for aR, a denotes the ceiling of a (i.e., the smallest integer not smaller than the real number a), and a denotes the flooring of a (i.e., the greatest integer not greater than a).

3. A Tight Lower Bound on the Rényi Entropy

We provide in this section a tight lower bound on the Rényi entropy, of an arbitrary order α>0, when the probability mass function of the discrete random variable is defined on a finite set of cardinality n, and the ratio of the maximal to minimal probability masses is upper bounded by an arbitrary fixed value ρ[1,). In other words, we derive the largest possible gap between the order-α Rényi entropies of an equiprobable distribution and a non-equiprobable distribution (defined on a finite set of the same cardinality) with a given value for the ratio of the maximal to minimal probability masses. The basic tool used for the development of our result in this section relies on the majorization theory. Our result strengthens the result in [12] (Theorem 2) for the Shannon entropy, and it further provides a generalization for the Rényi entropy of an arbitrary order α>0 (recall that the Shannon entropy is equal to the Rényi entropy of order α=1, see (4)). Furthermore, the approach for proving the main result in this section differs significantly from the proof in [12] for the Shannon entropy. The main result in this section is a key result for all what follows in this paper.

The following lemma is a restatement of [12] (Lemma 6).

Lemma 1.

Let PPn(ρ) with ρ1 and an integer n2, and assume without any loss of generality that the probability mass function P is defined on the set X={1,,n}. Let QPn be defined on X as follows:

Q(j)={ρpmin,j{1,,i},1(n+iρi1)pmin,j=i+1,pmin,j{i+2,,n} (10)

where

i:=1npmin(ρ1)pmin. (11)

Then,

  • (1) 

    QPn(ρ), and Q(1)Q(2)Q(n)>0;

  • (2) 

    PQ.

Proof. 

See [12] (p. 2236) (top of the second column). ☐

Lemma 2.

Let ρ>1, α>0, and n2 be an integer. For

β11+(n1)ρ,1n:=Γρ(n), (12)

let QβPn(ρ) be defined on X={1,,n} as follows:

Qβ(j)={ρβ,j{1,,iβ},1(n+iβρiβ1)β,j=iβ+1,β,j{iβ+2,,n} (13)

where

iβ:=1nβ(ρ1)β. (14)

Then, for every α>0,

minPPn(ρ)Hα(P)=minβΓρ(n)Hα(Qβ). (15)

Proof. 

See Appendix A. ☐

Lemma 3.

For ρ>1 and α>0, let

cα(n)(ρ):=lognminPPn(ρ)Hα(P),n=2,3, (16)

with cα(1)(ρ):=0. Then, for every nN,

0cα(n)(ρ)logρ, (17)
cα(n)(ρ)cα(2n)(ρ), (18)

and cα(n)(ρ) is monotonically increasing in α[0,].

Proof. 

See Appendix B. ☐

Lemma 4.

For α>0 and ρ>1, the limit

cα()(ρ):=limncα(n)(ρ) (19)

exists, having the following properties:

  • (a) 
    If α(0,1)(1,), then
    cα()(ρ)=1α1log1+1+α(ρ1)ρα(1α)(ρ1)αα1log1+1+α(ρ1)ρα(1α)(ρα1), (20)
    and
    limαcα()(ρ)=logρ. (21)
  • (b) 
    If α=1, then
    c1()(ρ)=limα1cα()(ρ)=ρlogρρ1logeρlogeρρ1. (22)
  • (c) 
    For all α>0,
    limρ1cα()(ρ)=0. (23)
  • For every nN, α>0 and ρ1,
    0cα(n)(ρ)cα(2n)(ρ)cα()(ρ)logρ. (24)

Proof. 

See Appendix C. ☐

In view of Lemmata 1–4, we obtain the following main result in this section:

Theorem 1.

Let α>0, ρ>1, n2, and let cα(n)(ρ) in (16) designate the maximal gap between the order-α Rényi entropies of equiprobable and arbitrary distributions in Pn(ρ). Then,

  • (a) 

    The non-negative sequence {cα(n)(ρ)}n=2 can be calculated by the real-valued single-parameter optimization in the right side of (15).

  • (b) 

    The asymptotic limit as n, denoted by cα()(ρ), admits the closed-form expressions in (20) and (22), and it satisfies the properties in (21), (23) and (24).

Remark 2.

Setting α=2 in Theorem 1 gives that, for all PPn(ρ) (with ρ>1, and an integer n2),

H2(P)lognc2(n)(ρ) (25)
lognc2()(ρ) (26)
=log4ρn(1+ρ)2 (27)

where (25)–(27) hold, respectively, due to (16), (24) and (20). This strengthens the result in [68] (Proposition 2) which gives the same lower bound as in the right side of (27) for H(P) rather than for H2(P) (recall that H(P)H2(P)).

For a numerical illustration of Theorem 1, Figure 1 provides a plot of cα()(ρ) in (20) and (22) as a function of ρ1, confirming numerically the properties in (21) and (23). Furthermore, Figure 2 provides plots of cα(n)(ρ) in (16) as a function of α>0, for ρ=2 (left plot) and ρ=256 (right plot), with several values of n2; the calculation of the curves in these plots relies on (15), (20) and (22), and they illustrate the monotonicity and boundedness properties in (24).

Figure 1.

Figure 1

A plot of cα()(ρ) in (20) and (22) (log is on base 2) as a function of ρ, confirming numerically the properties in (21) and (23).

Figure 2.

Figure 2

Plots of cα(n)(ρ) in (16) (log is on base 2) as a function of α>0, for ρ=2 (left plot) and ρ=256 (right plot), with several values of n2.

Remark 3.

Theorem 1 strengthens the result in [12] (Theorem 2) for the Shannon entropy (i.e., for α=1), in addition to its generalization to Rényi entropies of arbitrary orders α>0. This is because our lower bound on the Shannon entropy is given by

H(P)lognc1(n)(ρ),PPn(ρ), (28)

whereas the looser bound in [12] is given by (see [12] ((7)) and (22) here)

H(P)lognc1()(ρ),PPn(ρ), (29)

and we recall that 0c1(n)(ρ)c1()(ρ) (see (24)). Figure 3 shows the improvement in the new lower bound (28) over (29) by comparing c1()(ρ) versus c1(n)(ρ) for ρ[1,105] and with several values of n. It is reflected from Figure 3 that there is a very marginal improvement in the lower bound on the Shannon entropy (28) over the bound in (29) if ρ30 (even for small values of n), whereas there is a significant improvement over the bound in (29) for large values of ρ; by increasing the value of n, also the value of ρ needs to be increased for observing an improvement of the lower bound in (28) over (29) (see Figure 3).

An improvement of the bound in (28) over (29) leads to a tightening of the upper bound in [12] (Theorem 4) on the compression rate of Tunstall codes for discrete memoryless sources, which further tightens the bound by Jelinek and Schneider in [69] (Equation (9)). More explicitly, in view of [12] (Section 6), an improved upper bound on the compression rate of these variable-to-fixed lossless source codes is obtained by combining [12] (Equations (36) and (38)) with a tightened lower bound on the entropy H(W) of the leaves of the tree graph for Tunstall codes. From (28), the latter lower bound is given by H(W)log2nc1(n)(ρ) where c1(n)(ρ) is expressed in bits, ρ:=1pmin is the reciprocal of the minimal positive probability of the source symbols, and n is the number of codewords (so, all codewords are of length log2n bits). This yields a reduction in the upper bound on the non-asymptotic compression rate R of Tunstall codes from log2nH(X)log2nc1()(ρ) (see [12] (Equation (40)) and (22)) to log2nH(X)log2nc1(n)(ρ) bits per source symbol where H(X) denotes the source entropy (converging, in view of (17), to H(X) as we let n).

Figure 3.

Figure 3

A plot of c1()(ρ) in (22) versus c1(n)(ρ) for finite n (n=512,128,32, and 8) as a function of ρ.

Remark 4.

Equality (15) with the minimizing probability mass function of the form (13) holds, in general, by replacing the Rényi entropy with an arbitrary Schur-concave function (as it can be easily verified from the proof of Lemma 2 in Appendix A). However, the analysis leading to Lemmata 3–4 and Theorem 1 applies particularly to the Rényi entropy.

4. Bounds on the Rényi Entropy of a Function of a Discrete Random Variable

This section relies on Theorem 1 and majorization for extending [12] (Theorem 1), which applies to the Shannon entropy, to Rényi entropies of any positive order. More explicitly, let α(0,) and

  • X and Y be finite sets of cardinalities |X|=n and |Y|=m with n>m2; without any loss of generality, let X={1,,n} and Y={1,,m};

  • X be a random variable taking values on X with a probability mass function PXPn;

  • Fn,m be the set of deterministic functions f:XY; note that fFn,m is not one to one since m<n.

The main result in this section sharpens the inequality Hαf(X)Hα(X), for every deterministic function fFn,m with n>m2 and α>0, by obtaining non-trivial upper and lower bounds on maxfFn,mHαf(X). The calculation of the exact value of minfFn,mHαf(X) is much easier, and it is expressed in closed form by capitalizing on the Schur-concavity of the Rényi entropy.

The following main result extends [12] (Theorem 1) to Rényi entropies of arbitrary positive orders.

Theorem 2.

Let X{1,,n} be a random variable which satisfies PX(1)PX(2)PX(n).

  • (a) 
    For m{2,,n1}, if PX(1)<1m, let X˜m be the equiprobable random variable on {1,,m}; otherwise, if PX(1)1m, let X˜m{1,,m} be a random variable with the probability mass function
    PX˜m(i)={PX(i),i{1,,n},1mnj=n+1nPX(j),i{n+1,,m}, (30)
    where n is the maximal integer i{1,,m1} such that
    PX(i)1mij=i+1nPX(j). (31)
    Then, for every α>0,
    maxfFn,mHαf(X)Hα(X˜m)v(α),Hα(X˜m), (32)
    where
    v(α):=cα()(2)={logα12α2αα1logα2α1,α1,log2eln20.08607bits,α=1. (33)
  • (b) 
    There exists an explicit construction of a deterministic function fFn,m such that
    Hαf(X)Hα(X˜m)v(α),Hα(X˜m) (34)
    where f is independent of α, and it is obtained by using Huffman coding (as in [12] for α=1).
  • (c) 
    Let Y˜m{1,,m} be a random variable with the probability mass function
    PY˜m(i)={k=1nm+1PX(k),i=1,PX(nm+i),i{2,,m}. (35)
    Then, for every α>0,
    minfFn,mHαf(X)=Hα(Y˜m). (36)

Remark 5.

Setting α=1 specializes Theorem 2 to [12] (Theorem 1) (regarding the Shannon entropy). This point is further elaborated in Remark 8, after the proof of Theorem 2.

Remark 6.

Similarly to [12] (Lemma 1), an exact solution of the maximization problem in the left side of (32) is strongly NP-hard [70]; this means that, unless P=NP, there is no polynomial time algorithm which, for an arbitrarily small ε>0, computes an admissible deterministic function fεFn,m such that

Hαfε(X)(1ε)maxfFn,mHαf(X). (37)

This motivates the derivation of the bounds in (32), and the simple construction of a deterministic function fFn,m achieving (34).

A proof of Theorem 2 relies on the following lemmata.

Lemma 5.

Let X{1,,n}, m<n and α>0. Then,

maxQPm:PXQHα(Q)=Hα(X˜m) (38)

where the probability mass function of X˜m is given in (30).

Proof. 

Since PXPX˜m (see [12] (Lemma 2)) with PX˜mPm, and PX˜mQ for all QPm such that PXQ (see [12] (Lemma 4)), the result follows from the Schur-concavity of the Rényi entropy. ☐

Lemma 6.

Let X{1,,n}, α>0, and fFn,m with m<n. Then,

Hαf(X)Hα(X˜m). (39)

Proof. 

Since f is a deterministic function in Fn,m with m<n, the probability mass function of f(X) is an element in Pm which majorizes PX (see [12] (Lemma 3)). Inequality (39) then follows from Lemma 5. ☐

We are now ready to prove Theorem 2.

Proof. 

In view of (39),

maxfFn,mHαf(X)Hα(X˜m). (40)

We next construct a function fFn,m such that, for all α>0,

Hαf(X)maxQPm:PXQHα(Q)v(α) (41)
maxfFn,mHαf(X)v(α) (42)

where the function v:(0,)(0,) in the right side of (41) is given in (33), and (42) holds due to (38) and (40). The function f in our proof coincides with the construction in [12], and it is, therefore, independent of α.

We first review and follow the concept of the proof of [12] (Lemma 5), and we then deviate from the analysis there for proving our result. The idea behind the proof of [12] (Lemma 5) relies on the following algorithm:

  • (1)

    Start from the probability mass function PXPn with PX(1)PX(n);

  • (2)

    Merge successively pairs of probability masses by applying the Huffman algorithm;

  • (3)

    Stop the merging process in Step 2 when a probability mass function QPm is obtained (with Q(1)Q(m));

  • (4)

    Construct the deterministic function fFn,m by setting f(k)=j{1,,m} for all probability masses PX(k), with k{1,,n}, being merged in Steps 2–3 into the node of Q(j).

Let i{0,,m1} be the largest index such that PX(1)=Q(1),,PX(i)=Q(i) (note that i=0 corresponds to the case where each node Q(j), with j{1,,m}, is constructed by merging at least two masses of the probability mass function PX). Then, according to [12] (p. 2225),

Q(i+1)2Q(m). (43)

Let

S:=j=i+1mQ(j) (44)

be the sum of the mi smallest masses of the probability mass function Q. In view of (43), the vector

Q¯:=Q(i+1)S,,Q(m)S (45)

represents a probability mass function where the ratio of its maximal to minimal masses is upper bounded by 2.

At this point, our analysis deviates from [12] (p. 2225). Applying Theorem 1 to Q¯ with ρ=2 gives

Hα(Q¯)log(mi)cα()(2) (46)

with

cα()(2)=1α1log1+1+α2α1ααα1log1+1+α2α(1α)(2α1) (47)
=logα12α2αα1logα2α1 (48)
=v(α) (49)

where (47) follows from (20); (48) is straightforward algebra, and (49) is the definition in (33).

For α(0,1)(1,), we get

Hα(Q)=11αlogj=1mQα(j) (50)
=11αlogj=1iQα(j)+j=i+1mQα(j) (51)
=11αlogj=1iQα(j)+Sαexp(1α)Hα(Q¯) (52)
11αlogj=1iQα(j)+Sαexp(1α)log(mi)v(α) (53)
=11αlogj=1iQα(j)+Sα(mi)1αexp(α1)v(α) (54)

where (51) holds since i{0,,m1}; (52) follows from (2) and (45); (53) holds by (46)–(49).

In view of (44), let QPm be the probability mass function which is given by

Q(j)={Q(j),j=1,,iSmi,j=i+1,,m. (55)

From (50)–(55), we get

Hα(Q)11αlogj=1iQ(j)α+j=i+1mQ(j)αexp(α1)v(α) (56)
=11αlogj=1mQ(j)α+j=i+1mQ(j)αexp(α1)v(α)1 (57)
=Hα(Q)+11αlog1+Texp(α1)v(α)1 (58)

with

T:=j=i+1mQ(j)αj=1mQ(j)α[0,1]. (59)

Since T[0,1] and v(α)>0 for α>0, it can be verified from (56)–(58) that for α(0,1)(1,)

Hα(Q)Hα(Q)v(α). (60)

The validity of (60) is extended to α=1 by taking the limit α1 on both sides of this inequality, and due to the continuity of v(·) in (33) at α=1. Applying the majorization result QPX˜m in [12] ((31)), it follows from (60) and the Schur-concavity of the Rényi entropy that, for all α>0,

Hα(Q)Hα(Q)v(α)Hα(X˜m)v(α), (61)

which together with (40), prove Items a) and b) of Theorem 2 (note that, in view of the construction of the deterministic function fFn,m in Step 4 of the above algorithm, we get Hαf(X)=Hα(Q)).

We next prove Item c). Equality (36) is due to the Schur-concavity of the Rényi entropy, and since we have

  • f(X) is an aggregation of X, i.e., the probability mass function QPm of f(X) satisfies Q(j)=iIjPX(i) (1jm) where I1,,Im partition {1,,n} into m disjoint subsets as follows:
    Ij:={i{1,,n}:f(i)=j},j=1,,m; (62)
  • By the assumption PX(1)PX(2)PX(n), it follows that QPY˜m for every such QPm;

  • From (35), Y˜m=f˜(X) where the function f˜Fn,m is given by f˜(k):=1 for all k{1,,nm+1}, and f˜(nm+i):=i for all i{2,,m}. Hence, PY˜m is an element in the set of the probability mass functions of f(X) with fFn,m which majorizes every other element from this set.

 ☐

Remark 7.

The solid line in the left plot of Figure 2 depicts v(α):=cα()(2) in (33) for α>0. In view of Lemma 4, and by the definition in (33), the function v:(0,)(0,) is indeed monotonically increasing and continuous.

Remark 8.

Inequality (43) leads to the application of Theorem 1 with ρ=2 (see (46)). In the derivation of Theorem 2, we refer to v(α):=cα()(2) (see (47)–(49)) rather than referring to cα(n)(2) (although, from (24), we have 0cα(n)(2)v(α) for all α>0). We do so since, for n16, the difference between the curves of cα(n)(2) (as a function of α>0) and the curve of cα()(2) is marginal (see the dashed and solid lines in the left plot of Figure 2), and also because the function v in (33) is expressed in a closed form whereas cα(n)(2) is subject to numerical optimization for finite n (see (15) and (16)). For this reason, Theorem 2 coincides with the result in [12] (Theorem 1) for the Shannon entropy (i.e., for α=1) while providing a generalization of the latter result for Rényi entropies of arbitrary positive orders α. Theorem 1, however, both strengthens the bounds in [12] (Theorem 2) for the Shannon entropy with finite cardinality n (see Remark 3), and it also generalizes these bounds to Rényi entropies of all positive orders.

Remark 9.

The minimizing probability mass function in (35) to the optimization problem (36), and the maximizing probability mass function in (30) to the optimization problem (38) are in general valid when the Rényi entropy of a positive order is replaced by an arbitrary Schur-concave function. However, the main results in (32)–(34) hold particularly for the Rényi entropy.

Remark 10.

Theorem 2 makes use of the random variables denoted by X˜m and Y˜m, rather than (more simply) Xm and Ym respectively, because Section 5 considers i.i.d. samples {Xi}i=1k and {Yi}i=1k with XiPX and YiPY; note, however, that the probability mass functions of X˜m and Y˜m are different from PX and PY, respectively, and for that reason we make use of tilted symbols in the left sides of (30) and (35).

5. Information-Theoretic Applications: Non-Asymptotic Bounds for Lossless Compression and Guessing

Theorem 2 is applied in this section to derive non-asymptotic bounds for lossless compression of discrete memoryless sources and guessing moments. Each of the two subsections starts with a short background for making the presentation self-contained.

5.1. Guessing

5.1.1. Background

The problem of guessing discrete random variables has various theoretical and operational aspects in information theory (see [35,36,37,38,40,41,43,56,71,72,73,74,75,76,77,78,79,80,81]). The central object of interest is the distribution of the number of guesses required to identify a realization of a random variable X, taking values on a finite or countably infinite set X={1,,|X|}, by successively asking questions of the form “Is X equal to x?” until the value of X is guessed correctly. A guessing function is a one-to-one function g:XX, which can be viewed as a permutation of the elements of X in the order in which they are guessed. The required number of guesses is therefore equal to g(x) when X=x with xX.

Lower and upper bounds on the minimal expected number of required guesses for correctly identifying the realization of X, expressed as a function of the Shannon entropy H(X), have been respectively derived by Massey [77] and by McEliece and Yu [78], followed by a derivation of improved upper and lower bounds by De Santis et al. [80]. More generally, given a probability mass function PX on X, it is of interest to minimize the generalized guessing moment E[gρ(X)]=xXPX(x)gρ(x) for ρ>0. For an arbitrary positive ρ, the ρ-th moment of the number of guesses is minimized by selecting the guessing function to be a ranking function gX, for which gX(x)= if PX(x) is the -th largest mass [77]. Although the tie breaking affects the choice of gX, the distribution of gX(X) does not depend on how ties are resolved. Not only does this strategy minimize the average number of guesses, but it also minimizes the ρ-th moment of the number of guesses for every ρ>0. Upper and lower bounds on the ρ-th moment of ranking functions, expressed in terms of the Rényi entropies, were derived by Arikan [35], Boztaş [71], followed by recent improvements in the non-asymptotic regime by Sason and Verdú [56]. Although if |X| is small, it is straightforward to evaluate numerically the guessing moments, the benefit of bounds expressed in terms of Rényi entropies is particularly relevant when dealing with a random vector Xk=(X1,,Xk) whose letters belong to a finite alphabet X; computing all the probabilities of the mass function PXk over the set Xk, and then sorting them in decreasing order for the calculation of the ρ-th moment of the optimal guessing function for the elements of Xk becomes infeasible even for moderate values of k. In contrast, regardless of the value of k, bounds on guessing moments which depend on the Rényi entropy are readily computable if for example, {Xi}i=1k are independent; in which case, the Rényi entropy of the vector is equal to the sum of the Rényi entropies of its components. Arikan’s bounds in [35] are asymptotically tight for random vectors of length k as k, thus providing the correct exponential growth rate of the guessing moments for sufficiently large k.

5.1.2. Analysis

We next analyze the following setup of guessing. Let {Xi}i=1k be i.i.d. random variables where X1PX takes values on a finite set X with |X|=n. To cluster the data [82] (see also [12] (Section 3.A) and references therein), suppose that each Xi is mapped to Yi=f(Xi) where fFn,m is an arbitrary deterministic function (independent of the index i) with m<n. Consequently, {Yi}i=1k are i.i.d., and each Yi takes values on a finite set Y with |Y|=m<|X|.

Let gXk:Xk{1,,nk} and gYk:Yk{1,,mk} be, respectively, the ranking functions of the random vectors Xk=(X1,,Xk) and Yk=(Y1,,Yk) by sorting in separate decreasing orders the probabilities PXk(xk)=i=1kPX(xi) for xkXk, and PYk(yk)=i=1kPY(yi) for ykYk where ties in both cases are resolved arbitrarily. In view of Arikan’s bounds on the ρ-th moment of ranking functions (see [35] (Theorem 1) for the lower bound, and [35] (Proposition 4) for the upper bound), since |Xk|=nk and |Yk|=mk, the following bounds hold for all ρ>0:

ρH11+ρ(X)ρlog(1+klnn)k1klogEgXkρ(Xk)ρH11+ρ(X), (63)
ρH11+ρ(Y)ρlog(1+klnm)k1klogEgYkρ(Yk)ρH11+ρ(Y). (64)

In the following, we rely on Theorem 2 and the bounds in (63) and (64) to obtain bounds on the exponential reduction of the ρ-th moment of the ranking function of Xk as a result of its mapping to Yk. First, the combination of (63) and (64) yields

ρH11+ρ(X)H11+ρ(Y)ρlog(1+klnn)k1klogEgXkρ(Xk)EgYkρ(Yk) (65)
ρH11+ρ(X)H11+ρ(Y)+ρlog(1+klnm)k. (66)

In view of Theorem 2-(a) and (65), it follows that for an arbitrary fFn,m and ρ>0

1klogEgXkρ(Xk)EgYkρ(Yk)ρH11+ρ(X)H11+ρ(X˜m)ρlog(1+klnn)k (67)

where X˜m is a random variable whose probability mass function is given in (30). Please note that

H11+ρ(X˜m)H11+ρ(X),ρlog(1+klnn)kk0 (68)

where the first inequality in (68) holds since PXPX˜m (see Lemma 5) and the Rényi entropy is Schur-concave.

By the explicit construction of the function fFn,m according to the algorithm in Steps 1–4 in the proof of Theorem 2 (based on the Huffman procedure), by setting Yi:=f(Xi) for every i{1,,k}, it follows from (34) and (66) that for all ρ>0

1klogEgXkρ(Xk)EgYkρ(Yk)ρH11+ρ(X)H11+ρ(X˜m)+v11+ρ+ρlog(1+klnm)k (69)

where the monotonically increasing function v:(0,)(0,) is given in (33), and it is depicted by the solid line in the left plot of Figure 2. In view of (33), it can be shown that the linear approximation v(α)v(1)α is excellent for all α[0,1], and therefore for all ρ>0

v11+ρ0.086071+ρbits. (70)

Hence, for sufficiently large value of k, the gap between the lower and upper bounds in (67) and (69) is marginal, being approximately equal to 0.08607ρ1+ρbits for all ρ>0.

The following theorem summarizes our result in this section.

Theorem 3.

Let

  • {Xi}i=1k be i.i.d. with X1PX taking values on a set X with |X|=n;

  • Yi=f(Xi), for every i{1,,k}, where fFn,m is a deterministic function with m<n;

  • gXk:Xk{1,,nk} and gYk:Yk{1,,mk} be, respectively, ranking functions of the random vectors Xk=(X1,,Xk) and Yk=(Y1,,Yk).

Then, for every ρ>0,

  • (a) 

    The lower bound in (67) holds for every deterministic function fFn,m;

  • (b) 

    The upper bound in (69) holds for the specific fFn,m, whose construction relies on the Huffman algorithm (see Steps 1–4 of the procedure in the proof of Theorem 2);

  • (c) 

    The gap between these bounds, for f=f and sufficiently large k, is at most ρv11+ρ0.08607ρ1+ρbits.

5.1.3. Numerical Result

The following simple example illustrates the tightness of the achievable upper bound and the universal lower bound in Theorem 3, especially for sufficiently long sequences.

Example 1.

Let X be geometrically distributed restricted to {1,,n} with the probability mass function

PX(j)=(1a)aj11an,j{1,,n} (71)

where a=2425 and n=128. Assume that X1,,Xk are i.i.d. with X1PX, and let Yi=f(Xi) for a deterministic function fFn,m with n=128 and m=16. We compare the upper and lower bounds in Theorem 3 for the two cases where the sequence Xk=(X1,,Xk) is of length k=100 or k=1000. The lower bound in (67) holds for an arbitrary deterministic fFn,m, and the achievable upper bound in (69) holds for the construction of the deterministic function f=fFn,m (based on the Huffman algorithm) in Theorem 3. Numerical results are shown in Figure 4, providing plots of the upper and lower bounds on 1klog2EgXkρ(Xk)EgYkρ(Yk) in Theorem 3, and illustrating the improved tightness of these bounds when the value of k is increased from 100 (left plot) to 1000 (right plot). From Theorem 3-(c), for sufficiently large k, the gap between the upper and lower bounds is less than 0.08607 bits (for all ρ>0); this is consistent with the right plot of Figure 4 where k=1000.

Figure 4.

Figure 4

Plots of the upper and lower bounds on 1klog2EgXkρ(Xk)EgYkρ(Yk) in Theorem 3, as a function of ρ>0, for random vectors of length k=100 (left plot) or k=1000 (right plot) in the setting of Example 1. Each plot shows the universal lower bound for an arbitrary deterministic fF128,16, and the achievable upper bound with the construction of the deterministic function f=fF128,16 (based on the Huffman algorithm) in Theorem 3 (see, respectively, (67) and (69)).

5.2. Lossless Source Coding

5.2.1. Background

For uniquely decodable (UD) lossless source coding, Campbell [51,83] proposed the cumulant generating function of the codeword lengths as a generalization to the frequently used design criterion of average code length. Campbell’s motivation in [51] was to control the contribution of the longer codewords via a free parameter in the cumulant generating function: if the value of this parameter tends to zero, then the resulting design criterion becomes the average code length per source symbol; on the other hand, by increasing the value of the free parameter, the penalty for longer codewords is more severe, and the resulting code optimization yields a reduction in the fluctuations of the codeword lengths.

We introduce the coding theorem by Campbell [51] for lossless compression of a discrete memoryless source (DMS) with UD codes, which serves for our analysis jointly with Theorem 2.

Theorem 4 

(Campbell 1965, [51]). Consider a DMS which emits symbols with a probability mass function PX defined on a (finite or countably infinite) set X. Consider a UD fixed-to-variable source code operating on source sequences of k symbols with an alphabet of the codewords of size D. Let (xk) be the length of the codeword which corresponds to the source sequence xk:=(x1,,xk)Xk. Consider the scaled cumulant generating function of the codeword lengths

Λk(ρ):=1klogDxkXkPXk(xk)Dρ(xk),ρ>0 (72)

where

PXk(xk)=i=1kPX(xi),xkXk. (73)

Then, for every ρ>0, the following hold:

  • (a) 
    Converse result:
    Λk(ρ)ρ1logDH11+ρ(X). (74)
  • (b) 
    Achievability result: there exists a UD source code, for which
    Λk(ρ)ρ1logDH11+ρ(X)+1k. (75)

The term scaled cumulant generating function is used in view of [56] (Remark 20). The bounds in Theorem 4, expressed in terms of the Rényi entropy, imply that for sufficiently long source sequences, it is possible to make the scaled cumulant generating function of the codeword lengths approach the Rényi entropy as closely as desired by a proper fixed-to-variable UD source code; moreover, the converse result shows that there is no UD source code for which the scaled cumulant generating function of its codeword lengths lies below the Rényi entropy. By invoking L’Hôpital’s rule, one gets from (72)

limρ0Λk(ρ)ρ=1kxkXkPXk(xk)(xk)=1kE[(Xk)]. (76)

Hence, by letting ρ tend to zero in (74) and (75), it follows from (4) that Campbell’s result in Theorem 4 generalizes the well-known bounds on the optimal average length of UD fixed-to-variable source codes (see, e.g., [84] ((5.33) and (5.37))):

1logDH(X)1kE[(Xk)]1logDH(X)+1k, (77)

and (77) is satisfied by Huffman coding (see, e.g., [84] (Theorem 5.8.1)). Campbell’s result therefore generalizes Shannon’s fundamental result in [85] for the average codeword lengths of lossless compression codes, expressed in terms of the Shannon entropy.

Following the work by Campbell [51], Courtade and Verdú derived in [52] non-asymptotic bounds for the scaled cumulant generating function of the codeword lengths for PX-optimal variable-length lossless codes [23,86]. These bounds were used in [52] to obtain simple proofs of the asymptotic normality of the distribution of codeword lengths, and the reliability function of memoryless sources allowing countably infinite alphabets. Sason and Verdú recently derived in [56] improved non-asymptotic bounds on the cumulant generating function of the codeword lengths for fixed-to-variable optimal lossless source coding without prefix constraints, and non-asymptotic bounds on the reliability function of a DMS, tightening the bounds in [52].

5.2.2. Analysis

The following analysis for lossless source compression with UD codes relies on a combination of Theorems 2 and 4.

Let X1,,Xk be i.i.d. symbols which are emitted from a DMS according to a probability mass function PX whose support is a finite set X with |X|=n. Similarly to Section 5.1, to cluster the data, suppose that each symbol Xi is mapped to Yi=f(Xi) where fFn,m is an arbitrary deterministic function (independent of the index i) with m<n. Consequently, the i.i.d. symbols Y1,,Yk take values on a set Y with |Y|=m<|X|. Consider two UD fixed-to-variable source codes: one operating on the sequences xkXk, and the other one operates on the sequences ykYk; let D be the size of the alphabets of both source codes. Let (xk) and ¯(yk) denote the length of the codewords for the source sequences xk and yk, respectively, and let Λk(·) and Λ¯k(·) denote their corresponding scaled cumulant generating functions (see (72)).

In view of Theorem 4-(b), for every ρ>0, there exists a UD source code for the sequences in Xk such that the scaled cumulant generating function of its codeword lengths satisfies (75). Furthermore, from Theorem 4-(a), we get

Λ¯k(ρ)ρ1logDH11+ρ(Y). (78)

From (75), (78) and Theorem 2 (a) and (b), for every ρ>0, there exist a UD source code for the sequences in Xk, and a construction of a deterministic function fFn,m (as specified by Steps 1–4 in the proof of Theorem 2, borrowed from [12]) such that the difference between the two scaled cumulant generating functions satisfies

Λk(ρ)Λ¯k(ρ)ρlogDH11+ρ(X)H11+ρ(X˜m)+v11+ρ+ρk, (79)

where (79) holds for every UD source code operating on the sequences in Yk with Yi=f(Xi) (for i=1,,k) and the specific construction of fFn,m as above, and X˜m in the right side of (79) is a random variable whose probability mass function is given in (30). The right side of (79) can be very well approximated (for all ρ>0) by using (70).

We proceed with a derivation of a lower bound on the left side of (79). In view of Theorem 4, it follows that (74) is satisfied for every UD source code which operates on the sequences in Xk; furthermore, Theorems 2 and 4 imply that, for every fFn,m, there exists a UD source code which operates on the sequences in Yk such that

Λ¯k(ρ)ρ1logDH11+ρ(Y)+1k, (80)
1logDH11+ρ(X˜m)+1k, (81)

where (81) is due to (39) since Yi=f(Xi) (for i=1,,k) with an arbitrary deterministic function fFn,m, and YiPY for every i; hence, from (74), (80) and (81),

Λk(ρ)Λ¯k(ρ)ρlogDH11+ρ(X)H11+ρ(X˜m)ρk. (82)

We summarize our result as follows.

Theorem 5.

Let

  • X1,,Xk be i.i.d. symbols which are emitted from a DMS according to a probability mass function PX whose support is a finite set X with |X|=n;

  • Each symbol Xi be mapped to Yi=f(Xi) where fFn,m is the deterministic function (independent of the index i) with m<n, as specified by Steps 1–4 in the proof of Theorem 2 (borrowed from [12]);

  • Two UD fixed-to-variable source codes be used: one code encodes the sequences xkXk, and the other code encodes their mappings ykYk; let the common size of the alphabets of both codes be D;

  • Λk(·) and Λ¯k(·) be, respectively, the scaled cumulant generating functions of the codeword lengths of the k-length sequences in Xk (see (72)) and their mapping to Yk.

Then, for every ρ>0, the following holds for the difference between the scaled cumulant generating functions Λk(·) and Λ¯k(·):

  • (a) 

    There exists a UD source code for the sequences in Xk such that the upper bound in (79) is satisfied for every UD source code which operates on the sequences in Yk;

  • (b) 

    There exists a UD source code for the sequences in Yk such that the lower bound in (82) holds for every UD source code for the sequences in Xk; furthermore, the lower bound in (82) holds in general for every deterministic function fFn,m;

  • (c) 

    The gap between the upper and lower bounds in (79) and (82), respectively, is at most ρlogDv11+ρ+2ρk (the function v:(0,)(0,) is introduced in (33)), which is approximately 0.08607ρlogD21+ρ+2ρk;

  • (d) 

    The UD source codes in Items (a) and (b) for the sequences in Xk and Yk, respectively, can be constructed to be prefix codes by the algorithm in Remark 11.

Remark 11 (An Algorithm for Theorem 5 (d)).

A construction of the UD source codes for the sequences in Xk and Yk, whose existence is assured by Theorem 5 (a) and (b) respectively, is obtained by the following algorithm (of three steps) which also constructs them as prefix codes:

  • (1) 

    As a preparatory step, we first calculate the probability mass function PY from the given probability mass function PX and the deterministic function fFn,m which is obtained by Steps 1–4 in the proof of Theorem 2; accordingly, PY(y)=xX:f(x)=yPX(x) for all yY. We then further calculate the probability mass functions for the i.i.d. sequences in Xk and Yk (see (73)); recall that the number of types in Xk and Yk is polynomial in k (being upper bounded by (k+1)n1 and (k+1)m1, respectively), and the values of these probability mass functions are fixed over each type;

  • (2) 
    The sets of codeword lengths of the two UD source codes, for the sequences in Xk and Yk, can (separately) be designed according to the achievability proof in Campbell’s paper (see [51] (p. 428)). More explicitly, let α:=11+ρ; for all xkXk, let (xk)N be given by
    (xk)=αlogDPXk(xk)+logDQk (83)
    with
    Qk:=xkXkPXkα(xk)=xXPXα(x)k, (84)
    and let ¯(yk)N, for all ykYk, be given similarly to (83) and (84) by replacing PX with PY, and PXk with PYk. This suggests codeword lengths for the two codes which fulfil (75) and (80), and also, both satisfy Kraft’s inequality;
  • (3) 

    The separate construction of two prefix codes (a.k.a. instantaneous codes) based on their given sets of codeword lengths {(xk)}xkXk and {¯(yk)}ykYk, as determined in Step 2, is standard (see, e.g., the construction in the proof of [84] (Theorem 5.2.1)).

Theorem 5 is of interest since it provides upper and lower bounds on the reduction in the cumulant generating function of close-to-optimal UD source codes because of clustering data, and Remark 11 suggests an algorithm to construct such UD codes which are also prefix codes. For long enough sequences (as k), the upper and lower bounds on the difference between the scaled cumulant generating functions of the suggested source codes for the original and clustered data almost match (see (79) and (82)), being roughly equal to ρH11+ρ(X)H11+ρ(X˜m) (with logarithms on base D, which is the alphabet size of the source codes); as k, the gap between these upper and lower bounds is less than 0.08607logD2. Furthermore, in view of (76),

limρ0Λk(ρ)Λ¯k(ρ)ρ=1kE[(Xk)]E[¯(Yk)], (85)

so, it follows from (4), (33), (79) and (82) that the difference between the average code lengths (normalized by k) of the original and clustered data satisfies

H(X)H(X˜m)logD1kE[(Xk)]E[¯(Yk)]kH(X)H(X˜m)+0.08607log2logD, (86)

where the gap between the upper and lower bounds is equal to 0.08607logD2+1k.

Appendix A. Proof of Lemma 2

We first find the extreme values of pmin under the assumption that PPn(ρ). If pmaxpmin=1, then P is the equiprobable distribution on X and pmin=1n. On the other hand, if pmaxpmin=ρ, then the minimal possible value of pmin is obtained when P is the one-odd-mass distribution with n1 masses equal to ρpmin and a smaller mass equal to pmin. The latter case yields pmin=11+(n1)ρ.

Let β:=pmin, so β can get any value in the interval 11+(n1)ρ,1n:=Γρ(n). From Lemma 1, PQβ and QβPn(ρ), and the Schur-concavity of the Rényi entropy yields Hα(P)Hα(Qβ) for all PPn(ρ) with pmin=β. Minimizing Hα(P) over PPn(ρ) can be therefore restricted to minimizing Hα(Qβ) over βΓρ(n).

Appendix B. Proof of Lemma 3

The sequence {cα(n)(ρ)}nN is non-negative since Hα(P)logn for all PPn. To prove (17),

0cα(n)(ρ)=lognminPPn(ρ)Hα(P) (A1)
lognminPPn(ρ)H(P) (A2)
lognlognρ=logρ (A3)

where (A2) holds since Hα(P) is monotonically decreasing in α, and (A3) is due to (5) and pmaxρn.

Let Un denote the equiprobable probability mass function on {1,,n}. By the identity

Dα(PUn)=lognHα(P), (A4)

and since, by Lemma 2, Hα(·) attains its minimum over the set of probability mass functions Pn(ρ), it follows that Dα(·Un) attains its maximum over this set. Let PPn(ρ) be the probability measure which achieves the minimum in cα(n)(ρ) (see (16)), then from (A4)

cα(n)(ρ)=maxPPn(ρ)Dα(PUn) (A5)
=Dα(PUn). (A6)

Let Q be the probability mass function which is defined on {1,,2n} as follows:

Q(i)={12P(i),i{1,,n},12P(in),i{n+1,,2n}. (A7)

Since by assumption PPn(ρ), it is easy to verify from (A7) that

QP2n(ρ). (A8)

Furthermore, from (A7),

Dα(QU2n)=1α1logi=12nQ(i)α12n1α (A9)
=1α1log12i=1nP(i)α1n1α+12i=n+12nP(in)α1n1α (A10)
=1α1logi=1nP(i)α1n1α (A11)
=Dα(PUn). (A12)

Combining (A5)–(A12) yields

cα(2n)(ρ)=maxQP2n(ρ)Dα(QU2n) (A13)
Dα(QU2n) (A14)
=Dα(PUn) (A15)
=cα(n)(ρ), (A16)

proving (18). Finally, in view of (A5), cα(n)(ρ) is monotonically increasing in α since so is the Rényi divergence of order α (see [87] (Theorem 3)).

Appendix C. Proof of Lemma 4

From Lemma 2, the minimizing distribution of Hα is given by QβPn(ρ) where

Qβ=ρβ,,ρβi,1(n+iρi1)β,β,β,,βni1 (A17)

with β11+(n1)ρ,1n, and 1(n+iρi1)βρβρn. It therefore follows that the influence of the middle probability mass of Qβ on Hα(Qβ) tends to zero as n. Therefore, in this asymptotic case, one can instead minimize Hα(Q˜m) where

Q˜m=ρβ,,ρβm,β,β,,βnm (A18)

with the free parameter m{1,,n} and β=1n+m(ρ1) (so that the total mass of Q˜m is equal to 1).

For α(0,1)(1,), straightforward calculation shows that

Hα(Q˜m)=11αlogj=1nQ˜mα(j)=logn1α1log1+mn(ρα1)1+mn(ρ1)α, (A19)

and by letting n, the limit of the sequence {cα(n)(ρ)}nN exists, and it is equal to

cα()(ρ):=limncα(n)(ρ)=limnlognminm{1,,n}Hα(Q˜m)=limnmaxm{1,,n}1α1log1+mn(ρα1)1+mn(ρ1)α=maxx[0,1]1α1log1+(ρα1)x1+(ρ1)xα. (A20)

Let fα:[0,1]R be given by

fα(x)=1+(ρα1)x1+(ρ1)xα,x[0,1]. (A21)

Then, fα(0)=fα(1)=1, and straightforward calculation shows that the derivative fα vanishes if and only if

x=x:=1+α(ρ1)ρα(1α)(ρ1)(ρα1). (A22)

We rely here on a specialized version of the mean value theorem, known as Rolle’s theorem, which states that any real-valued differentiable function that attains equal values at two distinct points must have a point somewhere between them where the first derivative at this point is zero. By Rolle’s theorem, and due to the uniqueness of the point x in (A22), it follows that x(0,1). Substituting (A22) into (A20) gives (20). Taking the limit of (20) when α gives the result in (21).

In the limit where α1, the Rényi entropy of order α tends to the Shannon entropy. Hence, letting α1 in (20), it follows that for the Shannon entropy

c1()(ρ)=limα1cα()(ρ)=limα11α1log1+1+α(ρ1)ρα(1α)(ρ1)αα1log1+1+α(ρ1)ρα(1α)(ρα1)=ρlogρρ1logelogρlogeρρ1, (A23)

where (A23) follows by invoking L’Hôpital’s rule. This proves (22).

From (17)–(19), we get 0cα(n)(ρ)cα()(ρ). Since cα(n)(ρ) is monotonically increasing in α[0,], for every nN, so is cα()(ρ); hence, (21) yields cα()(ρ)logρ. This proves (24).

Funding

This research received no external funding.

Conflicts of Interest

The author declares no conflict of interest.

References

  • 1.Hardy G.H., Littlewood J.E., Pólya G. Inequalities. 2nd ed. Cambridge University Press; Cambridge, UK: 1952. [Google Scholar]
  • 2.Marshall A.W., Olkin I., Arnold B.C. Inequalities: Theory of Majorization and Its Applications. 2nd ed. Springer; New York, NY, USA: 2011. [Google Scholar]
  • 3.Arnold B.C. Majorization: Here, there and everywhere. Stat. Sci. 2007;22:407–413. doi: 10.1214/0883423060000000097. [DOI] [Google Scholar]
  • 4.Arnold B.C., Sarabia J.M. Majorization and the Lorenz Order with Applications in Applied Mathematics and Economics. Springer; New York, NY, USA: 2018. Statistics for Social and Behavioral Sciences. [Google Scholar]
  • 5.Cicalese F., Gargano L., Vaccaro U. Information theoretic measures of distances and their econometric applications; Proceedings of the 2013 IEEE International Symposium on Information Theory; Istanbul, Turkey. 7–12 July 2013; pp. 409–413. [Google Scholar]
  • 6.Steele J.M. The Cauchy-Schwarz Master Class. Cambridge University Press; Cambridge, UK: 2004. [Google Scholar]
  • 7.Bhatia R. Matrix Analysis Graduate Texts in Mathematics. Springer; New York, NY, USA: 1997. [Google Scholar]
  • 8.Horn R.A., Johnson C.R. Matrix Analysis. 2nd ed. Cambridge University Press; Cambridge, UK: 2013. [Google Scholar]
  • 9.Ben-Bassat M., Raviv J. Rényi’s entropy and probability of error. IEEE Trans. Inf. Theory. 1978;24:324–331. doi: 10.1109/TIT.1978.1055890. [DOI] [Google Scholar]
  • 10.Cicalese F., Vaccaro U. Bounding the average length of optimal source codes via majorization theory. IEEE Trans. Inf. Theory. 2004;50:633–637. doi: 10.1109/TIT.2004.825026. [DOI] [Google Scholar]
  • 11.Cicalese F., Gargano L., Vaccaro U. How to find a joint probability distribution with (almost) minimum entropy given the marginals; Proceedings of the 2017 IEEE International Symposium on Information Theory; Aachen, Germany. 25–30 June 2017; pp. 2178–2182. [Google Scholar]
  • 12.Cicalese F., Gargano L., Vaccaro U. Bounds on the entropy of a function of a random variable and their applications. IEEE Trans. Inf. Theory. 2018;64:2220–2230. doi: 10.1109/TIT.2017.2787181. [DOI] [Google Scholar]
  • 13.Cicalese F., Vaccaro U. Maximum entropy interval aggregations; Proceedings of the 2018 IEEE International Symposium on Information Theory; Vail, CO, USA. 17–20 June 2018; pp. 1764–1768. [Google Scholar]
  • 14.Harremoës P. A new look on majorization; Proceedings of the 2004 IEEE International Symposium on Information Theory and Its Applications; Parma, Italy. 10–13 October 2004; pp. 1422–1425. [Google Scholar]
  • 15.Ho S.W., Yeung R.W. The interplay between entropy and variational distance. IEEE Trans. Inf. Theory. 2010;56:5906–5929. doi: 10.1109/TIT.2010.2080452. [DOI] [Google Scholar]
  • 16.Ho S.W., Verdú S. On the interplay between conditional entropy and error probability. IEEE Trans. Inf. Theory. 2010;56:5930–5942. doi: 10.1109/TIT.2010.2080891. [DOI] [Google Scholar]
  • 17.Ho S.W., Verdú S. Convexity/concavity of the Rényi entropy and α-mutual information; Proceedings of the 2015 IEEE International Symposium on Information Theory; Hong Kong, China. 14–19 June 2015; pp. 745–749. [Google Scholar]
  • 18.Joe H. Majorization, entropy and paired comparisons. Ann. Stat. 1988;16:915–925. doi: 10.1214/aos/1176350843. [DOI] [Google Scholar]
  • 19.Joe H. Majorization and divergence. J. Math. Anal. Appl. 1990;148:287–305. doi: 10.1016/0022-247X(90)90002-W. [DOI] [Google Scholar]
  • 20.Koga H. Characterization of the smooth Rényi entropy using majorization; Proceedings of the 2013 IEEE Information Theory Workshop; Seville, Spain. 9–13 September 2013; pp. 604–608. [Google Scholar]
  • 21.Puchala Z., Rudnicki L., Zyczkowski K. Majorization entropic uncertainty relations. J. Phys. A Math. Theor. 2013;46:1–12. doi: 10.1088/1751-8113/46/27/272002. [DOI] [Google Scholar]
  • 22.Sason I., Verdú S. Arimoto-Rényi conditional entropy and Bayesian M-ary hypothesis testing. IEEE Trans. Inf. Theory. 2018;64:4–25. doi: 10.1109/TIT.2017.2757496. [DOI] [Google Scholar]
  • 23.Verdú S. Information Theory. 2018. in preparation.
  • 24.Witsenhhausen H.S. Some aspects of convexity useful in information theory. IEEE Trans. Inf. Theory. 1980;26:265–271. doi: 10.1109/TIT.1980.1056173. [DOI] [Google Scholar]
  • 25.Xi B., Wang S., Zhang T. Information Computing and Applications. Volume 7030. Springer; New York, NY, USA: 2011. Schur-convexity on generalized information entropy and its applications; pp. 153–160. Lecture Notes in Computer Science. [Google Scholar]
  • 26.Inaltekin H., Hanly S.V. Optimality of binary power control for the single cell uplink. IEEE Trans. Inf. Theory. 2012;58:6484–6496. doi: 10.1109/TIT.2012.2206071. [DOI] [Google Scholar]
  • 27.Jorshweick E., Bosche H. Majorization and matrix-monotone functions in wireless communications. Found. Trends Commun. Inf. Theory. 2006;3:553–701. doi: 10.1561/0100000026. [DOI] [Google Scholar]
  • 28.Palomar D.P., Jiang Y. MIMO transceiver design via majorization theory. Found. Trends Commun. Inf. Theory. 2006;3:331–551. doi: 10.1561/0100000018. [DOI] [Google Scholar]
  • 29.Roventa I. Recent Trends in Majorization Theory and Optimization: Applications to Wireless Communications. Editura Pro Universitaria & Universitaria Craiova; Bucharest, Romania: 2015. [Google Scholar]
  • 30.Sezgin A., Jorswieck E.A. Applications of majorization theory in space-time cooperative communications. In: Uysal M., editor. Cooperative Communications for Improved Wireless Network Transmission: Framework for Virtual Antenna Array Applications. IGI Global; Hershey, PA, USA: 2010. pp. 429–470. Information Science Reference. [Google Scholar]
  • 31.Viswanath P., Anantharam V. Optimal sequences and sum capacity of synchronous CDMA systems. IEEE Trans. Inf. Theory. 1999;45:1984–1993. doi: 10.1109/18.782121. [DOI] [Google Scholar]
  • 32.Viswanath P., Anantharam V., Tse D.N.C. Optimal sequences, power control, and user capacity of synchronous CDMA systems with linear MMSE multiuser receivers. IEEE Trans. Inf. Theory. 1999;45:1968–1983. doi: 10.1109/18.782119. [DOI] [Google Scholar]
  • 33.Viswanath P., Anantharam V. Optimal sequences for CDMA under colored noise: A Schur-saddle function property. IEEE Trans. Inf. Theory. 2002;48:1295–1318. doi: 10.1109/TIT.2002.1003823. [DOI] [Google Scholar]
  • 34.Rényi A. On measures of entropy and information; Proceedings of the 4th Berkeley Symposium on Probability Theory and Mathematical Statistics; Berkeley, CA, USA. 8–9 August 1961; pp. 547–561. [Google Scholar]
  • 35.Arikan E. An inequality on guessing and its application to sequential decoding. IEEE Trans. Inf. Theory. 1996;42:99–105. doi: 10.1109/18.481781. [DOI] [Google Scholar]
  • 36.Arikan E., Merhav N. Guessing subject to distortion. IEEE Trans. Inf. Theory. 1998;44:1041–1056. doi: 10.1109/18.669158. [DOI] [Google Scholar]
  • 37.Arikan E., Merhav N. Joint source-channel coding and guessing with application to sequential decoding. IEEE Trans. Inf. Theory. 1998;44:1756–1769. doi: 10.1109/18.705557. [DOI] [Google Scholar]
  • 38.Burin A., Shayevitz O. Reducing guesswork via an unreliable oracle. IEEE Trans. Inf. Theory. 2018;64:6941–6953. doi: 10.1109/TIT.2018.2856190. [DOI] [Google Scholar]
  • 39.Kuzuoka S. On the conditional smooth Rényi entropy and its applications in guessing and source coding. arXiv. 2018. 1810.09070
  • 40.Merhav N., Arikan E. The Shannon cipher system with a guessing wiretapper. IEEE Trans. Inf. Theory. 1999;45:1860–1866. doi: 10.1109/18.782106. [DOI] [Google Scholar]
  • 41.Sundaresan R. Guessing based on length functions; Proceedings of the 2007 IEEE International Symposium on Information Theory; Nice, France. 24–29 June 2007; pp. 716–719. [Google Scholar]
  • 42.Salamatian S., Beirami A., Cohen A., Médard M. Centralized versus decentralized multi-agent guesswork; Proceedings of the 2017 IEEE International Symposium on Information Theory; Aachen, Germany. 25–30 June 2017; pp. 2263–2267. [Google Scholar]
  • 43.Sundaresan R. Guessing under source uncertainty. IEEE Trans. Inf. Theory. 2007;53:269–287. doi: 10.1109/TIT.2006.887466. [DOI] [Google Scholar]
  • 44.Bracher A., Lapidoth A., Pfister C. Distributed task encoding; Proceedings of the 2017 IEEE International Symposium on Information Theory; Aachen, Germany. 25–30 June 2017; pp. 1993–1997. [Google Scholar]
  • 45.Bunte C., Lapidoth A. Encoding tasks and Rényi entropy. IEEE Trans. Inf. Theory. 2014;60:5065–5076. doi: 10.1109/TIT.2014.2329490. [DOI] [Google Scholar]
  • 46.Shayevitz O. On Rényi measures and hypothesis testing; Proceedings of the 2011 IEEE International Symposium on Information Theory; Saint Petersburg, Russia. 31 July–5 August 2011; pp. 800–804. [Google Scholar]
  • 47.Tomamichel M., Hayashi M. Operational interpretation of Rényi conditional mutual information via composite hypothesis testing against Markov distributions; Proceedings of the 2016 IEEE International Symposium on Information Theory; Barcelona, Spain. 10–15 July 2016; pp. 585–589. [Google Scholar]
  • 48.Harsha P., Jain R., McAllester D., Radhakrishnan J. The communication complexity of correlation. IEEE Trans. Inf. Theory. 2010;56:438–449. doi: 10.1109/TIT.2009.2034824. [DOI] [Google Scholar]
  • 49.Liu J., Verdú S. Rejection sampling and noncausal sampling under moment constraints; Proceedings of the 2018 IEEE International Symposium on Information Theory; Vail, CO, USA. 17–22 June 2018; pp. 1565–1569. [Google Scholar]
  • 50.Yu L., Tan V.Y.F. Wyner’s common information under Rényi divergence measures. IEEE Trans. Inf. Theory. 2018;64:3616–3632. doi: 10.1109/TIT.2018.2806569. [DOI] [Google Scholar]
  • 51.Campbell L.L. A coding theorem and Rényi’s entropy. Inf. Control. 1965;8:423–429. doi: 10.1016/S0019-9958(65)90332-3. [DOI] [Google Scholar]
  • 52.Courtade T., Verdú S. Cumulant generating function of codeword lengths in optimal lossless compression; Proceedings of the 2014 IEEE International Symposium on Information Theory; Honolulu, HI, USA. 29 June–4 July 2014; pp. 2494–2498. [Google Scholar]
  • 53.Courtade T., Verdú S. Variable-length lossy compression and channel coding: Non-asymptotic converses via cumulant generating functions; Proceedings of the 2014 IEEE International Symposium on Information Theory; Honolulu, HI, USA. 29 June–4 July 2014; pp. 2499–2503. [Google Scholar]
  • 54.Hayashi M., Tan V.Y.F. Equivocations, exponents, and second-order coding rates under various Rényi information measures. IEEE Trans. Inf. Theory. 2017;63:975–1005. doi: 10.1109/TIT.2016.2636154. [DOI] [Google Scholar]
  • 55.Kuzuoka S. On the smooth Rényi entropy and variable-length source coding allowing errors; Proceedings of the 2016 IEEE International Symposium on Information Theory; Barcelona, Spain. 10–15 July 2016; pp. 745–749. [Google Scholar]
  • 56.Sason I., Verdú S. Improved bounds on lossless source coding and guessing moments via Rényi measures. IEEE Trans. Inf. Theory. 2018;64:4323–4346. doi: 10.1109/TIT.2018.2803162. [DOI] [Google Scholar]
  • 57.Tan V.Y.F., Hayashi M. Analysis of ramaining uncertainties and exponents under various conditional Rényi entropies. IEEE Trans. Inf. Theory. 2018;64:3734–3755. doi: 10.1109/TIT.2018.2792495. [DOI] [Google Scholar]
  • 58.Tyagi H. Coding theorems using Rényi information measures; Proceedings of the 2017 IEEE Twenty-Third National Conference on Communications; Chennai, India. 2–4 March 2017; pp. 1–6. [Google Scholar]
  • 59.Csiszár I. Generalized cutoff rates and Rényi information measures. IEEE Trans. Inf. Theory. 1995;41:26–34. doi: 10.1109/18.370121. [DOI] [Google Scholar]
  • 60.Polyanskiy Y., Verdú S. Arimoto channel coding converse and Rényi divergence; Proceedings of the Forty-Eighth Annual Allerton Conference on Communication, Control and Computing; Monticello, IL, USA. 29 September–1 October 2010; pp. 1327–1333. [Google Scholar]
  • 61.Sason I. On the Rényi divergence, joint range of relative entropies, and a channel coding theorem. IEEE Trans. Inf. Theory. 2016;62:23–34. doi: 10.1109/TIT.2015.2504100. [DOI] [Google Scholar]
  • 62.Yu L., Tan V.Y.F. Lecture Notes in Computer Science, Proceedings of the 10th International Conference on Information Theoretic Security, Hong Kong, China, 29 November–2 December 2017. Volume 10681. Springer; New York, NY, USA: 2017. Rényi resolvability and its applications to the wiretap channel; pp. 208–233. [Google Scholar]
  • 63.Arimoto S. On the converse to the coding theorem for discrete memoryless channels. IEEE Trans. Inf. Theory. 1973;19:357–359. doi: 10.1109/TIT.1973.1055007. [DOI] [Google Scholar]
  • 64.Arimoto S. Information measures and capacity of order α for discrete memoryless channels. In: Csiszár I., Elias P., editors. Proceedings of the 2nd Colloquium on Information Theory; Keszthely, Hungary. 25–30 August 1975; Amsterdam, The Netherlands: Colloquia Mathematica Societatis Janós Bolyai; 1977. pp. 41–52. [Google Scholar]
  • 65.Dalai M. Lower bounds on the probability of error for classical and classical-quantum channels. IEEE Trans. Inf. Theory. 2013;59:8027–8056. doi: 10.1109/TIT.2013.2283794. [DOI] [Google Scholar]
  • 66.Leditzky F., Wilde M.M., Datta N. Strong converse theorems using Rényi entropies. J. Math. Phys. 2016;57:1–33. doi: 10.1063/1.4960099. [DOI] [Google Scholar]
  • 67.Mosonyi M., Ogawa T. Quantum hypothesis testing and the operational interpretation of the quantum Rényi relative entropies. Commun. Math. Phys. 2015;334:1617–1648. doi: 10.1007/s00220-014-2248-x. [DOI] [Google Scholar]
  • 68.Simic S. Jensen’s inequality and new entropy bounds. Appl. Math. Lett. 2009;22:1262–1265. doi: 10.1016/j.aml.2009.01.040. [DOI] [Google Scholar]
  • 69.Jelineck F., Schneider K.S. On variable-length-to-block coding. IEEE Trans. Inf. Theory. 1972;18:765–774. doi: 10.1109/TIT.1972.1054899. [DOI] [Google Scholar]
  • 70.Garey M.R., Johnson D.S. Computers and Intractability: A Guide to the Theory of NP-Completness. W. H. Freedman and Company; New York, NY, USA: 1979. [Google Scholar]
  • 71.Boztaş S. Comments on “An inequality on guessing and its application to sequential decoding”. IEEE Trans. Inf. Theory. 1997;43:2062–2063. doi: 10.1109/18.641578. [DOI] [Google Scholar]
  • 72.Bracher A., Hof E., Lapidoth A. Guessing attacks on distributed-storage systems; Proceedings of the 2015 IEEE International Symposium on Information Theory; Hong-Kong, China. 14–19 June 2015; pp. 1585–1589. [Google Scholar]
  • 73.Christiansen M.M., Duffy K.R. Guesswork, large deviations, and Shannon entropy. IEEE Trans. Inf. Theory. 2013;59:796–802. doi: 10.1109/TIT.2012.2219036. [DOI] [Google Scholar]
  • 74.Hanawal M.K., Sundaresan R. Guessing revisited: A large deviations approach. IEEE Trans. Inf. Theory. 2011;57:70–78. doi: 10.1109/TIT.2010.2090221. [DOI] [Google Scholar]
  • 75.Hanawal M.K., Sundaresan R. The Shannon cipher system with a guessing wiretapper: General sources. IEEE Trans. Inf. Theory. 2011;57:2503–2516. doi: 10.1109/TIT.2011.2110870. [DOI] [Google Scholar]
  • 76.Huleihel W., Salamatian S., Médard M. Guessing with limited memory; Proceedings of the 2017 IEEE International Symposium on Information Theory; Aachen, Germany. 25–30 June 2017; pp. 2258–2262. [Google Scholar]
  • 77.Massey J.L. Guessing and entropy; Proceedings of the 1994 IEEE International Symposium on Information Theory; Trondheim, Norway. 27 June–1 July 1994; p. 204. [Google Scholar]
  • 78.McEliece R.J., Yu Z. An inequality on entropy; Proceedings of the 1995 IEEE International Symposium on Information Theory; Whistler, BC, Canada. 17–22 September 1995; p. 329. [Google Scholar]
  • 79.Pfister C.E., Sullivan W.G. Rényi entropy, guesswork moments and large deviations. IEEE Trans. Inf. Theory. 2004;50:2794–2800. doi: 10.1109/TIT.2004.836665. [DOI] [Google Scholar]
  • 80.De Santis A., Gaggia A.G., Vaccaro U. Bounds on entropy in a guessing game. IEEE Trans. Inf. Theory. 2001;47:468–473. doi: 10.1109/18.904564. [DOI] [Google Scholar]
  • 81.Yona Y., Diggavi S. The effect of bias on the guesswork of hash functions; Proceedings of the 2017 IEEE International Symposium on Information Theory; Aachen, Germany. 25–30 June 2017; pp. 2253–2257. [Google Scholar]
  • 82.Gan G., Ma C., Wu J. Data Clustering: Theory, Algorithms, and Applications. SIAM; Philadelphia, PA, USA: 2007. (ASA-SIAM Series on Statistics and Applied Probability). [Google Scholar]
  • 83.Campbell L.L. Definition of entropy by means of a coding problem. Probab. Theory Relat. Field. 1966;6:113–118. doi: 10.1007/BF00537132. [DOI] [Google Scholar]
  • 84.Cover T.M., Thomas J.A. Elements of Information Theory. 2nd ed. John Wiley & Sons; Hoboken, NJ, USA: 2006. [Google Scholar]
  • 85.Shannon C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27:379–423. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
  • 86.Kontoyiannis I., Verdú S. Optimal lossless data compression: Non-asymptotics and asymptotics. IEEE Trans. Inf. Theory. 2014;60:777–795. doi: 10.1109/TIT.2013.2291007. [DOI] [Google Scholar]
  • 87.Van Erven T., Harremoës P. Rényi divergence and Kullback-Leibler divergence. IEEE Trans. Inf. Theory. 2014;60:3797–3820. doi: 10.1109/TIT.2014.2320500. [DOI] [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES