Skip to main content
Entropy logoLink to Entropy
. 2018 May 17;20(5):371. doi: 10.3390/e20050371

Normal Laws for Two Entropy Estimators on Infinite Alphabets

Chen Chen 1, Michael Grabchak 1, Ann Stewart 1, Jialin Zhang 1,*, Zhiyi Zhang 1
PMCID: PMC7512892  PMID: 33265461

Abstract

This paper offers sufficient conditions for the Miller–Madow estimator and the jackknife estimator of entropy to have respective asymptotic normalities on countably infinite alphabets.

Keywords: entropy, nonparametric estimator, Miller–Madow estimator, jackknife estimator, asymptotic normality

MSC: Primary 62F10, 62F12, 62G05, 62G20

1. Introduction

Let X={k;k1} be a finite or countably infinite alphabet, let p={pk;k1} be a probability distribution on X, and define K=k11[pk>0], where 1[·] is the indicator function, to be the effective cardinality of X under p. An important quantity associated with p is entropy, which is defined by [1] as

H=-k1pklnpk. (1)

Here and throughout, we adopt the convention that 0ln0=0.

Many properties of entropy and related quantities are discussed in [2]. The problem of statistical estimation of entropy has a long history (see the survey paper [3] or the recent book [4]). It is well-known that no unbiased estimators of entropy exist, and, for this reason, much energy has been focused on deriving estimators with relatively little bias (see [5] and the references therein for a discussion of some (but far from all) of these). Perhaps the most commonly used estimator is the plug-in. Its theoretical properties have been studied going back, at least, to [6], where conditions for consistency and asymptotic normality, in the case of finite alphabets, were derived. It would be almost fifty years before corresponding conditions for the countabe case would appear in the literature. Specifically, consistency, both in terms of almost sure and L2 convergence, was verified in [7]. Later, sufficient conditions for asymptotic normality were derived in two steps in [3,8].

Despite a simple form and nice theoretical properties, the plug-in suffers from large finite sample bias, which has led to the development of modifications that aim to reduce this bias. Two of the most popular are the Miller–Madow estimator of [6] and the jackknife estimator of [9]. Theoretical properties of these have not been studied, as extensively, in the literature. In this paper, we give sufficient conditions for the asymptotic normality of these two estimators. This is important for deriving confidence intervals and hypothesis tests, and it immediately implies consistency (see e.g., [4]).

We begin by introducing some notation. We say that a distribution p={pk;k1} is uniform if and only if its effective cardinality K< and for each k=1,2, either pk=1/K or pk=0. We write fg to denote limnf(n)/g(n)=1 and we write f=O(g(n)) to denote lim supnf(n)/g(n)<. Furthermore, we write L to denote convergence in law and p to denote convergence in probability. If a and b are real number, we write ab to denote the maximum of a and b. When it is not specified, all limits are assumed to be taken as n.

Let X1,,Xn be independent and identically distributed (iid) random variables on X under p. Let {Yk;k1} be the observed letter counts in the sample, i.e., Yk=i=1n1[Xi=k], and let p^={p^k;k1}, where p^k=Yk/n, be the corresponding relative frequencies. Perhaps the most intuitive estimator of H is the plug-in, which is given by

H^=-k1p^klnp^k. (2)

When the effective cardinality, K, is finite, [10] showed that the bias of H^ is

E(H^)-H=-K-12n+112n21-k=1K1pk+On-3. (3)

One of the simplest and earliest approaches aiming to reduce the bias of H^ is to estimate the first order term. Specifically, let m^=k11[Yk>0] be the number of letters observed in the sample and consider an estimator of the form,

H^MM=H^+m^-12n. (4)

This estimator is often attributed to [6] and is known as the Miller–Madow estimator. Note that, for finite K,

Em^-12n=K-12n-k(1-pk)n2n.

Since k(1-pk)nK(1-p)n, where p=min{pk:pk>0}, decays exponentially fast, it follows that, for finite K, the bias of H^MM is

E(H^MM)-H=112n21-k=1K1pk+On-3.

Among the many estimators in the literature aimed at reducing bias in entropy estimation, the Miller–Madow estimator is one of the most commonly used. Its popularity is due to its simplicity, its intuitive appeal, and, more importantly, its good performance across a wide range of different distributions including those on countably infinite alphabets. See, for instance, the simulation study in [5].

The jackknife entropy estimator is another commonly used estimator designed to reduce the bias of the plug-in. It is calculated in three steps:

  1. for each i{1,2,,n} construct H^(i), which is a plug-in estimator based on a sub-sample of size n-1 obtained by leaving the ith observation out;

  2. obtain H^(i)=nH^-(n-1)H^(i) for i=1,,n; and then

  3. compute the jackknife estimator
    H^JK=i=1nH^(i)n. (5)

Equivalently, (5) can be written as

H^JK=nH^-(n-1)i=1nH^(i)n. (6)

The jackknife estimator of entropy was first described by [9]. From (2), it may be verified that, when K<, the bias of H^JK is

EH^JK-H=On-2. (7)

Both the Miller–Madow and the jackknife estimators are adjusted versions of the plug-in. When the effective cardinality is finite, i.e., K<, the asymptotic normalities of both can be easily verified. A question of theoretical interest is whether these normalities still hold when the effective cardinality is countably infinite. In this paper, we give sufficient conditions for n(H^MM-H) and n(H^JK-H) to have asymptotic normalities on countably infinite alphabets and provide several illustrative examples. The rest of paper is organized as follows. Our main results for both the Miller–Madow and the jackknife estimators are given in Section 2. A small simulation study is given in Section 3. This is followed by a brief discussion in Section 4. Proofs are postponed to Section 5.

2. Main Results

We begin by recalling a sufficient condition due to [8] for the asymptotic normality of the plug-in estimator.

Condition 1.

The distribution, p={pk;k1}, satisfies

k1pkln2pk<, (8)

and there exists an integer-valued function K(n) such that, as n,

  • 1. 

    K(n),

  • 2. 

    K(n)/n0, and

  • 3. 

    nkK(n)pklnpk0.

Note that, by Jensen’s inequality (see e.g., [2]), (8) implies that

H2=-k1pklnpk2k1pkln2pk<,

where equality holds, i.e., H2=k1pkln2pk, if and only if p is a uniform distribution. Thus, when (8) holds, we have H<. The following result is given in [8].

Lemma 1.

Let p={pk;k1} be a distribution, which is not uniform, and set

σ2=k1pkln2pk-H2andσ^2=k1p^kln2p^k-H^2. (9)

If p satisfies Condition 1, then σ^pσ,

n(H^-H)σLN(0,1),

and

n(H^-H)σ^LN(0,1).

The following is useful for checking when Condition 1 holds.

Lemma 2.

Let p={pk;k1} and p={pk;k1} be two distributions and assume that p satisfies Condition 1. If there exists a C>0 such that, for large enough k,

pkCpk,

then p satisfies Condition 1 as well.

In [8], it is shown that Condition 1 holds for p={pk;k1} with

pk=Ck2ln2k,k=1,2,,

where C>0 is a normalizing constant. It follows from Lemma 2 that any distribution with tails lighter than this satisfies Condition 1 as well.

We are interested in finding conditions under which the result of Lemma 1 can be extended to bias adjusted modifications of H^. Let H^* be any bias-adjusted estimator of the form

H^*=H^+B^*, (10)

where B^* is an estimate of the bias. Combining Lemma 1 with Slutsky’s theorem immediately gives the following.

Theorem 1.

Let p={pk;k1} be a distribution, which is not uniform, and let σ2 and σ^2 be as in (9). If Condition 1 holds and nB^*p0, then σ^pσ,

n(H^*-H)σLN(0,1),

and

n(H^*-H)σ^LN(0,1).

For the Miller–Madow estimator and the jackknife estimator, respectively, the bias correction term, B^*, in (10) takes the form

Miller-Madow:B^MM=m^-12n,Jackknife:B^JK=n-1ni=1nH^-H^(i).

Below, we give sufficient conditions for when nB^MMp0 and when nB^JKp0.

2.1. Results for the Miller–Madow Estimator

Condition 2.

The distribution, p={pk;k1}, satisfies that, for sufficiently large k,

pk1a(k)b(k)k3, (11)

where a(k)>0 and b(k)>0 are two sequences such that

  • 1. 
    a(k) as k, and, furthermore,
    • (a) 
      the function a(k) is eventually nondecreasing, and
    • (b) 
      there exists an ε>0 such that
      lim supk(a(k))2εak(a(k))ε<; (12)
  • 2. 
    k11kb(k)<. (13)

Since this condition only requires that pk, for sufficiently large k, is upper bounded in the appropriate way, we immediately get the following.

Lemma 3.

Let p={pk;k1} and p={pk;k1} be two distributions and assume that p satisfies Condition 2. If there exists a C>0 such that, for large enough k,

pkCpk,

then p satisfies Condition 2 as well.

We now give our main results for the Miller–Madow Estimator.

Theorem 2.

Let p={pk;k1} be a distribution, which is not uniform, and let σ2 and σ^2 be as in (9). If Condition 2 holds, then σ^pσ,

n(H^MM-H)σLN(0,1)

and

n(H^MM-H)σ^LN(0,1).

In the proof of the theorem, we will show that Condition 2 implies that Condition 1 holds. Condition 2 requires pk to decay slightly faster than k-3 by two factors 1/a(k) and 1/b(k), where a(k) and b(k) satisfy (12) and (13) respectively. While (13) is clear in its implication on b(k), (12) is much less so on a(k). To have a better understanding of (12), we give an important situation where (12) holds. Consider the case a(n)=lnn. In this case, for any ε(0,0.5)

(a(n))2εan(a(n))ε=(lnn)2ε0.5lnn-εlnlnn(lnn)2ε0.5lnn0.

We now give a more general situation, which shows just how slow a(k) can be. First, we recall the iterated logarithm function. Define ln(r)(x), recursively for sufficiently large x>0, by ln(0)(x)=x and ln(r)(x)=lnln(r-1)x for r1. By induction, it can be shown that ddxln(r)(x)=i=0r-1ln(i)(x)-1 for r1.

Lemma 4.

The function a(n)=ln(r)(n) satisfies (12) with ε=0.5 for any r2.

We now give three examples.

Example 1.

Let p={pk;k1} be such that for sufficiently large k,

pkCk3(lnk)(lnlnk)2+ε,

where ε>0 and C>0 are fixed constants. In this case, Condition 2 holds with a(k)=lnlnk and b(k)=(lnk)(lnlnk)1+ε/C in (11).

We can consider a more general form, which allows for even heavier tails.

Example 2.

Let r be an integer with r2 and let p={pk;k1} be such that, for sufficiently large k,

pkCk3i=1r-1ln(i)k(ln(r)k)2+ε

where ε>0 and C>0 are fixed constants. In this case, Condition 2 holds with a(k)=ln(r)k and b(k)=i=1r-1ln(i)k(ln(r)k)1+ε/C in (11). The fact that b(k) satisfies (13) follows by the integral test for convergence.

It follows from Lemma 3 that any distribution with tails lighter than those in this example must satisfy Condition 2. On the other hand, the tails cannot get too much heavier.

Example 3.

Let p={pk;k1} be such that pk=Ck-3, where C>0 is a normalizing constant. In this case, Condition 2 does not hold. However, Condition 1 does hold.

2.2. Results for the Jackknife Estimator

For any distribution p, let Bn=E(H^)-H be the bias of the plug-in based on a sample of size n.

Condition 3.

The distribution, p={pk;k1}, satisfies

limnn3/2Bn-Bn-1=0.

Theorem 3.

Let p={pk;k1} be a distribution, which is not uniform, and let σ2 and σ^2 be as in (9). If Conditions 1 and 3 hold, then σ^pσ,

n(H^JK-H)σLN(0,1)

and

n(H^JK-H)σ^LN(0,1).

It is not clear to us whether Conditions 1 and 3 are equivalent or, if not, which is more stringent. For that reason, in the statement of Theorem 3, both conditions are imposed. The proof of the theorem uses the following lemma, which gives some insight into B^JK and Condition 3.

Lemma 5.

For any probability distribution p={pk;k1}, we have

B^JK=n-1ni=1nH^-H^(i)0

and

EB^JK=(n-1)Bn-Bn-10.

We now give a condition, which implies Condition 3 and tends to be easier to check.

Proposition 1.

If the distribution p={pk;k1} is such that there exists an ε(1/2,1) with k1pk1-ε<, then Condition 3 is satisfied.

We now give an example where this holds.

Example 4.

Let p={Ck-(2+δ);k1}, where δ>0 is fixed and C>0 is a normalizing constant. In this case, the assumption of Proposition 1 holds and thus Condition 3 is satisfied.

To see that the assumption of Proposition 1 holds in this case, fix ε(1/2,(1+δ)/(2+δ)). Note that -(1+δ/2)<-(2+δ)(1-ε)<-1, and thus

k1pk1-ε=C1-εk1k-(2+δ)(1-ε)<.

3. Simulations

The main application of the asymptotic normality results given in this paper is the construction of asymptotic confidence intervals and hypothesis tests. For instance, if p satisfies the assumptions of Theorem 2, then an asymptotic (1-α)100% confidence interval for H is given by

H^MM-zα/2σ^n,H^MM+zα/2σ^n,

where zα/2 is a number such that P(Z>zα/2)=α/2 and Z is a standard normal random variable. Similarly, if the assumptions of Theorem 3 are satisfied, then we can replace H^MM with H^JK, and if the assumptions of Lemma 1 are satisfied, then we can replace H^MM with H^. In this section, we give a small-scale simulation study to evaluate the finite sample performance of these confidence intervals.

For concreteness, we focus on the geometric distribution, which corresponds to

pk=p(1-p)k-1,k=1,2,,

where p(0,1) is a parameter. The true entropy of this distribution is given by H=-p-1plnp+(1-p)ln(1-p). In this case, Conditions 1, 2, and 3 all hold. For our simulations, we took p=0.5. The simulations were performed as follows. We began by simulating a random sample of size n and used it to evaluated a 95% confidence interval for the given estimator. We then checked to see if the true value of H was in the interval or not. This was repeated 5000 times and the proportion of times when the true value was in the interval was calculated. This proportion should be close to 0.95 when the confidence interval works well. We repeated this for sample sizes ranging from 20 to 1000 in increments of 10. The results are given in Figure 1. We can see that the Miller–Madow and jackknife estimators consistently outperform the plug-in. It may be interesting to note that, although the proofs of Theorems 1–3 are based on showing that the bias correction term approaches zero, it does not mean that the bias correction term is not useful. On the contrary, bias correction improves the finite sample performance of the asymptotic confidence intervals.

Figure 1.

Figure 1

Figure 1

Effectiveness of the 95% confidence intervals as a function of sample size. The plot on the top left gives the proportions for the Miller–Madow and the plug-in estimators, while the one on the top right gives the proportions for the jackknife and the plug-in estimators. The horizontal line is at 0.95. The closer the proportion is to this line, the better the performance. The plot on the bottom left gives the proportion for Miller–Madow minus the proportion for the plug-in, while the one of the bottom right gives the proportion for the jackknife minus the proportion for the plug-in. The larger the value, the greater the improvement due to bias correction. Here, the horizontal line is at 0.

4. Discussion

In this paper, we gave sufficient conditions for the asymptotic normality of the Miller–Madow and the Jackknife estimators of entropy. While our focus is on the case of countably infinite alphabets, our results are formulated and proved in the case where the effective cardinality K may be finite or countably infinite. As such, they hold in the case of finite alphabets as well. In fact, for finite alphabets, Conditions 1–3 always hold and we have asymptotic normality so long as the underlying distribution is not uniform. The difficulty with the uniform distribution is that it is the unique distribution for which σ2, as given by (9), is zero (see the discussion just below Condition 1). When the distribution is uniform, the asymptotic distribution is chi-squared with (K-1) degrees of freedom (see [6]).

In general, we do not know if our conditions are necessary. However, they cover most distributions of interest. The only distributions, which they preclude, are ones with extremely heavy tails. However, in complete generality, Conditions 1–3 may look complicated, and they are easily checked in many situations. For instance, Condition 2 always holds when, for large enough k, pkCk-3-δ for some C,δ>0, i.e., when

k=1k2pk<.

If the alphabet X=N is the set of natural numbers, then this is equivalent to the distribution p having a finite variance. Similarly, Conditions 1 and 3 both holding is the case when, for large enough k, pkCk-2-δ for some C,δ>0, i.e., when

k=1kpk<.

If the alphabet X=N is the set of natural numbers, then this is equivalent to the distribution p having a finite mean.

5. Proofs

Proof of Lemma 2.

Without loss of generality, assume that C>1 and thus that lnC>0. Let f(x)=xlnx for x(0,1). It is readily checked that f is negative and decreasing for x(0,e-1). Since Cpk0 as k, it follows that Cpk<e-1 for large enough k. Now, let K(n) be the sequence that works for p in Condition 1. For large enough n,

0nkK(n)pklnpkCnkK(n)pkln(Cpk)Cln(C)nkK(n)pk+CnkK(n)pkln(pk)Cln(C)+1nkK(n)pkln(pk)0.

Similarly, the function g(x)=xln2x for x(0,1) is positive and increasing for x(0,e-2). Thus, there is an integer M>0 such that if kM, then Cpk<e-2 and

0k1pkln2pkk=1M-1pkln2pk+Ck=Mpkln2(Cpk)=k=1M-1pkln2pk+Ck=Mpkln2(pk)+Cln2(C)k=Mpk+2Cln(C)k=Mpkln(pk)<,

as required. ☐

To prove Theorem 2, the following Lemma is needed.

Lemma 6.

If Condition 2 holds, then there exists a K1>0 such that for all kK1

pk1a(k)b(k)k31-1-2kb(k)1a(k)k2. (14)

Proof. 

Observing that e-x1-x holds for all real x and that limx0(1-e-x)/x=1, we have e-2/(kb(k))1-2/(kb(k)), and hence

1-1-2kb(k)1a(k)k21-e-2a(k)b(k)k32a(k)b(k)k3.

This implies that there is a K1>0 such that for all kK1 (11) holds and

2a(k)b(k)k31-1-2kb(k)1a(k)k22.

It follows that, for such k,

pk1a(k)b(k)k3=.52a(k)b(k)k31-1-2kb(k)1a(k)k21-1-2kb(k)1a(k)k21-1-2kb(k)1a(k)k2

as required. ☐

Proof of Theorem 2.

By Theorem 1, it suffices to show that Condition 2 implies that both Condition 1 and nB^MMp0 hold. The fact that Condition 2 implies Condition 1 follows by Example 3, Lemmas 2 and 3. We now show that nB^MMp0.

Fix ε0(0,ε). From (12) and the facts that a(k) is positive, eventually nondecreasing, and approaches infinity, it follows that

lim supk(a(k))2ε0ak(a(k))ε0=lim supk(a(k))-2(ε-ε0)(a(k))2εak(a(k))ε0lim supk(a(k))-2(ε-ε0)(a(k))2εak(a(k))ε=0. (15)

Let K2 be a positive integer such that, for all nK2, a(n) is nondecreasing, and let rn=n/(a(n))ε0K3, where K3=K1K2 and K1 is as in Lemma 6. It follows that

EnB^MM=nEm^-12n1nE(m^)=1nk11-1-pkn1nkrn1+1nk>rn1-1-2kb(k)na(k)k2=:S1+S2.

We have

S1rnn=(a(n))-ε0K3n0;andS21nk>rn1-1-2kb(k)na(rn)rn21nk>rn1-1-2kb(k)(a(n))2ε0an(a(n))ε0.

By (15), it follows that, for large enough n,

S21nk>rn1-1-2kb(k)=1nk>rn2kb(k)k11kb(k)2n0.

From here, Markov’s inequality implies that nB^MMp0. ☐

Proof of Lemma 4.

First note that ln(r-1)0.5lnn-0.5ln(v)nln(r-1)0.5lnn for any v2 and r1. This can be shown by induction on r. Specifically, the result is immediate for r=1. If the result is true for r=m, then for r=m+1

ln(m)0.5lnn-0.5ln(v)n=lnln(m-1)0.5lnn-0.5ln(v)n=lnln(m-1)0.5lnn-0.5ln(v)nln(m-1)0.5lnn+ln(m)0.5lnnln(m)0.5lnn.

It follows that for r2

limnln(r)nln(r)n/(ln(r)n)=limnln(r)nln(r-1)0.5lnn-0.5ln(r+1)n=limnln(r-1)lnnln(r-1)0.5lnn=1,

where the final equality follows by the fact that ln(r-1)(x) is a slowly varying function. Recall that a positive-valued function is called slowly varying if for any t>0

limx(xt)(x)=1

(see [11] for a standard reference). To see that ln(r-1)(x) is slowly varying, note that ln is slowly varying and that compositions of slowly varying functions are slowly varying by Proposition 1.3.6 in [11]. ☐

Proof of Lemma 5.

Observing the convention that 0ln0=0,

i=1nH^(i)=k1i:Xi=kH^(i)=k1Yk-Yk-1n-1lnYk-1n-1-j:j1,jkYjn-1lnYjn-1=k1Yk-Yk-1n-1lnYk-1Yk+lnYkn-1-j:j1,jkYjn-1lnYjn-1=k1Yk-Yk-1n-1lnYk-1Yk+1n-1lnYkn-1-j1Yjn-1lnYjn-1=-1n-1k1Yk(Yk-1)lnYk-1Yk+k1Ykn-1lnYkn-1-k1Ykj1Yjn-1lnYjn-1=-1n-1k1Yk(Yk-1)lnYk-1Yk-(n-1)k1Ykn-1lnYkn-1=-1n-1k1Yk(Yk-1)lnYk-1Yk-k1YklnYkn+lnnn-1=-1n-1k1Yk(Yk-1)lnYk-1Yk-nlnnn-1+nH^.

Therefore,

i=1nH^-H^(i)=1n-1k1Yk(Yk-1)lnYk-1Yk+nlnnn-1=1n-1k1Yk(Yk-1)lnYk-1Yk+(n-1)lnnn-1=1n-1k1Yk(n-1)lnnn-1-(Yk-1)lnYkYk-1.

It suffices to show that for any y{1,2,,n}

(y-1)lny-(y-1)ln(y-1)(n-1)lnn-(n-1)ln(n-1). (16)

Towards that end, first note that the inequality of (16) holds for y=1. Now, let

f(y)=(y-1)lny-(y-1)ln(y-1)

and, therefore, letting s=1-1/y,

f(y)=lnyy-1-1y=1-1y-1-ln1-1y=(s-1)-lns.

Since s-1lns for all s>0 (see e.g., 4.1.36 in [12]), f(y)0 for all y, 1<yn, which implies (16).

For the second part, we use the first part to get

0Ei=1nH^-H^(i)=nE[H^-H]-i=1nE[H^(i)-H]=n(Bn-Bn-1),

where the last equality follows from the facts that for each i, H^(i) is a plug-in estimator of H based on a sample of size (n-1) and that EH^(i) does not depend on i due to symmetry. From here, the result follows. ☐

Proof of Theorem 3.

By Theorem 1, it suffices to show nB^JKp0. Note that, by Lemma 5,

0EnB^JK=n(n-1)(Bn-Bn-1)n3/2(Bn-Bn-1)0,

where the convergence follows by Condition 2. From here, the result follows by Markov’s inequality. ☐

To prove Proposition 1, we need several lemmas, which may be of independent interest.

Lemma 7.

Let Sn and Sn-1 be binomial random variables with parameters (n,p) and (n-1,p), respectively. If n2 and p(0,1), then

E(SnlnSn)=Enpln(Sn-1+1). (17)

The proof is given on page 178 in [7].

Lemma 8.

Let X1,,Xn be iid Bernoulli random variables with parameter p(0,1). For m=1,,n let Sm=i=1mXi, p^m=Sm/m, h^m=-p^mlnp^m, and Δm=E[h^m-h^m-1]. Then,

Δn=E[h^n-h^n-1]p(2-p)(n-1)[(n-2)p+2]2p(n-1)[(n-2)p+2]. (18)

Proof. 

Applying Lemma 7 to Δn gives

Δn=plnnn-1+pElnSn-2+1Sn-1+1=plnnn-1+pElnSn-2+1Sn-2+Xn-1+1.

Conditioning on Xn-1 gives

Δn=plnnn-1+p2ElnSn-2+1Sn-2+2.

Noting that f(x)=ln(x/(x+1)) is a concave function for x>0, by Jensen’s inequality,

Δnplnnn-1+p2ln(n-2)p+1(n-2)p+2.

Applying the following inequalities (both of which follow from 4.1.36 in [12]) to the terms of the above expression,

lnxx-1<1x-1 for x>1andlnxx+1<-1x+1 for x>0,

it follows that

Δnpn-1-p2(n-2)p+2=p(2-p)(n-1)[(n-2)p+2],

which completes the proof. ☐

For fixed ε>0, rewriting the upper bound of (18) gives

2p(n-1)[(n-2)p+2]=2p1-εn-1pε(n-2)p+2=:2p1-εn-1{g(n,p,ε)}. (19)

Lemma 9.

For any ε(0,1) and n3, there exists a p0(0,1) such that g(n,p,ε) defined in (19) is maximized at p0 and

0g(n,p0,ε)=O(n-ε). (20)

Proof. 

Taking the derivative of lng(n,p,ε) with respect to p gives

plng(n,p,ε)=εp-n-2(n-2)p+2.

It is readily checked that this equals zero only at

p0=2ε(1-ε)(n-2)

and is positive for 0<p<p0 and negative for p0<p<1. Thus, p0 is the global maximum. For a fixed ε, we have

g(n,p0,ε)=(2ε)ε(1-ε)ε(n-2)ε[2ε(1-ε)+2]=εn-2ε1-ε21-ε=O(n-ε),

as required. ☐

Proof of Proposition 1.

For every k and every mn, let

Sm,k=i=1m1[Xi=k]andH^m=-k1Sm,kmlnSm,km

be the observed letter counts and the plug-in estimator of entropy based on the first m observations. Thus, Sm,k=Yk and H^n=H^. We are interested in evaluating

Bn-Bn-1=EH^n-H^n-1=k1ESn-1,kn-1lnSn-1,kn-1-Sn,knlnSn,knk12pk(n-1)[(n-2)pk+2]=2k1pk1-εg(n,pk,ε)(n-1),

where the third line follows by Lemma 8. Now, applying Lemmas 5 and 9 gives

0n3/2(Bn-Bn-1)2n3/2k1pk1-εg(n,pk,ε)(n-1)n3/2g(n,p0,ε)n-1k1pk1-ε=O(n1/2-ε),

which converges to zero when ε(1/2,1). ☐

Author Contributions

C.C., M.G., A.S., J.Z. and Z.Z. contributed to the proofs; C.C., M.G. and J.Z. contributed editorial input; Z.Z. wrote the paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Shannon C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948;27:379–423, 623–656. doi: 10.1002/j.1538-7305.1948.tb01338.x. [DOI] [Google Scholar]
  • 2.Cover T.M., Thomas J.A. Elements of Information Theory. 2nd ed. John Wiley & Son, Inc.; Hoboken, NJ, USA: 2006. [Google Scholar]
  • 3.Paninski L. Estimation of entropy and mutual information. Neural Comput. 2003;15:1191–1253. doi: 10.1162/089976603321780272. [DOI] [Google Scholar]
  • 4.Zhang Z. Statistical Implications of Turing’s Formula. John Wiley & Sons; New York, NY, USA: 2017. [Google Scholar]
  • 5.Zhang Z., Grabchak M. Bias adjustment for a nonparametric entropy estimator. Entropy. 2013;15:1999–2011. doi: 10.3390/e15061999. [DOI] [Google Scholar]
  • 6.Miller G.A., Madow W.G. In: On the Maximum-Likelihood Estimate of the Shannon-Wiener Measure of Information. Luce R.D., Bush R.R., Galanter E., editors. Bolling Air Force Base; Washington, DC, USA: 1954. Operational Applications Laboratory, Air Force, Cambridge Research Center, Air Research and Development Command, Report AFCRC-TR-54-75. [Google Scholar]
  • 7.Antos A., Kontoyiannis I. Convergence properties of functional estimates for discrete distributions. Random Struct. Algorithm. 2001;19:163–193. doi: 10.1002/rsa.10019. [DOI] [Google Scholar]
  • 8.Zhang Z., Zhang X. A normal law for the plug-in estimator of entropy. IEEE Trans. Inf. Theory. 2012;58:2745–2747. doi: 10.1109/TIT.2011.2179702. [DOI] [Google Scholar]
  • 9.Zahl S. Jackknifing an index of diversity. Ecology. 1977;58:907–913. doi: 10.2307/1936227. [DOI] [Google Scholar]
  • 10.Harris B. The statistical estimation of entropy in the non-parametric case. In: Csiszár I., editor. Topics in Information Theory. North-Holland; Amsterdam, The Netherlands: 1975. pp. 323–355. [Google Scholar]
  • 11.Bingham N.H., Goldie C.M., Teugels J.L. Regular Variation. Cambridge University Press; New York, NY, USA: 1987. [Google Scholar]
  • 12.Abramowitz M., Stegun I.A. Handbook of Mathematical Functions. 10th ed. Dover Publications; New York, NY, USA: 1972. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES