Skip to main content
Entropy logoLink to Entropy
. 2018 May 23;20(6):397. doi: 10.3390/e20060397

Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators

Jorge F Silva 1
PMCID: PMC7512916  PMID: 33265487

Abstract

This work addresses the problem of Shannon entropy estimation in countably infinite alphabets studying and adopting some recent convergence results of the entropy functional, which is known to be a discontinuous function in the space of probabilities in ∞-alphabets. Sufficient conditions for the convergence of the entropy are used in conjunction with some deviation inequalities (including scenarios with both finitely and infinitely supported assumptions on the target distribution). From this perspective, four plug-in histogram-based estimators are studied showing that convergence results are instrumental to derive new strong consistent estimators for the entropy. The main application of this methodology is a new data-driven partition (plug-in) estimator. This scheme uses the data to restrict the support where the distribution is estimated by finding an optimal balance between estimation and approximation errors. The proposed scheme offers a consistent (distribution-free) estimator of the entropy in ∞-alphabets and optimal rates of convergence under certain regularity conditions on the problem (finite and unknown supported assumptions and tail bounded conditions on the target distribution).

Keywords: Shannon entropy estimation, countably infinite alphabets, entropy convergence results, statistical learning, histogram-based estimators, data-driven partitions, strong consistency, rates of convergence

1. Introduction

Shannon entropy estimation has a long history in information theory, statistics, and computer science [1]. Entropy and related information measures (conditional entropy and mutual information) have a fundamental role in information theory and statistics [2,3] and, as a consequence, it has found numerous applications in learning and decision making tasks [4,5,6,7,8,9,10,11,12,13,14,15]. In many of these contexts, distributions are not available and the entropy needs to be estimated from empirical data. This problem belongs to the category of scalar functional estimation that has been thoroughly studied in non-parametric statistics.

Starting with the finite alphabet scenario, the classical plug-in estimator (i.e., the empirical distribution evaluated on the functional) is well known to be consistent, minimax optimal, and asymptotically efficient [16] (Section 8.7–8.9). More recent research has focused on looking at the so-called large alphabet (or large dimensional) regime, meaning a non-asymptotic under-sampling regime where the number of samples n is on the order of, or even smaller than, the size of the alphabet denoted by k. In this context, it has been shown that the classical plug-in estimator is sub-optimal as it suffers from severe bias [17,18]. For characterizing optimality in this high dimensional context, a non-asymptotic minimax mean square error analysis (under a finite n and k) has been explored by several authors [17,18,19,20,21] considering the minimax risk

R*(k,n)=infH^(·)supμP(k)EX1,XnμnH^(X1,,Xn)H(μ)2

where P(k) denotes the collection of probabilities on [k]1,,k and H(μ) is the entropy of μ (details in Section 2). Paninski [19] first showed that it was possible to construct an entropy estimator that uses a sub-linear sampling size to achieve minimax consistency when k goes to infinity, in the sense that there is a sequence (nk)=o(k) where R*(k,nk)0 as k goes to infinity. A set of results by Valiant et al. [20,21] shows that the optimal scaling of the sampling size with respect to k is O(k/log(k)), to achieve the aforementioned asymptotic consistency for entropy estimation. A refined set of results for the complete characterization of R*(k,n), the specific scaling of the sampling complexity, and the achievability of the obtained minimax L2 risk for the family P(k):k1 with practical estimators have been presented in [17,18]. On the other hand, it is well-known that the problem of estimating the distribution (consistently in total variation) in finite alphabets requires a sampling complexity that scales as O(k) [22]. Consequently, in finite alphabets the task of entropy estimation is simpler than estimating the distribution in terms of sampling complexity. These findings are consistent with the observation that the entropy is a continuous functional of the space of distributions (in the total variational distance sense) for the finite alphabet case [2,23,24,25].

1.1. The Challenging Infinite Alphabet Learning Scenario

In this work, we are interested in the countably infinite alphabet scenario, i.e., on the estimation of the entropy when the alphabet is countably infinite and we have a finite number of samples. This problem can be seen as an infinite dimensional regime as the size of the alphabet goes unbounded and n is kept finite for the analysis, which differs from the large dimensional regime mentioned above. As argued in [26] (Section IV), this is a challenging non-parametric learning problem because some of the finite alphabet properties of the entropy do not extend to this infinite dimensional context. Notably, it has been shown that the Shannon entropy is not a continuous functional with respect to the total variational distance in infinite alphabets [24,26,27]. In particular, Ho et al. [24] (Theorem 2) showed concrete examples where convergence in χ2-divergence and in direct information divergence (I-divergence) of a set of distributions to a limit, both stronger than total variational convergence [23,28], do not imply the convergence of the entropy. In addition, Harremoës [27] showed the discontinuity of the entropy with respect to the reverse I-divergence [29], and consequently, with respect to the total variational distance (the distinction between reverse and direct I-divergence was pointed out in the work of Barron et al. [29]). In entropy estimation, the discontinuity of the entropy implies that the minimax mean square error goes unbounded, i.e.,

Rn*=infH^(·)supμH(X)EX1,XnμnH^(X1,,Xn)H(μ)2=,

where H(X) denotes the family of finite entropy distribution over the countable alphabet set X (the proof of this result follows from [26] (Theorem 1) and the argument is presented in Appendix A). Consequently, there is no universal minimax consistent estimator (in the mean square error sense) of the entropy over the family of finite entropy distributions.

Considering a sample-wise (or point-wise) convergence to zero of the estimation error (instead of the minimax expected error analysis mentioned above), Antos et al. [30] (Theorem 2 and Corollary 1) show the remarkable result that the classical plug-in estimate is strongly consistent and consistent in the mean square error sense for any finite entropy distribution (point-wise). Then, the classical plug-in entropy estimator is universal, meaning that the convergence to the right limiting value H(μ) is achieved almost surely despite the discontinuity of the entropy. Moving on the analysis of the (point-wise) rate of convergence of the estimation error, Antos et al. [30] (Theorem 3) present a finite length lower bound for the error of any arbitrary estimation scheme, showing as a corollary that no universal rate of convergence (to zero) can be achieved for entropy estimation in infinite alphabets [30] (Theorem 4). Finally, constraining the problem to a family of distributions with specific power tail bounded conditions, Antos et al. [30] (Theorem 7) present a finite length expression for the rate of convergence of the estimation error of the classical plug-in estimate.

1.2. From Convergence Results to Entropy Estimation

In view of the discontinuity of the entropy in ∞-alphabets [24] and the results that guarantee entropy convergence [25,26,27,31], this work revisits the problem of point-wise almost-sure entropy estimation in ∞-alphabets from the perspective of studying and applying entropy convergence results and their derived bounds [25,26,31]. Importantly, entropy convergence results have established concrete conditions on both the limiting distribution μ and the way a sequence of distributions μn:n0 converges to μ such that limnH(μn)=H(μ) is satisfied. The natural observation that motivates this work is that consistency is basically a convergence to the true entropy value that happens with probability one. Then our main conjecture is that putting these conditions in the context of a learning task, i.e., where μn:n0 is a random sequence of distributions driven by the classical empirical process, will offer the possibility to study a broad family of plug-in estimators with the objective to derive new strong consistency and rates of convergence results. On the practical side, this work proposes and analizes a data-driven histogram-based estimator as a key learning scheme, since this approach offers the flexibility to adapt to learning task when appropriate bounds for the estimation and approximation errors are derived.

1.3. Contributions

We begin revisiting the classical plug-in entropy estimator considering the relevant scenario where μ (the unknown distribution that produces the i.i.d. samples) has a finite but arbitrary large and unknown support. This is declared to be a challenging problem by Ho and Yeung [26] (Theorem 13) because of the discontinuity of the entropy. Finite-length (non-asymptotic) deviation inequalities and intervals of confidence are derived extending the results presented in [26] (Section IV). From this, it is shown that the classical plug-in estimate achieves optimal rates of convergence. Relaxing the finite support restriction on μ, two concrete histogram-based plug-in estimators are presented, one built upon the celebrated Barron-Györfi-van der Meulen histogram-based approach [29,32,33]; and the other on a data-driven partition of the space [34,35,36]. For the Barron plug-in scheme, almost-sure consistency is shown for entropy estimation and distribution estimation in direct I-divergence under some mild support conditions on μ. For the data-driven partition scheme, the main context of application of this work, it is shown that this estimator is strongly consistent distribution-free, matching the universal result obtained for the classical plug-in approach in [30]. Furthermore, new almost-sure rates of convergence results (in the estimation error) are obtained for distributions with finite but unknown support and for families of distributions with power and exponential tail dominating conditions. In this context, our results show that this adaptive scheme has a concrete design solution that offers very good convergence rate of the overall estimation error, as it approaches the rate O(1/n) that is considered optimal for the finite alphabet case [16]. Importantly, the parameter selection of this scheme relies on, first, obtaining expressions to bound the estimation and approximation errors and, second, finding the optimal balance between these two learning errors.

1.4. Organization

The rest of the paper is organized as follows. Section 2 introduces some basic concepts, notation, and summarizes the main entropy convergence results used in this work. Section 3, Section 4 and Section 5 state and elaborate the main results of this work. Discussion of the results and final remarks are given in Section 6. The technical derivation of the main results are presented in Section 7. Finally, proofs of auxiliary results are relegated to the Appendix Section.

2. Preliminaries

Let X be a countably infinite set and let P(X) denote the collection of probability measures in X. For μ and v in P(X), and μ absolutely continuous with respect to v (i.e., μv), dμdv(x) denotes the Radon-Nikodym (RN) derivative of μ with respect to v. Every μP(X) is equipped with its probability mass function (pmf) that we denote by fμ(x)μx, xX. Finally, for any μP(X), AμxX:fμ(x)>0 denotes its support and

F(X)μP(X):Aμ< (1)

denotes the collection of probabilities with finite support.

Let μ and v be in P(X), then the total variation distance of μ and v is given by [28]

V(μ,v)supA2Xv(A)μ(A), (2)

where 2X denotes the subsets of X. The Kullback–Leibler divergence or I-divergence of μ with respect to v is given by

D(μ||v)xAμfμ(x)logfμ(x)fv(x)0, (3)

when μv, while D(μ||v) is set to infinite, otherwise [37].

The Shannon entropy of μP(X) is given by [1,2,38]:

H(μ)xAμfμ(x)logfμ(x)0. (4)

In this context, let H(X)P(X) be the collection of probabilities where (4) is finite, let AC(X|v)P(X) denote the collection of probabilities absolutely continuous with respect to vP(X), and let H(X|v)AC(X|v) denote the collection of probabilities where (3) is finite for vP(X).

Concerning convergence, a sequence μn:nNP(X) is said to converge in total variation to μP(X) if

limnV(μn,μ)=0. (5)

For countable alphabets, ref. [31] (Lemma 3) shows that the convergence in total variation is equivalent to the weak convergence, which is denoted here by μnμ, and the point-wise convergence of the pmf’s. Furthermore, from (2), the convergence in total variation implies the uniform convergence of the pmf’s, i.e, limnsupxXμn(x)μ(x)=0. Therefore, in this countable case, all the four previously mentioned notions of convergence are equivalent: total variation, weak convergence, point-wise convergence of the pmf’s, and uniform convergence of the pmf’s.

We conclude with the convergence in I-divergence introduced by Barron et al. [29]. It is said that μn:nN converges to μ in direct and in reverse I-divergence if limnD(μ||μn)=0 and limnD(μn||μ)=0, respectively. From Pinsker’s inequality [39,40,41], the convergence in I-divergence implies the weak convergence in (5), where it is known that the converse result is not true [27].

2.1. Convergence Results for the Shannon Entropy

The discontinuity of the entropy in ∞-alphabets raises the problem of finding conditions under which convergence of the entropy can be obtained. On this topic, Ho et al. [26] have studied the interplay between entropy and the total variation distance, specifying conditions for convergence by assuming a finite support on the involved distributions. On the other hand, Harremoës [27] (Theorem 21) obtained convergence of the entropy by imposing a power dominating condition [27] (Definition 17) on the limiting probability measure μ for all the sequences μn:n0 converging in reverse I-divergence to μ [29]. More recently, Silva et al. [25] have addressed the entropy convergence studying a number of new settings that involve conditions on the limiting measure μ, as well as the way the sequence μn:n0 converges to μ in the space of distributions. These results offer sufficient conditions where the entropy evaluated in a sequence of distributions converges to the entropy of its limiting distribution and, consequently, the possibility of applying these when analyzing plug-in entropy estimators. The results used in this work are summarized in the rest of this section.

Let us begin with the case when μF(X), i.e., when the support of the limiting measure is finite and unknown.

Proposition 1.

Let us assume that μF(X) and μn:nNAC(X|μ). If μnμ, then limnD(μn||μ)=0 and limnH(μn)=H(μ).

This result is well-known because when AμnAμ for all n, the scenario reduces to the finite alphabet case, where the entropy is known to be continuous [2,23]. Since we obtain two inequalities that are used in the following sections, a simple proof is provided here.

Proof. 

μ and μn belong to H(X) from the finite-supported assumption. The same argument can be used to show that D(μn||μ)<, since μnμ for all n. Let us consider the following identity:

H(μ)H(μn)=xAμ(fμn(x)fμ(x))logfμ(x)+D(μn||μ). (6)

The first term on the right hand side (RHS) of (6) is upper bounded by Mμ·V(μn,μ) where

Mμ=log1mμsupxAμlogμ(x)<. (7)

For the second term, we have that

D(μn||μ)loge·xAμnfμn(x)fμn(x)fμ(x)1logemμ·supxAμfμn(x)fμ(x)logemμ·V(μn,μ). (8)

and, consequently,

H(μ)H(μn)Mμ+logemμ·V(μn,μ). (9)

Under the assumptions of Proposition 1, we note that the reverse I-divergence and the entropy difference are bounded by the total variation by (8) and (9), respectively. Note, however, that these bounds are a distribution-dependent function of mμ(Mμ) in (7) (it is direct to show that mμ(Mμ)< if, and only if, μF(X)). The next result relaxes the assumption that μnμ and offers a necessary and sufficient condition for the convergence of the entropy.

Lemma 1.

Ref. [25] (Theorem 1) Let μF(X) and μn:nNF(X). If μnμ, then there exists N>0 such that μμnnN, and

limnD(μ||μn)=0.

Furthermore, limnH(μn)=H(μ), if and only if,

limnμnAμn\Aμ·Hμn·|Aμn\Aμ=0limnxAμn\Aμfμn(x)log1fμn(x)=0, (10)

where μ(·|B) denotes the conditional probability of μ given the event BX.,

Lemma 1 tells us that to achieve entropy convergence (on top of the weak convergence), it is necessary and sufficient to ask for a vanishing expression (with n) of the entropy of μn restricted to the elements of the set Aμn\Aμ. Two remarks about this result are: (1) The convergence in direct I-divergence does not imply the convergence of the entropy (concrete examples are presented in [24] (Section III) and [25]); (2) Under the assumption that μF(X), μ is eventually absolutely continuous with respect to μn, and the convergence in total variations is equivalent to the convergence in direct I-divergence.

This section concludes with the case when the support of μ is infinite and unknown, i.e., Aμ=. In this context, two results are highlighted:

Lemma 2.

Ref. [31] (Theorem 4) Let us consider that μH(X) and μn:n0AC(X|μ). If μnμ and

Msupn1supxAμfμn(x)fμ(x)<, (11)

then μnH(X)H(X|μ) for all n and it follows that

limnD(μn||μ)=0andlimnH(μn)=H(μ).

Interpreting Lemma 2 we have that, to obtain the convergence of the entropy functional (without imposing a finite support assumption on μ), a uniform bounding condition (UBC) μ-almost everywhere was added in (11). By adding this UBC, the convergence on reverse I-divergence is also obtained as a byproduct. Finally, when μμn for all n, the following result is considered:

Lemma 3.

Ref. [25] (Theorem 3) Let μH(X) and a sequence of measures μn:n1H(X) such that μμn for all n1. If μnμ and

supn1supxAμlogfμn(x)fμ(x)< (12)

then, μH(X|μn) for all n1, and

limnD(μ||μn)=0.

Furthermore, limnH(μn)=H(μ), if and only if,

limnxAμn\Aμfμn(x)log1fμn(x)=0. (13)

This result shows the non-sufficiency of the convergence in direct I-divergence to achieve entropy convergence in the regime when μμn. In fact, Lemma 3 may be interpreted as an extension of Lemma 1 when the finite support assumption over μ is relaxed.

3. Shannon Entropy Estimation

Let μ be a probability in H(X), and let denote by X1,X2,X3, the empirical process induced from i.i.d. realizations of a random variable driven by μ, i.e., Xiμ, for all i0. Let Pμ denote the distribution of the empirical process in (X,B(X)) and Pμn denote the finite block distribution of Xn(X1,,Xn) in the product space (Xn,B(Xn)). Given a realization of X1,X2,X3,,Xn, we can construct an histogram-based estimator like classical empirical probability given by:

μ^n(A)1nk=1n1A(Xk),AX, (14)

with pmf given by fμ^n(x)=μ^n(x) for all xX. A natural estimator of the entropy is the plug-in estimate of μ^n given by

H(μ^n)=xXfμ^n(x)logfμ^n(x), (15)

which is a measurable function of X1,,Xn (this dependency on the data will be implicit for the rest of the paper).

For the rest of this Section and Section 4 and Section 5, the convergence results in Section 2.1 are used to derive strong consistency results for plug-in histogram-based estimators, like H(μ^n) in (15), as well as finite length concentration inequalities to obtain almost-sure rates of convergence for the overall estimation error H(μ^n)H(μ).

3.1. Revisiting the Classical Plug-In Estimator for Finite and Unknown Supported Distributions

We start by analyzing the case when μ has a finite but unknown support. A consequence of the strong law of large numbers [42,43] is that xX, limnμ^n(x)=μ(x), Pμ-almost surely (a.s.), hence limnV(μ^n,μ)=0, Pμ-a.s. On the other hand, it is clear that Aμ^nAμ holds with probability one. Then Proposition 1 implies that

limnD(μ^n||μ)=0andlimnH(μ^n)=H(μ),Pμ-a.s., (16)

i.e., μ^n is a strongly consistent estimator of μ in reverse I-divergence and H(μ^n) is a strongly consistent estimate of H(μ) distribution-free in F(X). Furthermore, the following can be stated:

Theorem 1.

Let μF(X) and let us consider μ^n in (14). Then μ^nH(X)H(X|μ), Pμ-a.s and n1, ϵ>0,

PμnD(μ^n||μ)>ϵ2Aμ+1·e2mμ2·nϵ2loge2, (17)
PμnH(μ^n)H(μ)>ϵ2Aμ+1·e2nϵ2(Mμ+logemμ)2. (18)

Moreover, D(μ||μ^n) is eventually finite with probability one and ϵ>0, and for any n1,

PμnD(μ||μ^n)>ϵ2Aμ+1·e2nϵ2loge2·1/mu+12+enmμ2. (19)

This result implies that for any τ(0,1/2) and μF(X), H(μ^n)H(μ), D(μ^n||μ), and D(μ||μ^n) goes to zero as o(nτ) Pμ-a.s. Furthermore, EPμnH(μ^n)H(μ) and EPμn(D(μ^n||μ)) behave like O(1/n) for all μF(X) from (30) in Section 7, which is the optimal rate of convergence of the finite alphabet scenario. As a corollary of (18), it is possible to derive intervals of confidence for the estimation error H((μ^n)H(μn): for all δ>0 and n1,

PμH((μ^n)H(μn)Mμ+loge/mμ12nln2Aμ+1δ1δ. (20)

This confidence interval behaves like O(1/n) as a function of n, and like O(ln1/δ) as a function of δ, which are the same optimal asymptotic trends that can be obtained for V(μ,μ^n) in (30).

Finally, we observe that Aμ^nAμ Pμn-a.s. where for any n1, Pμn(Aμ^nAμ)>0 implying that EPμn(D(μ||μ^n))= for all finite n. Then even in the finite and unknown supported scenario, μ^n is not consistent in expected direct I-divergence, which is congruent with the result in [29,44]. Besides this negative result, strong consistency in direct I-divergence can be obtained from (19), in the sense that limnD(μ||μ^n)=0, Pμ-a.s.

3.2. A Simplified Version of the Barron Estimator for Finite Supported Probabilities

It is well-understood that consistency in expected direct I-divergence is of critical importance for the construction of a lossless universal source coding scheme [2,23,29,44,45,46,47,48]. Here, we explore an estimator that achieves this learning objective, in addition to entropy estimation. For that, let μF(X) and let assume vF(X) such that μv. Barron et al. [29] proposed a modified version of the empirical measure in (14) to estimate μ from i.i.d. realizations, adopting a mixture estimate of the form

μ˜n(B)=(1an)·μ^n(B)+an·v(B), (21)

for all BX, and with (an)nN a sequence of real numbers in (0,1). Note that Aμ˜n=Av then μμ˜n for all n and from the finite support assumption H(μ˜n)< and D(μ||μ˜n)<, Pμ-a.s.. The following result derives from the convergence result in Lemma 1.

Theorem 2.

Let vF(X), μv and let us consider μ˜n in (21) induced from i.i.d. realizations of μ.

  • (i) 

    If (an) is o(1), then limnH(μ˜n)=H(μ), limnD(μ||μ˜n)=0, Pμ-a.s., and limnEPμ(D(μ||μ˜n))=0.

  • (ii) 

    Furthermore, if (an) is O(np) with p>2, then for all τ(0,1/2), H(μ˜n)H(μ) and D(μ||μ˜n) are o(nτ)Pμ-a.s, and EPμ(H(μ˜n)H(μ)) and EPμ(D(μ||μ˜n)) are O(1/n).

Using this approach, we achieve estimation of the true distribution in expected information divergence as well as strong consistency for entropy estimation as intended. In addition, optimal rates of convergence are obtained under the finite support assumption on μ.

4. The Barron-Györfi-van der Meulen Estimator

The celebrated Barron estimator was proposed by Barron, Györfi and van der Meulen [29] in the context of an abstract and continuous measurable space. It is designed as a variation of the classical histogram-based scheme to achieve a consistent estimation of the distribution in direct I-divergence [29] (Theorem 2). Here, the Barron estimator is revisited in the countable alphabet scenario, with the objective of estimating the Shannon entropy consistently, which, to the best of our knowledge, has not been previously addressed in literature. For that purpose, the convergence result in Lemma 3 will be used as a key result.

Let vP(X) be of infinite support (i.e., mv=infxAvv(x)=0). We want to construct a strongly consistent estimate of the entropy restricted to the collection of probabilities in H(X|v). For that, let us consider a sequence (hn)n0 with values in (0,1) and let us denote by πn=An,1,An,2,,An,mn the finite partition of X with maximal cardinality satisfying that

v(An,i)hn,i1,,mn. (22)

Note that mn=πn1/hn for all n1, and because of the fact that infxAvv(x)=0 it is simple to verify that if (hn) is o(1) then limnmn=. πn offers an approximated statistically equivalent partition of X with respect to the reference measure v. In this context, given X1,,Xn, i.i.d. realizations of μH(X|v), the idea proposed by Barron et al. [29] is to estimate the RN derivative dμdv(x) by the following histogram-based construction:

dμn*dv(x)=(1an)·μ^n(An(x))v(An(x))+an,xAv, (23)

where an is a real number in (0,1), An(x) denotes the cell in πn that contains the point x, and μ^n is the empirical measure in (14). Note that

fμn*(x)=dμn*dλ(x)=fv(x)·(1an)·μ^n(An(x))v(An(x))+an,

xX, and, consequently, BX

μn*(B)=(1an)i=1mnμ^n(An,i)·v(BAn,i)v(An,i)+anv(B). (24)

By construction AμAvAμn* and, consequently, μμn* for all n1. The next result shows sufficient conditions on the sequences (an) and (hn) to guarantee a strongly consistent estimation of the entropy H(μ) and of μ in direct I-divergence, distribution free in H(X|v). The proof is based on verifying that the sufficient conditions of Lemma 3 are satisfied Pμ-a.s.

Theorem 3.

Let v be in P(X)H(X) with infinite support, and let us consider μ in H(X|v). If we have that:

  • (i) 

    (an) is o(1) and (hn) is o(1),

  • (ii) 

    τ(0,1/2), such that the sequence 1an·hn is o(nτ),

then μH(X)H(X|μn*) for all n1 and

limnH(μn*)=H(μ)andlimnD(μ||μn*)=0,Pμ-a.s.. (25)

This result shows an admisible regime of design parameters and its scaling with the number of samples that guarantees that the Barron plug-in entropy estimator is strongly consistent in H(X|v). As a byproduct, we obtain that the distribution μ is estimated consistently in direct information divergence.

The Barron estimator [29] was originally proposed in the context of distributions defined in an abstract measurable space. Then if we restrict [29] (Theorem 2) to the countable alphabet case, the following result is obtained:

Corollary 1.

Ref. [29] (Theorem 2) Let us consider vP(X) and μH(X|v). If (an) is o(1), (hn) is o(1) and limsupn1nanhn1 then

limnD(μ||μn*)=0,Pμ-a.s.

When the only objective is the estimation of distributions consistently in direct I-divergence, Corollary 1 should be considered to be a better result than Theorem 3 (Corollary 1 offers weaker conditions than Theorem 3 in particular condition (ii)). The proof of Theorem 3 is based on verifying the sufficient conditions of Lemma 3, where the objective is to achieve the convergence of the entropy, and as a consequence, the convergence in direct I-divergence. Therefore, we can say that the stronger conditions of Theorem 3 are needed when the objective is entropy estimation. This is justified from the observation that convergence in direct I-divergence does not imply entropy convergence in ∞-alphabets, as is discussed in Section 2.1 (see, Lemmas 1 and 3).

5. A Data-Driven Histogram-Based Estimator

Data-driven partitions offer a better approximation to the data distribution in the sample space than conventional non-adaptive histogram-based approaches [34,49]. They have the capacity to improve the approximation quality of histogram-based learning schemes, which translates in better performance in different non-parametric learning settings [34,35,36,50,51]. One of the basic design principles of this approach is to partition or select a sub-set of elements of X in a data-dependent way to preserve a critical number of samples per cell. In our problem, this last condition proves to be crucial to derive bounds for the estimation and approximation errors. Finally, these expressions will be used to propose design solutions that offer an optimal balance between estimation and approximation errors (Theorems 5 and 6).

Given X1,,Xn i.i.d. realizations driven by μH(X) and ϵ>0, let us define the data-driven set

ΓϵxX:μ^n(x)ϵ, (26)

and ϕϵΓϵc. Let Πϵx:xΓϵϕϵ2X be a data-driven partition with maximal resolution in Γϵ, and σϵσ(Πϵ) be the smallest sigma field that contains Πϵ (as Πϵ is a finite partition, σϵ is the collection of sets that are union of elements of Πϵ). We propose the conditional empirical probability restricted to Γϵ by:

μ^n,ϵμ^n(·|Γϵ). (27)

By construction, it follows that Aμ^n,ϵ=ΓϵAμ, Pμ-a.s. and this implies that μ^n,ϵμ for all n1. Furthermore, Γϵ1ϵ and, importantly in the context of the entropy functional, it follows that

mμ^nϵinfxΓϵμ^n(x)ϵ. (28)

The next result establishes a mild sufficient condition on (ϵn) for which H(μ^n,ϵn) is strongly consistent distribution-free in H(X). Considering that we are in the regime where μ^n,ϵnμ, Pμ-a.s., the proof of this result uses the convergence result in Lemma 2 as a central result.

Theorem 4.

If (ϵn) is O(nτ) with τ(0,1), then for all μH(X)

limnH(μ^n,ϵn)=H(μ),Pμ-a.s.

Complementing Theorem 4, the next result offers almost-sure rates of converge for a family of distributions with a power tail bounded condition (TBC). In particular, the family of distributions studied in [30] (Theorem 7) are considered.

Theorem 5.

Let us assume that for some p>1 there are two constants 0<k0k1 and N>0 such that k0·xpμ(x)k1xp for all xN. If we consider that (ϵn)(nτ*) for τ*=12+1/p, then

H(μ)H(μ^n,ϵn)isO(n11/p2+1/p),Pμ-a.s.

This result shows that under the mentioned p-power TBC on fμ(·), the plug-in estimator H(μ^n,ϵn) can achieve a rate of convergence to the true limit that is O(n11/p2+1/p) with probability one. For the derivation of this result, the approximation sequence (ϵn) is defined as a function of p (adapted to the problem) by finding an optimal tradeoff between estimation and approximation errors while performing a finite length (non-asymptotic) analysis of the expression H(μ)H(μ^n,ϵn) (the details of this analysis are presented in Section 7).

It is insightful to look at two extreme regimes of this result: p approaching 1, in which the rate is arbitrarily slow (approaching a non-decaying behavior); and p, where H(μ)H(μ^n,ϵn) is O(nq) for all q(0,1/2) Pμ-a.s.. This last power decaying range q(0,1/2) matches what is achieved for the finite alphabet scenario (for instance in Theorem 1, Equation (18)), which is known to be the optimal rate for finite alphabets.

Extending Theorem 5, the following result addresses the more constrained case of distributions with an exponential TBC.

Theorem 6.

Let us consider α>0 and let us assume that there are k0,k1 with 0<k0k1 and N>0 such that k0·eαxμ(x)k1·eαx for all xN. If we consider (ϵn)(nτ) with τ(0,1/2), then

H(μ)H(μ^n,ϵn)isO(nτlogn),Pμ-a.s.

Under this stringent TBC on fμ(·), it is observed that H(μ)H(μ^n,ϵn)iso(nq) Pμ-a.s., for any arbitrary q(0,1/2), by selecting (ϵn)(nτ) with q<τ<1/2. This last condition on τ is universal over α>0. Remarkably, for any distribution with this exponential TBC, we can approximate (arbitrarily closely) the optimal almost-sure rate of convergence achieved for the finite alphabet problem.

Finally, the finite and unknown supported scenario is revisited, where it is shown that the data-driven estimator exhibits the optimal almost sure convergence rate of the classical plug-in entropy estimator presented in Section 3.1.

Theorem 7.

Let us assume that μF(X) and (ϵn) being o(1). Then for all ϵ>0 there is N>0 such that nN

PμnH(μ^n,ϵn)H(μ)>ϵ2Aμ+1·e2nϵ2(Mμ+logemμ)2+enmμ24. (29)

The proof of this result reduces to verify that μ^n,ϵn detects Aμ almost-surely when n goes to infinity and from this, it follows that H(μ^n,ϵn) eventually matches the optimal almost sure performance of H(μ^n) under the key assumption that μF(X). Finally, the concentration bound in (29) implies that H(μ^n,ϵn)H(μ) is o(nq) almost surely for all q(0,1/2) as long as ϵn0 with n.

6. Discussion of the Results and Final Remarks

This work shows that entropy convergence results are instrumental to derive new (strongly consistent) estimation results for the Shannon entropy in ∞-alphabets and, as a byproduct, distribution estimators that are strongly consistent in direct and reverse I-divergence. Adopting a set of sufficient conditions for entropy convergence in the context of four plug-in histogram-based schemes, this work shows concrete design conditions where strong consistency for entropy estimation in ∞-alphabets can be obtained (Theorems 2–4). In addition, the relevant case where the target distribution has a finite but unknown support is explored, deriving almost sure rates of convergence results for the overall estimation error (Theorems 1 and 7) that match the optimal asymptotic rate that can be obtained in the finite alphabet version of the problem (i.e., the finite and known supported case).

As the main context of application, this work focuses on the case of a data-driven plug-in estimator that restricts the support where the distribution is estimated. The idea is to have design parameters that control estimation-error effects and to find an adequate balance between these two learning errors. Adopting the entropy convergence result in Lemma 2, it is shown that this data-driven scheme offers the same universal estimation attributes than the classical plug-in estimate under some mild conditions on its threshold design parameter (Theorem 4). In addition, by addressing the technical task of deriving concrete closed-form expressions for the estimation and approximation errors in this learning context a solution is presented where almost-sure rates of convergence of the overall estimation error are obtained over a family of distributions with some concrete tail bounded conditions (Theorems 5 and 6). These results show the capacity that data-driven frameworks offer for adapting aspects of their learning scheme to the complexity of the entropy estimation task in ∞-alphabets.

Concerning the classical plug-in estimator presented in Section 3.1, it is important to mention that the work of Antos et al. [30] shows that limnH(μ^n)=H(μ) happens almost surely and distribution-free and, furthermore, it provides rates of convergence for families with specific tail-bounded conditions [30] (Theorem 7). Theorem 1 focuses on the case when μF(X), where new finite-length deviation inequalities and confidence intervals are derived. From that perspective, Theorem 1 complements the results presented in [30] in the non-explored scenario when μF(X). It is also important to mention two results by Ho and Yeung [26] (Theorems 11 and 12) for the plug-in estimator in (15). They derived bounds for Pμn(H(μ^n)H(μ)ϵ) and determined confidence intervals under a finite and known support restriction on μ. In contrast, Theorem 1 resolves the case for a finite and unknown supported distribution, which is declared to be a challenging problem from the arguments presented in [26] (Theorem 13) concerning the discontinuity of the entropy.

7. Proof of the Main Results

Proof of Theorem 1.

Let μ be in F(X), then Aμk for some k>1. From Hoeffding’s inequality [28], n1, and for any ϵ>0

PμnV(μ^n,μ)>ϵ2k+1·e2nϵ2andEPμn(V(μ^n,μ))2(k+1)log2n. (30)

Considering that μ^nμ Pμ-a.s, we can use Proposition 1 to obtain that

D(μ^n||μ)logemμ·V(μ^n,μ),andH((μ^n)H(μn)Mμ+logemμ·V(μ^n,μ). (31)

Hence, (17) and (18) derive from (30).

For the direct I-divergence, let us consider a sequence (xi)i1 and the following function (a stopping time):

To(x1,x2,)infn1:Aμ^n(xn)=Aμ. (32)

To(x1,x2,) is the point where the support of μ^n(xn) is equal to Aμ and, consequently, the direct I-divergence is finite (since μF(X)). In fact, by the uniform convergence of μ^n to μn (Pμ-a.s.) and the finite support assumption of μ, it is simple to verify that Pμ(To(X1,X2,)<)=1. Let us define the event:

Bnx1,x2,:To(x1,x2,)nXN, (33)

i.e., the collection of sequences in XN where at time n, Aμ^n=Aμ and, consequently, D(μ||μ^n)<. Restricted to this set

D(μ||μ^n)xAμ^n||μfμ^n(x)logfμ^n(x)fμ(x)+xAμ\Aμ^n||μfμ^n(x)logfμ(x)fμ^n(x)loge·xAμ^n||μfμ^n(x)·fμ^n(x)fμ(x)1 (34)
+loge·μ(Aμ\Aμ^n||μ)μ^n((Aμ\Aμ^n||μ)) (35)
loge·1/mu+1V(μ,μ^n), (36)

where in the first inequality Aμ^n||μxAμ^n:fμ^n(x)>fμ(x), and the last is obtained by the definition of the total variational distance. In addition, let us define the ϵ-deviation set Anϵx1,x2,:D(μ||μ^n(xn))>ϵXN. Then by additivity and monotonicity of Pμ, we have that

Pμ(Anϵ)Pμ(AnϵBn)+Pμ(Bnc). (37)

By definition of Bn, (36) and (30) it follows that

Pμ(AnϵBn)Pμ(V(μ||μ^n)loge·1/mu+1>ϵ)2Aμ+1·e2nϵ2loge2·1/mu+12. (38)

On the other hand, ϵo(0,mμ) if V(μ,μ^n)ϵo then Ton. Consequently Bncx1,x2,:V(μ,μ^n(xn))>ϵo, and again from (30)

Pμ(Bnc)2Aμ+1·e2nϵo2, (39)

for all n1 and ϵo(0,mμ). Integrating the results in (38) and (39) and considering ϵ0=mμ/2 suffice to show the bound in (19). ☐

Proof of Theorem 2.

As (an) is o(1), it is simple to verify that limnV(μ˜n,μ)=0, Pμ-a.s. Also note that the support disagreement between μ˜n and μ is bounded by the hypothesis, then

limnμ˜nAμn\Aμ·logAμ˜n\Aμlimnμ˜nAμn\Aμ·logAv=0,Pμ-a.s. (40)

Therefore from Lemma 1, we have the strong consistency of H(μ˜n) and the almost sure convergence of D(μ||μ˜n) to zero. Note that D(μ||μ˜n) is uniformly upper bounded by loge·(1/mμ+1)V(μ,μ˜n) (see (36) in the proof of Theorem 1). Then the convergence in probability of D(μ||μ˜n) implies the convergence of its mean [42], which concludes the proof of the first part.

Concerning rates of convergence, we use the following:

H(μ)H(μ˜n)=xAμAμ˜nfμ˜n(x)fμ(x)logfμ(x)+xAμAμ˜nfμ˜n(x)logfμ˜n(x)fμ(x)xAμ˜n\Aμfμ˜n(x)log1fμ˜n(x). (41)

The absolute value of the first term in the right hand side (RHS) of (41) is bounded by Mμ·V(μ˜n,μ) and the second term is bounded by loge/mμ·V(μ˜n,μ), from the assumption that μF(X). For the last term, note that fμ˜n(x)=an·v(x) for all xAμ˜n\Aμ and that Aμ˜n=Av, then

0xAμ˜n\Aμfμ˜n(x)log1fμ˜n(x)an·(H(v)+log1an·v(Av\Aμ)).

On the other hand,

V(μ˜n,μ)=12xAμ(1an)μ^n(x)+anv(x)μ(x)+xAv\Aμanv(x).(1an)·V(μ^n,μ)+an.

Integrating these bounds in (41),

H(μ)H(μ˜n)(Mμ+loge/mμ)·((1an)·V(μ^n,μ)+an)+an·H(v)+an·log1an=K1·V(μ^n,μ)+K2·an+an·log1an, (42)

for constants K1>0 and K2>0 function of μ and v.

Under the assumption that μF(X), the Hoeffding’s inequality [28,52] tells us that Pμ(V(μ^n,μ)>ϵ)C1·eC2nϵ2 (for some distribution free constants C1>0 and C2>0). From this inequality, V(μ^n,μ) goes to zero as o(nτ) Pμ-a.s. τ(0,1/2) and EPμ(V(μ^n,μ)) is O(1/n). On the other hand, under the assumption in ii) (K2·an+an·log1an) is O(1/n), which from (42) proves the rate of convergence results for H(μ)H(μ˜n).

Considering the direct I-divergence, D(μ||μ˜n)loge·xAμfμ(x)fμ(x)fμ˜n(x)1logemμ˜n·V(μ˜n,μ). Then the uniform convergence of μ˜n(x) to μ(x) Pμ-a.s. in Aμ and the fact that Aμ< imply that for an arbitrary small ϵ>0 (in particular smaller than mμ)

limnD(μ||μ˜n)logemμϵ·limnV(μ˜n,μ),Pμ-a.s.. (43)

(43) suffices to obtain the convergence result for the I-divergence. ☐

Proof of Theorem 3.

Let us define the oracle Barron measure μ˜n by:

fμ˜n(x)=dμ˜ndλ(x)=fv(x)(1an)·μ(An(x))v(An(x))+an, (44)

where we consider the true probability instead of its empirical version in (23). Then, the following convergence result can be obtained (see Proposition A2 in Appendix B),

limnsupxAμ˜ndμ˜ndμn*(x)1=0,Pμ-a.s.. (45)

Let A denote the collection of sequences x1,x2, where the convergence in (45) is holding (this set is typical meaning that Pμ(A)=1). The rest of the proof reduces to show that for any arbitrary (xn)n1A, its respective sequence of induced measures μn*:n1 (the dependency of μn* on the sequence (xn)n1 will be considered implicit for the rest of the proof) satisfies the sufficient conditions of Lemma 3.

Let us fix an arbitrary (xn)n1A:

Weak convergence μn*μ: Without loss of generality we consider that Aμ˜n=Av for all n1. Since an0 and hn0, fμ˜n(x)μ(x) xAv, we got the weak convergence of μ˜n to μ. On the other hand by definition of A, limnsupxAμ˜nfμ˜n(x)fμn*(x)1=0 that implies that limnfμn*(x)fμ˜n(x)=0 for all xAv and, consequently, μn*μ.

The condition in (12): By construction μμn*, μμ˜n and μ˜nμn* for all n, then we will use the following equality:

logdμdμn*(x)=logdμdμ˜n(x)+logdμ˜ndμn*(x), (46)

for all xAμ. Concerning the approximation error term of (46), i.e., logdμdμ˜n(x), xAμ

dμ˜ndμ(x)=(1an)μ(An(x))μ(x)v(x)v(An(x))+anv(x)μ(x). (47)

Given that μH(X|v), this is equivalent to state that log(dμdv(x)) is bounded μ-almost everywhere, which is equivalent to say that minfxAμdμdv(x)>0 and MsupxAμdμdv(x)<. From this, AAμ,

mv(A)μ(A)Mv(A). (48)

Then we have that, xAμ mMμ(An(x))μ(x)v(x)v(An(x))Mm. Therefore for n sufficient large, 0<12mMdμ˜ndμ(x)Mm+M< for all x in Aμ. Hence, there exists No>0 such that supnNosupxAμlogdμ˜ndμ(x)<.

For the estimation error term of (46), i.e., logdμ˜ndμn*(x), note that from the fact that (xn)A, and the convergence in (45), there exists N1>0 such that for all nN1 supxAμlogdμ˜ndμn*(x)<, given that AμAμ˜n=Av. Then using (46), for all nmaxN0,N1 supxAμlogdμn*dμ(x)<, which verifies (12).

The condition in (13): Defining the function ϕn*(x)1Av\Aμ(x)·fμn*(x)log(1/fμn*(x)), we want to verify that limnXϕn*(x)dλ(x)=0. Considering that (xn)A for all ϵ>0, there exists N(ϵ)>0 such that supxAμ˜nfμ˜n(x)fμn*(x)1<ϵ and then

(1ϵ)fμ˜n(x)<fμn*(x)<(1+ϵ)fμ˜n(x),forallxAv. (49)

From (49), 0ϕn*(x)(1+ϵ)fμ˜n(x)log(1/(1ϵ)fμ˜n(x)) for all nN(ϵ). Analyzing fμ˜n(x) in (44), there are two scenarios: An(x)Aμ= where fμ˜n(x)=anfv(x) and, otherwise, fμ˜n(x)=fv(x)(an+(1an)μ(An(x)Aμ)/v(An(x))). Let us define:

BnxAv\Aμ:An(x)Aμ=andCnxAv\Aμ:An(x)Aμ. (50)

Then for all nN(ϵ),

xXϕn*(x)xAv\Aμ(1+ϵ)fμ˜n(x)log1/((1ϵ)fμ˜n(x))=xBn(1+ϵ)anfv(x)log1(1ϵ)anfv(x)+xXϕ˜n(x), (51)

with ϕ˜n(x)1Cn(x)·(1+ϵ)fμ˜n(x)log1(1ϵ)fμ˜n(x). The left term in (51) is upper bounded by an(1+ϵ)(H(v)+log(1/an)), which goes to zero with n from (an) being o(1) and the fact that vH(X). For the right term in (51), (hn) being o(1) implies that x belongs to Bn eventually (in n) xAv\Aμ, then ϕ˜n(x) tends to zero point-wise as n goes to infinity. On the other hand, for all xCn (see (50)), we have that

11/m+1μ(An(x)Aμ)v(An(x)Aμ)+v(Av\Aμ)μ(An(x))v(An(x))μ(An(x)Aμ)v(An(x)Aμ)M. (52)

These inequalities derive from (48). Consequently for all xX, if n sufficiently large such that an<0.5, then

0ϕ˜n(x)(1+ϵ)(an+(1an)M)fv(x)log1(1ϵ)(an+(1an)m/(m+1))(1+ϵ)(1+M)fv(x)log2(m+1)(1ϵ)+log1fv(x). (53)

Hence from (50), ϕ˜n(x) is bounded by a fix function that is 1(X) by the assumption that vH(X). Then by the dominated convergence theorems [43] and (51),

limnXϕn*(x)limnXϕ˜n(x).

In summary, we have shown that for any arbitrary (xn)A the sufficient conditions of Lemma 3 are satisfied, which proves the result in (25) reminding that Pμ(A)=1 from (45). ☐

Proof of Theorem 4.

Let us first introduce the oracle probability

μϵnμ(·|Γϵn)P(X). (54)

Note that μϵn is a random probability measure (function of the i.i.d sequence X1,,Xn) as Γϵn is a data-driven set, see (26). We will first show that:

limnH(μϵn)=H(μ)andlimnD(μϵn||μ)=0,Pμ-a.s. (55)

Under the assumption on (ϵn) of Theorem 4, limnμ(Γϵn)μ^n(Γϵn)=0, Pμ-a.s. (this result derives from the fact that limnV(μ/σϵn,μ^n/σϵn)=0, Pμ-a.s. , from (63)) In addition, since (ϵn) is o(1) then limnμ^n(Γϵn)=1, which implies that limnμ(Γϵn)=1 Pμ-a.s. From this μϵnμ, Pμ-a.s. Let us consider a sequences (xn) where limnμ(Γϵn)=1. Constrained to that

limsupnsupxAμfμϵn(x)fμ(x)=limsupn1μ(Γϵn)<. (56)

Then there is N>0 such that supn>NsupxAμfμϵn(x)fμ(x)<. Hence from Lemma 2, limnD(μϵn||μ)= 0 and limnH(μϵn)H(μ)=0. Finally, the set of sequences (xn) where limnμ(Γϵn)=1 has probability one (with respect to Pμ), which proves (55).

For the rest of the proof, we concentrate on the analysis of H(μ^n,ϵn)H(μϵn) that can be attributed to the estimation error aspect of the problem. It is worth noting that by construction Aμ^n,ϵn=Aμϵn=Γϵn, Pμ-a.s.. Consequently, we can use

H(μ^n,ϵn)H(μϵn)=xΓϵnμϵn(x)μ^n,ϵn(x)logμ^n,ϵn(x)+D(μϵn||μ^n,ϵn). (57)

The first term on the RHS of (57) is upper bounded by log1/mμ^nϵn·V(μϵn,μ^n,ϵn) log1/ϵn·V(μϵn,μ^n,ϵn). Concerning the second term on the RHS of (57), it is possible to show (details presented in Appendix C) that

D(μϵn||μ^n,ϵn)2logeϵnμ(Γϵn)·V(μ/σϵn,μ^n/σϵn), (58)

where

V(μ/σϵn,μ^n/σϵn)supAσϵnμ(A)μ^n(A). (59)

In addition, it can be verified (details presented in Appendix D) that

V(μϵn,μ^n,ϵn)K·V(μ/σϵn,μ^n/σϵn), (60)

for some universal constant K>0. Therefore from (57), (58) and (60), there is C>0 such that

H(μ^n,ϵn)H(μϵn)Cμ(Γϵn)log1ϵn·V(μ/σϵn,μ^n/σϵn). (61)

As mentioned before, μ(Γϵn) goes to 1 almost surely, then we need to concentrate on the analysis of the asymptotic behavior of log1/ϵn·V(μ/σϵn,μ^n/σϵn). From Hoeffding’s inequality [28], we have that δ>0

Pμnlog1/ϵn·V(μ/σϵn,μ^n/σϵn)>δ2Γϵn+1·e2nδ2(log1/ϵn)2, (62)

considering that by construction σϵn2Γϵn+121/ϵn+1. Assuming that (ϵn) is O(nτ),

lnPμnlog1/ϵn·V(μ/σϵn,μ^n/σϵn)>δ(nτ+1)ln22nδ2τlogn.

Therefore for all τ(0,1), δ>0 and any arbitrary l(τ,1)

limsupn1nl·lnPμnlog1/ϵn·V(μ/σϵn,μ^n/σϵn)>δ<0. (63)

This last result is sufficient to show that n1Pμnlog1/ϵn·V(μ/σϵn,μ^n/σϵn)>δ< that concludes the argument from the Borel-Cantelli Lemma. ☐

Proof of Theorem 5.

We consider the expression

H(μ)H(μ^n,ϵn)H(μ)H(μϵn)+H(μϵn)H(μ^n,ϵn) (64)

to analize the approximation error and the estimation error terms separately.

• Approximation Error Analysis

Note that H(μ)H(μϵn) is a random object as μϵn in (54) is a function of the data-dependent partition and, consequently, a function of X1,,Xn. In the following, we consider the oracle set

Γ˜ϵnxX:μ(x)ϵn, (65)

and the oracle conditional probability

μ˜ϵnμ(·|Γ˜ϵn)P.(X). (66)

Note that Γ˜ϵn is a deterministic function of (ϵn) and so is the measure μ˜ϵn in (66). From definitions and triangular inequality:

H(μ)H(μ˜ϵn)xΓ˜ϵncμ(x)log1μ(x)+log1μ(Γ˜ϵn)+1μ(Γ˜ϵn)1·xΓ˜ϵnμ(x)log1μ(x), (67)

and, similarly, the approximation error is bounded by

H(μ)H(μϵn)xΓϵncμ(x)log1μ(x)+log1μ(Γϵn)+1μ(Γϵn)1·xΓϵnμ(x)log1μ(x). (68)

We denote the RHS of (67) and (68) by aϵn and bϵn(X1,,Xn), respectively.

We can show that if (ϵn) is O(nτ) and τ(0,1/2), then

limsupnbϵn(X1,,Xn)a2ϵn0,Pμ-a.s., (69)

which from (68) implies that H(μ)H(μϵn) is O(a2ϵn), Pμ-a.s. The proof of (69) is presented in Appendix E.

Then, we need to analyze the rate of convergence of the deterministic sequence (a2ϵn). Analyzing the RHS of (67), we recognize two independent terms: the partial entropy sum xΓ˜ϵncμ(x)log1μ(x) and the rest that is bounded asymptotically by μ(Γ˜ϵnc)(1+H(μ)), using the fact that lnxx1 for x1. Here is where the tail condition on μ plays a role. From the tail condition, we have that

μ(Γ˜ϵnc)μ(ko/ϵn)1/p+1,(ko/ϵn)1/p+2,(ko/ϵn)1/p+3,=x(koϵn)1/p+1μ(x)k1·S(koϵn)1/p+1, (70)

where Sxoxxoxp. Similarly as 0,1,,(ko/ϵn)1/pΓ˜ϵn, then

xΓ˜ϵncμ(x)log1μ(x)x(koϵn)1/p+1μ(x)log1μ(x)x(koϵn)1/p+1k1xp·log1k0xpk1logp·R(koϵn)1/p+1+k1log1/k0·S(koϵn)1/p+1, (71)

where Rxoxxoxplogx.

In Appendix F, it is shown that SxoC0·xo1p and RxoC1·xo1p for constants C1>0 and C0>0. Integrating these results in the RHS of (70) and (71) and considering that (ϵn) is O(nτ), we have that both μ(Γ˜ϵnc) and xΓ˜ϵncμ(x)log1μ(x) are O(nτ(p1)p). This implies that our oracle sequence (aϵn) is O(nτ(p1)p).

In conclusion, if ϵn is O(nτ) for τ(0,1/2), it follows that

H(μ)H(μϵn)isO(nτ(p1)p),Pμ-a.s. (72)

• Estimation Error Analysis

Let us consider H(μϵn)H(μ^n,ϵn). From the bound in (61) and the fact that for any τ(0,1), limnμ(Γϵn)=1 Pμ-a.s. from (63), the problem reduces to analyze the rate of convergence of the following random object:

ρn(X1,,Xn)log1ϵn·V(μ/σ(Γϵn),μ^n/σ(Γϵn)). (73)

We will analize, instead, the oracle version of ρn(X1,,Xn) given by:

ξn(X1,,Xn)log1ϵn·V(μ/σ(Γ˜ϵn/2),μ^n/σ(Γ˜ϵn/2)), (74)

where Γ˜ϵxX:μ(x)ϵ is the oracle counterpart of Γϵ in (26). To do so, we can show that if ϵn is O(nτ) with τ(0,1/2), then

liminfnξn(X1,,Xn)ρn(X1,,Xn)0,Pμ-a.s.. (75)

The proof of (75) is presented in Appendix G.

Moving to the almost sure rate of convergence of ξn(X1,,Xn), it is simple to show for our p-power dominating distribution that if (ϵn) is O(nτ) and τ(0,p) then

limnξn(X1,,Xn)=0Pμ-a.s.,

and, more specifically,

ξn(X1,,Xn)iso(nq)forallq(0,(1τ/p)/2),Pμ-a.s.. (76)

The argument is presented in Appendix H.

In conclusion, if ϵn is O(nτ) for τ(0,1/2), it follows that

H(μϵn)H(μ^n,ϵn)isO(nq),Pμ-a.s., (77)

for all q(0,(1τ/p)/2).

• Estimation vs. Approximation Errors

Coming back to (64) and using (72) and (77), the analysis reduces to finding the solution τ* in (0,1/2) that offers the best trade-off between the estimation and approximation error rate:

τ*argmaxτ(0,1/2)min(1τ/p)2,τ(p1)p. (78)

It is simple to verify that τ*=1/2. Then by considering τ arbitrary close to the admissible limit 1/2, we can achieve a rate of convergence for H(μ)H(μ^n,ϵn) that is arbitrary close to O(n12(11/p)), P-a.s.

More formally, for any l(0,12(11/p)) we can take τ(l(11/p),12) where H(μ)H(μ^n,ϵn) is o(nl), Pμ-a.s., from (72) and (77).

Finally, a simple corollary of this analysis is to consider τ(p)=12+1/p<1/2 where:

H(μ)H(μ^n,ϵn)isO(n11/p2+1/p),Pμ-a.s., (79)

which concludes the argument. ☐

Proof of Theorem 6.

The argument follows the proof of Theorem 5. In particular, we use the estimation-approximation error bound:

H(μ)H(μ^n,ϵn)H(μ)H(μϵn)+H(μϵn)H(μ^n,ϵn), (80)

and the following two results derived in the proof of Theorem 5: If (ϵn) is O(nτ) with τ(0,1/2) then (for the approximation error)

H(μ)H(μϵn)isO(a2ϵn)Pμ-a.s., (81)

with aϵn=xΓ˜ϵncμ(x)log1μ(x)+μ(Γ˜ϵnc)(1+H(μ)), while (for the estimation error)

H(μϵn)H(μ^n,ϵn)isO(ξn(X1,,Xn))Pμ-a.s., (82)

with ξn(X1,,Xn)=log1ϵn·V(μ/σ(Γ˜ϵn/2),μ^n/σ(Γ˜ϵn/2)).

For the estimation error, we need to bound the rate of convergence of ξn(X1,,Xn) to zero almost surely. We first note that 1,,xo(ϵn)=Γ˜ϵn with xo(ϵn)=1/αln(k0/ϵn). Then from Hoeffding’s inequality we have that

Pμnξn(X1,,Xn)>δ2(Γ˜ϵn/2)·e2nδ2log(1/ϵn)221/αln(2k0/ϵn)+1·e2nδ2log(1/ϵn)2. (83)

Considering ϵn=O(nτ), an arbitrary sequence (δn) being o(1) and l>0, it follows from (83) that

1nl·lnPμnξn(X1,,Xn)>δn1nlln(2)1/αln(2k0/ϵn)+1n1lδn2log(1/ϵn)2. (84)

We note that the first term in the RHS of (84) is O(1nllogn) and goes to zero for all l>0, while the second term is O(n1lδn2logn2). If we consider δn=O(nq), this second term is O(n12ql·1logn2). Therefore, for any q(0,1/2) we can take an arbitrary l(0,12q] such that Pμnξn(X1,,Xn)>δn is O(enl) from (84). This result implies, from the Borel-Cantelli Lemma, that ξn(X1,,Xn) is o(δn), Pμ-a.s, which in summary shows that H(μϵn)H(μ^n,ϵn) is O(nq) for all q(0,1/2).

For the approximation error, it is simple to verify that:

μ(Γ˜ϵnc)k1·xxo(ϵn)+1eαx=k1·S˜xo(ϵn)+1 (85)

and

xΓ˜ϵncμ(x)log1μ(x)xxo(ϵn)+1k1eαxlog1k0eαx=k1log1k0·S˜xo(ϵn)+1+αloge·k1·R˜xo(ϵn)+1, (86)

where S˜xoxxoeαx and R˜xoxxox·eαx. At this point, it is not difficult to show that S˜xoM1eαxo and R˜xoM2eαxo·xo for some constants M1>0 and M2>0. Integrating these partial steps, we have that

aϵnk1(1+H(μ)+log1k0)·S˜xo(ϵn)+1+αloge·k1·R˜xo(ϵn)+1O1·ϵn+O2·ϵnlog1ϵn (87)

for some constant O1>0 and O2>0. The last step is from the evaluation of xo(ϵn)=1/αln(k0/ϵn). Therefore from (81) and (87), it follows that H(μ)H(μϵn) is O(nτlogn) Pμ-a.s. for all τ(0,1/2).

The argument concludes by integrating in (80) the almost sure convergence results obtained for the estimation and approximation errors. ☐

Proof of Theorem 7.

Let us define the event

Bnϵ=xnXn:Γϵ(xn)=Aμ, (88)

that represents the detection of the support of μ from the data for a given ϵ>0 in (26). Note that the dependency on the data for Γϵ is made explicit in this notation. In addition, let us consider the deviation event

Anϵ(μ)=xnXn:V(μ,μ^n)>ϵ. (89)

By the hypothesis that Aμ<, then mμ=minxAμfμ(x)>0. Therefore if xn(Anmμ/2(μ))c then μ^n(x)mμ/2 for all xAμ, which implies that (Bnϵ)cAnmμ/2(μ) as long as 0<ϵmμ/2. Using the hypothesis that ϵn0, there is N>0 such that for all nN (Bnϵn)cAnmμ/2(μ) and, consequently,

Pμn((Bnϵn)c)Pμn(Anmμ/2(μ))2k+1·enmμ24, (90)

the last from Hoeffding’s inequality considering k=Aμ<.

If we consider the events:

Cnϵ(μ)=xnXn:H(μ^n,ϵn)H(μ)>ϵand (91)
Dnϵ(μ)=xnXn:H(μ^n)H(μ)>ϵ (92)

and we use the fact that by definition μ^n,ϵn=μ^n conditioning on Bnϵn, it follows that Cnϵ(μ)BnϵnDnϵ(μ). Then, for all ϵ>0 and nN

Pμn(Cnϵ(μ))Pμn(Cnϵ(μ)Bnϵn)+Pμn((Bnϵn)c)Pμn(Dnϵ(μ))+Pμn((Bnϵn)c)2k+1e2nϵ2(Mμ+logemμ)2+enmμ24, (93)

the last inequality from Theorem 1 and (90). ☐

Acknowledgments

The author is grateful to Patricio Parada for his insights and stimulating discussion in the initial stage of this work. The author thanks the anonymous reviewers for their valuable comments and suggestions, and his colleagues Claudio Estevez, Rene Mendez and Ruben Claveria for proofreading this material.

Appendix A. Minimax risk for Finite Entropy Distributions in ∞-Alphabets

Proposition A1.

Rn*=.

For the proof, we use the following lemma that follows from [26] (Theorem 1).

Lemma A1.

Let us fix two arbitrary real numbers δ>0 and ϵ>0. Then there are P, Q two finite supported distributions on H(X) that satisfy that D(P||Q)<ϵ while H(Q)H(P)>δ.

The proof of Lemma A1 derives from the same construction presented in the proof of [26] (Theorem 1), i.e., P=(p1,,pL) and a modification of it QM=(p1·(11/M),p2+p1/MM,,pL+p1/MM,p1/MM,,p1/MM) both distribution of finite support and consequently in H(X). It is simple to verify that as M goes to infinity D(P||QM)0 while H(QM)H(P).

Proof. 

For any pair of distribution P, Q in H(X), Le Cam’s two point method [53] shows that:

Rn*14H(Q)H(P)2expnD(P||Q). (A1)

Adopting Lemma A1 and Equation (A1), for any n and any arbitrary ϵ>0 and δ>0, we have that Rn*>δ2expnϵ/4. Then exploiting the discontinuity of the entropy in infinite alphabets, we can fix ϵ and make δ arbitrar large. ☐

Appendix B. Proposition A2

Proposition A2.

Under the assumptions of Theorem 3:

limnsupxAμ˜ndμ˜ndμn*(x)1=0,Pμ-a.s. (A2)

Proof. 

First note that Aμ˜n=Aμn*, then dμ˜ndμn*(x) is finite and xAμ˜n

dμ˜ndμn*(x)=(1an)·μ(An(x))+anv(An(x))(1an)·μ^n(An(x))+anv(An(x)). (A3)

Then by construction,

supxAμ˜ndμ˜ndμn*(x)1supAπnμ^n(A)μ(A)an·hn. (A4)

From Hoeffding’s inequality, we have that ϵ>0

PμnsupAπnμ^n(A)μ(A)>ϵ2·πn·exp2nϵ2. (A5)

By condition ii), given that (1/anhn) is o(nτ) for some τ(0,1/2), then there exists τo(0,1) such that

limn1nτolnPμnsupxAμ˜ndμ˜ndμn*(x)1>ϵlimn1nτoln(2πn)2·(n1τo2anhnϵ)2=.

This implies that PμnsupxAμ˜ndμ˜ndμn*(x)1>ϵ is eventually dominated by a constant time (enτo)n1, which from the Borel-Cantelli Lemma [43] implies that

limnsupxAμ˜ndμ˜ndμn*(x)1=0,Pμ-a.s.. (A6)

Appendix C. Proposition A3

Proposition A3.

D(μϵn||μ^n,ϵn)2logeϵnμ(Γϵn)·V(μ/σϵn,μ^n/σϵn)

Proof. 

From definition,

D(μϵn||μ^n,ϵn)=1μ(Γϵn)xΓϵnfμ(x)logfμ(x)fμ^n(x)+logμ^n(Γϵn)μ(Γϵn). (A7)

For the right term in the RHS of (A7):

logμ^n(Γϵn)μ(Γϵn)log(e)μ(Γϵn)μ^n(Γϵn)μ(Γϵn). (A8)

For the left term in the RHS of (A7):

xΓϵnfμ(x)logfμ(x)fμ^n(x)=xΓϵnfμ(x)fμ^n(x)fμ(x)logfμ(x)fμ^n(x)+xΓϵnfμ(x)>fμ^n(x)ϵnfμ(x)logfμ(x)fμ^n(x)xΓϵnfμ(x)fμ^n(x)fμ(x)logfμ^n(x)fμ(x)+xΓϵnfμ(x)>fμ^n(x)ϵnfμ^n(x)logfμ(x)fμ^n(x)+xΓϵnfμ(x)>fμ^n(x)ϵn(fμ(x)fμ^n(x))·logfμ(x)fμ^n(x)logexΓϵnfμ(x)fμ^n(x)(fμ^n(x)fμ(x))+xΓϵnfμ(x)>fμ^n(x)(fμ(x)fμ^n(x)) (A9)
+log1ϵn·xΓϵnfμ(x)>fμ^n(x)(fμ(x)-fμ^n(x)) (A10)
(loge+log1ϵn)·xΓϵnfμ(x)-fμ^n(x). (A11)

The first inequality in (A9) is by triangular inequality, the second in (A10) is from the fact that lnxx1 for x>0. Finally, from definition of the total variational distance over σϵn in (59) we have that

2·V(μ/σϵn,μ^n/σϵn)=xΓϵnfμ(x)fμ^n(x)+μ^n(Γϵn)μ(Γϵn), (A12)

which concludes the argument from (A7)–(A9). ☐

Appendix D. Proposition A4

Proposition A4.

Considering that (kn), there exists K>0 and N>0 such that nN,

V(μ˜kn,μ^kn,n*)K·V(μ/σkn,μ^n/σkn). (A13)

Proof. 

V(μ˜kn,μ^kn,n*)=12xAμΓknμxμ(Γkn)μ^nxμ^n(Γkn)12μ(Γkn)xAμΓknμ^nxμx+xAμΓknμ^nxμ(Γkn)μ^n(Γkn)1=12μ(Γkn)2·V(μ/σkn,μ^n/σkn)+μ(Γkn)μ^n(Γkn)3·V(μ/σkn,μ^n/σkn)2μ(Γkn). (A14)

By the hypothesis μ(Γkn)1, which concludes the proof. ☐

Appendix E. Proposition A5

Proposition A5.

If ϵn is O(nτ) with τ(0,1/2), then

limsupnbϵn(X1,,Xn)a2ϵn0,Pμa.s..

Proof. 

Let us define the set

Bn=(x1,,xn):Γ˜2ϵnΓϵnXn.

From definition every sequence (x1,,xn)Bn is such that bϵn(x1,,xn)a2ϵn and, consequently, we just need to prove that Pμ(liminfnBn)=Pμ(n1knBk)=1 [42]. Furthermore, if supxΓ˜2ϵnμ^n(x)μ(x)ϵn, then by definition of Γ˜2ϵn in (65), we have that μ^n(x)ϵn for all xΓ2ϵn (i.e., Γ˜2ϵnΓϵn). From this

Pμn(Bnc)PμnsupxΓ˜2ϵnμ^n(x)μ(x)>ϵnΓ˜2ϵn·e2nϵn212ϵn·e2nϵn2, (A15)

from the Hoeffding’s inequality [28,52], the union bound and the fact that by construction Γ˜2ϵn12ϵn. If we consider ϵn=O(nτ) and l>0, we have that:

1nl·lnPμn(Bnc)1nlln(1/2·nτ)2n12τl. (A16)

From (A16) for any τ(0,1/2) there is l(0,12τ] such that Pμn(Bnc) is bounded by a term O(enl). This implies that n1Pμn(Bnc)<, that suffices to show that Pμ(n1knBk)=1. ☐

Appendix F. Auxiliary Results for Theorem 5

Let us first consider the series

Sxo=xxoxp=xop·1+xoxo+1p+xoxo+2p+=xop·S˜xo,0+S˜xo,1++S˜xo,xo1, (A17)

where S˜xo,jk=0k·xo+jxop for all j0,,xo1. It is simple to verify that for all j0,,xo1, S˜xo,jS˜xo,0=k0kp< given that by hypothesis p>1. Consequently, Sxoxo1p·k0kp.

Similarly, for the second series we have that:

Rxo=xxoxplogx=xop·log(xo)+xoxo+1log(xo+1)+xoxo+2log(xo+2)+=xop·R˜xo,0+R˜xo,2++R˜xo,xo1, (A18)

where R˜xo,jk=1k·xo+jxop·log(kxo+j) for all j0,,xo1. Note again that R˜xo,jR˜xo,0< for all j0,,xo1, and, consequently, Rxoxo1p·k1kplogk from (A18).

Appendix G. Proposition A6

Proposition A6.

If ϵn is O(nτ) with τ(0,1/2), then

liminfnξn(X1,,Xn)ρn(X1,,Xn)0,Pμa.s..

Proof. 

By definition if σ(Γϵn)σ(Γ˜ϵn/2) then ξn(X1,,Xn)ρn(X1,,Xn). Consequently, if we define the set:

Bn=(x1,,xn):σ(Γϵn)σ(Γ˜ϵn/2), (A19)

then the proof reduced to verify that Pμ(liminfnBn) =Pμ(n1knBk)=1.

On the other hand, if supxΓϵnμ^n(x)μ(x)ϵn/2 then by definition of Γϵ, for all xΓϵn μ(x)ϵn/2, i.e., ΓϵnΓ˜ϵn/2. In other words,

Cn=(x1,,xn):supxΓϵnμ^n(x)μ(x)ϵn/2Bn. (A20)

Finally,

Pμn(Cnc)=PμnsupxΓϵnμ^n(x)μ(x)>ϵn/2Γϵn·enϵ2/21ϵn·enϵ2/2. (A21)

In this context, if we consider ϵn=O(nτ) and l>0, then we have that:

1nl·lnPμn(Cnc)τ·lnnnln12τl2. (A22)

Therefore, we have that for any τ(0,1/2) we can take l(0,12τ] such that Pμn(Cnc) is bounded by a term O(enl). Then, the Borel Cantelli Lemma tells us that Pμ(n1knCk)=1, which concludes the proof from (A20). ☐

Appendix H. Proposition A7

Proposition A7.

For the p-power tail dominating distribution stated in Theorem 5, if (ϵn) is O(nτ) with τ(0,p) then ξn(X1,,Xn)iso(nq)forallq(0,(1τ/p)/2), Pμ-a.s.

Proof. 

From the Hoeffding’s inequality we have that

Pμnx1,,xn:ξn(x1,,xn)>δσ(Γ˜ϵn/2)·e2nδ2log(1/ϵn)22(2koϵn)1/p+1·e2nδ2log(1/ϵn)2, (A23)

the second inequality using that Γ˜ϵ(k0ϵ)1/p+1 from the definition of Γ˜ϵ in (65) and the tail bounded assumption on μ. If we consider ϵn=O(nτ) and l>0, then we have that:

1nl·lnPμnx1,,xn:ξn(x1,,xn)>δln2·(Cnτ/pl+nl)2δ2τ2·n1llogn2 (A24)

for some constant C>0. Then in order to obtain that ξn(X1,,Xn) converges almost surely to zero from (A24), it is sufficient that l>0, l<1, and l>τ/p. This implies that if τ<p, there is l(τ/p,1) such that such that Pμn(ξn(x1,,xn)>δ) is bounded by a term O(enl) and, consequently, limnξn(X1,,Xn)=0, Pμ-a.s. (this by using the same steps used in Appendix G).

Moving to the rate of convergence of ξn(X1,,Xn) (assuming that τ<p), let us consider δn=nq for some q0. From (A24):

1nl·lnPμnx1,,xn:ξn(x1,,xn)>δnln2·(Cnτ/pl+nl)2δ2τ2·n12qllogn2. (A25)

To make ξn(X1,,Xn) being o(nq) P-a.s., a sufficient condition is that l>0, l>τ/p, and l<12q. Therefore (considering that τ<p), the admissibility condition on the existence of a exponential rate of convergence O(enl) for l>0 for the deviation event x1,,xn:ξn(x1,,xn)>δn is that τ/p<12q, which is equivalent to 0<q<1τ/p2. ☐

Funding

The work is supported by funding from FONDECYT Grant 1170854, CONICYT-Chile and the Advanced Center for Electrical and Electronic Engineering (AC3E), Basal Project FB0008.

Conflicts of Interest

The author declares no conflict of interest.

References

  • 1.Beirlant J., Dudewicz E., Györfi L., van der Meulen E.C. Nonparametric entropy estimation: An Overview. Int. Math. Stat. Sci. 1997;6:17–39. [Google Scholar]
  • 2.Cover T.M., Thomas J.A. Elements of Information Theory. 2nd ed. Wiley; New York, NY, USA: 2006. [Google Scholar]
  • 3.Kullback S. Information Theory and Statistics. Wiley; New York, NY, USA: 1959. [Google Scholar]
  • 4.Principe J. Information Theoretic Learning: Renyi Entropy and Kernel Perspective. Springer; New York, NY, USA: 2010. [Google Scholar]
  • 5.Fisher J.W., III, Wainwright M., Sudderth E., Willsky A.S. Statistical and information-theoretic methods for self-organization and fusion of multimodal, networked sensors. Int. J. High Perform. Comput. Appl. 2002;16:337–353. doi: 10.1177/10943420020160031201. [DOI] [Google Scholar]
  • 6.Liu J., Moulin P. Information-theoretic analysis of interscale and intrascale dependencies between image wavelet coefficients. IEEE Trans. Image Process. 2001;10:1647–1658. doi: 10.1109/83.967393. [DOI] [PubMed] [Google Scholar]
  • 7.Thévenaz P., Unser M. Optimization of mutual information for multiresolution image registration. IEEE Trans. Image Process. 2000;9:2083–2099. doi: 10.1109/83.887976. [DOI] [PubMed] [Google Scholar]
  • 8.Butz T., Thiran J.P. From error probability to information theoretic (multi-modal) signal processing. Elsevier Signal Process. 2005;85:875–902. doi: 10.1016/j.sigpro.2004.11.027. [DOI] [Google Scholar]
  • 9.Kim J., Fisher J.W., III, Yezzi A., Cetin M., Willsky A.S. A nonparametric statistical method for image segmentation using information theory and curve evolution. IEEE Trans. Image Process. 2005;14:1486–1502. doi: 10.1109/tip.2005.854442. [DOI] [PubMed] [Google Scholar]
  • 10.Padmanabhan M., Dharanipragada S. Maximizing information content in feature extraction. IEEE Trans. Speech Audio Process. 2005;13:512–519. doi: 10.1109/TSA.2005.848876. [DOI] [Google Scholar]
  • 11.Silva J., Narayanan S. Minimum probability of error signal representation; Presented at IEEE Workshop Machine Learning for Signal Processing; Thessaloniki, Greece. 27–29 August 2007; pp. 348–353. [Google Scholar]
  • 12.Silva J., Narayanan S. Discriminative wavelet packet filter bank selection for pattern recognition. IEEE Trans. Signal Process. 2009;57:1796–1810. doi: 10.1109/TSP.2009.2013898. [DOI] [Google Scholar]
  • 13.Gokcay E., Principe J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2002;24:158–171. doi: 10.1109/34.982897. [DOI] [Google Scholar]
  • 14.Arellano-Valle R.B., Contreras-Reyes J.E., Stehlik M. Generalized skew-normal negentropy and its applications to fish condition time factor time series. Entropy. 2017;19:528. doi: 10.3390/e19100528. [DOI] [Google Scholar]
  • 15.Lake D.E. Nonparametric entropy estimation using kernel densities. Methods Enzymol. 2009;467:531–546. doi: 10.1016/S0076-6879(09)67020-8. [DOI] [PubMed] [Google Scholar]
  • 16.Van der Vaart A.W. Asymptotic Statistics. Volume 3 Cambridge University Press; Cambridge, UK: 2000. [Google Scholar]
  • 17.Wu Y., Yang P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inf. Theory. 2016;62:3702–3720. doi: 10.1109/TIT.2016.2548468. [DOI] [Google Scholar]
  • 18.Jiao J., Venkat K., Han Y., Weissman T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory. 2015;61:2835–2885. doi: 10.1109/TIT.2015.2412945. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Paninski L. Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory. 2004;50:2200–2203. doi: 10.1109/TIT.2004.833360. [DOI] [Google Scholar]
  • 20.Valiant G., Valiant P. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown opitmal via new CLTs; Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing; San Jose, AL, USA. 6–8 June 2011; pp. 685–694. [Google Scholar]
  • 21.Valiant G., Valiant P. A CLT and Tight Lower Bounds for Estimating Entropy. Volume 17. Electronic Colloquium on Computational Complexity; Potsdam, Germany: 2011. p. 9. Technical Report TR 10-179. [Google Scholar]
  • 22.Braess D., Forster J., Sauer T., Simon H.U. How to achieve minimax expected Kullback-Leibler distance from an unknown finite distribution; Proceedings of the International Conference on Algorithmic Learning Theory; Lübeck, Germany. 24–26 November 2004; Berlin/Heidelberg, Germany: Springer; 2002. pp. 380–394. [Google Scholar]
  • 23.Csiszár I., Shields P.C. Foundations and Trends® in Communications and Information Theory. Now Publishers Inc.; Breda, The Netherlands: 2004. Information theory and statistics: A tutorial; pp. 417–528. [Google Scholar]
  • 24.Ho S.W., Yeung R.W. On the discontinuity of the Shannon information measures. IEEE Trans. Inf. Theory. 2009;55:5362–5374. [Google Scholar]
  • 25.Silva J., Parada P. Shannon entropy convergence results in the countable infinite case; Proceedings of the International Symposium on Information Theory; Cambridge, MA, USA. 1–6 July 2012; pp. 155–159. [Google Scholar]
  • 26.Ho S.W., Yeung R.W. The interplay between entropy and variational distance. IEEE Trans. Inf. Theory. 2010;56:5906–5929. doi: 10.1109/TIT.2010.2080452. [DOI] [Google Scholar]
  • 27.Harremoës P. Information topologies with applications. In: Csiszár I., Katona G.O.H., Tardos G., editors. Entropy, Search, Complexity. Volume 16. Springer; New York, NY, USA: 2007. pp. 113–150. [Google Scholar]
  • 28.Devroye L., Lugosi G. Combinatorial Methods in Density Estimation. Springer; New York, NY, USA: 2001. [Google Scholar]
  • 29.Barron A., Györfi L., van der Meulen E.C. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans. Inf. Theory. 1992;38:1437–1454. doi: 10.1109/18.149496. [DOI] [Google Scholar]
  • 30.Antos A., Kontoyiannis I. Convergence properties of fucntionals estimates for discrete distributions. Random Struct. Algorithms. 2001;19:163–193. doi: 10.1002/rsa.10019. [DOI] [Google Scholar]
  • 31.Piera F., Parada P. On convergence properties of Shannon entropy. Probl. Inf. Transm. 2009;45:75–94. doi: 10.1134/S003294600902001X. [DOI] [Google Scholar]
  • 32.Berlinet A., Vajda I., van der Meulen E.C. About the asymptotic accuracy of Barron density estimates. IEEE Trans. Inf. Theory. 1998;44:999–1009. doi: 10.1109/18.669143. [DOI] [Google Scholar]
  • 33.Vajda I., van der Meulen E.C. Optimization of Barron density estimates. IEEE Trans. Inf. Theory. 2001;47:1867–1883. doi: 10.1109/18.930924. [DOI] [Google Scholar]
  • 34.Lugosi G., Nobel A.B. Consistency of data-driven istogram methods for density estimation and classification. Ann. Stat. 1996;24:687–706. [Google Scholar]
  • 35.Silva J., Narayanan S. Information divergence estimation based on data-dependent partitions. J. Stat. Plan. Inference. 2010;140:3180–3198. doi: 10.1016/j.jspi.2010.04.011. [DOI] [Google Scholar]
  • 36.Silva J., Narayanan S.N. Nonproduct data-dependent partitions for mutual information estimation: Strong consistency and applications. IEEE Trans. Signal Process. 2010;58:3497–3511. doi: 10.1109/TSP.2010.2046077. [DOI] [Google Scholar]
  • 37.Kullback S., Leibler R. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
  • 38.Gray R.M. Entropy and Information Theory. Springer; New York, NY, USA: 1990. [Google Scholar]
  • 39.Kullback S. A lower bound for discrimination information in terms of variation. IEEE Trans. Inf. Theory. 1967;13:126–127. doi: 10.1109/TIT.1967.1053968. [DOI] [Google Scholar]
  • 40.Csiszár I. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967;2:299–318. [Google Scholar]
  • 41.Kemperman J. On the optimum rate of transmitting information. Ann. Math. Stat. 1969;40:2156–2177. doi: 10.1214/aoms/1177697293. [DOI] [Google Scholar]
  • 42.Breiman L. Probability. Addison-Wesley; Boston, MA, USA: 1968. [Google Scholar]
  • 43.Varadhan S. Probability Theory. American Mathematical Society; Providence, RI, USA: 2001. [Google Scholar]
  • 44.Györfi L., Páli I., van der Meulen E.C. There is no universal source code for an infinite source alphabet. IEEE Trans. Inf. Theory. 1994;40:267–271. doi: 10.1109/18.272495. [DOI] [Google Scholar]
  • 45.Rissanen J. Information and Complexity in Statistical Modeling. Springer; New York, NY, USA: 2007. [Google Scholar]
  • 46.Boucheron S., Garivier A., Gassiat E. Coding on countably infinite alphabets. IEEE Trans. Inf. Theory. 2009;55:358–373. doi: 10.1109/TIT.2008.2008150. [DOI] [Google Scholar]
  • 47.Silva J.F., Piantanida P. The redundancy gains of almost lossless universal source coding over envelope families; Proceedings of the IEEE International Symposium on Information Theory; Aachen, Germany. 25–30 June 2017; pp. 2003–2007. [Google Scholar]
  • 48.Silva J.F., Piantanida P. Almost Lossless Variable-Length Source Coding on Countably Infinite Alphabets; Proceedings of the IEEE International Symposium on Information Theory; Barcelona, Spain. 10–15 July 2016; pp. 1–5. [Google Scholar]
  • 49.Nobel A.B. Histogram regression estimation using data-dependent partitions. Ann. Stat. 1996;24:1084–1105. doi: 10.1214/aos/1032526958. [DOI] [Google Scholar]
  • 50.Silva J., Narayanan S. Complexity-regularized tree-structured partition for mutual information estimation. IEEE Trans. Inf. Theory. 2012;58:940–1952. doi: 10.1109/TIT.2011.2177771. [DOI] [Google Scholar]
  • 51.Darbellay G.A., Vajda I. Estimation of the information by an adaptive partition of the observation space. IEEE Trans. Inf. Theory. 1999;45:1315–1321. doi: 10.1109/18.761290. [DOI] [Google Scholar]
  • 52.Devroye L., Györfi L., Lugosi G. A Probabilistic Theory of Pattern Recognition. Springer; New York, NY, USA: 1996. [Google Scholar]
  • 53.Tsybakov A.B. Introduction to Nonparametric Estimation. Springer; New York, NY, USA: 2009. [Google Scholar]

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES