Abstract
This work addresses the problem of Shannon entropy estimation in countably infinite alphabets studying and adopting some recent convergence results of the entropy functional, which is known to be a discontinuous function in the space of probabilities in ∞-alphabets. Sufficient conditions for the convergence of the entropy are used in conjunction with some deviation inequalities (including scenarios with both finitely and infinitely supported assumptions on the target distribution). From this perspective, four plug-in histogram-based estimators are studied showing that convergence results are instrumental to derive new strong consistent estimators for the entropy. The main application of this methodology is a new data-driven partition (plug-in) estimator. This scheme uses the data to restrict the support where the distribution is estimated by finding an optimal balance between estimation and approximation errors. The proposed scheme offers a consistent (distribution-free) estimator of the entropy in ∞-alphabets and optimal rates of convergence under certain regularity conditions on the problem (finite and unknown supported assumptions and tail bounded conditions on the target distribution).
Keywords: Shannon entropy estimation, countably infinite alphabets, entropy convergence results, statistical learning, histogram-based estimators, data-driven partitions, strong consistency, rates of convergence
1. Introduction
Shannon entropy estimation has a long history in information theory, statistics, and computer science [1]. Entropy and related information measures (conditional entropy and mutual information) have a fundamental role in information theory and statistics [2,3] and, as a consequence, it has found numerous applications in learning and decision making tasks [4,5,6,7,8,9,10,11,12,13,14,15]. In many of these contexts, distributions are not available and the entropy needs to be estimated from empirical data. This problem belongs to the category of scalar functional estimation that has been thoroughly studied in non-parametric statistics.
Starting with the finite alphabet scenario, the classical plug-in estimator (i.e., the empirical distribution evaluated on the functional) is well known to be consistent, minimax optimal, and asymptotically efficient [16] (Section 8.7–8.9). More recent research has focused on looking at the so-called large alphabet (or large dimensional) regime, meaning a non-asymptotic under-sampling regime where the number of samples n is on the order of, or even smaller than, the size of the alphabet denoted by k. In this context, it has been shown that the classical plug-in estimator is sub-optimal as it suffers from severe bias [17,18]. For characterizing optimality in this high dimensional context, a non-asymptotic minimax mean square error analysis (under a finite n and k) has been explored by several authors [17,18,19,20,21] considering the minimax risk
where denotes the collection of probabilities on and is the entropy of (details in Section 2). Paninski [19] first showed that it was possible to construct an entropy estimator that uses a sub-linear sampling size to achieve minimax consistency when k goes to infinity, in the sense that there is a sequence where as k goes to infinity. A set of results by Valiant et al. [20,21] shows that the optimal scaling of the sampling size with respect to k is O(k/)), to achieve the aforementioned asymptotic consistency for entropy estimation. A refined set of results for the complete characterization of , the specific scaling of the sampling complexity, and the achievability of the obtained minimax risk for the family with practical estimators have been presented in [17,18]. On the other hand, it is well-known that the problem of estimating the distribution (consistently in total variation) in finite alphabets requires a sampling complexity that scales as [22]. Consequently, in finite alphabets the task of entropy estimation is simpler than estimating the distribution in terms of sampling complexity. These findings are consistent with the observation that the entropy is a continuous functional of the space of distributions (in the total variational distance sense) for the finite alphabet case [2,23,24,25].
1.1. The Challenging Infinite Alphabet Learning Scenario
In this work, we are interested in the countably infinite alphabet scenario, i.e., on the estimation of the entropy when the alphabet is countably infinite and we have a finite number of samples. This problem can be seen as an infinite dimensional regime as the size of the alphabet goes unbounded and n is kept finite for the analysis, which differs from the large dimensional regime mentioned above. As argued in [26] (Section IV), this is a challenging non-parametric learning problem because some of the finite alphabet properties of the entropy do not extend to this infinite dimensional context. Notably, it has been shown that the Shannon entropy is not a continuous functional with respect to the total variational distance in infinite alphabets [24,26,27]. In particular, Ho et al. [24] (Theorem 2) showed concrete examples where convergence in -divergence and in direct information divergence (I-divergence) of a set of distributions to a limit, both stronger than total variational convergence [23,28], do not imply the convergence of the entropy. In addition, Harremoës [27] showed the discontinuity of the entropy with respect to the reverse I-divergence [29], and consequently, with respect to the total variational distance (the distinction between reverse and direct I-divergence was pointed out in the work of Barron et al. [29]). In entropy estimation, the discontinuity of the entropy implies that the minimax mean square error goes unbounded, i.e.,
where denotes the family of finite entropy distribution over the countable alphabet set (the proof of this result follows from [26] (Theorem 1) and the argument is presented in Appendix A). Consequently, there is no universal minimax consistent estimator (in the mean square error sense) of the entropy over the family of finite entropy distributions.
Considering a sample-wise (or point-wise) convergence to zero of the estimation error (instead of the minimax expected error analysis mentioned above), Antos et al. [30] (Theorem 2 and Corollary 1) show the remarkable result that the classical plug-in estimate is strongly consistent and consistent in the mean square error sense for any finite entropy distribution (point-wise). Then, the classical plug-in entropy estimator is universal, meaning that the convergence to the right limiting value is achieved almost surely despite the discontinuity of the entropy. Moving on the analysis of the (point-wise) rate of convergence of the estimation error, Antos et al. [30] (Theorem 3) present a finite length lower bound for the error of any arbitrary estimation scheme, showing as a corollary that no universal rate of convergence (to zero) can be achieved for entropy estimation in infinite alphabets [30] (Theorem 4). Finally, constraining the problem to a family of distributions with specific power tail bounded conditions, Antos et al. [30] (Theorem 7) present a finite length expression for the rate of convergence of the estimation error of the classical plug-in estimate.
1.2. From Convergence Results to Entropy Estimation
In view of the discontinuity of the entropy in ∞-alphabets [24] and the results that guarantee entropy convergence [25,26,27,31], this work revisits the problem of point-wise almost-sure entropy estimation in ∞-alphabets from the perspective of studying and applying entropy convergence results and their derived bounds [25,26,31]. Importantly, entropy convergence results have established concrete conditions on both the limiting distribution and the way a sequence of distributions converges to such that is satisfied. The natural observation that motivates this work is that consistency is basically a convergence to the true entropy value that happens with probability one. Then our main conjecture is that putting these conditions in the context of a learning task, i.e., where is a random sequence of distributions driven by the classical empirical process, will offer the possibility to study a broad family of plug-in estimators with the objective to derive new strong consistency and rates of convergence results. On the practical side, this work proposes and analizes a data-driven histogram-based estimator as a key learning scheme, since this approach offers the flexibility to adapt to learning task when appropriate bounds for the estimation and approximation errors are derived.
1.3. Contributions
We begin revisiting the classical plug-in entropy estimator considering the relevant scenario where (the unknown distribution that produces the i.i.d. samples) has a finite but arbitrary large and unknown support. This is declared to be a challenging problem by Ho and Yeung [26] (Theorem 13) because of the discontinuity of the entropy. Finite-length (non-asymptotic) deviation inequalities and intervals of confidence are derived extending the results presented in [26] (Section IV). From this, it is shown that the classical plug-in estimate achieves optimal rates of convergence. Relaxing the finite support restriction on , two concrete histogram-based plug-in estimators are presented, one built upon the celebrated Barron-Györfi-van der Meulen histogram-based approach [29,32,33]; and the other on a data-driven partition of the space [34,35,36]. For the Barron plug-in scheme, almost-sure consistency is shown for entropy estimation and distribution estimation in direct I-divergence under some mild support conditions on . For the data-driven partition scheme, the main context of application of this work, it is shown that this estimator is strongly consistent distribution-free, matching the universal result obtained for the classical plug-in approach in [30]. Furthermore, new almost-sure rates of convergence results (in the estimation error) are obtained for distributions with finite but unknown support and for families of distributions with power and exponential tail dominating conditions. In this context, our results show that this adaptive scheme has a concrete design solution that offers very good convergence rate of the overall estimation error, as it approaches the rate that is considered optimal for the finite alphabet case [16]. Importantly, the parameter selection of this scheme relies on, first, obtaining expressions to bound the estimation and approximation errors and, second, finding the optimal balance between these two learning errors.
1.4. Organization
The rest of the paper is organized as follows. Section 2 introduces some basic concepts, notation, and summarizes the main entropy convergence results used in this work. Section 3, Section 4 and Section 5 state and elaborate the main results of this work. Discussion of the results and final remarks are given in Section 6. The technical derivation of the main results are presented in Section 7. Finally, proofs of auxiliary results are relegated to the Appendix Section.
2. Preliminaries
Let be a countably infinite set and let denote the collection of probability measures in . For and v in , and absolutely continuous with respect to v (i.e., ), denotes the Radon-Nikodym (RN) derivative of with respect to v. Every is equipped with its probability mass function (pmf) that we denote by , . Finally, for any , denotes its support and
(1) |
denotes the collection of probabilities with finite support.
Let and v be in , then the total variation distance of and v is given by [28]
(2) |
where denotes the subsets of . The Kullback–Leibler divergence or I-divergence of with respect to v is given by
(3) |
when , while is set to infinite, otherwise [37].
The Shannon entropy of is given by [1,2,38]:
(4) |
In this context, let be the collection of probabilities where (4) is finite, let denote the collection of probabilities absolutely continuous with respect to , and let denote the collection of probabilities where (3) is finite for .
Concerning convergence, a sequence is said to converge in total variation to if
(5) |
For countable alphabets, ref. [31] (Lemma 3) shows that the convergence in total variation is equivalent to the weak convergence, which is denoted here by , and the point-wise convergence of the pmf’s. Furthermore, from (2), the convergence in total variation implies the uniform convergence of the pmf’s, i.e, . Therefore, in this countable case, all the four previously mentioned notions of convergence are equivalent: total variation, weak convergence, point-wise convergence of the pmf’s, and uniform convergence of the pmf’s.
We conclude with the convergence in I-divergence introduced by Barron et al. [29]. It is said that converges to in direct and in reverse I-divergence if and , respectively. From Pinsker’s inequality [39,40,41], the convergence in I-divergence implies the weak convergence in (5), where it is known that the converse result is not true [27].
2.1. Convergence Results for the Shannon Entropy
The discontinuity of the entropy in ∞-alphabets raises the problem of finding conditions under which convergence of the entropy can be obtained. On this topic, Ho et al. [26] have studied the interplay between entropy and the total variation distance, specifying conditions for convergence by assuming a finite support on the involved distributions. On the other hand, Harremoës [27] (Theorem 21) obtained convergence of the entropy by imposing a power dominating condition [27] (Definition 17) on the limiting probability measure for all the sequences converging in reverse I-divergence to [29]. More recently, Silva et al. [25] have addressed the entropy convergence studying a number of new settings that involve conditions on the limiting measure , as well as the way the sequence converges to in the space of distributions. These results offer sufficient conditions where the entropy evaluated in a sequence of distributions converges to the entropy of its limiting distribution and, consequently, the possibility of applying these when analyzing plug-in entropy estimators. The results used in this work are summarized in the rest of this section.
Let us begin with the case when , i.e., when the support of the limiting measure is finite and unknown.
Proposition 1.
Let us assume that and . If , then and .
This result is well-known because when for all n, the scenario reduces to the finite alphabet case, where the entropy is known to be continuous [2,23]. Since we obtain two inequalities that are used in the following sections, a simple proof is provided here.
Proof.
and belong to from the finite-supported assumption. The same argument can be used to show that , since for all n. Let us consider the following identity:
(6) The first term on the right hand side (RHS) of (6) is upper bounded by where
(7) For the second term, we have that
(8) and, consequently,
(9) ☐
Under the assumptions of Proposition 1, we note that the reverse I-divergence and the entropy difference are bounded by the total variation by (8) and (9), respectively. Note, however, that these bounds are a distribution-dependent function of () in (7) (it is direct to show that if, and only if, ). The next result relaxes the assumption that and offers a necessary and sufficient condition for the convergence of the entropy.
Lemma 1.
Ref. [25] (Theorem 1) Let and . If , then there exists such that , and
Furthermore, , if and only if,
(10) where denotes the conditional probability of μ given the event .,
Lemma 1 tells us that to achieve entropy convergence (on top of the weak convergence), it is necessary and sufficient to ask for a vanishing expression (with n) of the entropy of restricted to the elements of the set . Two remarks about this result are: (1) The convergence in direct I-divergence does not imply the convergence of the entropy (concrete examples are presented in [24] (Section III) and [25]); (2) Under the assumption that , is eventually absolutely continuous with respect to , and the convergence in total variations is equivalent to the convergence in direct I-divergence.
This section concludes with the case when the support of is infinite and unknown, i.e., . In this context, two results are highlighted:
Lemma 2.
Ref. [31] (Theorem 4) Let us consider that and . If and
(11) then for all n and it follows that
Interpreting Lemma 2 we have that, to obtain the convergence of the entropy functional (without imposing a finite support assumption on ), a uniform bounding condition (UBC) -almost everywhere was added in (11). By adding this UBC, the convergence on reverse I-divergence is also obtained as a byproduct. Finally, when for all n, the following result is considered:
Lemma 3.
Ref. [25] (Theorem 3) Let and a sequence of measures such that for all . If and
(12) then, for all , and
Furthermore, , if and only if,
(13)
This result shows the non-sufficiency of the convergence in direct I-divergence to achieve entropy convergence in the regime when . In fact, Lemma 3 may be interpreted as an extension of Lemma 1 when the finite support assumption over is relaxed.
3. Shannon Entropy Estimation
Let be a probability in , and let denote by the empirical process induced from i.i.d. realizations of a random variable driven by , i.e., , for all . Let denote the distribution of the empirical process in and denote the finite block distribution of in the product space . Given a realization of , we can construct an histogram-based estimator like classical empirical probability given by:
(14) |
with pmf given by for all . A natural estimator of the entropy is the plug-in estimate of given by
(15) |
which is a measurable function of (this dependency on the data will be implicit for the rest of the paper).
For the rest of this Section and Section 4 and Section 5, the convergence results in Section 2.1 are used to derive strong consistency results for plug-in histogram-based estimators, like in (15), as well as finite length concentration inequalities to obtain almost-sure rates of convergence for the overall estimation error .
3.1. Revisiting the Classical Plug-In Estimator for Finite and Unknown Supported Distributions
We start by analyzing the case when has a finite but unknown support. A consequence of the strong law of large numbers [42,43] is that , , -almost surely (a.s.), hence , -a.s. On the other hand, it is clear that holds with probability one. Then Proposition 1 implies that
(16) |
i.e., is a strongly consistent estimator of in reverse I-divergence and is a strongly consistent estimate of distribution-free in . Furthermore, the following can be stated:
Theorem 1.
Let and let us consider in (14). Then , -a.s and , ,
(17)
(18) Moreover, is eventually finite with probability one and , and for any ,
(19)
This result implies that for any and , , , and goes to zero as -a.s. Furthermore, and behave like for all from (30) in Section 7, which is the optimal rate of convergence of the finite alphabet scenario. As a corollary of (18), it is possible to derive intervals of confidence for the estimation error : for all and ,
(20) |
This confidence interval behaves like as a function of n, and like as a function of , which are the same optimal asymptotic trends that can be obtained for in (30).
Finally, we observe that -a.s. where for any , implying that for all finite n. Then even in the finite and unknown supported scenario, is not consistent in expected direct I-divergence, which is congruent with the result in [29,44]. Besides this negative result, strong consistency in direct I-divergence can be obtained from (19), in the sense that , -a.s.
3.2. A Simplified Version of the Barron Estimator for Finite Supported Probabilities
It is well-understood that consistency in expected direct I-divergence is of critical importance for the construction of a lossless universal source coding scheme [2,23,29,44,45,46,47,48]. Here, we explore an estimator that achieves this learning objective, in addition to entropy estimation. For that, let and let assume such that . Barron et al. [29] proposed a modified version of the empirical measure in (14) to estimate from i.i.d. realizations, adopting a mixture estimate of the form
(21) |
for all , and with a sequence of real numbers in . Note that then for all n and from the finite support assumption and , -a.s.. The following result derives from the convergence result in Lemma 1.
Theorem 2.
Let , and let us consider in (21) induced from i.i.d. realizations of μ.
- (i)
If is , then , , -a.s., and .
- (ii)
Furthermore, if is with , then for all , and are -a.s, and and are .
Using this approach, we achieve estimation of the true distribution in expected information divergence as well as strong consistency for entropy estimation as intended. In addition, optimal rates of convergence are obtained under the finite support assumption on .
4. The Barron-Györfi-van der Meulen Estimator
The celebrated Barron estimator was proposed by Barron, Györfi and van der Meulen [29] in the context of an abstract and continuous measurable space. It is designed as a variation of the classical histogram-based scheme to achieve a consistent estimation of the distribution in direct I-divergence [29] (Theorem 2). Here, the Barron estimator is revisited in the countable alphabet scenario, with the objective of estimating the Shannon entropy consistently, which, to the best of our knowledge, has not been previously addressed in literature. For that purpose, the convergence result in Lemma 3 will be used as a key result.
Let be of infinite support (i.e., ). We want to construct a strongly consistent estimate of the entropy restricted to the collection of probabilities in . For that, let us consider a sequence with values in and let us denote by the finite partition of with maximal cardinality satisfying that
(22) |
Note that for all , and because of the fact that it is simple to verify that if is then . offers an approximated statistically equivalent partition of with respect to the reference measure v. In this context, given , i.i.d. realizations of , the idea proposed by Barron et al. [29] is to estimate the RN derivative by the following histogram-based construction:
(23) |
where is a real number in , denotes the cell in that contains the point x, and is the empirical measure in (14). Note that
, and, consequently,
(24) |
By construction and, consequently, for all . The next result shows sufficient conditions on the sequences and to guarantee a strongly consistent estimation of the entropy and of in direct I-divergence, distribution free in . The proof is based on verifying that the sufficient conditions of Lemma 3 are satisfied -a.s.
Theorem 3.
Let v be in with infinite support, and let us consider μ in . If we have that:
- (i)
is and is ,
- (ii)
, such that the sequence is ,
then for all and
(25)
This result shows an admisible regime of design parameters and its scaling with the number of samples that guarantees that the Barron plug-in entropy estimator is strongly consistent in . As a byproduct, we obtain that the distribution is estimated consistently in direct information divergence.
The Barron estimator [29] was originally proposed in the context of distributions defined in an abstract measurable space. Then if we restrict [29] (Theorem 2) to the countable alphabet case, the following result is obtained:
Corollary 1.
Ref. [29] (Theorem 2) Let us consider and . If is , is and then
When the only objective is the estimation of distributions consistently in direct I-divergence, Corollary 1 should be considered to be a better result than Theorem 3 (Corollary 1 offers weaker conditions than Theorem 3 in particular condition (ii)). The proof of Theorem 3 is based on verifying the sufficient conditions of Lemma 3, where the objective is to achieve the convergence of the entropy, and as a consequence, the convergence in direct I-divergence. Therefore, we can say that the stronger conditions of Theorem 3 are needed when the objective is entropy estimation. This is justified from the observation that convergence in direct I-divergence does not imply entropy convergence in ∞-alphabets, as is discussed in Section 2.1 (see, Lemmas 1 and 3).
5. A Data-Driven Histogram-Based Estimator
Data-driven partitions offer a better approximation to the data distribution in the sample space than conventional non-adaptive histogram-based approaches [34,49]. They have the capacity to improve the approximation quality of histogram-based learning schemes, which translates in better performance in different non-parametric learning settings [34,35,36,50,51]. One of the basic design principles of this approach is to partition or select a sub-set of elements of in a data-dependent way to preserve a critical number of samples per cell. In our problem, this last condition proves to be crucial to derive bounds for the estimation and approximation errors. Finally, these expressions will be used to propose design solutions that offer an optimal balance between estimation and approximation errors (Theorems 5 and 6).
Given i.i.d. realizations driven by and , let us define the data-driven set
(26) |
and . Let be a data-driven partition with maximal resolution in , and be the smallest sigma field that contains (as is a finite partition, is the collection of sets that are union of elements of ). We propose the conditional empirical probability restricted to by:
(27) |
By construction, it follows that , -a.s. and this implies that for all . Furthermore, and, importantly in the context of the entropy functional, it follows that
(28) |
The next result establishes a mild sufficient condition on for which is strongly consistent distribution-free in . Considering that we are in the regime where , -a.s., the proof of this result uses the convergence result in Lemma 2 as a central result.
Theorem 4.
If is with , then for all
Complementing Theorem 4, the next result offers almost-sure rates of converge for a family of distributions with a power tail bounded condition (TBC). In particular, the family of distributions studied in [30] (Theorem 7) are considered.
Theorem 5.
Let us assume that for some there are two constants and such that for all . If we consider that for , then
This result shows that under the mentioned p-power TBC on , the plug-in estimator can achieve a rate of convergence to the true limit that is with probability one. For the derivation of this result, the approximation sequence is defined as a function of p (adapted to the problem) by finding an optimal tradeoff between estimation and approximation errors while performing a finite length (non-asymptotic) analysis of the expression (the details of this analysis are presented in Section 7).
It is insightful to look at two extreme regimes of this result: p approaching 1, in which the rate is arbitrarily slow (approaching a non-decaying behavior); and , where is for all -a.s.. This last power decaying range matches what is achieved for the finite alphabet scenario (for instance in Theorem 1, Equation (18)), which is known to be the optimal rate for finite alphabets.
Extending Theorem 5, the following result addresses the more constrained case of distributions with an exponential TBC.
Theorem 6.
Let us consider and let us assume that there are with and such that for all . If we consider with , then
Under this stringent TBC on , it is observed that , for any arbitrary , by selecting with . This last condition on is universal over . Remarkably, for any distribution with this exponential TBC, we can approximate (arbitrarily closely) the optimal almost-sure rate of convergence achieved for the finite alphabet problem.
Finally, the finite and unknown supported scenario is revisited, where it is shown that the data-driven estimator exhibits the optimal almost sure convergence rate of the classical plug-in entropy estimator presented in Section 3.1.
Theorem 7.
Let us assume that and being . Then for all there is such that
(29)
The proof of this result reduces to verify that detects almost-surely when n goes to infinity and from this, it follows that eventually matches the optimal almost sure performance of under the key assumption that . Finally, the concentration bound in (29) implies that is almost surely for all as long as with n.
6. Discussion of the Results and Final Remarks
This work shows that entropy convergence results are instrumental to derive new (strongly consistent) estimation results for the Shannon entropy in ∞-alphabets and, as a byproduct, distribution estimators that are strongly consistent in direct and reverse I-divergence. Adopting a set of sufficient conditions for entropy convergence in the context of four plug-in histogram-based schemes, this work shows concrete design conditions where strong consistency for entropy estimation in ∞-alphabets can be obtained (Theorems 2–4). In addition, the relevant case where the target distribution has a finite but unknown support is explored, deriving almost sure rates of convergence results for the overall estimation error (Theorems 1 and 7) that match the optimal asymptotic rate that can be obtained in the finite alphabet version of the problem (i.e., the finite and known supported case).
As the main context of application, this work focuses on the case of a data-driven plug-in estimator that restricts the support where the distribution is estimated. The idea is to have design parameters that control estimation-error effects and to find an adequate balance between these two learning errors. Adopting the entropy convergence result in Lemma 2, it is shown that this data-driven scheme offers the same universal estimation attributes than the classical plug-in estimate under some mild conditions on its threshold design parameter (Theorem 4). In addition, by addressing the technical task of deriving concrete closed-form expressions for the estimation and approximation errors in this learning context a solution is presented where almost-sure rates of convergence of the overall estimation error are obtained over a family of distributions with some concrete tail bounded conditions (Theorems 5 and 6). These results show the capacity that data-driven frameworks offer for adapting aspects of their learning scheme to the complexity of the entropy estimation task in ∞-alphabets.
Concerning the classical plug-in estimator presented in Section 3.1, it is important to mention that the work of Antos et al. [30] shows that happens almost surely and distribution-free and, furthermore, it provides rates of convergence for families with specific tail-bounded conditions [30] (Theorem 7). Theorem 1 focuses on the case when , where new finite-length deviation inequalities and confidence intervals are derived. From that perspective, Theorem 1 complements the results presented in [30] in the non-explored scenario when . It is also important to mention two results by Ho and Yeung [26] (Theorems 11 and 12) for the plug-in estimator in (15). They derived bounds for and determined confidence intervals under a finite and known support restriction on . In contrast, Theorem 1 resolves the case for a finite and unknown supported distribution, which is declared to be a challenging problem from the arguments presented in [26] (Theorem 13) concerning the discontinuity of the entropy.
7. Proof of the Main Results
Proof of Theorem 1.
Let be in , then for some . From Hoeffding’s inequality [28], , and for any
(30) Considering that -a.s, we can use Proposition 1 to obtain that
(31) Hence, (17) and (18) derive from (30).
For the direct I-divergence, let us consider a sequence and the following function (a stopping time):
(32) is the point where the support of is equal to and, consequently, the direct I-divergence is finite (since ). In fact, by the uniform convergence of to (-a.s.) and the finite support assumption of , it is simple to verify that . Let us define the event:
(33) i.e., the collection of sequences in where at time n, and, consequently, . Restricted to this set
(34)
(35)
(36) where in the first inequality , and the last is obtained by the definition of the total variational distance. In addition, let us define the -deviation set . Then by additivity and monotonicity of , we have that
(37) By definition of , (36) and (30) it follows that
(38) On the other hand, if then . Consequently , and again from (30)
(39) for all and . Integrating the results in (38) and (39) and considering suffice to show the bound in (19). ☐
Proof of Theorem 2.
As is , it is simple to verify that , -a.s. Also note that the support disagreement between and is bounded by the hypothesis, then
(40) Therefore from Lemma 1, we have the strong consistency of and the almost sure convergence of to zero. Note that is uniformly upper bounded by (see (36) in the proof of Theorem 1). Then the convergence in probability of implies the convergence of its mean [42], which concludes the proof of the first part.
Concerning rates of convergence, we use the following:
(41) The absolute value of the first term in the right hand side (RHS) of (41) is bounded by and the second term is bounded by , from the assumption that . For the last term, note that for all and that , then
On the other hand,
Integrating these bounds in (41),
(42) for constants and function of and v.
Under the assumption that , the Hoeffding’s inequality [28,52] tells us that (for some distribution free constants and ). From this inequality, goes to zero as -a.s. and is . On the other hand, under the assumption in ii) is , which from (42) proves the rate of convergence results for .
Considering the direct I-divergence, . Then the uniform convergence of to -a.s. in and the fact that imply that for an arbitrary small (in particular smaller than )
(43) (43) suffices to obtain the convergence result for the I-divergence. ☐
Proof of Theorem 3.
Let us define the oracle Barron measure by:
(44) where we consider the true probability instead of its empirical version in (23). Then, the following convergence result can be obtained (see Proposition A2 in Appendix B),
(45) Let denote the collection of sequences where the convergence in (45) is holding (this set is typical meaning that ). The rest of the proof reduces to show that for any arbitrary , its respective sequence of induced measures (the dependency of on the sequence will be considered implicit for the rest of the proof) satisfies the sufficient conditions of Lemma 3.
Let us fix an arbitrary :
Weak convergence : Without loss of generality we consider that for all . Since and , , we got the weak convergence of to . On the other hand by definition of , that implies that for all and, consequently, .
The condition in (12): By construction , and for all n, then we will use the following equality:
(46) for all . Concerning the approximation error term of (46), i.e., ,
(47) Given that , this is equivalent to state that is bounded -almost everywhere, which is equivalent to say that and . From this, ,
(48) Then we have that, . Therefore for n sufficient large, for all x in . Hence, there exists such that .
For the estimation error term of (46), i.e., , note that from the fact that , and the convergence in (45), there exists such that for all , given that . Then using (46), for all , which verifies (12).
The condition in (13): Defining the function , we want to verify that . Considering that for all , there exists such that and then
(49) From (49), for all . Analyzing in (44), there are two scenarios: where and, otherwise, . Let us define:
(50) Then for all ,
(51) with . The left term in (51) is upper bounded by , which goes to zero with n from being and the fact that . For the right term in (51), being implies that x belongs to eventually (in n) , then tends to zero point-wise as n goes to infinity. On the other hand, for all (see (50)), we have that
(52) These inequalities derive from (48). Consequently for all , if n sufficiently large such that , then
(53) Hence from (50), is bounded by a fix function that is by the assumption that . Then by the dominated convergence theorems [43] and (51),
In summary, we have shown that for any arbitrary the sufficient conditions of Lemma 3 are satisfied, which proves the result in (25) reminding that from (45). ☐
Proof of Theorem 4.
Let us first introduce the oracle probability
(54) Note that is a random probability measure (function of the i.i.d sequence ) as is a data-driven set, see (26). We will first show that:
(55) Under the assumption on of Theorem 4, , -a.s. (this result derives from the fact that , -a.s. , from (63)) In addition, since is then , which implies that -a.s. From this , -a.s. Let us consider a sequences where . Constrained to that
(56) Then there is such that . Hence from Lemma 2, 0 and . Finally, the set of sequences where has probability one (with respect to ), which proves (55).
For the rest of the proof, we concentrate on the analysis of that can be attributed to the estimation error aspect of the problem. It is worth noting that by construction , -a.s.. Consequently, we can use
(57) The first term on the RHS of (57) is upper bounded by . Concerning the second term on the RHS of (57), it is possible to show (details presented in Appendix C) that
(58) where
(59) In addition, it can be verified (details presented in Appendix D) that
(60) for some universal constant . Therefore from (57), (58) and (60), there is such that
(61) As mentioned before, goes to 1 almost surely, then we need to concentrate on the analysis of the asymptotic behavior of . From Hoeffding’s inequality [28], we have that
(62) considering that by construction . Assuming that is ,
Therefore for all , and any arbitrary
(63) This last result is sufficient to show that that concludes the argument from the Borel-Cantelli Lemma. ☐
Proof of Theorem 5.
We consider the expression
(64) to analize the approximation error and the estimation error terms separately.
• Approximation Error Analysis
Note that is a random object as in (54) is a function of the data-dependent partition and, consequently, a function of . In the following, we consider the oracle set
(65) and the oracle conditional probability
(66) Note that is a deterministic function of and so is the measure in (66). From definitions and triangular inequality:
(67) and, similarly, the approximation error is bounded by
(68) We denote the RHS of (67) and (68) by and , respectively.
We can show that if is and , then
(69) which from (68) implies that is , -a.s. The proof of (69) is presented in Appendix E.
Then, we need to analyze the rate of convergence of the deterministic sequence . Analyzing the RHS of (67), we recognize two independent terms: the partial entropy sum and the rest that is bounded asymptotically by , using the fact that for . Here is where the tail condition on plays a role. From the tail condition, we have that
(70) where . Similarly as , then
(71) where .
In Appendix F, it is shown that and for constants and . Integrating these results in the RHS of (70) and (71) and considering that is , we have that both and are . This implies that our oracle sequence is .
In conclusion, if is for , it follows that
(72) • Estimation Error Analysis
Let us consider . From the bound in (61) and the fact that for any , -a.s. from (63), the problem reduces to analyze the rate of convergence of the following random object:
(73) We will analize, instead, the oracle version of given by:
(74) where is the oracle counterpart of in (26). To do so, we can show that if is with , then
(75) The proof of (75) is presented in Appendix G.
Moving to the almost sure rate of convergence of , it is simple to show for our p-power dominating distribution that if is and then
and, more specifically,
(76) The argument is presented in Appendix H.
In conclusion, if is for , it follows that
(77) for all .
• Estimation vs. Approximation Errors
Coming back to (64) and using (72) and (77), the analysis reduces to finding the solution in that offers the best trade-off between the estimation and approximation error rate:
(78) It is simple to verify that . Then by considering arbitrary close to the admissible limit , we can achieve a rate of convergence for that is arbitrary close to , -a.s.
More formally, for any we can take where is , -a.s., from (72) and (77).
Finally, a simple corollary of this analysis is to consider where:
(79) which concludes the argument. ☐
Proof of Theorem 6.
The argument follows the proof of Theorem 5. In particular, we use the estimation-approximation error bound:
(80) and the following two results derived in the proof of Theorem 5: If is with then (for the approximation error)
(81) with while (for the estimation error)
(82) with
For the estimation error, we need to bound the rate of convergence of to zero almost surely. We first note that with . Then from Hoeffding’s inequality we have that
(83) Considering , an arbitrary sequence being and , it follows from (83) that
(84) We note that the first term in the RHS of (84) is and goes to zero for all , while the second term is . If we consider , this second term is . Therefore, for any we can take an arbitrary such that is from (84). This result implies, from the Borel-Cantelli Lemma, that is , -a.s, which in summary shows that is for all .
For the approximation error, it is simple to verify that:
(85) and
(86) where and . At this point, it is not difficult to show that and for some constants and . Integrating these partial steps, we have that
(87) for some constant and . The last step is from the evaluation of . Therefore from (81) and (87), it follows that is -a.s. for all .
The argument concludes by integrating in (80) the almost sure convergence results obtained for the estimation and approximation errors. ☐
Proof of Theorem 7.
Let us define the event
(88) that represents the detection of the support of from the data for a given in (26). Note that the dependency on the data for is made explicit in this notation. In addition, let us consider the deviation event
(89) By the hypothesis that , then . Therefore if then for all , which implies that as long as . Using the hypothesis that , there is such that for all and, consequently,
(90) the last from Hoeffding’s inequality considering .
If we consider the events:
(91)
(92) and we use the fact that by definition conditioning on , it follows that . Then, for all and
(93) the last inequality from Theorem 1 and (90). ☐
Acknowledgments
The author is grateful to Patricio Parada for his insights and stimulating discussion in the initial stage of this work. The author thanks the anonymous reviewers for their valuable comments and suggestions, and his colleagues Claudio Estevez, Rene Mendez and Ruben Claveria for proofreading this material.
Appendix A. Minimax risk for Finite Entropy Distributions in ∞-Alphabets
Proposition A1.
For the proof, we use the following lemma that follows from [26] (Theorem 1).
Lemma A1.
Let us fix two arbitrary real numbers and . Then there are P, Q two finite supported distributions on that satisfy that while .
The proof of Lemma A1 derives from the same construction presented in the proof of [26] (Theorem 1), i.e., and a modification of it both distribution of finite support and consequently in . It is simple to verify that as M goes to infinity while .
Proof.
For any pair of distribution P, Q in , Le Cam’s two point method [53] shows that:
(A1) Adopting Lemma A1 and Equation (A1), for any n and any arbitrary and , we have that . Then exploiting the discontinuity of the entropy in infinite alphabets, we can fix and make arbitrar large. ☐
Appendix B. Proposition A2
Proposition A2.
Under the assumptions of Theorem 3:
(A2)
Proof.
First note that , then is finite and
(A3) Then by construction,
(A4) From Hoeffding’s inequality, we have that
(A5) By condition ii), given that is for some , then there exists such that
This implies that is eventually dominated by a constant time , which from the Borel-Cantelli Lemma [43] implies that
(A6) ☐
Appendix C. Proposition A3
Proposition A3.
Proof.
From definition,
(A7) For the right term in the RHS of (A7):
(A8) For the left term in the RHS of (A7):
(A9)
(A10)
(A11) The first inequality in (A9) is by triangular inequality, the second in (A10) is from the fact that for . Finally, from definition of the total variational distance over in (59) we have that
(A12) which concludes the argument from (A7)–(A9). ☐
Appendix D. Proposition A4
Proposition A4.
Considering that , there exists and such that ,
(A13)
Proof.
(A14) By the hypothesis , which concludes the proof. ☐
Appendix E. Proposition A5
Proposition A5.
If is with , then
Proof.
Let us define the set
From definition every sequence is such that and, consequently, we just need to prove that [42]. Furthermore, if , then by definition of in (65), we have that for all (i.e., ). From this
(A15) from the Hoeffding’s inequality [28,52], the union bound and the fact that by construction . If we consider and , we have that:
(A16) From (A16) for any there is such that is bounded by a term . This implies that , that suffices to show that . ☐
Appendix F. Auxiliary Results for Theorem 5
Let us first consider the series
(A17) |
where for all . It is simple to verify that for all , given that by hypothesis . Consequently, .
Similarly, for the second series we have that:
(A18) |
where for all . Note again that for all , and, consequently, from (A18).
Appendix G. Proposition A6
Proposition A6.
If is with , then
Proof.
By definition if then . Consequently, if we define the set:
(A19) then the proof reduced to verify that .
On the other hand, if then by definition of , for all , i.e., . In other words,
(A20) Finally,
(A21) In this context, if we consider and , then we have that:
(A22) Therefore, we have that for any we can take such that is bounded by a term . Then, the Borel Cantelli Lemma tells us that , which concludes the proof from (A20). ☐
Appendix H. Proposition A7
Proposition A7.
For the p-power tail dominating distribution stated in Theorem 5, if is with then , -a.s.
Proof.
From the Hoeffding’s inequality we have that
(A23) the second inequality using that from the definition of in (65) and the tail bounded assumption on . If we consider and , then we have that:
(A24) for some constant . Then in order to obtain that converges almost surely to zero from (A24), it is sufficient that , , and . This implies that if , there is such that such that is bounded by a term and, consequently, , -a.s. (this by using the same steps used in Appendix G).
Moving to the rate of convergence of (assuming that ), let us consider for some . From (A24):
(A25) To make being -a.s., a sufficient condition is that , , and . Therefore (considering that ), the admissibility condition on the existence of a exponential rate of convergence for for the deviation event is that , which is equivalent to . ☐
Funding
The work is supported by funding from FONDECYT Grant 1170854, CONICYT-Chile and the Advanced Center for Electrical and Electronic Engineering (AC3E), Basal Project FB0008.
Conflicts of Interest
The author declares no conflict of interest.
References
- 1.Beirlant J., Dudewicz E., Györfi L., van der Meulen E.C. Nonparametric entropy estimation: An Overview. Int. Math. Stat. Sci. 1997;6:17–39. [Google Scholar]
- 2.Cover T.M., Thomas J.A. Elements of Information Theory. 2nd ed. Wiley; New York, NY, USA: 2006. [Google Scholar]
- 3.Kullback S. Information Theory and Statistics. Wiley; New York, NY, USA: 1959. [Google Scholar]
- 4.Principe J. Information Theoretic Learning: Renyi Entropy and Kernel Perspective. Springer; New York, NY, USA: 2010. [Google Scholar]
- 5.Fisher J.W., III, Wainwright M., Sudderth E., Willsky A.S. Statistical and information-theoretic methods for self-organization and fusion of multimodal, networked sensors. Int. J. High Perform. Comput. Appl. 2002;16:337–353. doi: 10.1177/10943420020160031201. [DOI] [Google Scholar]
- 6.Liu J., Moulin P. Information-theoretic analysis of interscale and intrascale dependencies between image wavelet coefficients. IEEE Trans. Image Process. 2001;10:1647–1658. doi: 10.1109/83.967393. [DOI] [PubMed] [Google Scholar]
- 7.Thévenaz P., Unser M. Optimization of mutual information for multiresolution image registration. IEEE Trans. Image Process. 2000;9:2083–2099. doi: 10.1109/83.887976. [DOI] [PubMed] [Google Scholar]
- 8.Butz T., Thiran J.P. From error probability to information theoretic (multi-modal) signal processing. Elsevier Signal Process. 2005;85:875–902. doi: 10.1016/j.sigpro.2004.11.027. [DOI] [Google Scholar]
- 9.Kim J., Fisher J.W., III, Yezzi A., Cetin M., Willsky A.S. A nonparametric statistical method for image segmentation using information theory and curve evolution. IEEE Trans. Image Process. 2005;14:1486–1502. doi: 10.1109/tip.2005.854442. [DOI] [PubMed] [Google Scholar]
- 10.Padmanabhan M., Dharanipragada S. Maximizing information content in feature extraction. IEEE Trans. Speech Audio Process. 2005;13:512–519. doi: 10.1109/TSA.2005.848876. [DOI] [Google Scholar]
- 11.Silva J., Narayanan S. Minimum probability of error signal representation; Presented at IEEE Workshop Machine Learning for Signal Processing; Thessaloniki, Greece. 27–29 August 2007; pp. 348–353. [Google Scholar]
- 12.Silva J., Narayanan S. Discriminative wavelet packet filter bank selection for pattern recognition. IEEE Trans. Signal Process. 2009;57:1796–1810. doi: 10.1109/TSP.2009.2013898. [DOI] [Google Scholar]
- 13.Gokcay E., Principe J.C. Information theoretic clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2002;24:158–171. doi: 10.1109/34.982897. [DOI] [Google Scholar]
- 14.Arellano-Valle R.B., Contreras-Reyes J.E., Stehlik M. Generalized skew-normal negentropy and its applications to fish condition time factor time series. Entropy. 2017;19:528. doi: 10.3390/e19100528. [DOI] [Google Scholar]
- 15.Lake D.E. Nonparametric entropy estimation using kernel densities. Methods Enzymol. 2009;467:531–546. doi: 10.1016/S0076-6879(09)67020-8. [DOI] [PubMed] [Google Scholar]
- 16.Van der Vaart A.W. Asymptotic Statistics. Volume 3 Cambridge University Press; Cambridge, UK: 2000. [Google Scholar]
- 17.Wu Y., Yang P. Minimax rates of entropy estimation on large alphabets via best polynomial approximation. IEEE Trans. Inf. Theory. 2016;62:3702–3720. doi: 10.1109/TIT.2016.2548468. [DOI] [Google Scholar]
- 18.Jiao J., Venkat K., Han Y., Weissman T. Minimax estimation of functionals of discrete distributions. IEEE Trans. Inf. Theory. 2015;61:2835–2885. doi: 10.1109/TIT.2015.2412945. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Paninski L. Estimating entropy on m bins given fewer than m samples. IEEE Trans. Inf. Theory. 2004;50:2200–2203. doi: 10.1109/TIT.2004.833360. [DOI] [Google Scholar]
- 20.Valiant G., Valiant P. Estimating the unseen: An n/log(n)-sample estimator for entropy and support size, shown opitmal via new CLTs; Proceedings of the Forty-Third Annual ACM Symposium on Theory of Computing; San Jose, AL, USA. 6–8 June 2011; pp. 685–694. [Google Scholar]
- 21.Valiant G., Valiant P. A CLT and Tight Lower Bounds for Estimating Entropy. Volume 17. Electronic Colloquium on Computational Complexity; Potsdam, Germany: 2011. p. 9. Technical Report TR 10-179. [Google Scholar]
- 22.Braess D., Forster J., Sauer T., Simon H.U. How to achieve minimax expected Kullback-Leibler distance from an unknown finite distribution; Proceedings of the International Conference on Algorithmic Learning Theory; Lübeck, Germany. 24–26 November 2004; Berlin/Heidelberg, Germany: Springer; 2002. pp. 380–394. [Google Scholar]
- 23.Csiszár I., Shields P.C. Foundations and Trends® in Communications and Information Theory. Now Publishers Inc.; Breda, The Netherlands: 2004. Information theory and statistics: A tutorial; pp. 417–528. [Google Scholar]
- 24.Ho S.W., Yeung R.W. On the discontinuity of the Shannon information measures. IEEE Trans. Inf. Theory. 2009;55:5362–5374. [Google Scholar]
- 25.Silva J., Parada P. Shannon entropy convergence results in the countable infinite case; Proceedings of the International Symposium on Information Theory; Cambridge, MA, USA. 1–6 July 2012; pp. 155–159. [Google Scholar]
- 26.Ho S.W., Yeung R.W. The interplay between entropy and variational distance. IEEE Trans. Inf. Theory. 2010;56:5906–5929. doi: 10.1109/TIT.2010.2080452. [DOI] [Google Scholar]
- 27.Harremoës P. Information topologies with applications. In: Csiszár I., Katona G.O.H., Tardos G., editors. Entropy, Search, Complexity. Volume 16. Springer; New York, NY, USA: 2007. pp. 113–150. [Google Scholar]
- 28.Devroye L., Lugosi G. Combinatorial Methods in Density Estimation. Springer; New York, NY, USA: 2001. [Google Scholar]
- 29.Barron A., Györfi L., van der Meulen E.C. Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans. Inf. Theory. 1992;38:1437–1454. doi: 10.1109/18.149496. [DOI] [Google Scholar]
- 30.Antos A., Kontoyiannis I. Convergence properties of fucntionals estimates for discrete distributions. Random Struct. Algorithms. 2001;19:163–193. doi: 10.1002/rsa.10019. [DOI] [Google Scholar]
- 31.Piera F., Parada P. On convergence properties of Shannon entropy. Probl. Inf. Transm. 2009;45:75–94. doi: 10.1134/S003294600902001X. [DOI] [Google Scholar]
- 32.Berlinet A., Vajda I., van der Meulen E.C. About the asymptotic accuracy of Barron density estimates. IEEE Trans. Inf. Theory. 1998;44:999–1009. doi: 10.1109/18.669143. [DOI] [Google Scholar]
- 33.Vajda I., van der Meulen E.C. Optimization of Barron density estimates. IEEE Trans. Inf. Theory. 2001;47:1867–1883. doi: 10.1109/18.930924. [DOI] [Google Scholar]
- 34.Lugosi G., Nobel A.B. Consistency of data-driven istogram methods for density estimation and classification. Ann. Stat. 1996;24:687–706. [Google Scholar]
- 35.Silva J., Narayanan S. Information divergence estimation based on data-dependent partitions. J. Stat. Plan. Inference. 2010;140:3180–3198. doi: 10.1016/j.jspi.2010.04.011. [DOI] [Google Scholar]
- 36.Silva J., Narayanan S.N. Nonproduct data-dependent partitions for mutual information estimation: Strong consistency and applications. IEEE Trans. Signal Process. 2010;58:3497–3511. doi: 10.1109/TSP.2010.2046077. [DOI] [Google Scholar]
- 37.Kullback S., Leibler R. On information and sufficiency. Ann. Math. Stat. 1951;22:79–86. doi: 10.1214/aoms/1177729694. [DOI] [Google Scholar]
- 38.Gray R.M. Entropy and Information Theory. Springer; New York, NY, USA: 1990. [Google Scholar]
- 39.Kullback S. A lower bound for discrimination information in terms of variation. IEEE Trans. Inf. Theory. 1967;13:126–127. doi: 10.1109/TIT.1967.1053968. [DOI] [Google Scholar]
- 40.Csiszár I. Information-type measures of difference of probability distributions and indirect observations. Studia Sci. Math. Hungar. 1967;2:299–318. [Google Scholar]
- 41.Kemperman J. On the optimum rate of transmitting information. Ann. Math. Stat. 1969;40:2156–2177. doi: 10.1214/aoms/1177697293. [DOI] [Google Scholar]
- 42.Breiman L. Probability. Addison-Wesley; Boston, MA, USA: 1968. [Google Scholar]
- 43.Varadhan S. Probability Theory. American Mathematical Society; Providence, RI, USA: 2001. [Google Scholar]
- 44.Györfi L., Páli I., van der Meulen E.C. There is no universal source code for an infinite source alphabet. IEEE Trans. Inf. Theory. 1994;40:267–271. doi: 10.1109/18.272495. [DOI] [Google Scholar]
- 45.Rissanen J. Information and Complexity in Statistical Modeling. Springer; New York, NY, USA: 2007. [Google Scholar]
- 46.Boucheron S., Garivier A., Gassiat E. Coding on countably infinite alphabets. IEEE Trans. Inf. Theory. 2009;55:358–373. doi: 10.1109/TIT.2008.2008150. [DOI] [Google Scholar]
- 47.Silva J.F., Piantanida P. The redundancy gains of almost lossless universal source coding over envelope families; Proceedings of the IEEE International Symposium on Information Theory; Aachen, Germany. 25–30 June 2017; pp. 2003–2007. [Google Scholar]
- 48.Silva J.F., Piantanida P. Almost Lossless Variable-Length Source Coding on Countably Infinite Alphabets; Proceedings of the IEEE International Symposium on Information Theory; Barcelona, Spain. 10–15 July 2016; pp. 1–5. [Google Scholar]
- 49.Nobel A.B. Histogram regression estimation using data-dependent partitions. Ann. Stat. 1996;24:1084–1105. doi: 10.1214/aos/1032526958. [DOI] [Google Scholar]
- 50.Silva J., Narayanan S. Complexity-regularized tree-structured partition for mutual information estimation. IEEE Trans. Inf. Theory. 2012;58:940–1952. doi: 10.1109/TIT.2011.2177771. [DOI] [Google Scholar]
- 51.Darbellay G.A., Vajda I. Estimation of the information by an adaptive partition of the observation space. IEEE Trans. Inf. Theory. 1999;45:1315–1321. doi: 10.1109/18.761290. [DOI] [Google Scholar]
- 52.Devroye L., Györfi L., Lugosi G. A Probabilistic Theory of Pattern Recognition. Springer; New York, NY, USA: 1996. [Google Scholar]
- 53.Tsybakov A.B. Introduction to Nonparametric Estimation. Springer; New York, NY, USA: 2009. [Google Scholar]