Ensemble estimators for multivariate entropy estimation

Kumar Sricharan; Dennis Wei; Alfred O Hero, III

doi:10.1109/TIT.2013.2251456

. Author manuscript; available in PMC: 2015 Apr 18.

Published in final edited form as: IEEE Trans Inf Theory. 2013 Mar 7;59(7):4374–4388. doi: 10.1109/TIT.2013.2251456

Ensemble estimators for multivariate entropy estimation

Kumar Sricharan ¹, Dennis Wei ¹, Alfred O Hero III ¹

PMCID: PMC4401872 NIHMSID: NIHMS676617 PMID: 25897177

Abstract

The problem of estimation of density functionals like entropy and mutual information has received much attention in the statistics and information theory communities. A large class of estimators of functionals of the probability density suffer from the curse of dimensionality, wherein the mean squared error (MSE) decays increasingly slowly as a function of the sample size T as the dimension d of the samples increases. In particular, the rate is often glacially slow of order O(T⁻^γ^/^d), where γ > 0 is a rate parameter. Examples of such estimators include kernel density estimators, k-nearest neighbor (k-NN) density estimators, k-NN entropy estimators, intrinsic dimension estimators and other examples. In this paper, we propose a weighted affine combination of an ensemble of such estimators, where optimal weights can be chosen such that the weighted estimator converges at a much faster dimension invariant rate of O(T⁻¹). Furthermore, we show that these optimal weights can be determined by solving a convex optimization problem which can be performed offline and does not require training data. We illustrate the superior performance of our weighted estimator for two important applications: (i) estimating the Panter-Dite distortion-rate factor and (ii) estimating the Shannon entropy for testing the probability distribution of a random sample.

1 Introduction

Non-linear functionals of probability densities f of the form G(f) = ∫g(f(x), x)f(x)dx arise in applications of information theory, machine learning, signal processing and statistical estimation. Important examples of such functionals include Shannon g(f, x) = −log(f) and Rényi g(f, x) = f^α ⁻¹ entropy, and the quadratic functional g(f, x) = f². In these applications, the functional of interest often must be estimated empirically from sample realizations of the underlying densities.

Functional estimation has received significant attention in the mathematical statistics community. However, estimators of functionals of multivariate probability densities f suffer from mean square error (MSE) rates which typically decrease with dimension d of the sample as O(T⁻^γ^/^d), where T is the number of samples and γ is a positive rate parameter. Examples of such estimators include kernel density estimators [18], k-nearest neighbor (k-NN) density estimators [4], k-NN entropy functional estimators [9, 17, 13], intrinsic dimension estimators [17], divergence estimators [19], and mutual information estimators. This slow convergence is due to the curse of dimensionality. In this paper, we introduce a simple affine combination of an ensemble of such slowly convergent estimators and show that the weights in this combination can be chosen to significantly improve the rate of MSE convergence of the weighted estimator. In fact our ensemble averaging method can improve MSE convergence to the parametric rate O(T⁻¹).

Specifically, for d-dimensional data, it has been observed that the variance of estimators of functionals G(f) decays as O(T⁻¹) while the bias decays as O(T⁻¹^/⁽¹⁺^d⁾). To accelerate the slow rate of convergence of the bias in high dimensions, we propose a weighted ensemble estimator for ensembles of estimators that satisfy conditions Inline graphic .1(2.1) and .2(2.2) defined in Sec. II below. Optimal weights, which serve to lower the bias of the ensemble estimator to O(T ^−1/2), can be determined by solving a convex optimization problem. Remarkably, this optimization problem does not involve any density-dependent parameters and can therefore be performed offline. This then ensures MSE convergence of the weighted estimator at the parametric rate of O(T⁻¹).

1.1 Related work

When the density f is s > d/4 times differentiable, certain estimators of functionals of the form ∫g(f(x), x)f(x)dx, proposed by Birge and Massart [2], Laurent [11] and Giné and Mason [5], can achieve the parametric MSE convergence rate of O(T ⁻¹). The key ideas in [2, 11, 5] are: (i) estimation of quadratic functionals ∫f²(x)dx with MSE convergence rate O(T ⁻¹); (ii) use of kernel density estimators with kernels that satisfy the following symmetry constraints:

\int K (x) d x = 1, \int x^{r} K (x) d x = 0,

(1.1)

for r = 1,.., s; and finally (iii) truncating the kernel density estimate so that it is bounded away from 0. By using these ideas, the estimators proposed by [2, 11, 5] are able to achieve parametric convergence rates.

In contrast, the estimators proposed in this paper require additional higher order smoothness conditions on the density, i. e. the density must be s > d times differentiable. However, our estimators are much simpler to implement in contrast to the estimators proposed in [2, 11, 5]. In particular, the estimators in [2, 11, 5] require separately estimating quadratic functionals of the form ∫f²(x)dx, and using truncated kernel density estimators with symmetric kernels (1.1), conditions that are not required in this paper. Our estimator is a simple affine combination of an ensemble of estimators, where the ensemble satisfies conditions Inline graphic .1 and .2. Such an ensemble can be trivial to implement. For instance, in this paper we show that simple uniform kernel plug-in estimators (3.3) satisfy conditions .1 and .2.

Ensemble based methods have been previously proposed in the context of classification. For example, in both boosting [16] and multiple kernel learning [10] algorithms, lower complexity weak learners are combined to produce classifiers with higher accuracy. Our work differs from these methods in several ways. First and foremost, our proposed method performs estimation rather than classification. An important consequence of this is that the weights we use are data independent, while the weights in boosting and multiple kernel learning must be estimated from training data since they depend on the unknown distribution.

1.2 Organization

The remainder of the paper is organized as follows. We formally describe the weighted ensemble estimator for a general ensemble of estimators in Section 2, and specify conditions Inline graphic .1 and .2 on the ensemble that ensure that the ensemble estimator has a faster rate of MSE convergence. Under the assumption that conditions .1 and .2 are satisfied, we provide an MSE optimal set of weights as the solution to a convex optimization(2.3). Next, we shift the focus to entropy estimation in Section 3, propose an ensemble of simple uniform kernel plug-in entropy estimators, and show that this ensemble satisfies conditions Inline graphic .1 and .2. Subsequently, we apply the ensemble estimator theory in Section 2 to the problem of entropy estimation using this ensemble of kernel plug-in estimators. We present simulation results in Section 4 that illustrate the superior performance of this ensemble entropy estimator in the context of (i) estimation of the Panter-Dite distortion-rate factor [6] and (ii) testing the probability distribution of a random sample. We conclude the paper in Section 5.

Notation

We will use bold face type to indicate random variables and random vectors and regular type face for constants. We denote the statistical expectation operator by the symbol Inline graphic and the conditional expectation given random variable Z using the notation . We also define the variance operator as [X] = [(X − [X])²] and the covariance operator as Cov[X, Y] = [(X − [X])(Y − [Y])]. We denote the bias of an estimator by .

2 Ensemble estimators

Let l̄ = {l₁, .., l_L} denote a set of parameter values. For a parameterized ensemble of estimators {Ê_l}_l_∈_l̄ of E, define the weighted ensemble estimator with respect to weights w = {w(l₁), …, w(l_L)} as

{\hat{E}}_{w} = \sum_{l \in \bar{l}} w (l) {\hat{E}}_{l}

where the weights satisfy Σ_l_∈_l̄w(l) = 1. This latter sum-to-one condition guarantees that Ê_w is asymptotically unbiased if the component estimators {Ê_l}_l_∈_l̄ are asymptotically unbiased. Let this ensemble of estimators {Ê_l}_l_∈_l̄ satisfy the following two conditions:

.1 The bias is given by
$B ({\hat{E}}_{l}) = \sum_{i \in J} c_{i} ψ_{i} (l) T^{- i / 2 d} + O (1 / \sqrt{T}),$ (2.1)

where c_i are constants that depend on the underlying density, = {i₁, .., i_I} is a finite index set with cardinality I < L, min( ) = i₀ > 0 and max( ) = i_d ≤ d, and ψ_i(l) are basis functions that depend only on the estimator parameter l.
.2 The variance is given by
$V ({\hat{E}}_{l}) = c_{v} (\frac{1}{T}) + o (\frac{1}{T}) .$ (2.2)

Theorem 1

For an ensemble of estimators {Ê_l}_l_∈_l̄, assume that the conditions Inline graphic .1 and .2 hold. Then, there exists a weight vector w_o such that

E [{({\hat{E}}_{w_{o}} - E)}^{2}] = O (1 / T) .

This weight vector can be found by solving the following convex optimization problem:

\begin{array}{l} \underset{w}{minimize} & {‖ w ‖}_{2} \\ subject t o & \sum_{l \in \bar{l}} w (l) = 1, \\ γ_{w} (i) = \sum_{l \in \bar{l}} w (l) ψ_{i} (l) = 0, i \in J, \end{array}

(2.3)

where ψ_i(l) is the basis defined in (2.1).

Proof

The bias of the ensemble estimator is given by

\begin{array}{l} B ({\hat{E}}_{w}) = \sum_{i \in J} c_{i} γ_{w} (i) T^{- i / 2 d} + O (\frac{{‖ w ‖}_{1}}{\sqrt{T}}) \\ = \sum_{i \in J} c_{i} γ_{w} (i) T^{- i / 2 d} + O (\frac{\sqrt{L} {‖ w ‖}_{2}}{\sqrt{T}}) \end{array}

(2.4)

Denote the covariance matrix of {Ê_l; l ∈ l̄} by Σ_L. Let Σ̄_L = Σ_LT. Observe that by (2.2) and the Cauchy-Schwarz inequality, the entries of Σ̄_L are O(1). The variance of the weighted estimator Ê_w can then be bounded as follows:

\begin{array}{l} V ({\hat{E}}_{w}) = V (\sum_{l \in \bar{l}} w_{l} {\hat{E}}_{l}) = w^{'} \sum_{L} w = \frac{w^{'} {\sum^{¯}}_{L} w}{T} \\ \leq \frac{λ_{max} ({\sum^{¯}}_{L}) {‖ w ‖}_{2}^{2}}{T} \leq \frac{trace ({\sum^{¯}}_{L}) {‖ w ‖}_{2}^{2}}{T} \leq \frac{L {‖ w ‖}_{2}^{2}}{T} \end{array}

(2.5)

We seek a weight vector w that (i) ensures that the bias of the weighted estimator is O(T^−1/2) and (ii) has low ℓ₂ norm ||w||₂ in order to limit the contribution of the variance, and the higher order bias terms of the weighted estimator. To this end, let w_o be the solution to the convex optimization problem defined in (2.3). The solution w_o is the solution of

\begin{array}{l} \underset{w}{minimize} & {‖ w ‖}_{2}^{2} \\ subject to & A_{0} w = b, \end{array}

where A₀ and b are defined below. Let a₀ be the vector of ones: [1, 1…, 1]₁_×L; and let a_i, for each i ∈ Inline graphic be given by a_i = [ψ_i(l₁), .., ψ_i(l_L)]. Define $A_{0} = {[a_{0}^{'}, a_{i_{1}}^{'}, \dots, a_{i_{I}}^{'}]}^{'}, A_{1} = {[a_{i_{1}}^{'}, \dots, a_{i_{I}}^{'}]}^{'}$ and b = [1; 0; 0; ..; 0]₍_I_+1)×1.

Since L > I, the system of equations A₀w = b is guaranteed to have at least one solution (assuming linear independence of the rows a_i). The minimum squared norm $η_{L} (d) : = {‖ w_{0} ‖}_{2}^{2}$ is then given by

η_{L} (d) = \frac{det (A_{1} A_{1}^{'})}{det (A_{0} A_{0}^{'})} .

Consequently, by (2.4), the bias $B [{\hat{E}}_{w_{0}}] = O (\sqrt{L η_{L} (d)} / \sqrt{T})$ . By (2.5), the estimator variance Inline graphic [Ê_w₀] = O(Lη_L(d)/T). The overall MSE is also therefore of order O(Lη_L(d)/T).

For any fixed dimension d and fixed number of estimators L > I in the ensemble independent of sample size T, the value of η_L(d) is also independent of T. Stated mathematically, Lη_L(d) = Θ(1) for any fixed dimension d and fixed number of estimators L > I independent of sample size T. This concludes the proof.

In the next section, we will verify conditions Inline graphic .1(2.1) and .2(2.2) for plug-in estimators Ĝ_k(f) of entropy-like functionals G(f) = ∫g(f(x), x)f(x)dx.

3 Application to estimation of functionals of a density

Our focus is the estimation of general non-linear functionals G(f) of d-dimensional multivariate densities f with known finite support Inline graphic = [a, b]^d, where G(f) has the form

G (f) = \int g (f (x), x) f (x) d x,

(3.1)

for some smooth function g(f, x). Let Inline graphic denote the boundary of . Assume that T = N +M i.i.d realizations {X₁, …, X_N, X_N₊₁, …, X_N₊_M} are available from the density f.

3.1 Plug-in estimators of entropy

The truncated uniform kernel density estimator is defined below. For any positive real number k ≤ M, define the distance d_k to be: d_k = (k/M)^1/^d. Define the truncated kernel region for each X ∈ Inline graphic to be S_k(X) = {Y ∈ : ||X − Y ||_∞ ≤ d_k/2}, and the volume of the truncated uniform kernel to be V_k(X) = ∫_{S_k(X)} dz. Note that when the smallest distance from X to is greater than d_k/2, $V_{k} (X) = d_{k}^{d} = k / M$ . Let l_k(X) denote the number of samples falling in $S_{k} (X) : l_{k} (X) = \sum_{i = 1}^{M} 1_{{X_{i} \in S_{k} (X)}}$ . The truncated uniform kernel density estimator is defined as

{\hat{f}}_{k} (X) = \frac{l_{k} (X)}{M V_{k} (X)} .

(3.2)

The plug-in estimator of the density functional is constructed using a data splitting approach as follows. The data is randomly subdivided into two parts {X₁, …, X_N } and {X_N₊₁, …, X_N₊_M} of N and M points respectively. In the first stage, we form the kernel density estimate f̂_k at the N points {X₁, …, X_N } using the M realizations {X_N₊₁, …, X_N₊_M}. Subsequently, we use the N samples {X₁, …, X_N} to approximate the functional G(f) and obtain the plug-in estimator:

{\hat{G}}_{k} = \frac{1}{N} \sum_{i = 1}^{N} g ({\hat{f}}_{k} (X_{i}), X_{i}) .

(3.3)

Also define a standard kernel density estimator f̃_k, which is identical to f̂_k except that the volume V_k(X) is always set to the untruncated value V_k(X) = k/M. Define

{\tilde{G}}_{k} = \frac{1}{N} \sum_{i = 1}^{N} g ({\tilde{f}}_{k} (X_{i}), X_{i}) .

(3.4)

The estimator G̃_k is identical to the estimator of Györfi and van der Meulen [8]. Observe that the implementation of G̃_k, unlike Ĝ_k, does not require knowledge about the support of the density.

3.1.1 Assumptions

We make a number of technical assumptions that will allow us to obtain tight MSE convergence rates for the kernel density estimators defined above. ( Inline graphic .0) : Assume that k = k₀M^β for some rate constant 0 < β < 1, and assume that M, N and T are linearly related through the proportionality constant α_frac with: 0 < α_frac < 1, M = α_fracT and N = (1 − α_frac)T. ( .1) : Let the density f be uniformly bounded away from 0 and upper bounded on the set Inline graphic , i.e., there exist constants ε₀, ε_∞ such that 0 < ε₀ ≤ f(x) ≤ ε_∞ < ∞ ∀x ∈ . ( .2): Assume that the density f has continuous partial derivatives of order d in the interior of the set , and that these derivatives are upper bounded. ( .3): Assume that the function g(f, x) has max{λ, d} partial derivatives w.r.t. the argument f, where λ satisfies the condition λβ > 1. Denote the n-th partial derivative of g(f, x) wrt x by g⁽ⁿ⁾(f, x). ( Inline graphic .4): Assume that the absolute value of the functional g(f, x) and its partial derivatives are strictly upper bounded in the range ε₀ ≤ f ≤ ε_∞ for all x. ( .5): Let ε ∈ (0, 1) and δ ∈ (2/3, 1). Let (M) be a positive function satisfying the condition (M) = Θ(exp(−M^β^{(1 −} ^δ⁾)). For some fixed 0 < ε < 1, define p_l = (1 − ε)ε₀ and p_u = (1 + ε)ε_∞. Assume that the conditions

$sup_{x} ∣ h (0, x) ∣ < G_{1} < \infty$ ,
$sup_{f \in (p_{l}, p_{u}), x} ∣ h (f, x) ∣ < G_{2} < \infty$ ,
$sup_{f \in (1 / k, p_{u}), x} ∣ h (f, x) ∣ C (M) < G_{3} < \infty \forall M$ ,
$sup_{f \in (p_{l}, 2^{d} M / k), x} ∣ h (f, x) ∣ C (M) < G_{4} < \infty \forall M$ ,

are satisfied by h(f, x) = g(f, x), g⁽³⁾(f, x) and g⁽^λ⁾(f, x), for some constants G₁, G₂, G₃ and G₄.

These assumptions are comparable to other rigorous treatments of entropy estimation. The assumption ( Inline graphic .0) is equivalent to choosing the bandwidth of the kernel to be a fractional power of the sample size [15]. The rest of the above assumptions can be divided into two categories: (i) assumptions on the density f, and (ii) assumptions on the functional g. The assumptions on the smoothness, boundedness away from 0 and ∞ of the density f are similar to the assumptions made by other estimators of entropy as listed in Section II, [1]. The assumptions on the functional g ensure that g is sufficiently smooth and that the estimator is bounded. These assumptions on the functional are readily satisfied by the common functionals that are of interest in literature: Shannon g(f, x) = −log(f)I(f > 0) + I(f = 0) and Rényi g(f, x) = f^α⁻¹I(f > 0) + I(f = 0) entropy, where I(.) is the indicator function, and the quadratic functional g(f, x) = f².

3.1.2 Analysis of MSE

Under the assumptions stated above, we have shown the following in the Appendix:

Theorem 2

The biases of the plug-in estimators Ĝ_k, G̃_k are given by

\begin{array}{l} B ({\hat{G}}_{k}) = \sum_{i = 1}^{d} c_{1, i} {(\frac{k}{M})}^{i / d} + \frac{c_{2}}{k} + o (\frac{1}{k} + \frac{k}{M}) \\ B ({\tilde{G}}_{k}) = c_{1} {(\frac{k}{M})}^{1 / d} + \frac{c_{2}}{k} + o (\frac{1}{k} + \frac{k}{M}), \end{array}

where c_1,_i, c₁ and c₂ are constants that depend on g and f.

Theorem 3

The variances of the plug-in estimators Ĝ_k, G̃_k are identical up to leading terms, and are given by

\begin{array}{l} V ({\hat{G}}_{k}) = c_{4} (\frac{1}{N}) + c_{5} (\frac{1}{M}) + o (\frac{1}{M} + \frac{1}{N}) \\ V ({\tilde{G}}_{k}) = c_{4} (\frac{1}{N}) + c_{5} (\frac{1}{M}) + o (\frac{1}{M} + \frac{1}{N}), \end{array}

where c₄ and c₅ are constants that depend on g and f.

3.1.3 Optimal MSE rate

From Theorem 2, observe that the conditions k → ∞ and k/M → 0 are necessary for the estimators Ĝ_k and G̃_k to be unbiased. Likewise from Theorem 3, the conditions N → ∞ and M → ∞ are necessary for the variance of the estimator to converge to 0. Below, we optimize the choice of bandwidth k for minimum MSE, and also show that the optimal MSE rate is invariant to the choice of α_frac.

Optimal choice of k

Minimizing the MSE over k is equivalent to minimizing the square of the bias over k. The optimal choice of k is given by

k_{opt} = Θ (M^{1 / 1 + d}),

(3.5)

and the bias evaluated at k_opt is Θ(M^−1/1+^d).

Choice of α_frac

Observe that the MSE of Ĝ_k and G̃_k are dominated by the squared bias (Θ(M^−2/1+^d)) as contrasted to the variance (Θ(1/N + 1/M)). This implies that the asymptotic MSE rate of convergence is invariant to the selected proportionality constant α_frac.

In view of (a) and (b) above, the optimal MSE for the estimators Ĝ_k and G̃_k is therefore achieved for the choice of k = Θ(M^1/(1+^d⁾), and is given by Θ(T^−2/(1+^d⁾). Our goal is to reduce the estimator MSE to O(T⁻¹). We do so by applying the method of weighted ensembles described in Section 2.

3.2 Weighted ensemble entropy estimator

For a positive integer L > I = d−1, choose l̄ = {l₁, …, l_L} to be positive real numbers. Define the mapping $k (l) = l \sqrt{M}$ and let k̄ = {k(l); l ∈ l̄}. Define the weighted ensemble estimator

{\hat{G}}_{w} = \sum_{l \in \bar{l}} w (l) {\hat{G}}_{k (l)} .

(3.6)

From Theorems 2 and 3, we see that the biases of the ensemble of estimators {Ĝ_k₍_l₎; l ∈ l̄} satisfy Inline graphic .1(2.1) when we set ψ_i(l) = l^i/d and = {1, .., d−1}. Furthermore, the general form of the variance of Ĝ_k₍_l₎ follows .2(2.2) because N, M = Θ(T). This implies that we can use the weighted ensemble estimator Ĝ_w to estimate entropy at O(Lη_L(d)/T) convergence rate by setting w equal to the optimal weight w_o given by (2.3).

4 Experiments

We illustrate the superior performance of the proposed weighted ensemble estimator for two applications: (i) estimation of the Panter-Dite rate distortion factor, and (ii) estimation of entropy to test for randomness of a random sample.

For finite T direct use of Theorem 1 can lead to excessively high variance. This is because forcing the condition (2.3) that γ_w(i) = 0 is too strong and, in fact, not necessary. The careful reader may notice that to obtain O(T⁻¹) MSE convergence rate in Theorem 1 it is sufficient that γ_w(i) be of order O(T^−1/2+ⁱ^/2^d). Therefore, in practice we determine the optimal weights according to the optimization:

\begin{array}{l} min_{w} & ε \\ subject to & γ_{w} (0) = 1, \\ ∣ γ_{w} (i) T^{1 / 2 - i / 2 d} ∣ \leq ε, i \in J, \\ {‖ w ‖}_{2}^{2} \leq η . \end{array}

(4.1)

The optimization (4.1) is also convex. Note that, as contrasted to (2.3), the norm of the weight vector w is bounded instead of being minimized. By relaxing the constraints γ_w(i) = 0 in (2.3) to the softer constraints in (4.1), the upper bound η on ${‖ w ‖}_{2}^{2}$ can be reduced from the value η_L(d) obtained by solving (2.3). This results in a more favorable trade-off between bias and variance for moderate sample sizes. In our experiments, we find that setting η = 3d yields good MSE performance. Note that as T → ∞, we must have γ_w(i) → 0 for i ∈ Inline graphic in order to keep ε finite, thus recovering the strict constraints in (2.3).

For fixed sample size T and dimension d, observe that increasing L increases the number of degrees of freedom in the convex problem (4.1), and therefore will result in a smaller value of ε and in turn improved estimator performance. In our simulations, we choose l̄ to be L = 50 equally spaced values between 0.3 and 3, ie the l_i are uniformly spaced as

l_{i} = \frac{x}{a} + \frac{(a - 1) i x}{a L}; i = 1, .., L,

with scale and range parameters a = 10 and x = 3 respectively. We limit L to 50 because we find that the gains beyond L = 50 are negligible. The reason for this diminishing return is a direct result of the increasing similarity among the entries in l̄, which translates to increasingly similar basis functions ψ_i(l) = lⁱ^/^d.

4.1 Panter-Dite factor estimation

For a d-dimensional source with underlying density f, the Panter-Dite distortion-rate function [6] for a q-dimensional vector quantizer with n levels of quantization is given by δ(n) = n^−2/^q ∫ f^q^/(^q⁺²⁾(x)dx. The Panter-Dite factor corresponds to the functional G(f) with g(f, x) = n^−2/^qf^−2/(^q⁺²⁾I(f > 0) + I(f = 0). The Panter-Dite factor is directly related to the Rényi α-entropy, for which several other estimators have been proposed [7, 3, 14, 12].

In our simulations we compare six different choices of functional estimators - the three estimators previously introduced: (i) the standard kernel plug-in estimator G̃_k, (ii) the boundary truncated plug-in estimator Ĝ_k and (iii) the weighted estimator Ĝ_w with optimal weight w = w^* given by (4.1), and in addition the following popular entropy estimators: (iv) histogram plug-in estimator [7], (v) k-nearest neighbor (k-NN) entropy estimator [12] and (vi) entropic k-NN graph estimator [3, 14]. For both G̃_k and Ĝ_k, we select the bandwidth parameter k as a function of M according to the optimal proportionality k = M^1/(1+^d⁾ and N = M = T/2.

We choose f to be the d dimensional mixture density f(a, b, p, d) = pf_β(a, b, d) + (1−p)f_u(d); where d = 6, f_β(a, b, d) is a d-dimensional Beta density with parameters a = 6, b = 6, f_u(d) is a d-dimensional uniform density and the mixing ratio p is 0.8. The reason we choose the beta-uniform mixture for our experiments is because it trivially satisfies all the assumptions on the density f listed in Section 3.1, including the assumptions of finite support and strict boundedness away from 0 on the support. The true value of the Panter-Dite factor δ(n) for the beta-uniform mixture is calculated using numerical integration methods via the ‘Mathematica’ software (http://www.wolfram.com/mathematica/). Numerical integration is used because evaluating the entropy in closed form for the beta-uniform mixture is not tractable.

The MSE values for each of the six estimators are calculated by averaging the squared error [δ̂_i(n) − δ(n)]², i = 1, .., m over m = 1000 Monte-Carlo trials, where each δ̂_i(n) corresponds to an independent instance of the estimator.

4.1.1 Variation of MSE with sample size T

The MSE results of the different estimators are shown in Fig. 1(a) as a function of sample size T, for fixed dimension d = 6. It is clear from the figure that the proposed ensemble estimator Ĝ_w has significantly faster rate of convergence while the MSE of the rest of the estimators, including the truncated kernel plug-in estimator, have similar, slow rates of convergence. It is therefore clear that the proposed optimal ensemble averaging significantly accelerates the MSE convergence rate.

Variation of MSE of Panter-Dite factor estimates using standard kernel plug-in estimator [14], truncated kernel plug-in estimator (3.3), histogram plug-in estimator[17], k-NN estimator [20], entropic graph estimator [18] and the weighted ensemble estimator (3.6).

(a) Variation of MSE of Panter-Dite factor estimates as a function of sample size T. From the figure, we see that the proposed weighted estimator has the fastest MSE rate of convergence wrt sample size T (d = 6).

(b) Variation of MSE of Panter-Dite factor estimates as a function of dimension d. From the figure, we see that the MSE of the proposed weighted estimator has the slowest rate of growth with increasing dimension d (T = 3000).

4.1.2 Variation of MSE with dimension d

For fixed sample size T and fixed number of estimators L, it can be seen that ε increases monotonically with d. This follows from the fact that the number of constraints in the convex problem 4.1 is equal to d + 1 and each of the basis functions ψ_i(l) = lⁱ^/^d monotonically approaches 1 as d grows, . This in turn implies that for a fixed sample size T and number of estimators L, the overall MSE of the ensemble estimator should increase monotonically with the dimension d.

The MSE results of the different estimators are shown in Fig. 1(b) as a function of dimension d, for fixed sample size T = 3000. For the standard kernel plug-in estimator and truncated kernel plug-in estimator, the MSE increases rapidly with d as expected. The MSE of the histogram and k-NN estimators increase at a similar rate, indicating that these estimators suffer from the curse of dimensionality as well. On the other hand, the MSE of the weighted estimator also increases with the dimension as predicted, but at a slower rate. Also observe that the MSE of the weighted estimator is smaller than the MSE of the other estimators for all dimensions d > 3.

4.2 Distribution testing

In this section, we illustrate the weighted ensemble estimator for non-parametric estimation of Shannon differential entropy. The Shannon differential entropy is given by G(f) where g(f, x) = −log(f)I(f > 0) + I(f = 0). The improved accuracy of the weighted ensemble estimator is demonstrated in the context of hypothesis testing using estimated entropy as a statistic to test for the underlying probability distribution of a random sample. Specifically, the samples under the null and alternate hypotheses H₀ and H₁ are drawn from the probability distribution f(a, b, p, d), described in Section IV.A, with fixed d = 6, p = 0.75 and two sets of values of a, b under the null and alternate hypothesis, H₀ : a = a₀, b = b₀ versus H₁ : a = a₁, b = b₁.

First, we fix a₀ = b₀ = 6 and a₁ = b₁ = 5. The density under the null hypothesis f(6, 6, 0.75, 6) has greater curvature relative to f(5, 5, 0.75, 6) and therefore has smaller entropy. Five hundred (500) experiments are performed under each hypothesis with each experiment consisting of 1000 samples drawn from the corresponding distribution. The true entropy and estimates G̃_k, Ĝ_k and Ĝ_w obtained from each instance of 10³ samples are shown in Fig. 2(a) for the 1000 experiments. This figure suggests that the ensemble weighted estimator provides better discrimination ability by suppressing the bias, at the cost of some additional variance.

Entropy estimates using standard kernel plug-in estimator, truncated kernel plug-in estimator and the weighted estimator, for random samples corresponding to hypothesis H₀ and H₁. The weighted estimator provides better discrimination ability by suppressing the bias, at the cost of some additional variance.

(a) Entropy estimates for random samples corresponding to hypothesis H₀ (experiments 1–500) and H₁ (experiments 501–1000).

(b) Histogram envelopes of entropy estimates for random samples corresponding to hypothesis H₀ (blue) and H₁ (red).

To demonstrate that the weighted estimator provides better discrimination, we plot the histogram envelope of the entropy estimates using standard kernel plug-in estimator, truncated kernel plug-in estimator and the weighted estimator for the cases corresponding to the hypothesis H₀ (color coded blue) and H₁ (color coded red) in Fig. 2(b). Furthermore, we quantitatively measure the discriminative ability of the different estimators using the deflection statistic $d s = ∣ μ_{1} - μ_{0} ∣ / \sqrt{σ_{0}^{2} + σ_{1}^{2}}$ , where μ₀ and σ₀ (respectively μ₁ and σ₁) are the sample mean and standard deviation of the entropy estimates. The deflection statistic was found to be 1.49, 1.60 and 1.89 for the standard kernel plug-in estimator, truncated kernel plug-in estimator and the weighted estimator respectively. The receiver operating curves (ROC) for this entropy-based test using the three different estimators are shown in Fig. 3(a). The corresponding areas under the ROC curves (AUC) are given by 0.9271, 0.9459 and 0.9619.

Comparison of performance in terms of ROC for the distribution testing problem. The weighted estimator uniformly outperforms the individual plug-in estimators.

In our final experiment, we fix a₀ = b₀ = 10 and set a₁ = b₁ = 10 − δ, perform 500 experiments each under the null and alternate hypotheses with samples of size 5000, and plot the AUC as δ varies from 0 to 1 in Fig. 3(b). For comparison, we also plot the AUC for the Neyman-Pearson likelihood ratio test. The Neyman-Pearson likelihood ratio test, unlike the Shannon entropy based tests, is an omniscient test that assumes knowledge of both the underlying beta-uniform mixture parametric model of the density and the parameter values a₀, b₀ and a₁, b₁ under the null and alternate hypothesis respectively. Figure 4 shows that the weighted estimator uniformly and significantly outperforms the individual plug-in estimators and comes closest to the performance of the omniscient Neyman-Pearson likelihood test. The relatively superior performance of the Neyman-Pearson likelihood test is due to the fact that the weighted estimator is a nonparametric estimator that has marginally higher variance (proportional to ${‖ w^{*} ‖}_{2}^{2}$ ) as compared to the underlying parametric model for which the Neyman-Pearson test statistic provides the most powerful test.

5 Conclusions

We have proposed a new estimator of functionals of a multivariate density based on weighted ensembles of kernel density estimators. For ensembles of estimators that satisfy general conditions on bias and variance as specified by Inline graphic .1(2.1) and .2(2.2) respectively, the weight optimized ensemble estimator has parametric O(T⁻¹) MSE convergence rate that can be much faster than the rate of convergence of any of the individual estimators in the ensemble. The optimal weights are determined as a solution to a convex optimization problem that can be performed offline and does not require training data. We illustrated this estimator for uniform kernel plug-in estimators and demonstrated the superior performance of the weighted ensemble entropy estimator for (i) estimation of the Panter-Dite factor and (ii) non-parametric hypothesis testing.

Several extensions of the framework of this paper are being pursued: (i) using k-nearest neighbor (k-NN) estimators in place of kernel estimators; (ii) extending the framework to the case where support Inline graphic is not known, but for which conditions .1 and .2 hold; (iii) using ensemble estimators for estimation of other functionals of probability densities including divergence, mutual information and intrinsic dimension; and (iv) using an l₁ norm ||w||₁ in place of the l₂ norm ||w||₂ in the weight optimization algorithm (2.3) so as to introduce sparsity into the weighted ensemble.

Acknowledgments

This work was partially supported by (i) ARO grant W911NF-12-1-0443 and (ii) NIH grant 2P01CA087634-06A2.

Appendices: Outline of appendix

We first establish moment properties for uniform kernel density estimates in Appendix A. Subsequently, we prove theorems 2 and 3 in Appendix B.

A Moment properties of boundary compensated uniform kernel density estimates

Throughout this section, we assume without loss of generality that the support Inline graphic = [−1, 1]^d. Observe that l_k(X) is a binomial random variable with parameters M and U_k(X) = Pr(Z ∈ S_k(X)). The probability mass function of the binomial random variable l_k(X) is given by

P r (l_{k} (X) = l) = (\begin{matrix} M \\ l \end{matrix}) {(U_{k} (X))}^{l} {(1 - U_{k} (X))}^{M - l} .

Define the error function of the truncated uniform kernel density,

\begin{array}{l} {\hat{e}}_{k} (X) = {\hat{f}}_{k} (X) - E [{\hat{f}}_{k} (X)] \\ = \frac{l_{k} (X)}{M V_{k} (X)} - \frac{U_{k} (X)}{V_{k} (X)} \\ = \frac{\sum_{i = 1}^{M} (1_{X_{i} \in S_{k} (X)} - U_{k} (X))}{M V_{k} (X)} . \end{array}

(A.1)

Also define the error function of the standard uniform kernel density,

\begin{array}{l} {\tilde{e}}_{k} (X) = {\tilde{f}}_{k} (X) - E [{\tilde{f}}_{k} (X)] \\ = (M V_{k} (X) / k) {\hat{e}}_{k} (X), \end{array}

and note that when X ∈ Inline graphic (k), ẽ_k(X) = ê_k(X).

A.1 Taylor series expansion of coverage

For any X ∈ Inline graphic , the coverage function U_k(X) can be represented by using a d order Taylor series expansion of f about X as follows. Because the density f has continuous partial derivatives of order d in , for any X ∈ ,

\begin{array}{l} U_{k} (X) = \int_{S_{k} (X)} f (z) d z \\ = f (X) V_{k} (X) + \sum_{i = 1}^{d} c_{i, k} (X) V_{k}^{1 + i / d} (X) + o ({(k / M)}^{2}), \end{array}

(A.2)

where c_i_,_k are functions which depend on k and the unknown density f. This implies that the expectation of the density estimate is given by

\begin{array}{l} E [{\hat{f}}_{k} (X)] = U_{k} (X) / V_{k} (X) \\ = f (X) + \sum_{i = 1}^{d} c_{i, k} (X) {(\frac{k}{M})}^{i / d} + o ((\frac{k}{M})) . \end{array}

(A.3)

A.2 Concentration inequalities for uniform kernel density estimator

Because l_k(X) is a binomial random variable, standard Chernoff inequalities can be applied to obtain concentration bounds on l_k(X). In particular, for 0 < p < 1/2,

\begin{array}{l} P r (l_{k} (X) > (1 + p) M U_{k} (X)) \leq e^{- M U_{k} (X) p^{2} / 4}, \\ P r (l_{k} (X) > (1 - p) M U_{k} (X)) \leq e^{- M U_{k} (X) p^{2} / 4} . \end{array}

(A.4)

Let Inline graphic (X) denote the event (1 − p_k)MU_k(X) < l_k(X) < (1+ p_k)MU_k(X), where p_k = 1/(k^δ^/2) for some fixed δ ∈ (2/3, 1). Then, for k = O(M^β),

P r (♮^{c} (X)) = O (e^{- p_{k}^{2} k}) = C (M),

(A.5)

where Inline graphic (M) satisfies the condition lim_M_→∞ M^a/ (M) = 0 for any a > 0. Also observe that under the event (X),

\begin{array}{l} {\hat{e}}_{k} (X) = \frac{l_{k} (X)}{M V_{k} (X)} - \frac{U_{k} (X)}{V_{k} (X)} \\ = O (p_{k} U_{k} (X) / V_{k} (X)) = O (p_{k}) = O (1 / (k^{δ / 2})) . \end{array}

(A.6)

A.3 Bounds on uniform kernel density estimator

Let B_r(X) be an Euclidean ball of radius r centered at X. Let X be a Lebesgue point of f, i.e., an X for which

lim_{r \to 0} \frac{\int_{B_{r} (X)} f (y) d y}{\int_{B_{r} (x)} d y} = f (X) .

Because f is an density, we know that almost all X ∈ Inline graphic satisfy the above property. Now, fix ε ∈ (0, 1) and find ε_r > 0 such that

sup_{0 < r \leq ε_{r}} \frac{\int_{B_{r} (X)} f (y) d y}{\int_{B_{r} (x)} d y} - f (X) \leq ε / 2 f (X) .

For small values of k/M, B_{ε_r}(X) ⊂ S_k(X) and therefore

(1 - ε / 2) f (X) V_{k} (X) \leq U_{k} (X) \leq (1 + ε / 2) f (X) V_{k} (X)

(A.7)

This implies that under the event Inline graphic (X) defined in the previous subsection,

(1 - ε) ε_{0} \leq {\hat{f}}_{k} (X) \leq (1 + ε) ε_{\infty} .

(A.8)

Let Inline graphic (X) denote the event that f̂_k(X) = 0. Let (X) denote the event 1 <= l_k(X) <= (1 − p_k)MU_k(X) and (X) denote l_k(X) >= (1 + p_k)MU_k(X). Then conditioned on the event (X)

1 / k \leq {\hat{f}}_{k} (X) \leq (1 + ε) ε_{\infty} .

(A.9)

and conditioned on the event Inline graphic (X)

(1 - ε) ε_{0} \leq {\hat{f}}_{k} (X) \leq 2^{d} M / k .

(A.10)

Observe that Inline graphic (X), (X), (X) and (X) form a disjoint partition of the event space.

A.4 Bias

Lemma 4

Let γ(x, y) be an arbitrary function with d partial derivatives wrt x and sup_x_,_y |γ(x, y)| < ∞. Let X₁, .., X_M, X denote M + 1 i.i.d realizations of the density f. Then,

E [γ ({\overset{ˇ}{f}}_{k} (Z), (Z))] - E [γ (f (Z), Z)] = \sum_{i = 1}^{d} c_{1, i} (γ (x, y)) {(k / M)}^{i / d} + o ((k / M)),

(A.11)

where c_1,_i(γ(x, y)) are functionals of γ and f.

Proof

To analyze the bias, first extend the density function f as follows. In particular, extend the definition of f to the domain Inline graphic = [−2, 2]^d while ensuring that the extended function f_e is differentiable d times on this extended domain. Let s_k(X) = {Y : ||X − Y||₁ ≤ d_k/2} be the natural un-truncated ball. Let u_k(X) = ∫_{z∈s_k(X)} f_e(z)dz. Define the function f̄_k(X) = u_k(X)/(k/M). For any X ∈ Inline graphic , using this extended definition,

\begin{array}{l} u_{k} (X) = \int_{s_{k} (X)} f_{e} (z) d z \\ = f (X) (k / M) + \sum_{i = 1}^{d} c_{i} (X) {(k / M)}^{1 + i / d} + o ({(k / M)}^{2}), \end{array}

(A.12)

where c_i are only functions of the unknown density f_e. Also define f̌_k(X) = Inline graphic [f̂_k(X)|X]. Define the interior region (k) = {X ∈ : s_k(X) ∩ = ϕ}. Note that f̄_k(X) = f̌_k(X) for all X ∈ (k). Now,

\begin{array}{l} E [γ ({\overset{ˇ}{f}}_{k} (Z), Z)] - E [γ (f (Z), Z)] = E [γ ({\bar{f}}_{k} (Z), Z) - γ (f (Z), Z)] + E [γ ({\overset{ˇ}{f}}_{k} (Z), Z) - γ ({\bar{f}}_{k} (Z), Z)] \\ = E [γ ({\bar{f}}_{k} (Z), Z) - γ (f (Z), Z)] + E [1_{Z \in S - S_{I} (k)} (γ ({\overset{ˇ}{f}}_{k} (Z), Z) - γ ({\bar{f}}_{k} (Z), Z))] \\ = I + I I . \end{array}

(A.13)

A.4.1 Evaluation of I

\begin{array}{l} I = E [γ ({\bar{f}}_{k} (Z), Z) - γ (f (Z), Z)] \\ = \sum_{i = 1}^{d} E [γ^{(i)} (f (Z), Z) {({\bar{f}}_{k} (Z) - f (Z))}^{i}] \\ = \sum_{i = 1}^{d} c_{11, i} (γ (x, y)) {(k / M)}^{i / d} + o ((k / M)), \end{array}

(A.14)

where c_11,_i(γ(x, y)) are functionals of γ(x, y) and its derivatives.

A.4.2 Evaluation of II

Let m = M/2, k_M = k/M and k_m = (k/m)^1/^d. Define mappings Inline graphic , and : − (k) → as follows. Let u(X) denote the unit vector from the origin to X, and define (X) = u(X) ∩ . Let (m) be a reference set. Define (X) = u(X) ∩ (m). Let l_b(X) = || (X) − X||. Finally define (X) = n(X)u(X), where n(X) satisfies || (X) − (X)|| = (m/k)^1/^dl_b(X). For each X ∈ Inline graphic − (k), let l_r(X) = || (X) − (X)|| and l_max(X) = || (X) − (X)||. Let denote the set of all unit vectors: = u(X). Observe that, by definition, the shape of the regions S_k(X) and S_m( (X)) is identical. This is illustrated in Fig. 4.

Analysis of f̄_m( (X)), f̌_m( (X))

Inline graphic (X) can represented in terms of (X) as (X) = (X)+l_s(X)u(X). Using Taylor series around (X), f̌_m( (X)) can then be evaluated as

\begin{array}{l} {\overset{ˇ}{f}}_{m} (F_{s} (X)) = U_{m} (F_{s} (X)) / V_{m} (F_{s} (X)) \\ = f (F_{b} (X)) + \sum_{i = 1}^{d} {\overset{ˋ}{c}}_{i, F_{b} (X)} (F_{b} (X)) l_{r}^{i} (X) + o (l_{r}^{d} (X)), \end{array}

(A.15)

where the functionals Inline graphic depend only on the shape of the regions S_k(X) or S_m( (X)) and therefore only on (X). Similarly,

\begin{array}{l} {\bar{f}}_{m} (X) = u_{m} (X) / (1 / 2) \\ = f (F_{b} (X)) + \sum_{i = 1}^{d} {\overset{´}{c}}_{i, F_{b} (X)} (F_{b} (X)) l_{r}^{i} (X) + o (l_{r}^{d} (X)), \end{array}

(A.16)

where the functionals Inline graphic again depend only on (X). This implies that for any fixed u ∈ and corresponding X_b ∈ , for any function η(x) and positive integer q ∈ {1, .., d}, integration over the line l(X_b) = {X_b − cu(X_b); c ∈ (0, l_max(X_b))}

\int_{Z \in l (X_{b})} η (Z) {({\overset{ˇ}{f}}_{m} (Z) - f (Z))}^{q} d Z \sum_{i = q}^{d} {\overset{ˋ}{c}}_{i, q, η} (X_{b}) l_{\max}^{i} (X) + o (l_{\max}^{d} (X)),

(A.17)

and

\int_{Z \in l (X_{b})} η (Z) {({\bar{f}}_{m} (Z) - f (Z))}^{r} d Z \sum_{i = q}^{d} {\overset{´}{c}}_{i, q, η} (X_{b}) l_{\max}^{q} (X) + o (l_{\max}^{d} (X)),

(A.18)

where the functions c`_i_,_q_,_η (X_b) and c`_i_,_q_,_η (X_b) depend only on X_b, q, η and are independent of Z and k.

Analysis of f̄_k(X), f̌_k(X)

Inline graphic (X) can be represented in terms of X as (X) = X + k_ml_r(X)u(X). Identically, this gives,

\begin{array}{l} {\overset{ˇ}{f}}_{k} (X) = U_{k} (X) / V_{k} (X) \\ = f (F_{b} (X)) + \sum_{i = 1}^{d} {\overset{ˋ}{c}}_{i, F_{b} (X)} (F_{b} (X)) k_{m}^{i} l_{r}^{i} (X) + o (k_{M}^{d} l_{r}^{d} (X)) . \end{array}

(A.19)

and

\begin{array}{l} {\bar{f}}_{k} (X) = u_{k} (X) / k_{M} \\ = f (F_{b} (X)) + \sum_{i = 1}^{d} {\overset{´}{c}}_{i, F_{b} (X)} (F_{b} (X)) k_{M}^{i} l_{r}^{i} (X) + o (k_{M}^{d} l_{r}^{d} (X)) . \end{array}

(A.20)

This implies that for any fixed u ∈ Inline graphic and corresponding X_b ∈ , integration over the line l(X_b) = {X_b − cu(X_b); c ∈ (0, k_ml_max(X_b))}

\int_{Z \in l (X_{b})} η (Z) {({\overset{ˇ}{f}}_{k} (Z) - f (Z))}^{q} d Z \sum_{i = q}^{d} {\overset{ˋ}{c}}_{i, r, η} (X_{b}) k_{M}^{i} l_{\max}^{i} (X) + o (k_{m}^{d} l_{\max}^{d} (X)),

(A.21)

and

\int_{Z \in l (X_{b})} η (Z) {({\bar{f}}_{k} (Z) - f (Z))}^{q} d Z \sum_{i = q}^{d} {\overset{´}{c}}_{i, r, η} (X_{b}) k_{M}^{i} l_{\max}^{i} (X) + o (k_{m}^{d} l_{\max}^{d} (X)) .

(A.22)

Analysis of II

\begin{array}{l} I I = E [1_{Z \in S - S_{I} (k)} (γ ({\overset{ˇ}{f}}_{k} (Z), Z) - γ ({\bar{f}}_{k} (Z), Z))] \\ = \int_{Z \in S - S_{I} (k)} (γ ({\overset{ˇ}{f}}_{k} (Z), Z) - γ ({\bar{f}}_{k} (Z), Z)) f (Z) d Z \\ = \int_{{X_{b} \in B} \cup {c \in (0, k_{m} l_{\max} (X_{b}))}} 1_{{Z = X_{b} - c u (X_{b})}} \sum_{i = 1}^{d} [γ^{(i)} (f (X_{b}), X_{b}) {({\overset{ˇ}{f}}_{k} (Z) - f (Z))}^{i}] f (Z) d Z \\ - \int_{{X_{b} \in B} \cup {c \in (0, k_{m} l_{\max} (X_{b}))}} 1_{{Z = X_{b} - c u (X_{b})}} \sum_{i = 1}^{d} [γ^{(i)} (f (X_{b}), X_{b}) {({\bar{f}}_{k} (Z) - f (Z))}^{i}] f (Z) d Z \\ = \sum_{i = 1}^{d} c_{12, i} (γ (x, y)) {(k / M)}^{i / d} + 0 ((k / M)), \end{array}

(A.23)

where c_12,_i(γ(x, y)) are functionals of γ(x, y) and its derivatives. This implies that

\begin{array}{l} E [γ ({\overset{ˇ}{f}}_{k} (Z))] - E [γ (f (Z))] = I + I I \\ = \sum_{i = 1}^{d} c_{1, i} (γ (x, y)) {(k / M)}^{i / d} + o ((k / M)), \end{array}

(A.24)

where the functionals c_1,_i(γ(x, y)) are independent of k.

A.5 Central Moments

Since l_k(X) is a binomial random variable, we can easily obtain moments of the uniform kernel density estimate in terms of U_k(X). These are listed below.

Lemma 5

Let γ(x) be an arbitrary function satisfying sup_x |γ(x)| < ∞. Let X₁, .., X_M, X denote M + 1 i.i.d realizations of the density f. Then,

E [γ (X) {\hat{e}}_{k}^{q} (X)] = 1_{{q = 2}} c_{2} (γ (x)) (\frac{1}{k}) + o (\frac{1}{k}),

(A.25)

E [γ (X) {\tilde{e}}_{k}^{q} (X)] = 1_{{q = 2}} c_{2} (γ (x)) (\frac{1}{k}) + o (\frac{1}{k}),

(A.26)

where c₂(γ(x)) is a functional of γ and f.

Proof

When r = 2,

\begin{array}{l} V [{\hat{f}}_{k} (X)] = E [{\hat{e}}_{k}^{2} (X)] \\ = \frac{U_{k} (X) (1 - U_{k} (X))}{M V_{k}^{2} (X)} \\ = \frac{f (X)}{M V_{k} (X)} + o (\frac{1}{k}) . \end{array}

(A.27)

For any integer r ≥ 3,

\begin{array}{l} E [{\hat{e}}_{k}^{r} (X)] = E [1_{♮ (X)} {\hat{e}}_{k}^{r} (X)] + E [1_{♮^{c} (X)} {\hat{e}}_{k}^{r} (X)] \\ = O (\frac{1}{k^{δ r / 2}}) = o (1 / k) . \end{array}

(A.28)

Observe that V_k(X) = Θ(k/M) and therefore $E [{\hat{e}}_{k}^{2} (X)] = Θ (1 / k) + o (1 / k)$ . This implies,

E [γ (X) {\hat{e}}_{k}^{q} (X)] = 1_{{q = 2}} c_{2} (γ (x)) (\frac{1}{k}) + o (\frac{1}{k}) .

When X ∈ Inline graphic (k), ẽ_k(X) = ê_k(X). Also Pr(X ∈ (k)) = o(1). This result in conjunction with the fact that ẽ_k(X) = (MV_k(X)/k) ê_k(X), and V_k(X) = Θ(k/M) gives

E [γ (X) {\tilde{e}}_{k}^{q} (X)] = 1_{{q = 2}} c_{2} (γ (x)) (\frac{1}{k}) + o (\frac{1}{k}) .

A.6 Cross moments

Let X and Y be two distinct points. Clearly the density estimates at X and Y are not independent. Observe that the uniform kernel regions S_k(X), S_k(Y) are disjoint for the set of points given by Ψ_k:= {X, Y}: ||X − Y||₁ ≥ 2(k/M)^1/^d, and have finite intersection on the complement of Ψ_k.

Intersecting balls

Lemma 6

For a fixed pair of points {X, Y} ∈ Ψ_k, and positive integers q, r,

Cov [{\hat{e}}_{k}^{q} (X), {\hat{e}}_{k}^{r} (Y)] = 1_{{q = 1, r = 1}} (\frac{- f (X) f (Y)}{M}) + o (\frac{1}{M}) .

Proof

For a fixed pair of points {X, Y} ∈ Ψ_K, the joint probability mass function of the functions l_k(X),l_k(Y) is given by

P r (l_{k} (X) = l_{x}, l_{k} (Y) = l_{y}) = 1_{{l_{x} + l_{y} \leq M}} (\begin{matrix} M \\ l_{x}, l_{y} \end{matrix}) {(U_{k} (X))}^{l_{x}} {(U_{k} (Y))}^{l_{y}} {(1 - U_{k} (X) - U_{k} (Y))}^{M - l_{x} - l_{y}} .

Denote the high probability event Inline graphic (X)∩ (Y) by (X, Y). Define l̂_k(X), l̂_k(Y) to be binomial random variables with parameters {U_k(X), M − q} and {U_k(Y), M − r} respectively. The covariance between powers of density estimates is then given by

\begin{array}{l} Cov ({\hat{f}}_{k}^{q} (X), {\hat{f}}_{k}^{r} (Y)) = (\frac{1}{M^{q + r} V_{k}^{q} (X) V_{k}^{r} (Y)}) Cov (l_{k}^{q} (X), l_{k}^{r} (Y)) \\ = (\frac{1}{M^{q + r} V_{k}^{q} (X) V_{k}^{r} (Y)}) \sum l_{x}^{q} l_{y}^{r} [P r (l_{k} (X) = l_{x}, l_{k} (Y) = l_{y}) - P r (l_{k} (X) = l_{x}) P r (l_{k} (Y) = l_{y})] \\ = (\frac{1}{M^{q + r} V_{k}^{q} (X) V_{k}^{r} (Y)}) \sum_{♮ (X, Y)} l_{x}^{q} l_{y}^{r} [P r (l_{k} (X) = l_{x}, l_{k} (Y) = l_{y}) - P r (l_{k} (X) = l_{x}) P r (l_{k} (Y) = l_{y})] + o (\frac{1}{M}) \\ = (\frac{1}{M^{q + r} V_{k}^{q} (X) V_{k}^{r} (Y)}) \sum_{♮ (X, Y)} \frac{l_{x}^{q} l_{y}^{r} U_{k}^{q} (X) U_{k}^{r} (Y)}{(l_{x} \times \dots \times l_{x} - q + 1) (l_{y} \times \dots \times l_{y} - r + 1)} \times \\ [(M \times \dots \times M - (q + r - 1)) P r ({\hat{l}}_{k} (X) = l_{x}, {\hat{l}}_{k} (Y) = l_{y}) \\ - (M \times \dots \times M - q + 1) (M \times \dots \times M - r + 1) P r ({\hat{l}}_{k} (X) = l_{x}) P r ({\hat{l}}_{k} (Y) = l_{y})] + o (\frac{1}{M}) \\ = (\frac{f^{q} (X) f^{r} (Y)}{M^{q + r}}) \times \\ \sum_{♮ (X, Y)} [(M \times \dots \times M - (q + r - 1)) P r ({\hat{l}}_{k} (X) = l_{x}, {\hat{l}}_{k} (Y) = l_{y}) \\ - (M \times \dots \times M - (q - 1)) (M \times \dots \times M - (r - 1)) P r ({\hat{l}}_{k} (X) = l_{x}) P r ({\hat{l}}_{k} (Y) = l_{y})] + o (\frac{1}{M}) \\ = (\frac{f^{q} (X) f^{r} (Y)}{M^{q + r}}) \times \\ [(M \times \dots \times M - (q + r - 1)) - (M \times \dots \times M - (q - 1)) (M \times \dots \times M - (r - 1))] \\ = \frac{- q r f^{q} (X) f^{r} (Y)}{M} + o (\frac{1}{M}) . \end{array}

Then, the covariance between the powers of the error function is given by

\begin{array}{l} Cov ({\hat{e}}_{k}^{q} (X), {\hat{e}}_{k}^{r} (Y)) = Cov ({({\hat{f}}_{k} (X) - E [{\hat{f}}_{k} (X)])}^{q}, {({\hat{f}}_{k} (Y) - E [{\hat{f}}_{k} (Y)])}^{r}) \\ = \sum_{a = 1}^{q} \sum_{b = 1}^{r} (\begin{matrix} q \\ a \end{matrix}) (\begin{matrix} r \\ b \end{matrix}) {(- E [{\hat{f}}_{k} (X)])}^{a} {(- E [{\hat{f}}_{k} (Y)])}^{b} Cov ({\hat{f}}_{k}^{a} (X), {\hat{f}}_{k}^{b} (Y)) \\ = \sum_{a = 1}^{q} \sum_{b = 1}^{r} (\begin{matrix} q \\ a \end{matrix}) (\begin{matrix} r \\ b \end{matrix}) [{(- f (X))}^{a} {(- f (Y))}^{b} + o (1)] Cov ({\hat{f}}_{k}^{a} (X), {\hat{f}}_{k}^{b} (Y)) \\ = - f^{q} (X) f^{r} (Y) \sum_{a = 1}^{q} \sum_{b = 1}^{r} (\begin{matrix} q \\ a \end{matrix}) (\begin{matrix} r \\ b \end{matrix}) \frac{{(- 1)}^{a + b} a b}{M} + o (\frac{1}{M}) \\ = 1_{{q = 1, r = 1}} (\frac{- f (X) f (Y)}{M}) + o (\frac{1}{M}) . \end{array}

Disjoint balls

For ${X, Y} \in Ψ_{k}^{c}$ , there is no closed form expression for the covariance. However we have the following lemma by applying the Cauchy-Schwartz inequality:

Lemma 7

For a fixed pair of points ${X, Y} \in Ψ_{k}^{c}$ ,

Cov [{\hat{e}}_{k}^{q} (X), {\hat{e}}_{k}^{r} (Y)] = 1_{{q = 1, r = 1}} O (\frac{1}{k}) + o (\frac{1}{k}) .

Proof

\begin{array}{r} ∣ Cov [{\hat{e}}_{k}^{q} (X), {\hat{e}}_{k}^{r} (Y)] ∣ \leq \sqrt{V [{\hat{e}}_{k}^{2 q} (X)] V [{\hat{e}}_{k}^{2 r} (Y)]} \\ 1_{{q = 1, r = 1}} O (\frac{1}{k}) + o (\frac{1}{k}) . \end{array}

Joint expression

Lemma 8

Let γ₁(x), γ₂(x) be arbitrary functions with 1 partial derivative wrt x and sup_x |γ₁(x)| < ∞, sup_x |γ₂(x)| < ∞. Let X₁, .., X_M,X, Y denote M + 2 i.i.d realizations of the density f. Then,

Cov [γ_{1} (X) {\hat{e}}_{k}^{q} (X), γ_{2} (Y) {\hat{e}}_{k}^{q} (Y)] = 1_{{q = 1, r = 1}} c_{5} (γ_{1} (x), γ_{2} (x)) (\frac{1}{M}) + o (\frac{1}{M}),

(A.29)

Cov [γ_{1} (X) {\tilde{e}}_{k}^{q} (X), γ_{2} (Y) {\tilde{e}}_{k}^{q} (Y)] = 1_{{q = 1, r = 1}} c_{5} (γ_{1} (x), γ_{2} (x)) (\frac{1}{M}) + o (\frac{1}{M}),

(A.30)

where c₅(γ₁(x), γ₂(x)) is a functional of γ₁(x), γ₂(x) and f.

Proof

Let the indicator function 1_{Δ_k}(X, Y) denote the event $Δ_{k} : {X, Y} \in Ψ_{k}^{c}$ . Then

Cov [γ_{1} (X) {\hat{e}}_{k}^{q} (X), γ_{2} (Y) {\hat{e}}_{k}^{r} (Y)] = I + D,

where ‘I’ stands for the contribution form the intersecting balls and ‘D’ for the contribution from the dis-joint balls. I and D are given by

\begin{array}{l} I & = E [1_{Δ_{k}} (X, Y) Cov [γ_{1} (X) {\hat{e}}_{k}^{q} (X), γ_{2} (Y) {\hat{e}}_{k}^{r} (Y)]], \\ D & = E [(1 - 1_{Δ_{k}} (X, Y) Cov [γ_{1} (X) {\hat{e}}_{k}^{q} (X), γ_{2} (Y) {\hat{e}}_{k}^{r} (Y)]] . \end{array}

When 1_{Δ_k}(X, Y) ≠= 0, we have ${X, Y} \in Ψ_{k}^{c}$ . Then,

\begin{array}{l} I = E [1_{Δ_{k}} (X, Y) γ_{1} (X) γ_{2} (Y) {\hat{e}}_{k}^{q} (X) {\hat{e}}_{k}^{r} (Y)] \\ = E [1_{Δ_{k}} (X, Y) γ_{1} (X) γ_{2} (Y) E_{X, Y} [{\hat{e}}_{k}^{q} (X) {\hat{e}}_{k}^{r} (Y)]] \\ \leq E [1_{Δ_{k}} (X, Y) γ_{1} (X) γ_{2} (Y) \sqrt{E_{X} [{\hat{e}}_{k}^{2 q} (X)] E_{Y} [{\hat{e}}_{k}^{2 r} (Y)]}] \\ = E [1_{Δ_{k}} (X, Y) γ_{1} (X) γ_{2} (Y) (1_{{q = 1, r = 1}} O (\frac{1}{k}) + o (\frac{1}{k}))] \\ = \int [(1_{{q = 1, r = 1}} O (\frac{1}{k}) + o (\frac{1}{k})) (γ_{1} (X) γ_{2} (X) + o (1))] (\int Δ_{k} (x, y) d y) d x \\ = \int [(1_{{q = 1, r = 1}} O (\frac{1}{k}) + o (\frac{1}{k})) (γ_{1} (X) γ_{2} (X) + o (1))] (2^{d} \frac{k}{M}) d x \\ = 1_{{q = 1, r = 1}} c_{5, 1} (γ_{1}, γ_{2}) (\frac{1}{M}) + o (\frac{1}{M}), \end{array}

where the bound is obtained using the Cauchy-Schwarz inequality and using Eq.A.28. Also,

\begin{array}{l} D = E [(1 - 1_{Δ_{k}} (X, Y)) γ_{1} (X) γ_{2} (Y) E_{X, Y} [Cov ({\hat{e}}_{k}^{q} (X), {\hat{e}}_{k}^{r} (Y))]] \\ = 1_{{q = 1, r = 1}} c_{5, 2} (γ_{1}, γ_{2}) (\frac{1}{M}) + o (\frac{1}{M}) . \end{array}

(A.31)

This gives

Cov [γ_{1} (X) {\hat{e}}_{k}^{q} (X), γ_{2} (Y) {\hat{e}}_{k}^{q} (Y)] = 1_{{q = 1, r = 1}} c_{5} (γ_{1} (x), γ_{2} (x)) (\frac{1}{M}) + o (\frac{1}{M}) .

Again, since X ∈ Inline graphic (k) implies ẽ_k(X) = ê_k(X) and Pr(X ∈ _I(k)) = o(1),

Cov [γ_{1} (X) {\tilde{e}}_{k}^{q} (X), γ_{2} (Y) {\tilde{e}}_{k}^{q} (Y)] = 1_{{q = 1, r = 1}} c_{5} (γ_{1} (x), γ_{2} (x)) (\frac{1}{M}) + o (\frac{1}{M}) .

This concludes the proof.

B Bias and variance results

Lemma 9

Assume that U(x, y) is any arbitrary functional which satisfies

$sup_{y} ∣ U (0, y) ∣ = G_{1} < \infty$ ,
$sup_{x \in (p_{l}, p_{u}), y} ∣ U (x, y) ∣ = G_{2} / 4 < \infty$ ,
$sup_{x \in (1 / k, p_{u}), y} ∣ U (x, y) ∣ C (M) = G_{3} < \infty$ ,
$E [sup_{x \in (p_{l}, 2^{d} M / k), y} ∣ U (x, y) ∣] C (M) = G_{4} < \infty$ .

Let Z denote X_i for some fixed i ∈ {1, ..,N}. Let ζ_Z be any random variable which almost surely lies in the range (f(Z), f̂_k(Z)). Then,

E [∣ U (ζ_{Z}, Z) ∣] < \infty .

Proof

We will show that the conditional expectation Inline graphic [|U(ζ_Z, Z)| | ] < ∞. Because 0 < ε₀ < f(X) < ε_∞ < ∞ by ( .1), it immediately follows that

E [∣ U (ζ_{Z}, Z) ∣] = E [E [∣ U (ζ_{Z}, Z) ∣ ∣ X_{N}]] < \infty .

Also observe that ε₀ < f(Z) < ε_∞ and therefore p_l < f(Z) < p_u. Finally observe that the events Inline graphic (Z) and (Z) occur with probability O( (M)). Using (A.8), (A.9), (A.10), conditioned on ,

\begin{array}{l} E [∣ U (ζ_{Z}, Z) ∣] = E [1_{♮_{0} (Z)} ∣ U (ζ_{Z}, Z) ∣] + E [1_{♮_{1} (Z)} ∣ U (ζ_{Z}, Z) ∣] + E [1_{♮_{2} (Z)} ∣ U (ζ_{Z}, Z) ∣] + E [1_{♮ (Z)} ∣ U (ζ_{Z}, Z) ∣] \\ \leq (G_{1} + G_{2}) + (G_{3} + G_{2}) + (G_{4} + G_{2}) + (G_{2}) \\ = G_{1} + 4 G_{2} + G_{3} + G_{4} < \infty . \end{array}

(B.1)

Proof of Theorem 2

Proof

Using the continuity of g‴(x, y), construct the following third order Taylor series of g(f̂_k(Z), Z) around the conditional expected value f̌_k(Z) = Inline graphic [f̂_k(Z) | Z].

g ({\hat{f}}_{k} (Z), Z) = g ({\overset{ˇ}{f}}_{k} (Z), Z) + g^{'} ({\overset{ˇ}{f}}_{k} (Z), Z) {\hat{e}}_{k} (Z) + \frac{1}{2} g^{″} ({\overset{ˇ}{f}}_{k} (Z), Z) {\hat{e}}_{k}^{2} (Z) + \frac{1}{6} g^{(3)} (ζ_{Z}, Z) {\hat{e}}_{k}^{3} (Z),

where ζ_Z ∈ (f̌_k(Z), f̂_k(Z)) is defined by the mean value theorem. This gives

E [(g ({\hat{f}}_{k} (Z), Z) - g ({\overset{ˇ}{f}}_{k} (Z), Z))] = E [\frac{1}{2} g^{″} ({\overset{ˇ}{f}}_{k} (Z), Z) {\hat{e}}_{k}^{2} (Z)] + E [\frac{1}{6} g^{(3)} (ζ_{Z}, Z) {\hat{e}}_{k}^{3} (Z)]

Let $Δ (Z) = \frac{1}{6} g^{(3)} (ζ_{Z}, Z)$ . Direct application of Lemma 9 in conjunction with assumption ( Inline graphic .5) implies that [Δ²(Z)] = O(1). By Cauchy-Schwarz and applying Lemma 5 for the choice q = 6,

∣ E [Δ (Z) {\hat{e}}_{k}^{3} (Z)] ∣ \leq \sqrt{E [Δ^{2} (Z)] E [{\hat{e}}_{k}^{6} (Z)]} = o (\frac{1}{k}) .

By observing that the density estimates {f̂_k(X_i)}, i = 1, …, N are identical, we therefore have

\begin{array}{l} E [{\hat{G}}_{k}] - G (f) = E [g ({\hat{f}}_{k} (Z), Z) - g (f (Z), Z)] \\ = E [g ({\overset{ˇ}{f}}_{k} (Z), Z) - g (f (Z), Z)] + E [\frac{1}{2} g^{″} ({\overset{ˇ}{f}}_{k} (Z), Z) e_{k}^{2} (Z)] + o (1 / k) . \end{array}

By Lemma 4 and Lemma 5 for the choice q = 2, in conjunction with assumptions ( Inline graphic .3) and ( .4), this implies that

\begin{array}{l} E [{\hat{G}}_{k}] - G (f) = \sum_{i = 1}^{d} c_{1, i} (g (x, y)) {(\frac{k}{M})}^{i / d} + c_{2} (g^{″} ({\overset{ˇ}{f}}_{k} (x), x)) (\frac{1}{k}) + o (\frac{1}{k} + \frac{k}{M}) \\ = \sum_{i = 1}^{d} c_{1, i} (g (x, y)) {(\frac{k}{M})}^{i / d} + c_{2} (g^{″} (f (x), x)) (\frac{1}{k}) + o (\frac{1}{k} + \frac{k}{M}) \\ = \sum_{i = 1}^{d} c_{1, i} {(\frac{k}{M})}^{i / d} + c_{2} (\frac{1}{k}) + o (\frac{1}{k} + \frac{k}{M}), \end{array}

where the last but one step follows because, by (A.3), we know f̌_k(Z) = f(Z) + o(1). This in turn implies c₂(f²(x)g″(f̌_k(x), x)) = c₂(f²(x)g″(f(x), x)) + o(1). Finally, by assumptions ( Inline graphic .2) and ( .4), the leading constants c_1,_i and c₂ are bounded.

Note that the natural density estimate f̃_k(X) is identical to the truncated kernel density estimate f̂_k(X) on the set X ∈ Inline graphic (k). From the definition of set (k), Pr(Z ∉ = ) = O((k/M)^1/^d) = o(1).

\begin{array}{l} E [{\tilde{G}}_{k}] - G (f) = E [g ({\tilde{f}}_{k} (Z), Z) - g (f (Z), Z)] \\ = E [1_{{Z \in S_{I} (k)}} g ({\hat{f}}_{k} (Z), Z) - g (f (Z), Z)] + E [1_{{Z \in S - S_{I} (k)}} g ({\hat{f}}_{k} (Z), Z) - g (f (Z), Z)] \\ = I + I I \end{array}

(B.2)

Using the exact same method as in the Proof of Theorem 2, using (A.3) and (A.25), and the fact that Pr(Z ∉ = Inline graphic (k)) = O((k/M)^1/^d) = o(1), we have

I = c_{1, 1} (g (x, y)) {(\frac{k}{M})}^{1 / d} + c_{2} (g^{″} (f (x))) (\frac{1}{k}) + o (\frac{1}{k} + {(\frac{k}{M})}^{2 / d}),

Because we assume that g satisfies assumption ( Inline graphic .5), from the proof of Lemma 9, for Z ∈ − (k), we have [g(f̃_k(Z), Z) − g(f(Z), Z)] = O(1). This implies that,

\begin{array}{l} I I = E [1_{{Z \in S - S_{I} (k)}} g ({\hat{f}}_{k} (Z), Z) - g (f (Z), Z)] \\ = E [E [g ({\hat{f}}_{k} (Z), Z) - g (f (Z), Z)] ∣ {Z \in S - S_{I} (k)}] \times P r (Z \notin S_{I} (k)) \\ = O (1) \times O ({(k / M)}^{1 / d}) = O ({(k / M)}^{1 / d}) . \end{array}

(B.3)

This implies that

\begin{array}{l} E [{\tilde{G}}_{k}] - G (f) = I + I I \\ = c_{1} {(\frac{k}{M})}^{1 / d} + c_{2} (\frac{1}{k}) + o (\frac{1}{k} + {(\frac{k}{M})}^{1 / d}) . \end{array}

Proof of Theorem 3

Proof

By the continuity of g⁽^λ⁾(x, y), we can construct the following Taylor series of g(f̌_k(Z), Z) around the conditional expected value f̌_k(Z).

g ({\hat{f}}_{k} (Z), Z) = g ({\overset{ˇ}{f}}_{k} (Z), Z) + g^{'} ({\overset{ˇ}{f}}_{k} (Z), Z) {\hat{e}}_{k} (Z) + (\sum_{i = 2}^{λ - 1} \frac{g^{(i)} ({\overset{ˇ}{f}}_{k} (Z), Z)}{i!} {\hat{e}}_{k}^{i} (Z)) + \frac{g^{(λ)} (ξ_{Z}, Z)}{λ!} {\hat{e}}_{k}^{λ} (Z),

where ξ_Z ∈ (g( Inline graphic [f̌_k(Z)], g(f̌_k(Z))). Denote (g^λ(ξ_Z, Z))/λ! by Ψ(Z). Further define the operator (Z) = Z − [Z] and

\begin{array}{l} p_{i} = M (g ({\overset{ˇ}{f}}_{k} (X_{i}), X_{i})), \\ q_{i} = M (g^{'} ({\overset{ˇ}{f}}_{k} (X_{i}), X_{i}) {\hat{e}}_{k} (X_{i})), \\ r_{i} = M (\sum_{i = 2}^{λ} \frac{g^{(i)} ({\overset{ˇ}{f}}_{k} (X_{i}), X_{i})}{i!} {\hat{e}}_{k}^{i} (X_{i})) \\ s_{i} = M (Ψ (X_{i}) {\hat{e}}_{k}^{λ} (X_{i})) \end{array}

The variance of the estimator Ĝ_N(f̂_k) is given by

\begin{array}{l} V [{\hat{G}}_{k}] = E [{(\hat{G} (f) - E [\hat{G} (f)])}^{2}] \\ = \frac{1}{N} E [{(p_{1} + q_{1} + r_{1} + s_{1})}^{2}] + \frac{N - 1}{N} E [(p_{1} + q_{1} + r_{1} + s_{1}) (p_{2} + q_{2} + r_{2} + s_{2})] . \end{array}

Because X₁, X₂ are independent, we have Inline graphic [(p₁)(p₂ + q₂ + r₂ + s₂)] = 0. Furthermore,

E [{(p_{1} + q_{1} + r_{1} + s_{1})}^{2}] = E [{p_{1}}^{2}] + o (1) = V [g ({\overset{ˇ}{f}}_{k} (Z), Z)] + o (1) .

Applying Lemma 5 and Lemma 8, in conjunction with assumptions ( Inline graphic .3) and ( .4), it follows that

[p₁²] = [g(f̌_k(Z), Z)] = c₄(g(f̌_k(x), x))
$E [q_{1} q_{2}] = c_{5} (g^{'} ({\overset{ˇ}{f}}_{k} (x), x), g^{'} ({\overset{ˇ}{f}}_{k} (x), x)) (\frac{1}{M}) + o (\frac{1}{M})$
$E [q_{1} r_{2}] = o (\frac{1}{M})$
$E [r_{1} r_{2}] = o (\frac{1}{M})$

Since q₁ and s₂ are 0 mean random variables

\begin{array}{l} E [q_{1} s_{2}] = E [q_{1} Ψ (X_{2}) {(\hat{f} (X_{2}) - {\overset{ˇ}{f}}_{k} (X_{2}))}^{λ}] \\ = E [q_{1} Ψ (X_{2}) {\hat{e}}_{k}^{λ} (X_{2})] \\ \leq \sqrt{E [Ψ^{2} (X_{2})] E [q_{1}^{2} {\hat{e}}_{k}^{2 λ} (X_{2})]} \\ = \sqrt{E [Ψ^{2} (Z)]} (o (\frac{1}{k^{λ}})) \end{array}

Direct application of Lemma 9 in conjunction with assumptions ( Inline graphic .5) implies that [Ψ²(Z)] = O(1). Note that from assumption ( .3), $o (\frac{1}{k^{λ}}) = o (1 / M)$ . In a similar manner, it can be shown that $E [r_{1} s_{2}] = o (\frac{1}{M})$ and $E [s_{1} s_{2}] = o (\frac{1}{M})$ . This implies that

\begin{array}{l} V [{\hat{G}}_{k}] = \frac{1}{N} E [{p_{1}}^{2}] + \frac{(N - 1)}{N} E [q_{1} q_{2}] + o (\frac{1}{M} + \frac{1}{N}) \\ = c_{4} (g ({\overset{ˇ}{f}}_{k} (x), x)) (\frac{1}{N}) + c_{5} (g^{'} ({\overset{ˇ}{f}}_{k} (x), x), g^{'} ({\overset{ˇ}{f}}_{k} (x), x)) (\frac{1}{M}) + o (\frac{1}{M} + \frac{1}{N}) \\ = c_{4} (g (f (x), x)) (\frac{1}{N}) + c_{5} (g^{'} (f (x), x), g^{'} (f (x), x)) (\frac{1}{M}) + o (\frac{1}{M} + \frac{1}{N}) \\ = c_{4} (\frac{1}{N}) + c_{5} (\frac{1}{M}) + o (\frac{1}{M} + \frac{1}{N}), \end{array}

where the last but one step follows because, by (A.3), we know f̌_k(Z) = f(Z) + o(1). This in turn implies c₄(g(f̌_k(x), x)) = c₄(g(f(x), x)) + o(1) and c₅(g′(f̌_k(x), x), g′(f̌_k(x), x)) = c₅(g′(f(x), x), g′(f(x), x)) + o(1). Finally, by assumptions ( Inline graphic .2) and ( .4), the leading constants c₄ and c₅ are bounded.

Because of the identical nature of the expressions of ê_k(X) and ẽ_k(X) in Lemma 5 and Lemma 8, it immediately follows that

V [{\tilde{G}}_{k}] = c_{4} (\frac{1}{N}) + c_{5} (\frac{1}{M}) + o (\frac{1}{M} + \frac{1}{N}) .

This concludes the proof of Theorem 3.

Contributor Information

Kumar Sricharan, Email: kksreddy@umich.edu.

Dennis Wei, Email: dlwei@umich.edu.

Alfred O. Hero, III, Email: hero@umich.edu.

References

1.Beirlant J, Dudewicz EJ, Györfi L, Van der Meulen EC. Nonparametric entropy estimation: An overview. Intl Journal of Mathematical and Statistical Sciences. 1997;6:17–40. [Google Scholar]
2.Birge L, Massart P. Estimation of integral functions of a density. The Annals of Statistics. 1995;23(1):11–29. [Google Scholar]
3.Costa JA, Hero AO. Geodesic entropic graphs for dimension and entropy estimation in manifold learning. Signal Processing, IEEE Transactions on. 2004;52(8):2210–2221. [Google Scholar]
4.Fukunaga K, Hostetler LD. IEEE Transactions on Information Theory. 1973. Optimization of k-nearest-neighbor density estimates. [Google Scholar]
5.Giné E, Mason DM. Uniform in bandwidth estimation of integral functionals of the density function. Scandinavian Journal of Statistics. 2008;35:739761. [Google Scholar]
6.Gupta R. PhD thesis. University of Michigan; Ann Arbor: 2001. Quantization Strategies for Low-Power Communications. [Google Scholar]
7.Györfi L, van der Meulen EC. Density-free convergence properties of various estimators of entropy. Comput Statist Data Anal. 1987:425–436. [Google Scholar]
8.Györfi L, van der Meulen EC. An entropy estimate based on a kernel density estimation. Limit Theorems in Probability and Statistics. 1989:229–240. [Google Scholar]
9.Hero AO, Costa J, Ma B. Technical Report, Communications and Signal Processing Laboratory. The University of Michigan; Mar, 2003. Asymptotic relations between minimal graphs and alpha-entropy. [Google Scholar]
10.Lanckriet G, Cristianini N, Bartlett P, El Ghaoui L. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research. 2002;5:2004. [Google Scholar]
11.Laurent B. Efficient estimation of integral functionals of a density. The Annals of Statistics. 1996;24(2):659–681. [Google Scholar]
12.Leonenko N, Prozanto L, Savani V. A class of Rényi information estimators for multidimensional densities. Annals of Statistics. 2008;36:2153–2182. [Google Scholar]
13.Liitiäinen E, Lendasse A, Corona F. On the statistical estimation of rényi entropies. Proceedings of IEEE/MLSP 2009 International Workshop on Machine Learning for Signal Processing; Grenoble (France). September 2–4 2009. [Google Scholar]
14.Pal D, Poczos B, Szepesvari C. Proc Advances in Neural Information Processing Systems (NIPS) MIT Press; 2010. Estimation of Rényi entropy and mutual information based on generalized nearest-neighbor graphs. [Google Scholar]
15.Raykar VC, Duraiswami R. Fast optimal bandwidth selection for kernel density estimation. In: Ghosh J, Lambert D, Skillicorn D, Srivastava J, editors. Proceedings of the sixth SIAM International Conference on Data Mining; 2006. pp. 524–528. [Google Scholar]
16.Schapire Robert E. The strength of weak learnability. Machine Learning. 1990 Jun;5(2):197, 227–227. [Google Scholar]
17.Sricharan K, Raich R, Hero AO. Empirical estimation of entropy functionals with confidence. Dec, 2010. ArXiv e-prints. [Google Scholar]
18.Turlach B. Bandwidth selection in kernel density estimation: A review. [Google Scholar]
19.Wang Q, Kulkarni SR, Verdú S. Divergence estimation of continuous distributions based on data-dependent partitions. Information Theory, IEEE Transactions on. 2005;51(9):3064–3074. [Google Scholar]

[R1] 1.Beirlant J, Dudewicz EJ, Györfi L, Van der Meulen EC. Nonparametric entropy estimation: An overview. Intl Journal of Mathematical and Statistical Sciences. 1997;6:17–40. [Google Scholar]

[R2] 2.Birge L, Massart P. Estimation of integral functions of a density. The Annals of Statistics. 1995;23(1):11–29. [Google Scholar]

[R3] 3.Costa JA, Hero AO. Geodesic entropic graphs for dimension and entropy estimation in manifold learning. Signal Processing, IEEE Transactions on. 2004;52(8):2210–2221. [Google Scholar]

[R4] 4.Fukunaga K, Hostetler LD. IEEE Transactions on Information Theory. 1973. Optimization of k-nearest-neighbor density estimates. [Google Scholar]

[R5] 5.Giné E, Mason DM. Uniform in bandwidth estimation of integral functionals of the density function. Scandinavian Journal of Statistics. 2008;35:739761. [Google Scholar]

[R6] 6.Gupta R. PhD thesis. University of Michigan; Ann Arbor: 2001. Quantization Strategies for Low-Power Communications. [Google Scholar]

[R7] 7.Györfi L, van der Meulen EC. Density-free convergence properties of various estimators of entropy. Comput Statist Data Anal. 1987:425–436. [Google Scholar]

[R8] 8.Györfi L, van der Meulen EC. An entropy estimate based on a kernel density estimation. Limit Theorems in Probability and Statistics. 1989:229–240. [Google Scholar]

[R9] 9.Hero AO, Costa J, Ma B. Technical Report, Communications and Signal Processing Laboratory. The University of Michigan; Mar, 2003. Asymptotic relations between minimal graphs and alpha-entropy. [Google Scholar]

[R10] 10.Lanckriet G, Cristianini N, Bartlett P, El Ghaoui L. Learning the kernel matrix with semi-definite programming. Journal of Machine Learning Research. 2002;5:2004. [Google Scholar]

[R11] 11.Laurent B. Efficient estimation of integral functionals of a density. The Annals of Statistics. 1996;24(2):659–681. [Google Scholar]

[R12] 12.Leonenko N, Prozanto L, Savani V. A class of Rényi information estimators for multidimensional densities. Annals of Statistics. 2008;36:2153–2182. [Google Scholar]

[R13] 13.Liitiäinen E, Lendasse A, Corona F. On the statistical estimation of rényi entropies. Proceedings of IEEE/MLSP 2009 International Workshop on Machine Learning for Signal Processing; Grenoble (France). September 2–4 2009. [Google Scholar]

[R14] 14.Pal D, Poczos B, Szepesvari C. Proc Advances in Neural Information Processing Systems (NIPS) MIT Press; 2010. Estimation of Rényi entropy and mutual information based on generalized nearest-neighbor graphs. [Google Scholar]

[R15] 15.Raykar VC, Duraiswami R. Fast optimal bandwidth selection for kernel density estimation. In: Ghosh J, Lambert D, Skillicorn D, Srivastava J, editors. Proceedings of the sixth SIAM International Conference on Data Mining; 2006. pp. 524–528. [Google Scholar]

[R16] 16.Schapire Robert E. The strength of weak learnability. Machine Learning. 1990 Jun;5(2):197, 227–227. [Google Scholar]

[R17] 17.Sricharan K, Raich R, Hero AO. Empirical estimation of entropy functionals with confidence. Dec, 2010. ArXiv e-prints. [Google Scholar]

[R18] 18.Turlach B. Bandwidth selection in kernel density estimation: A review. [Google Scholar]

[R19] 19.Wang Q, Kulkarni SR, Verdú S. Divergence estimation of continuous distributions based on data-dependent partitions. Information Theory, IEEE Transactions on. 2005;51(9):3064–3074. [Google Scholar]

PERMALINK

Ensemble estimators for multivariate entropy estimation

Kumar Sricharan

Dennis Wei

Alfred O Hero III

Abstract

1 Introduction

1.1 Related work

1.2 Organization

Notation

2 Ensemble estimators

Theorem 1

Proof

3 Application to estimation of functionals of a density

3.1 Plug-in estimators of entropy

3.1.1 Assumptions

3.1.2 Analysis of MSE

Theorem 2

Theorem 3

3.1.3 Optimal MSE rate

Optimal choice of k

Choice of αfrac

3.2 Weighted ensemble entropy estimator

4 Experiments

4.1 Panter-Dite factor estimation

4.1.1 Variation of MSE with sample size T

Figure 1.

4.1.2 Variation of MSE with dimension d

4.2 Distribution testing

Figure 2.

Figure 3.

Figure 4.

5 Conclusions

Acknowledgments

Appendices: Outline of appendix

A Moment properties of boundary compensated uniform kernel density estimates

A.1 Taylor series expansion of coverage

A.2 Concentration inequalities for uniform kernel density estimator

A.3 Bounds on uniform kernel density estimator

A.4 Bias

Lemma 4

Proof

A.4.1 Evaluation of I

A.4.2 Evaluation of II

Analysis of f̄m( (X)), f̌m( (X))

Analysis of f̄k(X), f̌k(X)

Analysis of II

A.5 Central Moments

Lemma 5

Proof

A.6 Cross moments

Intersecting balls

Lemma 6

Proof

Disjoint balls

Lemma 7

Proof

Joint expression

Lemma 8

Proof

B Bias and variance results

Lemma 9

Proof

Proof of Theorem 2

Proof

Proof of Theorem 3

Proof

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Choice of α_frac

Analysis of f̄_m( (X)), f̌_m( (X))

Analysis of f̄_k(X), f̌_k(X)