Skip to main content
Entropy logoLink to Entropy
. 2022 Aug 3;24(8):1071. doi: 10.3390/e24081071

A Bayesian Motivated Two-Sample Test Based on Kernel Density Estimates

Naveed Merchant 1,*, Jeffrey D Hart 1
Editor: Udo Von Toussaint1
PMCID: PMC9407360  PMID: 36010735

Abstract

A new nonparametric test of equality of two densities is investigated. The test statistic is an average of log-Bayes factors, each of which is constructed from a kernel density estimate. Prior densities for the bandwidths of the kernel estimates are required, and it is shown how to choose priors so that the log-Bayes factors can be calculated exactly. Critical values of the test statistic are determined by a permutation distribution, conditional on the data. An attractive property of the methodology is that a critical value of 0 leads to a test for which both type I and II error probabilities tend to 0 as sample sizes tend to . Existing results on Kullback–Leibler loss of kernel estimates are crucial to obtaining these asymptotic results, and also imply that the proposed test works best with heavy-tailed kernels. Finite sample characteristics of the test are studied via simulation, and extensions to multivariate data are straightforward, as illustrated by an application to bivariate connectionist data.

Keywords: Bayes factors, permutation tests, cross-validation, consistent tests, Kolmogorov–Smirnov test

1. Introduction

Ref. [1] proposed the use of cross-validation Bayes factors in the classic two-sample problem of comparing two distributions. Their basic idea is to randomly divide the data into two distinct parts, call them A and B, and to define two models based on kernel density estimates from part A. One model assumes that the two distributions are the same and the other allows them to be different. A Bayes factor comparing the two part A models is then defined from the part B data. In order to stabilize the Bayes factor, Ref. [1] suggest that a number of different random data splits be used, and the resulting log-Bayes factors averaged.

In the current paper we consider a special case of this approach in which the part A data consists of all the available observations save one. If the sample sizes of the two data sets are m and n, this entails that a total of m+n log-Bayes factors may be calculated. The average of these m+n quantities becomes the test statistic here considered, and is termed ALB.

Although ALB is an average of log-Bayes factors, it does not lead to a consistent Bayes test because each of the log-Bayes factors is based on just a single observation. Ref. [1] suppose that the validation set size grows to , while in our case it remains of size 1. This results in the ALB converging to the Kullback–Leibler divergence of the two densities, and not as in the case of [1]. We therefore use frequentist ideas to construct our test. The exact null distribution of ALB conditional on order statistics is obtained using permutations of the data. Doing so leads to a consistent frequentist test whose size is controlled exactly. The problem of bandwidth selection is dealt with by using leave-one-out likelihood cross-validation applied to the combination of the two data sets. This method is computationally efficient in that the resulting bandwidth is invariant to permutations of the combined data, and therefore has to be computed just once. Our methodology is easily extended to bivariate data, and we do so in a real data example.

Ref. [2] also use a permutation test based on kernel estimates for the two-sample problem, their statistic being based on an L2 distance. Ref. [3] shows how other distances and divergences compare when applying them to the general k-sample problem, restricting their comparisons to the one-dimensional case. Our method mainly differs from these procedures by virtue of its Bayesian motivation. Existing methodology that most closely resembles ours is that of [4], who use a kernel-based marginal likelihood ratio to test goodness of fit of parametric models for a distribution. Their marginal likelihood employs a prior for a bandwidth, as does ours.

2. Methodology

We assume that X=(X1,,Xm) are independent and identically distributed (i.i.d.) from density f, and independently Y=(Y1,,Yn) are i.i.d. from density g. We are interested in the problem of testing the null hypothesis that f and g are identical on the basis of the data X and Y. Let U=(U1,,Uk) be an arbitrary set of k scalar observations, and define a kernel density estimate by

f^K(u|h,U)=1khi=1kKuUih,<u<,

where K is the kernel and h>0 the bandwidth.

2.1. The Test Statistic

Let Zi=Xi, i=1,,m, Zi=Yim, i=m+1,,m+n, Z=(Z1,,Zm+n) and Zi be the vector Z with all its components except Zi, i=1,,m+n. Furthermore, let Xi be all the components of X except Xi, i=1,,m, and Yj all the components of Y except Yj, j=1,,n. If we assume that f is identical to g, then potential models for f are M0i={f^K(·|h,Zi):h>0}, i=1,,m+n. Suppose that 1im. If we allow that f and g are different, then a model for the datum Zi is M1i={f^K(·|a,Xi):a>0}. In this case a legitimate Bayes factor for comparing M0i and M1i on the basis of the datum Zi has the form

Bi=0π(a)f^K(Zi|a,Xi)da0π(h)f^K(Zi|h,Zi)dh,i=1,,m,

where, mainly for convenience, we have assumed that the bandwidth priors are the same in all cases. Likewise, if i=m+1,,m+n, then M1i={f^K(·|b,Yim):b>0} is a model for the datum Zi, and a Bayes factor for comparing M0i and M1i is

Bi=0π(a)f^K(Zi|a,Yim)da0π(h)f^K(Zi|h,Zi)dh,i=m+1,,m+n.

When m and n are large, it is expected that M1i will be a good model for f if i=1,,m and for g if i=m+1,,m+n. Likewise, each of M0i will be a good model for the common density on the assumption that f and g are identical. However, none of B1,,Bm+n will be Bayes factors that can provide convincing evidence for either hypothesis simply because each one uses likelihoods based on a single datum. At first blush one might think that a solution to this problem is to take the average of the m+n log-Bayes factors:

ALB=1(m+n)i=1m+nlogBi. (1)

However, this results in a statistic that will consistently estimate 0 or a positive constant in the respective cases fg or fg. In neither case does the statistic have the property of Bayes consistency, i.e., the property that the Bayes factor tends to 0 and when fg and fg, respectively.

The discussion immediately above points out a fundamental fact that seems not to have been widely discussed: combining a large number of inconsistent Bayes factors does not necessarily lead to a consistent Bayes factor. A guiding principle in [1] was that of averaging log-Bayes factors from different random splits of the data with the aim of producing a more stable log-Bayes factor. However, in order for this practice to yield a consistent Bayes factor, it is important that each of the log-Bayes factors being averaged is consistent. Furthermore, to ensure this consistency, it is necessary that the sizes of both the training and validation sets tend to with the samples sizes m and n. Obviously this is not the case when the size of each validation set is just 1, as in the current paper.

An advantage of the approach proposed herein is that the practitioner does not have to choose the size of the training sets. The cost is that the resulting statistic does not have the property of Bayes consistency. We thus propose that the statistic be used in frequentist fashion. An appealing way of doing so is to use a permutation test, which (save for certain practical issues to be discussed) leads to a test with exact type I error probability for all m>1 and n>1. Let Z(1)<Z(2)<<Z(m+n) be the order statistics for the combined sample. Let j=(j1,,jm+n) be a random permutation of 1,,m+n, and define T(j) to be the statistic (1) when the X-sample is taken to be Zj1,,Zjm and the Y-sample to be Zjm+1,,Zjm+n. It follows that, conditional on the order statistics Z(1),,Z(m+n), the (m+n)! values taken on by T(·) are equally likely. Therefore, if tm,n is a 1α quantile of the empirical distribution of T(·), then the test that rejects fg when Ttm,n will have an (unconditional) type I error probability of α. As will be shown in the Appendix A.3, ALB is negative with probability tending to 1 as m,n, implying that for any α>0 tm,n will be negative for m and n large enough. From an evidentiary standpoint, it is nonsense to reject H0 for a negative value of ALB. We therefore suggest using the critical value max(0,tm,n), which ensures that the test is sensible and has level α.

2.2. The Effect of Using Scale Family Priors

Let π0 be an arbitrary density with support (0,). A possible family of priors is one that contains all rescaled versions of π0. For b>0, using the prior π(h)=π0(h/b)/b and making the change of variable h/b=u in the denominator of Bi, we have

0b1π0(h/b)f^K(Zi|h,Zi)dh=f^L(Zi|b,Zi),

where the kernel L is

L(z)=0u1π0(u)K(z/u)du,forallz. (2)

So, by using this type of prior, each marginal likelihood comprising ALB becomes a kernel density estimate with bandwidth equal to the scale parameter of the prior. In one sense this is disappointing since it means that averaging kernel estimates with respect to a bandwidth prior does not actually sidestep the issue of choosing a smoothing parameter. One has simply traded bandwidth choice for choice of the prior’s scale. However, it turns out that there is a quantifiable advantage to using a prior for the bandwidth of K. As detailed in the Appendix A.2, likelihood cross-validation is often more efficient when applied to f^L rather than to f^K.

When using a scale family of priors, the result immediately above implies that

(m+n)ALB=i=1mlog(f^L(Xi|b,Xi))+j=1nlog(f^L(Yj|b,Yj))i=1m+nlog(f^L(Zi|b,Zi)), (3)

and so the proposed statistic is proportional to the log of a likelihood ratio. The two likelihoods are cross-validation likelihoods, and the numerator and denominator of the ratio correspond to the hypotheses of different and equal densities, respectively.

In practice one must select both the kernel L and bandwidth b. For the moment we assume that L is given. The denominator of exp((m+n)ALB) as a function of b is the likelihood cross-validation criterion, as studied by [5], based on the combined sample. We propose using b=b^, the maximizer of this denominator. This bandwidth has the desirable property that it is invariant to the ordering of the data in the combined sample. Let ALB be the value of test statistic (1) for a permuted data set. One should use the principle that ALB is the same function of the permuted data as ALB is of the original data. So, in principle the bandwidth should be selected for every permuted data set, but because of the invariance of b^ to the ordering of the combined sample, this data-driven bandwidth equals b^ for every permuted data set. This results in a large computational savings relative to a procedure that selects the bandwidth differently for the X- and Y-samples. Using the same bandwidth under both null and alternative hypotheses also fits with the principle espoused by [6].

Concerning L, Ref. [5] showed that kernels must be relatively heavy-tailed in order for them to perform well with respect to likelihood cross-validation. In particular, he shows that likelihood cross-validation fails miserably as a method for choosing the bandwidth of a kde based on a Gaussian kernel. The tails of the kernel must be considerably heavier than those of a Gaussian density in order for likelihood cross-validation to be effective. Proposition A1 in the Appendix A.1 shows that under very general conditions L (as defined in (2)) has heavier tails than those of K. Therefore, the Bayesian notion of averaging commonly used kernel estimates with respect to a prior brings the resulting kernel estimate more in line with the conditions of [5]. This has a substantial benefit for our statistic inasmuch as we use a likelihood cross-validation bandwidth in its construction.

Consider the following kernel proposed by [5]:

L0(u)=18πeΦ(1)exp12log(1+|u|)2.

Suppose that a kde is defined using kernel L0 and its bandwidth is chosen by likelihood cross-validation. Ref. [5] shows that, in general, this cross-validation bandwidth will be asymptotically optimal in a Kullback–Leibler sense. We will therefore use L0 in all subsequent simulations. Results in the Appendix A.2 provide a kernel K and corresponding prior that produce L0.

2.3. Further Properties of ALB

In the Appendix A.3 we will show that the ALB test is consistent in the frequentist sense. In other words, for any alternative the power of an ALB test of fixed level tends to 1 as m and n tend to .

Interestingly, ALB has the property of being sharply bounded above. It can be rewritten as follows:

i=1mlog(f^L(Xi|b,Xi)/f^L(Xi|b,Zi))+j=1nlog(f^L(Yj|b,Yj)/f^L(Yj|b,Zm+j)).

Defining pm,n=(m1)/(m+n1),

f^L(Xi|b,Zi)=pm,nf^L(Xi|b,Xi)+1pm,nf^L(Xi|b,Y),i=1,,m,

and therefore

f^L(Xi|b,Xi)f^L(Xi|b,Zi)=1pm,n·pm,nf^L(Xi|b,Xi)pm,nf^L(Xi|b,Xi)+1pm,nf^L(Xi|b,Y)1pm,n.

A similar bound applies for the other component of ALB, implying that

ALBmm+nlog(pm,n)+1mm+nlogn1m+n1. (4)

Using the fact that [xlog(x)+(1x)log(1x)] has its maximum at x=1/2 when 0x1, bound (4) implies that

ALBlog(2)·maxm(m1),n(n1).

Unless one of m and n is very small, the effective bound on ALB is log(2). This reinforces the fact that ALB does not have the property of Bayes consistency. While it is true that ALB is an average of Bayes factors, none of these Bayes factors can ever provide compelling evidence in favor of the alternative. To reiterate, this problem is overcome by employing ALB in frequentist fashion.

While ALB can take on positive values when the null hypothesis is true, our proof of frequentist consistency shows that, under H0, P(ALB<0)1 as m,n. This implies that if 0 is used as a critical value, then the resulting test level tends to 0 as m,n. So, even though |ALB| does not tend to , the sign of ALB provides compelling evidence for the hypotheses of interest when the sample sizes are large.

The exact conditional distribution of ALB is known under the null hypothesis, as we use a permutation test. Nonetheless, it is of some interest to have an impression of the unconditional distribution of ALB. To this end, we randomly select two normal mixture densities that differ. The number of components M in the first mixture is between 2 and 20 and chosen from a distribution such that the probability of m is proportional to m1, m=2,,20. Given M=m, mixture weights are drawn from a Dirichlet distribution with all m parameters equal to 1/2. Given M=m and mixture weights, variances σ12,,σm2 of the normal components are a random sample from an inverse gamma distribution with both parameters equal to 1/2. Finally, means μ1,,μm of the normal components are such that μ1,,μm given σ1,,σm are independent with μj|σjN(0,σj2), j=1,,m. The second normal mixture is independently selected using exactly the same mechanism. Random selection of densities in this manner for simulation studies has been proposed and explored in [7].

We draw a sample of size 100 from each of the two randomly generated densities (so that m=n=100), and then compute ALB. This procedure is replicated on the same two densities 100 times. After this, we repeat the whole procedure for nine more pairs of randomly selected densities. The results are seen in Figure 1. Save for case 3, the proportion of positive ALBs is nearly 1 in all cases.

Figure 1.

Figure 1

Distribution of ALB under various alternative hypotheses.

We repeated a similar procedure for the null hypothesis setting. The simulation was exactly the same except that in each of the ten cases, only one density was generated, and a pair of independent samples (of size 100 each) was selected from this same density. The resulting ALB distributions can be seen in Figure 2. The proportion of the cases where ALB<0 for the 10 densities were, respectively, 0.89, 0.83, 0.83, 0.84, 0.85, 0.87, 0.91, 0.84, 0.84, and 0.76. These results are consistent with the fact that P(ALB<0) tends to 1 with sample size.

Figure 2.

Figure 2

Distribution of ALB under various null hypotheses.

We feel that ALB has potential for screening variables in a binary classification problem. Since ALB is negative with high probability under H0, we feel that 0 is a nicely interpretable cutoff for variable inclusion. However, we leave this topic for future research.

3. Simulations

We perform a small simulation study to investigate the size and power of our test. To explore the effect of the number of permutations, we generate 500 pairs of data sets, with one data set being a random sample of size m=50 from a standard normal distribution, and the other a random sample of size n=50 from a normal distribution with mean 0 and standard deviation 2. For each of the 500 pairs of data sets, the 95th percentile of ALBs is approximated using a range of different numbers (N) of permutations starting at 100 and increasing by a factor of 1.5 up to 3845. Results are indicated by the boxplots in Figure 3. The percentiles are centered at approximately the same value for all N. Not surprisingly, the variability of the percentiles becomes smaller as N increases. This implies a certain amount of mismatch between percentiles at N=3845 and those at smaller N.

Figure 3.

Figure 3

Effect of number of permutations on the 95th percentile of permutation distributions.

The consequence of the mismatch just alluded to can be investigated by determining the true conditional and unconditional levels of tests based on small N. For the null case, two data sets, each of size 50, are generated from a common normal distribution. Since the distribution of ALB is invariant to location and scale in the null case, we use a standard normal without loss of generality. For each pair of data sets, the data are randomly permuted 338 times, which leads to 338 values of ALB. A second set of 3845 permutations is then performed, leading to 3845 more values of ALB. The proportion of ALBs from the second set that exceed the 95th percentile of the ALBs formed from the first set is then determined. This proportion is approximately equal to the conditional level of the test based on 338 permutations. This same procedure is used for each of 500 data sets, and the resulting distribution of approximate levels is shown in Figure 4.

Figure 4.

Figure 4

Distribution of approximate conditional levels of permutation tests under the null hypothesis. Each conditional level is the proportion of 3845 ALBs from permuted data sets that exceed the 95th percentile of ALBs formed from 338 permuted data sets. Results are based on 500 replications in each of which both distributions are standard normal.

The histogram is centered near 0.05, and 87% of the conditional levels are between 0.03 and 0.07. Furthermore, an approximation to the unconditional level is i=1500α^i/500=0.053, where α^i is the approximate conditional level for the ith data set, i=1,,500. Based on these results, use of only 338 permutations is arguably adequate.

The same experiment is repeated except now the two data sets are drawn from different distributions, a standard normal and a normal with mean 0 and standard deviation 2. Results from this experiment are given in Figure 5.

Figure 5.

Figure 5

Distribution of approximate conditional levels of permutation tests under an alternative hypothesis. Each conditional level is the proportion of 3845 ALBs from permuted data sets that exceed the 95th percentile of ALBs formed from 338 permuted data sets. Results are based on 500 replications in each of which one distribution is standard normal and the other is normal with mean 0 and standard deviation 2.

As in the null case, the conditional levels based on the use of 338 permutations are quite good. Eighty-eight percent of the levels are between 0.03 and 0.07, and the approximate unconditional level is 0.051.

The proportion of ALBs from permuted data sets that are larger than the ALB computed from the original data provides a p-value. The p-values obtained with our method (based on 3845 permutations) are compared to the p-values obtained with the Kolmogorov–Smirnov test and Bowman’s two-sample test. Results are summarized in Figure 6 and Figure 7. In 98% of the replications the K-S p-value was larger than the ALB p-value, and in 57% of the cases the Bowman p-value was equal to or larger than the ALB p-value. These results suggest that in this case our test has much better power than that of the Kolmogorov–Smirnov test and power at least comparable to that of Bowman’s test.

Figure 6.

Figure 6

Kolmogorov–Smirnov p-values versus ALB p-values. Results are based on 500 data sets in each of which one distribution is standard normal and the other is normal with mean 0 and standard deviation 2. The ALB p-value is less than the KS-test p-value in 98% of cases. There are only 183 p-values from the KS-test that are less than 0.05.

Figure 7.

Figure 7

Bowman p-values versus ALB p-values. Results are based on 500 data sets in each of which one distribution is standard normal and the other is normal with mean 0 and standard deviation 2. The number of p-values less than 0.05 for Bowman’s test and the ALB test are 454 and 458, respectively. The ALB p-value is less than, more than and equal to the Bowman p-value in 49%, 43% and 8% of cases, respectively.

4. A Bivariate Extension of the Two-Sample Test and Application to Connectionist Bench Data

Our method can be extended to the bivariate case by using a bivariate kernel density estimate. Assume now that X=(X1,...,Xm) are independent and identically distributed from density f and Y=(Y1,...,Ym) are independent and identically distributed from g, where Xi and Yj are each bivariate observations, i=1,,m, j=1,,n.

A product kernel K will be used, i.e., the bivariate kernel K is the product of two univariate kernels. For k arbitrary bivariate observations U=(U1,,Uk), Ui=(Ui1,Ui2), i=1,,k, and u=(u1,u2), the kernel estimate is defined by

f^K(u|h,U)=1kh1h2i=1kKu1Ui1h1Ku2Ui2h2,

where <u1<, <u2< and h=(h1,h2) is a two-vector of (positive) bandwidths.

We will use the same sort of notation as before, i.e., Zi=Xi, i=1,,m, Zi=Yim, i=m+1,,m+n, Z=(Z1,,Zm+n) and Zi is the object Z with all its components except Zi, i=1,,m+n. In this case the ith Bayes factor is defined as

Bi=00π(h1,h2)f^K(Zi|h,Xi)dh1dh200π(h1,h2)f^K(Zi|h,Zi)dh1dh2,i=1,,m,

and similarly for i=m+1,,m+n. As before the test statistic is ALB=i=1m+nlogBi/(m+n).

This form may seem daunting, but reduces to a more familiar form if we take π(h1,h2)=π0(h1/b1)π0(h2/b2)/(b1b2). In this case, proceeding exactly as in Section 2, Bi has the form

Bi=f^L(Zi|b,Xi)f^L(Zi|b,Zi),i=1,,m,

and similarly for i=m+1,,m+n, where b=(b1,b2) and L is defined by (2).

We will analyze a subset of the connectionist bench data, which consist of measurements obtained after bouncing sonar waves off of either rocks or metal cylinders. The data may be found at the UCI Machine Learning repository, Ref. [8]. There are 60 variables in the data set, with m=111 and n=97 measurements of each variable for the metal cylinders and rocks, respectively. Variable numbers (1 to 60) correspond to increasing aspect angles at which signals are bounced off of either metal or rock, and each of the 60 numbers is an amount of energy within a particular frequency band, integrated over a certain period of time. We will apply our test to see if the first two variables (corresponding to the smallest aspect angles) have a different distribution for rocks than they do for metal cylinders. In our analysis K is taken to be ϕ, the standard normal density, and π0 to be of the form (A1). In this event L is a t-density with ν degrees of freedom. We will use ν=3, leading to a fairly heavy-tailed kernel, which is desirable for reasons discussed previously.

The data for each variable are inherently between 0 and 1, and bivariate kernel estimates display boundary effects along the lines x=0 and y=0, with the largest bias near the origin. We therefore use a reflection technique to reduce bias along these two lines. Suppose one has k observations (x1,y1),,(xk,yk) on the unit square. Each observation (xi,yi) is reflected to create three new observations: (xi,yi), (xi,yi) and (xi,yi), i=1,,k. One then simply computes, at points in the unit square, a standard kernel density estimate from the data set of size 4k, and multiplies it by 4 to ensure integration to 1. The value of ALB is computed as described previously except that each leave-out estimate leaves out four values: the observation at which the estimate is evaluated plus its three reflected versions. In this way the kde is constructed from data that are independent of the value at which the kde is evaluated.

Kernel density estimates for variables 1 and 2 in the form of heat maps are shown in Figure 8 and Figure 9, and contours of the estimates are given in Figure 10. The latter figure suggests that the distributions for metal cylinders and rock are different. The value of ALB turned out to be 0.013, and an approximate p-value based on 10,000 permuted data sets was 0.0076. So, there is strong evidence of a difference between the rock and metal bivariate distributions. Interestingly, the percentage of negative ALBs among the 10,000 permutations was 0.9785. A kernel density estimate based on the 10,000 values of ALB is shown in Figure 11.

Figure 8.

Figure 8

A heat map of the first two variables for the signals bounced off the metal cylinder. Variables x and y correspond to the smallest and next to smallest aspect angles, respectively.

Figure 9.

Figure 9

A heat map of the first two variables for the signals bounced off the rock object. Variables x and y are as defined in Figure 8.

Figure 10.

Figure 10

Contour plots of the first two variables of both rock and cylinder objects. The blue contours correspond to the rock measurements and red to the cylinder measurements. Variables x and y are as defined in Figure 8.

Figure 11.

Figure 11

A kernel density estimate computed using 10,000 values of ALB from permuted data sets. The value of ALB for the original data set was 0.013.

5. Conclusions and Future Work

We have proposed a new nonparametric test of the null hypothesis that two densities are equal. An attractive property of the test is that its critical values are defined by a permutation distribution, allaying essentially any concern about test validity. The fact that the statistic is an average of log-Bayes factors leads to another attractive property: a critical value of 0 leads to a test with type I error probability tending to 0 with sample size. A simulation study showed the new test to have much better power than the Kolmogorov–Smirnov test in a case where the two densities differed with respect to scale. An application to connectionist data illustrated the usefulness of our methodology for bivariate data.

Future work includes efforts to increase the speed of computing the test statistic and its permutation distribution, especially for large data sets. We are also interested in applying the new test to the problem of screening variables prior to performing binary classification. A common method of doing so is to compute a two-sample test statistic for each variable, and to then select variables whose statistics exceed some threshold. An inherent problem in this approach is objectively choosing a threshold. Results of the current paper suggest that 0 would be a natural and effective threshold for variable screening.

Appendix A

Appendix A.1. Relationship of K and L

By far the most popular choice of kernel in practice is the Gaussian kernel, K(x)=ϕ(x), <x<, where ϕ is the standard normal density. For ν>0, define

π0(u)=2(ν/2)ν/2Γ(ν/2)u(ν+1)expν2u2,u>0. (A1)

If one takes K to be the the standard normal kernel and uses prior (A1), then the corresponding kernel L is a t-density with ν degrees of freedom. An interesting aspect of these kernels is that they have heavier tails than those of the Gaussian kernel. This is especially true for the more diffuse, or noninformative priors, i.e., those for which ν is small. (The mean and variance of (A1) exist for ν>2. At ν=3, the two are 1.382 and 1.090, respectively, and as ν they converge to 1 and 0).

The fact that the kernel L is more heavy-tailed than K in the previous example is not an isolated phenomenon, as indicated by the following proposition (which is straightforward to prove):

Proposition A1.

If π0 has support (0,C) with 1<C and the tails of K decay exponentially, then the tails of L are heavier than those of K in that K(u)/L(u)0 as u.

In principle, many different choices of π0 and K could produce the same kernel L. Or, one might ask “given kernel K, what prior π0 would produce a specified L?” When K is Gaussian, the latter question is answered by solving an integral equation. Unfortunately, doing so, at least in a general sense, exceeds our mathematical abilities. In the case where K is uniform, though, an elegant solution exists, as seen in the next section.

Appendix A.2. When K Is Uniform

In the special case where K is uniform on the interval (1/2,1/2), it is easy to check that, for all u,

L(u)=2|u|α1π0(α)dα. (A2)

If π0 has support (0,), then L has support (,), and hence we see again that averaging kernels with respect to a prior leads to a more heavy-tailed kernel.

Since our statistic ends up being a log-likelihood ratio based on kernel L, an interesting question is “what prior π0 gives rise to a specified kernel L?" Taking u0, (A2) implies that

π0(2u)=uL(u). (A3)

When L is decreasing on [0,) it follows that π0 is a density. (Under mild tail conditions on L and assuming that L(0+) exists finite, it is easy to show using integration by parts that (A3) integrates to 1 on (0,)).

Suppose that a kde is defined using the Hall kernel L0 and its bandwidth is chosen by likelihood cross-validation. Ref. [5] shows that, in general, this cross-validation bandwidth will be asymptotically optimal in a Kullback–Leibler sense. In contrast, using cross-validation to choose the bandwidth of a uniform kernel kde will produce a bandwidth that diverges to ∞ as the sample size tends to ∞.

Using (A3) the prior, shown in Figure A1, that produces L0 is

π0(2u)=L0(u)ulog(1+u)1+u.

This shape for the bandwidth prior could be considered canonical inasmuch as L will be similarly shaped for kernels that are decreasing on (0,).

Figure A1.

Figure A1

The prior that produces the Hall kernel when K is uniform.

Appendix A.3. Consistency

Here we prove

  • R1.

    frequentist consistency of our test, and

  • R2.

    P(ALB<0)1 as m,n.

Our proof uses the following assumptions.

  • A1.
    Under the null and alternative hypotheses the following integrals exist finite:
    IX=f(x)logf(x)dxandIY=g(y)logg(y)dy.

    When the alternative hypothesis is true, f and g are assumed to be different in the sense that the total variation distance, δ(f,g), is positive.

  • A2.

    The kernel L in ALB (expression (3)) is the Hall kernel, L0.

  • A3.

    The combined data likelihood cross-validation is maximized over an interval of the form [(m+n)1+ϵ,(m+n)ϵ], where ϵ is an arbitrarily small positive constant. The maximizer of this cross-validation is denoted b^m+n.

  • A4.

    The ratio m/(m+n) tends to ρ, 0<ρ<1, as m,n tend to .

  • A5.

    The densities f, g and ρf(x)+(1ρ)g(x) satisfy the conditions of [5] that are needed for the asymptotic optimality of a likelihood cross-validation bandwidth.

  • A6.
    Under the null hypothesis, let k(b) be the Kullback–Leibler risk of a kernel density estimate based on sample size k, kernel L0 and bandwidth b. Then k satisfies
    k(b)=CV(nb)1+a+CBb4+o(nb)1+a+b4
    for positive constants a, CV and CB with 0<a<1.

Before proceeding to the proof, remarks about assumption A6 are in order. This condition is needed only in proving R2, and represents a subset of the cases studied by [5]. It has been assumed merely to allow a more concise proof of R2, which remains true under more general conditions on k.

The critical values of a test with fixed size α>0 will tend to 0 as m,n tend to so long as ALB tends to 0 in probability under the null hypothesis. Therefore, the power of the test will tend to 1 if we can show that ALB tends to a positive constant under the alternative. Our proof of consistency thus boils down to showing that, as m,n tend to , ALB converges in probability to 0 and a positive number under the null and alternative hypotheses, respectively.

For data U=(U1,,Uk), define

CV(b|U)=1ki=1klog(f^L(Ui|b,Ui)),b>0.

The statistic ALB may then be written

ALB=mm+nCV(b^|X)+nm+nCV(b^|Y)CV(b^|Z),

where b^ maximizes CV(b|Z) for b[(m+n)1+ϵ,(m+n)ϵ].

Now suppose that U is a random sample from density d, k(b) is the expectation of the Kullback–Leibler loss of f^L(·|b,U) and define

Q(k)=1ki=1klogd(Xi)d(x)logd(x)dx,

where d(x)logd(x)dx exists finite. Then if d satisfies the conditions of [5] and k,

CV(b|U)=d(x)logd(x)dxk(b)+Q(k)+op(k(b)) (A4)

uniformly in b[k1+ϵ,kϵ], where ϵ is arbitrarily small. By the strong law of large numbers Q(k) converges to 0 in probability. Furthermore, maxb[k1+ϵ,kϵ]k(b) tends to 0 as k. If the maximizer b˜ of CV(b|U) is in [k1+ϵ,kϵ] it therefore follows that CV(b˜|U) converges in probability to d(x)logd(x)dx as k.

In the null case, (A4) implies that

mm+nCV(b|X)+nm+nCV(b|Y)CV(b|Z)=
mm+nm(b)+nm+nn(b)+m+n(b)+op(m(b)), (A5)

uniformly in b[(m+n)1+ϵ,(m+n)ϵ], where we have used all of A1–A5. Since b^m+n[(m+n)1+ϵ,(m+n)ϵ], (A5) implies that ALB converges to 0 in probability as m,n, which proves one part of R1.

To prove R2, we first observe that the bias component of k(b) is free of sample size, and hence the first order term of (A5) is free of bias components. Along with A3 and A6, this implies that

mm+nCV(b|X)+nm+nCV(b|Y)CV(b|Z)=
CV((m+n)b)1+aρa+(1ρ)a1+op((m+n)b)1+a+b4, (A6)

uniformly in b[(m+n)1+ϵ,(m+n)ϵ]. By A5, b^m+n is asymptotic in probability to bm+n, the minimizer of the Kullback–Leibler risk m+n. Along with (A6), this implies that

ALB=CV((m+n)bm+n)1+aρa+(1ρ)a1+op((m+n)bm+n)1+a+bm+n4.

By A6, we have

bm+nC0(m+n)(1a)/(5a),

where

C0=CV(1a)4CB1/(5a).

Combining the previous results yields

ALB=CVC01aρa+(1ρ)a1(m+n)4(1a)/(5a)+op((m+n)4(1a)/(5a)).

Using the fact that ρa+(1ρ)a1>0 it now follows that P(ALB<0)1 as m,n.

Turning to the alternative case, we apply (A4) to conclude that CV(b^|X), CV(b^|Y) and CV(b^|Z) are consistent for f(x)logf(x)dx, g(x)logg(x)dx, and fρ(x)logfρ(x)dx, respectively, where

fρ(x)=ρf(x)+(1ρ)g(x).

It follows that ALB is consistent for Δ=ρKL(f,fρ)+(1ρ)KL(g,fρ), where KL(f1,f2) denotes the Kullback–Leibler divergence between f1 and f2. By the Csiszár-Kemperman-Kullback-Pinsker inequality,

Δloge2·ρδ(f,fρ)2+(1ρ)δ(g,fρ)2=loge2·ρ(1ρ)2δ(f,g)2+(1ρ)ρ2δ(f,g)2=loge2ρ(1ρ)δ(f,g)2>0,

with the last inequality following by assumption. This completes the proof of R1.

Author Contributions

Conceptualization, N.M. and J.D.H.; Methodology, N.M. and J.D.H.; Software, N.M. and J.D.H.; Investigation, N.M. and J.D.H.; Writing, N.M. and J.D.H. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

The APC was funded by the Department of Statistics, Texas A&M University.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Merchant N., Hart J., Choi T. Use of cross-validation Bayes factors to test equality of two densities. arXiv. 20202003.06368 [Google Scholar]
  • 2.Bowman A.W., Azzalini A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Volume 18 OUP Oxford; New York, NY, USA: 1997. [Google Scholar]
  • 3.Baranzano R. Ph.D. Thesis. Uppsala University; Uppsala, Sweden: 2011. Non-Parametric Kernel Density Estimation-Based Permutation Test: Implementation and Comparisons. [Google Scholar]
  • 4.Hart J.D., Choi T., Yi S. Frequentist nonparametric goodness-of-fit tests via marginal likelihood ratios. Comput. Stat. Data Anal. 2016;96:120–132. doi: 10.1016/j.csda.2015.10.013. [DOI] [Google Scholar]
  • 5.Hall P. On Kullback-Leibler loss and density estimation. Ann. Stat. 1987;15:1491–1519. doi: 10.1214/aos/1176350606. [DOI] [Google Scholar]
  • 6.Young S.G., Bowman A.W. Non-parametric analysis of covariance. Biometrics. 1995;51:920–931. doi: 10.2307/2532993. [DOI] [Google Scholar]
  • 7.Hart J.D. Use of BayesSim and smoothing to enhance simulation studies. Open J. Stat. 2017;7:153–172. doi: 10.4236/ojs.2017.71012. [DOI] [Google Scholar]
  • 8.Dua D., Graff C. UCI Machine Learning Repository. School of Information and Computer Sciences, University of California, Irvine. 2017. [(accessed on 15 March 2022)]. Available online: http://archive.ics.uci.edu/ml.

Articles from Entropy are provided here courtesy of Multidisciplinary Digital Publishing Institute (MDPI)

RESOURCES