A Bayesian Motivated Two-Sample Test Based on Kernel Density Estimates

Naveed Merchant; Jeffrey D Hart

doi:10.3390/e24081071

. 2022 Aug 3;24(8):1071. doi: 10.3390/e24081071

A Bayesian Motivated Two-Sample Test Based on Kernel Density Estimates

Naveed Merchant ^1,^*, Jeffrey D Hart ¹

Editor: Udo Von Toussaint¹

PMCID: PMC9407360 PMID: 36010735

Abstract

A new nonparametric test of equality of two densities is investigated. The test statistic is an average of log-Bayes factors, each of which is constructed from a kernel density estimate. Prior densities for the bandwidths of the kernel estimates are required, and it is shown how to choose priors so that the log-Bayes factors can be calculated exactly. Critical values of the test statistic are determined by a permutation distribution, conditional on the data. An attractive property of the methodology is that a critical value of 0 leads to a test for which both type I and II error probabilities tend to 0 as sample sizes tend to ∞. Existing results on Kullback–Leibler loss of kernel estimates are crucial to obtaining these asymptotic results, and also imply that the proposed test works best with heavy-tailed kernels. Finite sample characteristics of the test are studied via simulation, and extensions to multivariate data are straightforward, as illustrated by an application to bivariate connectionist data.

Keywords: Bayes factors, permutation tests, cross-validation, consistent tests, Kolmogorov–Smirnov test

1. Introduction

Ref. [1] proposed the use of cross-validation Bayes factors in the classic two-sample problem of comparing two distributions. Their basic idea is to randomly divide the data into two distinct parts, call them A and B, and to define two models based on kernel density estimates from part A. One model assumes that the two distributions are the same and the other allows them to be different. A Bayes factor comparing the two part A models is then defined from the part B data. In order to stabilize the Bayes factor, Ref. [1] suggest that a number of different random data splits be used, and the resulting log-Bayes factors averaged.

In the current paper we consider a special case of this approach in which the part A data consists of all the available observations save one. If the sample sizes of the two data sets are m and n, this entails that a total of $m + n$ log-Bayes factors may be calculated. The average of these $m + n$ quantities becomes the test statistic here considered, and is termed $A L B$ .

Although $A L B$ is an average of log-Bayes factors, it does not lead to a consistent Bayes test because each of the log-Bayes factors is based on just a single observation. Ref. [1] suppose that the validation set size grows to ∞, while in our case it remains of size 1. This results in the $A L B$ converging to the Kullback–Leibler divergence of the two densities, and not ∞ as in the case of [1]. We therefore use frequentist ideas to construct our test. The exact null distribution of $A L B$ conditional on order statistics is obtained using permutations of the data. Doing so leads to a consistent frequentist test whose size is controlled exactly. The problem of bandwidth selection is dealt with by using leave-one-out likelihood cross-validation applied to the combination of the two data sets. This method is computationally efficient in that the resulting bandwidth is invariant to permutations of the combined data, and therefore has to be computed just once. Our methodology is easily extended to bivariate data, and we do so in a real data example.

Ref. [2] also use a permutation test based on kernel estimates for the two-sample problem, their statistic being based on an $L_{2}$ distance. Ref. [3] shows how other distances and divergences compare when applying them to the general k-sample problem, restricting their comparisons to the one-dimensional case. Our method mainly differs from these procedures by virtue of its Bayesian motivation. Existing methodology that most closely resembles ours is that of [4], who use a kernel-based marginal likelihood ratio to test goodness of fit of parametric models for a distribution. Their marginal likelihood employs a prior for a bandwidth, as does ours.

2. Methodology

We assume that $X = (X_{1}, \dots, X_{m})$ are independent and identically distributed (i.i.d.) from density f, and independently $Y = (Y_{1}, \dots, Y_{n})$ are i.i.d. from density g. We are interested in the problem of testing the null hypothesis that f and g are identical on the basis of the data $X$ and $Y$ . Let $U = (U_{1}, \dots, U_{k})$ be an arbitrary set of k scalar observations, and define a kernel density estimate by

{\hat{f}}_{K} (u | h, U) = \frac{1}{k h} \sum_{i = 1}^{k} K (\frac{u - U_{i}}{h}), - \infty < u < \infty,

where K is the kernel and $h > 0$ the bandwidth.

2.1. The Test Statistic

Let $Z_{i} = X_{i}$ , $i = 1, \dots, m$ , $Z_{i} = Y_{i - m}$ , $i = m + 1, \dots, m + n$ , $Z = (Z_{1}, \dots, Z_{m + n})$ and $Z_{i}$ be the vector $Z$ with all its components except $Z_{i}$ , $i = 1, \dots, m + n$ . Furthermore, let $X_{i}$ be all the components of $X$ except $X_{i}$ , $i = 1, \dots, m$ , and $Y_{j}$ all the components of $Y$ except $Y_{j}$ , $j = 1, \dots, n$ . If we assume that f is identical to g, then potential models for f are $M_{0 i} = {{\hat{f}}_{K} (\cdot | h, Z_{i}) : h > 0}$ , $i = 1, \dots, m + n$ . Suppose that $1 \leq i \leq m$ . If we allow that f and g are different, then a model for the datum $Z_{i}$ is $M_{1 i} = {{\hat{f}}_{K} (\cdot | a, X_{i}) : a > 0}$ . In this case a legitimate Bayes factor for comparing $M_{0 i}$ and $M_{1 i}$ on the basis of the datum $Z_{i}$ has the form

B_{i} = \frac{\int_{0}^{\infty} π (a) {\hat{f}}_{K} (Z_{i} | a, X_{i}) d a}{\int_{0}^{\infty} π (h) {\hat{f}}_{K} (Z_{i} | h, Z_{i}) d h}, i = 1, \dots, m,

where, mainly for convenience, we have assumed that the bandwidth priors are the same in all cases. Likewise, if $i = m + 1, \dots, m + n$ , then $M_{1 i} = {{\hat{f}}_{K} (\cdot | b, Y_{i - m}) : b > 0}$ is a model for the datum $Z_{i}$ , and a Bayes factor for comparing $M_{0 i}$ and $M_{1 i}$ is

B_{i} = \frac{\int_{0}^{\infty} π (a) {\hat{f}}_{K} (Z_{i} | a, Y_{i - m}) d a}{\int_{0}^{\infty} π (h) {\hat{f}}_{K} (Z_{i} | h, Z_{i}) d h}, i = m + 1, \dots, m + n .

When m and n are large, it is expected that $M_{1 i}$ will be a good model for f if $i = 1, \dots, m$ and for g if $i = m + 1, \dots, m + n$ . Likewise, each of $M_{0 i}$ will be a good model for the common density on the assumption that f and g are identical. However, none of $B_{1}, \dots, B_{m + n}$ will be Bayes factors that can provide convincing evidence for either hypothesis simply because each one uses likelihoods based on a single datum. At first blush one might think that a solution to this problem is to take the average of the $m + n$ log-Bayes factors:

\begin{matrix} A L B = \frac{1}{(m + n)} \sum_{i = 1}^{m + n} log B_{i} . \end{matrix}

(1)

However, this results in a statistic that will consistently estimate 0 or a positive constant in the respective cases $f \equiv g$ or $f ≢ g$ . In neither case does the statistic have the property of Bayes consistency, i.e., the property that the Bayes factor tends to 0 and ∞ when $f \equiv g$ and $f ≢ g$ , respectively.

The discussion immediately above points out a fundamental fact that seems not to have been widely discussed: combining a large number of inconsistent Bayes factors does not necessarily lead to a consistent Bayes factor. A guiding principle in [1] was that of averaging log-Bayes factors from different random splits of the data with the aim of producing a more stable log-Bayes factor. However, in order for this practice to yield a consistent Bayes factor, it is important that each of the log-Bayes factors being averaged is consistent. Furthermore, to ensure this consistency, it is necessary that the sizes of both the training and validation sets tend to ∞ with the samples sizes m and n. Obviously this is not the case when the size of each validation set is just 1, as in the current paper.

An advantage of the approach proposed herein is that the practitioner does not have to choose the size of the training sets. The cost is that the resulting statistic does not have the property of Bayes consistency. We thus propose that the statistic be used in frequentist fashion. An appealing way of doing so is to use a permutation test, which (save for certain practical issues to be discussed) leads to a test with exact type I error probability for all $m > 1$ and $n > 1$ . Let $Z_{(1)} < Z_{(2)} < \dots < Z_{(m + n)}$ be the order statistics for the combined sample. Let $j = (j_{1}, \dots, j_{m + n})$ be a random permutation of $1, \dots, m + n$ , and define $T (j)$ to be the statistic (1) when the X-sample is taken to be $Z_{j_{1}}, \dots, Z_{j_{m}}$ and the Y-sample to be $Z_{j_{m + 1}}, \dots, Z_{j_{m + n}}$ . It follows that, conditional on the order statistics $Z_{(1)}, \dots, Z_{(m + n)}$ , the $(m + n)!$ values taken on by $T (\cdot)$ are equally likely. Therefore, if $t_{m, n}$ is a $1 - α$ quantile of the empirical distribution of $T (\cdot)$ , then the test that rejects $f \equiv g$ when $T \geq t_{m, n}$ will have an (unconditional) type I error probability of $α$ . As will be shown in the Appendix A.3, $A L B$ is negative with probability tending to 1 as $m, n \to \infty$ , implying that for any $α > 0$ $t_{m, n}$ will be negative for m and n large enough. From an evidentiary standpoint, it is nonsense to reject $H_{0}$ for a negative value of $A L B$ . We therefore suggest using the critical value $max (0, t_{m, n})$ , which ensures that the test is sensible and has level $α$ .

2.2. The Effect of Using Scale Family Priors

Let $π_{0}$ be an arbitrary density with support $(0, \infty)$ . A possible family of priors is one that contains all rescaled versions of $π_{0}$ . For $b > 0$ , using the prior $π (h) = π_{0} (h / b) / b$ and making the change of variable $h / b = u$ in the denominator of $B_{i}$ , we have

\begin{matrix} \int_{0}^{\infty} b^{- 1} π_{0} (h / b) {\hat{f}}_{K} (Z_{i} | h, Z_{i}) d h = {\hat{f}}_{L} (Z_{i} | b, Z_{i}), \end{matrix}

where the kernel L is

\begin{matrix} L (z) = \int_{0}^{\infty} u^{- 1} π_{0} (u) K (z / u) d u, for all z . \end{matrix}

(2)

So, by using this type of prior, each marginal likelihood comprising $A L B$ becomes a kernel density estimate with bandwidth equal to the scale parameter of the prior. In one sense this is disappointing since it means that averaging kernel estimates with respect to a bandwidth prior does not actually sidestep the issue of choosing a smoothing parameter. One has simply traded bandwidth choice for choice of the prior’s scale. However, it turns out that there is a quantifiable advantage to using a prior for the bandwidth of K. As detailed in the Appendix A.2, likelihood cross-validation is often more efficient when applied to ${\hat{f}}_{L}$ rather than to ${\hat{f}}_{K}$ .

When using a scale family of priors, the result immediately above implies that

\begin{matrix} (m + n) A L B & = & \sum_{i = 1}^{m} log ({\hat{f}}_{L} (X_{i} | b, X^{i})) + \sum_{j = 1}^{n} log ({\hat{f}}_{L} (Y_{j} | b, Y^{j})) \\ - \sum_{i = 1}^{m + n} log ({\hat{f}}_{L} (Z_{i} | b, Z^{i})), \end{matrix}

(3)

and so the proposed statistic is proportional to the log of a likelihood ratio. The two likelihoods are cross-validation likelihoods, and the numerator and denominator of the ratio correspond to the hypotheses of different and equal densities, respectively.

In practice one must select both the kernel L and bandwidth b. For the moment we assume that L is given. The denominator of $exp ((m + n) A L B)$ as a function of b is the likelihood cross-validation criterion, as studied by [5], based on the combined sample. We propose using $b = \hat{b}$ , the maximizer of this denominator. This bandwidth has the desirable property that it is invariant to the ordering of the data in the combined sample. Let $A L B^{*}$ be the value of test statistic (1) for a permuted data set. One should use the principle that $A L B^{*}$ is the same function of the permuted data as $A L B$ is of the original data. So, in principle the bandwidth should be selected for every permuted data set, but because of the invariance of $\hat{b}$ to the ordering of the combined sample, this data-driven bandwidth equals $\hat{b}$ for every permuted data set. This results in a large computational savings relative to a procedure that selects the bandwidth differently for the X- and Y-samples. Using the same bandwidth under both null and alternative hypotheses also fits with the principle espoused by [6].

Concerning L, Ref. [5] showed that kernels must be relatively heavy-tailed in order for them to perform well with respect to likelihood cross-validation. In particular, he shows that likelihood cross-validation fails miserably as a method for choosing the bandwidth of a kde based on a Gaussian kernel. The tails of the kernel must be considerably heavier than those of a Gaussian density in order for likelihood cross-validation to be effective. Proposition A1 in the Appendix A.1 shows that under very general conditions L (as defined in (2)) has heavier tails than those of K. Therefore, the Bayesian notion of averaging commonly used kernel estimates with respect to a prior brings the resulting kernel estimate more in line with the conditions of [5]. This has a substantial benefit for our statistic inasmuch as we use a likelihood cross-validation bandwidth in its construction.

Consider the following kernel proposed by [5]:

L_{0} (u) = \frac{1}{\sqrt{8 π e} Φ (1)} exp [- \frac{1}{2} {(log (1 + | u |))}^{2}] .

Suppose that a kde is defined using kernel $L_{0}$ and its bandwidth is chosen by likelihood cross-validation. Ref. [5] shows that, in general, this cross-validation bandwidth will be asymptotically optimal in a Kullback–Leibler sense. We will therefore use $L_{0}$ in all subsequent simulations. Results in the Appendix A.2 provide a kernel K and corresponding prior that produce $L_{0}$ .

2.3. Further Properties of $A L B$

In the Appendix A.3 we will show that the $A L B$ test is consistent in the frequentist sense. In other words, for any alternative the power of an $A L B$ test of fixed level tends to 1 as m and n tend to ∞.

Interestingly, $A L B$ has the property of being sharply bounded above. It can be rewritten as follows:

\sum_{i = 1}^{m} log ({\hat{f}}_{L} (X_{i} | b, X^{i}) / {\hat{f}}_{L} (X_{i} | b, Z^{i})) + \sum_{j = 1}^{n} log ({\hat{f}}_{L} (Y_{j} | b, Y^{j}) / {\hat{f}}_{L} (Y_{j} | b, Z^{m + j})) .

Defining $p_{m, n} = (m - 1) / (m + n - 1)$ ,

{\hat{f}}_{L} (X_{i} | b, Z^{i}) = p_{m, n} {\hat{f}}_{L} (X_{i} | b, X^{i}) + (1 - p_{m, n}) {\hat{f}}_{L} (X_{i} | b, Y), i = 1, \dots, m,

and therefore

\begin{matrix} \frac{{\hat{f}}_{L} (X_{i} | b, X^{i})}{{\hat{f}}_{L} (X_{i} | b, Z^{i})} = \frac{1}{p_{m, n}} \cdot \frac{p_{m, n} {\hat{f}}_{L} (X_{i} | b, X^{i})}{p_{m, n} {\hat{f}}_{L} (X_{i} | b, X^{i}) + (1 - p_{m, n}) {\hat{f}}_{L} (X_{i} | b, Y)} \leq \frac{1}{p_{m, n}} . \end{matrix}

A similar bound applies for the other component of $A L B$ , implying that

\begin{matrix} A L B \leq - [(\frac{m}{m + n}) log (p_{m, n}) + (1 - \frac{m}{m + n}) log (\frac{n - 1}{m + n - 1})] . \end{matrix}

(4)

Using the fact that $- [x log (x) + (1 - x) log (1 - x)]$ has its maximum at $x = 1 / 2$ when $0 \leq x \leq 1$ , bound (4) implies that

A L B \leq log (2) \cdot max (\frac{m}{(m - 1)}, \frac{n}{(n - 1)}) .

Unless one of m and n is very small, the effective bound on $A L B$ is $log (2)$ . This reinforces the fact that $A L B$ does not have the property of Bayes consistency. While it is true that $A L B$ is an average of Bayes factors, none of these Bayes factors can ever provide compelling evidence in favor of the alternative. To reiterate, this problem is overcome by employing $A L B$ in frequentist fashion.

While $A L B$ can take on positive values when the null hypothesis is true, our proof of frequentist consistency shows that, under $H_{0}$ , $P (A L B < 0) \to 1$ as $m, n \to \infty$ . This implies that if 0 is used as a critical value, then the resulting test level tends to 0 as $m, n \to \infty$ . So, even though $| A L B |$ does not tend to ∞, the sign of $A L B$ provides compelling evidence for the hypotheses of interest when the sample sizes are large.

The exact conditional distribution of $A L B$ is known under the null hypothesis, as we use a permutation test. Nonetheless, it is of some interest to have an impression of the unconditional distribution of $A L B$ . To this end, we randomly select two normal mixture densities that differ. The number of components M in the first mixture is between 2 and 20 and chosen from a distribution such that the probability of m is proportional to $m^{- 1}$ , $m = 2, \dots, 20$ . Given $M = m$ , mixture weights are drawn from a Dirichlet distribution with all m parameters equal to $1 / 2$ . Given $M = m$ and mixture weights, variances $σ_{1}^{2}, \dots, σ_{m}^{2}$ of the normal components are a random sample from an inverse gamma distribution with both parameters equal to $1 / 2$ . Finally, means $μ_{1}, \dots, μ_{m}$ of the normal components are such that $μ_{1}, \dots, μ_{m}$ given $σ_{1}, \dots, σ_{m}$ are independent with $μ_{j} | σ_{j} \sim N (0, σ_{j}^{2})$ , $j = 1, \dots, m$ . The second normal mixture is independently selected using exactly the same mechanism. Random selection of densities in this manner for simulation studies has been proposed and explored in [7].

We draw a sample of size 100 from each of the two randomly generated densities (so that $m = n = 100$ ), and then compute $A L B$ . This procedure is replicated on the same two densities 100 times. After this, we repeat the whole procedure for nine more pairs of randomly selected densities. The results are seen in Figure 1. Save for case 3, the proportion of positive $A L B$ s is nearly 1 in all cases.

Distribution of $A L B$ under various alternative hypotheses.

We repeated a similar procedure for the null hypothesis setting. The simulation was exactly the same except that in each of the ten cases, only one density was generated, and a pair of independent samples (of size 100 each) was selected from this same density. The resulting $A L B$ distributions can be seen in Figure 2. The proportion of the cases where $A L B < 0$ for the 10 densities were, respectively, 0.89, 0.83, 0.83, 0.84, 0.85, 0.87, 0.91, 0.84, 0.84, and 0.76. These results are consistent with the fact that $P (A L B < 0)$ tends to 1 with sample size.

Distribution of $A L B$ under various null hypotheses.

We feel that $A L B$ has potential for screening variables in a binary classification problem. Since $A L B$ is negative with high probability under $H_{0}$ , we feel that 0 is a nicely interpretable cutoff for variable inclusion. However, we leave this topic for future research.

3. Simulations

We perform a small simulation study to investigate the size and power of our test. To explore the effect of the number of permutations, we generate 500 pairs of data sets, with one data set being a random sample of size $m = 50$ from a standard normal distribution, and the other a random sample of size $n = 50$ from a normal distribution with mean 0 and standard deviation 2. For each of the 500 pairs of data sets, the 95th percentile of $A L B$ s is approximated using a range of different numbers (N) of permutations starting at 100 and increasing by a factor of 1.5 up to 3845. Results are indicated by the boxplots in Figure 3. The percentiles are centered at approximately the same value for all N. Not surprisingly, the variability of the percentiles becomes smaller as N increases. This implies a certain amount of mismatch between percentiles at $N = 3845$ and those at smaller N.

Effect of number of permutations on the 95th percentile of permutation distributions.

The consequence of the mismatch just alluded to can be investigated by determining the true conditional and unconditional levels of tests based on small N. For the null case, two data sets, each of size 50, are generated from a common normal distribution. Since the distribution of $A L B$ is invariant to location and scale in the null case, we use a standard normal without loss of generality. For each pair of data sets, the data are randomly permuted 338 times, which leads to 338 values of $A L B$ . A second set of 3845 permutations is then performed, leading to 3845 more values of $A L B$ . The proportion of $A L B$ s from the second set that exceed the 95th percentile of the $A L B$ s formed from the first set is then determined. This proportion is approximately equal to the conditional level of the test based on 338 permutations. This same procedure is used for each of 500 data sets, and the resulting distribution of approximate levels is shown in Figure 4.

Distribution of approximate conditional levels of permutation tests under the null hypothesis. Each conditional level is the proportion of 3845 $A L B$ s from permuted data sets that exceed the 95th percentile of $A L B$ s formed from 338 permuted data sets. Results are based on 500 replications in each of which both distributions are standard normal.

The histogram is centered near 0.05, and 87% of the conditional levels are between 0.03 and 0.07. Furthermore, an approximation to the unconditional level is $\sum_{i = 1}^{500} {\hat{α}}_{i} / 500 = 0.053$ , where ${\hat{α}}_{i}$ is the approximate conditional level for the ith data set, $i = 1, \dots, 500$ . Based on these results, use of only 338 permutations is arguably adequate.

The same experiment is repeated except now the two data sets are drawn from different distributions, a standard normal and a normal with mean 0 and standard deviation 2. Results from this experiment are given in Figure 5.

Distribution of approximate conditional levels of permutation tests under an alternative hypothesis. Each conditional level is the proportion of 3845 $A L B$ s from permuted data sets that exceed the 95th percentile of $A L B$ s formed from 338 permuted data sets. Results are based on 500 replications in each of which one distribution is standard normal and the other is normal with mean 0 and standard deviation 2.

As in the null case, the conditional levels based on the use of 338 permutations are quite good. Eighty-eight percent of the levels are between 0.03 and 0.07, and the approximate unconditional level is 0.051.

The proportion of $A L B$ s from permuted data sets that are larger than the $A L B$ computed from the original data provides a p-value. The p-values obtained with our method (based on 3845 permutations) are compared to the p-values obtained with the Kolmogorov–Smirnov test and Bowman’s two-sample test. Results are summarized in Figure 6 and Figure 7. In 98% of the replications the K-S p-value was larger than the $A L B$ p-value, and in 57% of the cases the Bowman p-value was equal to or larger than the $A L B$ p-value. These results suggest that in this case our test has much better power than that of the Kolmogorov–Smirnov test and power at least comparable to that of Bowman’s test.

Kolmogorov–Smirnov p-values versus $A L B$ p-values. Results are based on 500 data sets in each of which one distribution is standard normal and the other is normal with mean 0 and standard deviation 2. The $A L B$ p-value is less than the KS-test p-value in 98% of cases. There are only 183 p-values from the KS-test that are less than $0.05$ .

Bowman p-values versus $A L B$ p-values. Results are based on 500 data sets in each of which one distribution is standard normal and the other is normal with mean 0 and standard deviation 2. The number of p-values less than 0.05 for Bowman’s test and the $A L B$ test are 454 and 458, respectively. The ALB p-value is less than, more than and equal to the Bowman p-value in 49%, 43% and 8% of cases, respectively.

4. A Bivariate Extension of the Two-Sample Test and Application to Connectionist Bench Data

Our method can be extended to the bivariate case by using a bivariate kernel density estimate. Assume now that $X = (X_{1}, . . ., X_{m})$ are independent and identically distributed from density f and $Y = (Y_{1}, . . ., Y_{m})$ are independent and identically distributed from g, where $X_{i}$ and $Y_{j}$ are each bivariate observations, $i = 1, \dots, m$ , $j = 1, \dots, n$ .

A product kernel K will be used, i.e., the bivariate kernel K is the product of two univariate kernels. For k arbitrary bivariate observations $U = (U_{1}, \dots, U_{k})$ , $U_{i} = (U_{i 1}, U_{i 2})$ , $i = 1, \dots, k$ , and $u = (u_{1}, u_{2})$ , the kernel estimate is defined by

{\hat{f}}_{K} (u | h, U) = \frac{1}{k h_{1} h_{2}} \sum_{i = 1}^{k} K (\frac{u_{1} - U_{i 1}}{h_{1}}) K (\frac{u_{2} - U_{i 2}}{h_{2}}),

where $- \infty < u_{1} < \infty$ , $- \infty < u_{2} < \infty$ and $h = (h_{1}, h_{2})$ is a two-vector of (positive) bandwidths.

We will use the same sort of notation as before, i.e., $Z_{i} = X_{i}$ , $i = 1, \dots, m$ , $Z_{i} = Y_{i - m}$ , $i = m + 1, \dots, m + n$ , $Z = (Z_{1}, \dots, Z_{m + n})$ and $Z_{i}$ is the object $Z$ with all its components except $Z_{i}$ , $i = 1, \dots, m + n$ . In this case the ith Bayes factor is defined as

B_{i} = \frac{\int_{0}^{\infty} \int_{0}^{\infty} π (h_{1}, h_{2}) {\hat{f}}_{K} (Z_{i} | h, X_{i}) d h_{1} d h_{2}}{\int_{0}^{\infty} \int_{0}^{\infty} π (h_{1}, h_{2}) {\hat{f}}_{K} (Z_{i} | h, Z_{i}) d h_{1} d h_{2}}, i = 1, \dots, m,

and similarly for $i = m + 1, \dots, m + n$ . As before the test statistic is $A L B = \sum_{i = 1}^{m + n} log B_{i} / (m + n)$ .

This form may seem daunting, but reduces to a more familiar form if we take $π (h_{1}, h_{2}) = π_{0} (h_{1} / b_{1}) π_{0} (h_{2} / b_{2}) / (b_{1} b_{2})$ . In this case, proceeding exactly as in Section 2, $B_{i}$ has the form

B_{i} = \frac{{\hat{f}}_{L} (Z_{i} | b, X_{i})}{{\hat{f}}_{L} (Z_{i} | b, Z_{i})}, i = 1, \dots, m,

and similarly for $i = m + 1, \dots, m + n$ , where $b = (b_{1}, b_{2})$ and L is defined by (2).

We will analyze a subset of the connectionist bench data, which consist of measurements obtained after bouncing sonar waves off of either rocks or metal cylinders. The data may be found at the UCI Machine Learning repository, Ref. [8]. There are 60 variables in the data set, with $m = 111$ and $n = 97$ measurements of each variable for the metal cylinders and rocks, respectively. Variable numbers (1 to 60) correspond to increasing aspect angles at which signals are bounced off of either metal or rock, and each of the 60 numbers is an amount of energy within a particular frequency band, integrated over a certain period of time. We will apply our test to see if the first two variables (corresponding to the smallest aspect angles) have a different distribution for rocks than they do for metal cylinders. In our analysis K is taken to be $ϕ$ , the standard normal density, and $π_{0}$ to be of the form (A1). In this event L is a t-density with $ν$ degrees of freedom. We will use $ν = 3$ , leading to a fairly heavy-tailed kernel, which is desirable for reasons discussed previously.

The data for each variable are inherently between 0 and 1, and bivariate kernel estimates display boundary effects along the lines $x = 0$ and $y = 0$ , with the largest bias near the origin. We therefore use a reflection technique to reduce bias along these two lines. Suppose one has k observations $(x_{1}, y_{1}), \dots, (x_{k}, y_{k})$ on the unit square. Each observation $(x_{i}, y_{i})$ is reflected to create three new observations: $(x_{i}, - y_{i})$ , $(- x_{i}, - y_{i})$ and $(- x_{i}, y_{i})$ , $i = 1, \dots, k$ . One then simply computes, at points in the unit square, a standard kernel density estimate from the data set of size $4 k$ , and multiplies it by 4 to ensure integration to 1. The value of $A L B$ is computed as described previously except that each leave-out estimate leaves out four values: the observation at which the estimate is evaluated plus its three reflected versions. In this way the kde is constructed from data that are independent of the value at which the kde is evaluated.

Kernel density estimates for variables 1 and 2 in the form of heat maps are shown in Figure 8 and Figure 9, and contours of the estimates are given in Figure 10. The latter figure suggests that the distributions for metal cylinders and rock are different. The value of $A L B$ turned out to be $0.013$ , and an approximate p-value based on 10,000 permuted data sets was 0.0076. So, there is strong evidence of a difference between the rock and metal bivariate distributions. Interestingly, the percentage of negative $A L B$ s among the 10,000 permutations was $0.9785$ . A kernel density estimate based on the 10,000 values of $A L B^{*}$ is shown in Figure 11.

A heat map of the first two variables for the signals bounced off the metal cylinder. Variables x and y correspond to the smallest and next to smallest aspect angles, respectively.

A heat map of the first two variables for the signals bounced off the rock object. Variables x and y are as defined in Figure 8.

Contour plots of the first two variables of both rock and cylinder objects. The blue contours correspond to the rock measurements and red to the cylinder measurements. Variables x and y are as defined in Figure 8.

A kernel density estimate computed using 10,000 values of $A L B$ from permuted data sets. The value of $A L B$ for the original data set was 0.013.

5. Conclusions and Future Work

We have proposed a new nonparametric test of the null hypothesis that two densities are equal. An attractive property of the test is that its critical values are defined by a permutation distribution, allaying essentially any concern about test validity. The fact that the statistic is an average of log-Bayes factors leads to another attractive property: a critical value of 0 leads to a test with type I error probability tending to 0 with sample size. A simulation study showed the new test to have much better power than the Kolmogorov–Smirnov test in a case where the two densities differed with respect to scale. An application to connectionist data illustrated the usefulness of our methodology for bivariate data.

Future work includes efforts to increase the speed of computing the test statistic and its permutation distribution, especially for large data sets. We are also interested in applying the new test to the problem of screening variables prior to performing binary classification. A common method of doing so is to compute a two-sample test statistic for each variable, and to then select variables whose statistics exceed some threshold. An inherent problem in this approach is objectively choosing a threshold. Results of the current paper suggest that 0 would be a natural and effective threshold for variable screening.

Appendix A

Appendix A.1. Relationship of K and L

By far the most popular choice of kernel in practice is the Gaussian kernel, $K (x) = ϕ (x)$ , $- \infty < x < \infty$ , where $ϕ$ is the standard normal density. For $ν > 0$ , define

\begin{matrix} π_{0} (u) = \frac{2 {(ν / 2)}^{ν / 2}}{Γ (ν / 2)} u^{- (ν + 1)} exp (- \frac{ν}{2 u^{2}}), u > 0 . \end{matrix}

(A1)

If one takes K to be the the standard normal kernel and uses prior (A1), then the corresponding kernel L is a t-density with $ν$ degrees of freedom. An interesting aspect of these kernels is that they have heavier tails than those of the Gaussian kernel. This is especially true for the more diffuse, or noninformative priors, i.e., those for which $ν$ is small. (The mean and variance of (A1) exist for $ν > 2$ . At $ν = 3$ , the two are 1.382 and 1.090, respectively, and as $ν \to \infty$ they converge to 1 and 0).

The fact that the kernel L is more heavy-tailed than K in the previous example is not an isolated phenomenon, as indicated by the following proposition (which is straightforward to prove):

Proposition A1.

If $π_{0}$ has support $(0, C)$ with $1 < C \leq \infty$ and the tails of K decay exponentially, then the tails of L are heavier than those of K in that $K (u) / L (u) \to 0$ as $u \to \infty$ .

In principle, many different choices of $π_{0}$ and K could produce the same kernel L. Or, one might ask “given kernel K, what prior $π_{0}$ would produce a specified L?” When K is Gaussian, the latter question is answered by solving an integral equation. Unfortunately, doing so, at least in a general sense, exceeds our mathematical abilities. In the case where K is uniform, though, an elegant solution exists, as seen in the next section.

Appendix A.2. When K Is Uniform

In the special case where K is uniform on the interval $(- 1 / 2, 1 / 2)$ , it is easy to check that, for all u,

\begin{matrix} L (u) = \int_{2 | u |}^{\infty} α^{- 1} π_{0} (α) d α . \end{matrix}

(A2)

If $π_{0}$ has support $(0, \infty)$ , then L has support $(- \infty, \infty)$ , and hence we see again that averaging kernels with respect to a prior leads to a more heavy-tailed kernel.

Since our statistic ends up being a log-likelihood ratio based on kernel L, an interesting question is “what prior $π_{0}$ gives rise to a specified kernel L?" Taking $u \geq 0$ , (A2) implies that

\begin{matrix} π_{0} (2 u) = - u L^{'} (u) . \end{matrix}

(A3)

When L is decreasing on $[0, \infty)$ it follows that $π_{0}$ is a density. (Under mild tail conditions on L and assuming that $L^{'} (0 +)$ exists finite, it is easy to show using integration by parts that (A3) integrates to 1 on $(0, \infty)$ ).

Suppose that a kde is defined using the Hall kernel $L_{0}$ and its bandwidth is chosen by likelihood cross-validation. Ref. [5] shows that, in general, this cross-validation bandwidth will be asymptotically optimal in a Kullback–Leibler sense. In contrast, using cross-validation to choose the bandwidth of a uniform kernel kde will produce a bandwidth that diverges to ∞ as the sample size tends to ∞.

Using (A3) the prior, shown in Figure A1, that produces $L_{0}$ is

π_{0} (2 u) = L_{0} (u) \frac{u log (1 + u)}{1 + u} .

This shape for the bandwidth prior could be considered canonical inasmuch as $L^{'}$ will be similarly shaped for kernels that are decreasing on $(0, \infty)$ .

Figure A1 — The prior that produces the Hall kernel when K is uniform.

Appendix A.3. Consistency

Here we prove

R1.
frequentist consistency of our test, and
R2.
$P (A L B < 0) \to 1$ as $m, n \to \infty$ .

Our proof uses the following assumptions.

A1.
Under the null and alternative hypotheses the following integrals exist finite:
$I_{X} = \int_{- \infty}^{\infty} f (x) log f (x) d x and I_{Y} = \int_{- \infty}^{\infty} g (y) log g (y) d y .$

When the alternative hypothesis is true, f and g are assumed to be different in the sense that the total variation distance, $δ (f, g)$ , is positive.
A2.
The kernel L in $A L B$ (expression (3)) is the Hall kernel, $L_{0}$ .
A3.
The combined data likelihood cross-validation is maximized over an interval of the form $[{(m + n)}^{- 1 + ϵ}, {(m + n)}^{- ϵ}]$ , where $ϵ$ is an arbitrarily small positive constant. The maximizer of this cross-validation is denoted ${\hat{b}}_{m + n}$ .
A4.
The ratio $m / (m + n)$ tends to $ρ$ , $0 < ρ < 1$ , as $m, n$ tend to ∞.
A5.
The densities f, g and $ρ f (x) + (1 - ρ) g (x)$ satisfy the conditions of [5] that are needed for the asymptotic optimality of a likelihood cross-validation bandwidth.
A6.
Under the null hypothesis, let $ℓ_{k} (b)$ be the Kullback–Leibler risk of a kernel density estimate based on sample size k, kernel $L_{0}$ and bandwidth b. Then $ℓ_{k}$ satisfies
$ℓ_{k} (b) = C_{V} {(n b)}^{- 1 + a} + C_{B} b^{4} + o ({(n b)}^{- 1 + a} + b^{4})$
for positive constants a, $C_{V}$ and $C_{B}$ with $0 < a < 1$ .

Before proceeding to the proof, remarks about assumption A6 are in order. This condition is needed only in proving R2, and represents a subset of the cases studied by [5]. It has been assumed merely to allow a more concise proof of R2, which remains true under more general conditions on $ℓ_{k}$ .

The critical values of a test with fixed size $α > 0$ will tend to 0 as $m, n$ tend to ∞ so long as $A L B$ tends to 0 in probability under the null hypothesis. Therefore, the power of the test will tend to 1 if we can show that $A L B$ tends to a positive constant under the alternative. Our proof of consistency thus boils down to showing that, as $m, n$ tend to ∞, $A L B$ converges in probability to 0 and a positive number under the null and alternative hypotheses, respectively.

For data $U = (U_{1}, \dots, U_{k})$ , define

C V (b | U) = \frac{1}{k} \sum_{i = 1}^{k} log ({\hat{f}}_{L} (U_{i} | b, U^{i})), b > 0 .

The statistic $A L B$ may then be written

A L B = (\frac{m}{m + n}) C V (\hat{b} | X) + (\frac{n}{m + n}) C V (\hat{b} | Y) - C V (\hat{b} | Z),

where $\hat{b}$ maximizes $C V (b | Z)$ for $b \in [{(m + n)}^{- 1 + ϵ}, {(m + n)}^{- ϵ}]$ .

Now suppose that $U$ is a random sample from density d, $ℓ_{k} (b)$ is the expectation of the Kullback–Leibler loss of ${\hat{f}}_{L} (\cdot | b, U)$ and define

Q (k) = \frac{1}{k} \sum_{i = 1}^{k} log d (X_{i}) - \int d (x) log d (x) d x,

where $\int d (x) log d (x) d x$ exists finite. Then if d satisfies the conditions of [5] and $k \to \infty$ ,

\begin{matrix} C V (b | U) = \int d (x) log d (x) d x - ℓ_{k} (b) + Q (k) + o_{p} (ℓ_{k} (b)) \end{matrix}

(A4)

uniformly in $b \in [k^{- 1 + ϵ}, k^{- ϵ}]$ , where $ϵ$ is arbitrarily small. By the strong law of large numbers $Q (k)$ converges to 0 in probability. Furthermore, ${max}_{b \in [k^{- 1 + ϵ}, k^{- ϵ}]} ℓ_{k} (b)$ tends to 0 as $k \to \infty$ . If the maximizer $\tilde{b}$ of $C V (b | U)$ is in $[k^{- 1 + ϵ}, k^{- ϵ}]$ it therefore follows that $C V (\tilde{b} | U)$ converges in probability to $\int d (x) log d (x) d x$ as $k \to \infty$ .

In the null case, (A4) implies that

\begin{matrix} (\frac{m}{m + n}) C V (b | X) + (\frac{n}{m + n}) C V (b | Y) - C V (b | Z) = \end{matrix}

\begin{matrix} - [(\frac{m}{m + n}) ℓ_{m} (b) + (\frac{n}{m + n}) ℓ_{n} (b)] + ℓ_{m + n} (b) + o_{p} (ℓ_{m} (b)), \end{matrix}

(A5)

uniformly in $b \in [{(m + n)}^{- 1 + ϵ}, {(m + n)}^{- ϵ}]$ , where we have used all of A1–A5. Since ${\hat{b}}_{m + n} \in [{(m + n)}^{- 1 + ϵ}, {(m + n)}^{- ϵ}]$ , (A5) implies that $A L B$ converges to 0 in probability as $m, n \to \infty$ , which proves one part of R1.

To prove R2, we first observe that the bias component of $ℓ_{k} (b)$ is free of sample size, and hence the first order term of (A5) is free of bias components. Along with A3 and A6, this implies that

(\frac{m}{m + n}) C V (b | X) + (\frac{n}{m + n}) C V (b | Y) - C V (b | Z) =

\begin{matrix} - C_{V} {((m + n) b)}^{- 1 + a} (ρ^{a} + {(1 - ρ)}^{a} - 1) + o_{p} ({((m + n) b)}^{- 1 + a} + b^{4}), \end{matrix}

(A6)

uniformly in $b \in [{(m + n)}^{- 1 + ϵ}, {(m + n)}^{- ϵ}]$ . By A5, ${\hat{b}}_{m + n}$ is asymptotic in probability to $b_{m + n}$ , the minimizer of the Kullback–Leibler risk $ℓ_{m + n}$ . Along with (A6), this implies that

\begin{matrix} A L B & = & - C_{V} {((m + n) b_{m + n})}^{- 1 + a} (ρ^{a} + {(1 - ρ)}^{a} - 1) \\ + o_{p} ({((m + n) b_{m + n})}^{- 1 + a} + b_{m + n}^{4}) . \end{matrix}

By A6, we have

b_{m + n} \sim C_{0} {(m + n)}^{- (1 - a) / (5 - a)},

where

C_{0} = {[\frac{C_{V} (1 - a)}{4 C_{B}}]}^{1 / (5 - a)} .

Combining the previous results yields

\begin{matrix} A L B & = & - (\frac{C_{V}}{C_{0}^{1 - a}}) (ρ^{a} + {(1 - ρ)}^{a} - 1) {(m + n)}^{- 4 (1 - a) / (5 - a)} \\ + o_{p} ({(m + n)}^{- 4 (1 - a) / (5 - a)}) . \end{matrix}

Using the fact that $(ρ^{a} + {(1 - ρ)}^{a} - 1) > 0$ it now follows that $P (A L B < 0) \to 1$ as $m, n \to \infty$ .

Turning to the alternative case, we apply (A4) to conclude that $C V (\hat{b} | X)$ , $C V (\hat{b} | Y)$ and $C V (\hat{b} | Z)$ are consistent for $\int f (x) log f (x) d x$ , $\int g (x) log g (x) d x$ , and $\int f_{ρ} (x) log f_{ρ} (x) d x$ , respectively, where

f_{ρ} (x) = ρ f (x) + (1 - ρ) g (x) .

It follows that $A L B$ is consistent for $Δ = ρ K L (f, f_{ρ}) + (1 - ρ) K L (g, f_{ρ})$ , where $K L (f_{1}, f_{2})$ denotes the Kullback–Leibler divergence between $f_{1}$ and $f_{2}$ . By the Csiszár-Kemperman-Kullback-Pinsker inequality,

\begin{matrix} Δ & \geq & \frac{log e}{2} \cdot [ρ δ {(f, f_{ρ})}^{2} + (1 - ρ) δ {(g, f_{ρ})}^{2}] \\ = & \frac{log e}{2} \cdot [ρ {(1 - ρ)}^{2} δ {(f, g)}^{2} + (1 - ρ) ρ^{2} δ {(f, g)}^{2}] \\ = & (\frac{log e}{2}) ρ (1 - ρ) δ {(f, g)}^{2} > 0, \end{matrix}

with the last inequality following by assumption. This completes the proof of R1.

Author Contributions

Conceptualization, N.M. and J.D.H.; Methodology, N.M. and J.D.H.; Software, N.M. and J.D.H.; Investigation, N.M. and J.D.H.; Writing, N.M. and J.D.H. All authors have read and agreed to the published version of the manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

Funding Statement

The APC was funded by the Department of Statistics, Texas A&M University.

Footnotes

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1.Merchant N., Hart J., Choi T. Use of cross-validation Bayes factors to test equality of two densities. arXiv. 20202003.06368 [Google Scholar]
2.Bowman A.W., Azzalini A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Volume 18 OUP Oxford; New York, NY, USA: 1997. [Google Scholar]
3.Baranzano R. Ph.D. Thesis. Uppsala University; Uppsala, Sweden: 2011. Non-Parametric Kernel Density Estimation-Based Permutation Test: Implementation and Comparisons. [Google Scholar]
4.Hart J.D., Choi T., Yi S. Frequentist nonparametric goodness-of-fit tests via marginal likelihood ratios. Comput. Stat. Data Anal. 2016;96:120–132. doi: 10.1016/j.csda.2015.10.013. [DOI] [Google Scholar]
5.Hall P. On Kullback-Leibler loss and density estimation. Ann. Stat. 1987;15:1491–1519. doi: 10.1214/aos/1176350606. [DOI] [Google Scholar]
6.Young S.G., Bowman A.W. Non-parametric analysis of covariance. Biometrics. 1995;51:920–931. doi: 10.2307/2532993. [DOI] [Google Scholar]
7.Hart J.D. Use of BayesSim and smoothing to enhance simulation studies. Open J. Stat. 2017;7:153–172. doi: 10.4236/ojs.2017.71012. [DOI] [Google Scholar]
8.Dua D., Graff C. UCI Machine Learning Repository. School of Information and Computer Sciences, University of California, Irvine. 2017. [(accessed on 15 March 2022)]. Available online: http://archive.ics.uci.edu/ml.

[B1-entropy-24-01071] 1.Merchant N., Hart J., Choi T. Use of cross-validation Bayes factors to test equality of two densities. arXiv. 20202003.06368 [Google Scholar]

[B2-entropy-24-01071] 2.Bowman A.W., Azzalini A. Applied Smoothing Techniques for Data Analysis: The Kernel Approach with S-Plus Illustrations. Volume 18 OUP Oxford; New York, NY, USA: 1997. [Google Scholar]

[B3-entropy-24-01071] 3.Baranzano R. Ph.D. Thesis. Uppsala University; Uppsala, Sweden: 2011. Non-Parametric Kernel Density Estimation-Based Permutation Test: Implementation and Comparisons. [Google Scholar]

[B4-entropy-24-01071] 4.Hart J.D., Choi T., Yi S. Frequentist nonparametric goodness-of-fit tests via marginal likelihood ratios. Comput. Stat. Data Anal. 2016;96:120–132. doi: 10.1016/j.csda.2015.10.013. [DOI] [Google Scholar]

[B5-entropy-24-01071] 5.Hall P. On Kullback-Leibler loss and density estimation. Ann. Stat. 1987;15:1491–1519. doi: 10.1214/aos/1176350606. [DOI] [Google Scholar]

[B6-entropy-24-01071] 6.Young S.G., Bowman A.W. Non-parametric analysis of covariance. Biometrics. 1995;51:920–931. doi: 10.2307/2532993. [DOI] [Google Scholar]

[B7-entropy-24-01071] 7.Hart J.D. Use of BayesSim and smoothing to enhance simulation studies. Open J. Stat. 2017;7:153–172. doi: 10.4236/ojs.2017.71012. [DOI] [Google Scholar]

[B8-entropy-24-01071] 8.Dua D., Graff C. UCI Machine Learning Repository. School of Information and Computer Sciences, University of California, Irvine. 2017. [(accessed on 15 March 2022)]. Available online: http://archive.ics.uci.edu/ml.

PERMALINK

A Bayesian Motivated Two-Sample Test Based on Kernel Density Estimates

Naveed Merchant

Jeffrey D Hart

Roles

Abstract

1. Introduction

2. Methodology

2.1. The Test Statistic

2.2. The Effect of Using Scale Family Priors

2.3. Further Properties of ALB

Figure 1.

Figure 2.

3. Simulations

Figure 3.

Figure 4.

Figure 5.

Figure 6.

Figure 7.

4. A Bivariate Extension of the Two-Sample Test and Application to Connectionist Bench Data

Figure 8.

Figure 9.

Figure 10.

Figure 11.

5. Conclusions and Future Work

Appendix A

Appendix A.1. Relationship of K and L

Proposition A1.

Appendix A.2. When K Is Uniform

Figure A1.

Appendix A.3. Consistency

Author Contributions

Conflicts of Interest

Funding Statement

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

2.3. Further Properties of $A L B$