Nonparametric Bayes Classification and Hypothesis Testing on Manifolds

Abhishek Bhattacharya; David Dunson

doi:10.1016/j.jmva.2012.02.020

. Author manuscript; available in PMC: 2013 Oct 1.

Published in final edited form as: J Multivar Anal. 2012 Apr 17;111:1–19. doi: 10.1016/j.jmva.2012.02.020

Nonparametric Bayes Classification and Hypothesis Testing on Manifolds

Abhishek Bhattacharya ¹, David Dunson ²

PMCID: PMC3384705 NIHMSID: NIHMS371284 PMID: 22754028

Abstract

Our first focus is prediction of a categorical response variable using features that lie on a general manifold. For example, the manifold may correspond to the surface of a hypersphere. We propose a general kernel mixture model for the joint distribution of the response and predictors, with the kernel expressed in product form and dependence induced through the unknown mixing measure. We provide simple sufficient conditions for large support and weak and strong posterior consistency in estimating both the joint distribution of the response and predictors and the conditional distribution of the response. Focusing on a Dirichlet process prior for the mixing measure, these conditions hold using von Mises-Fisher kernels when the manifold is the unit hypersphere. In this case, Bayesian methods are developed for efficient posterior computation using slice sampling. Next we develop Bayesian nonparametric methods for testing whether there is a difference in distributions between groups of observations on the manifold having unknown densities. We prove consistency of the Bayes factor and develop efficient computational methods for its calculation. The proposed classification and testing methods are evaluated using simulation examples and applied to spherical data applications.

Key words and phrases: Bayes factor, Classification, Dirichlet process mixture, Flexible prior, Hypothesis testing, Non-Euclidean manifold, Nonparametric Bayes, Posterior consistency, Spherical data

1. Introduction

Classification is one of the fundamental problems in statistics and machine learning. Let (X, Y) denote a pair of random variables with X ∈ Inline graphic the predictors and Y ∈ = {1, …, L} the response. The focus in classification is on predicting Y given features X. From a model-based perspective, one can address this problem by first estimating p(y, x) = Pr(Y = y |X = x), for y = 1, …, L and all x ∈ , based on a training sample of n subjects. Then, under a 0–1 loss function, the optimal predictive value for y_n₊₁, the unknown response for an unlabeled (n+1)st subject, is simply the value of y that maximizes the estimate p̂(y, x_n₊₁) for y ∈ {1, …, L}. The model-based perspective has the advantage of providing measures of uncertainty in classification. However, performance will be critically dependent on obtaining an accurate estimate of p(y, x).

A common strategy for addressing this problem is to use a discriminant analysis approach, which lets

p (y, x) = Pr (Y = y ∣ X = x) = \frac{Pr (Y = y) f (x ∣ Y = y)}{\sum_{j = 1}^{L} Pr (Y = j) f (x ∣ Y = j)}, y = 1, \dots, L,

with Pr(Y = y) the marginal probability of having label y, and f(x |Y = y) the conditional density of the features (predictors) for subjects in class y. Then, one can simply use the proportion of subjects in class y as an estimate of Pr(Y = y), while applying a multivariate density estimator to learn f(x |Y = y) separately within each class. For example, [25] proposed a popular approach which estimates f(x |Y = y) using a mixture of multivariate Gaussians (refer also to [19]). [1] instead use mixtures of von Mises-Fisher (vMF) distributions, but for unsupervised clustering on the unit hypersphere instead of classification. [5] extends the idea to features lying on a general manifold and uses mixtures of complex Watson distributions on the shape space.

Even when features can be assumed to have support on ℜ^p, there are two primary issues that arise. Firstly, the number of mixture components is typically unknown and can be difficult to estimate reliably, and secondly it may be difficult to accurately estimate a multivariate density specific to each class unless p is small and there are abundant training data in each class. To address these issues, a nonparametric Bayes discriminant analysis approach can be used in which the prior incorporates dependence in the unknown class-specific densities ([13]; [14]). An important challenge in implementing nonparametric Bayes methods is showing that the prior is sufficiently flexible to accurately approximate any classification function p(y, x). A primary goal of this article is to provide methods for classification that allow the predictors to have support on a manifold, utilizing priors with full support that lead to posterior consistency for the classification function.

As noted in [23] it is routine in many applications areas, ranging from genomics to computer vision, to normalize the data prior to analysis to remove artifacts. This leads to feature vectors that lie on the surface of the unit hypersphere, though due to the lack of straightforward methods for the analysis of spherical data, Gaussian approximations in Euclidean space are typically used. [23] show that treating spherical features as Euclidean can lead to poor performance in classification if the feature vectors are not approximately spherical-homoscedastic. We will propose a class of product kernel mixture models that can be designed to have full support on the set of densities on a manifold, and that lead to strong posterior consistency. In important special cases, such as spherical data, these models also facilitate computationally convenient Gibbs sampling algorithms for posterior computation.

A closely related problem to the classification problem is testing for differences in the distribution of features across groups. In the testing setting, the nonparametric Bayes literature is surprisingly limited perhaps due to the computational challenges that arise in calculating Bayes factors. For recent articles on nonparametric testing of differences between groups, refer to [15], [32] and [26]. The former two articles considered interval null hypotheses, while the later article considered a point null for testing differences in two groups using Polya tree priors. Here, we modify the methodology developed for the classification problem to obtain an easy to implement approach for nonparametric Bayes testing of differences between groups, with the data within each group constrained to lie on a compact metric space or Riemannian manifold, and prove consistency of this testing procedure.

Here is a very brief overview of the sections to follow. §2 describes the general modeling framework for classification on manifolds. §3 adapts the methodology to the testing problem. §4 focuses on the special case in which the features lie on the surface of a unit hypersphere and adapts the theory and methodology of the earlier sections to this manifold. §5 contains results from simulation studies where the developed methods of classification and testing are compared with existing ones. §6 applies the methods to spherical data applications, and §7 discusses the results. Proofs are included in an Appendix (§8).

2. Nonparametric Bayes Classification

2.1. Kernel mixture model

Let ( Inline graphic , ρ) be a compact metric space, ρ being the distance metric and = {1, …, L} a finite set. Consider a pair of random variables (X, Y) taking values in × . To induce a flexible model on the classification function p(y, x), we propose to model the joint distribution of the (X, Y) pair. The approach of inducing a flexible model on the conditional distribution through a nonparametric model for the joint distribution was proposed by [31]. In particular, they used a Dirichlet process mixture (DPM) of multivariate Gaussians for (X, Y) to induce a flexible prior on E(Y |X = x). [35, 24] recently generalized this approach beyond Gaussian mixtures. [9] used a joint DPM for random effects underlying a functional predictor and a response to induce a flexible model for prediction.

As an alternative to the joint modeling approach, one can directly place a non-parametric Bayesian model on the conditional p(y, x) = Pr(Y = y|X = x). There is a rich applied literature on this topic, with most approaches focusing on categorical predictors and/or predictors in a Euclidean space without consideration of theoretical properties. [22] showed posterior consistency in the Inline graphic = {1, 2} binary case using a Gaussian process prior mapped to the unit interval. [12] showed minimax optimal rates of posterior contraction up to a log factor for a binary response kernel regression model, with Gaussian priors on the weights leading to a conditionally Gaussian process. These and other theoretical articles on nonparametric Bayes classification focus on Inline graphic being a compact d-dimensional subset of ℜ^d, with the optimal rate for α-smooth functions being n⁻^α^/(^d⁺²^α⁾.

Our focus is on the case in which Y is an unordered categorical variable and X is constrained to have support on a compact metric space, with non-Euclidean Riemannian manifolds of particular interest. Potentially, one could induce a conditional model for p(y, x) without modeling the marginal distribution of the predictors p(x) through appropriate mappings of Gaussian processes defined over Inline graphic . However, Gaussian processes lead to well-known computational challenges even in Euclidean spaces, and there are advantages of the joint modeling approach in terms of simplifying theory, computation and modifications (e.g., to the testing problem considered in §3) while also allowing missing predictors. Most of the literature on nonparametric Bayes asymptotic theory follows a similar path, with some early papers on the type of strategy we will follow including [2, 20]. However, it is far from straightforward to establish the necessary conditions in new settings, and essentially all of the nonparametric Bayes asymptotic literature has focused on Euclidean spaces and relatively simple problems.

Assume that the joint distribution of (X, Y) has a joint density with respect to some fixed base measure λ on Inline graphic × . Let $λ = λ_{1} \otimes \sum_{j = 1}^{L} δ_{j}$ where δ denotes the Dirac delta measure. If is a Riemannian manifold, the natural choice for λ₁ will be the Riemannian volume form V. The distance ρ will be chosen to maintain the topology of the manifold. Letting ( × Inline graphic ) denote the space of all densities with respect to λ, we propose the following joint density model

f (x, y; P, κ) = \int_{X \times S_{L - 1}} ν_{y} K (x; μ, κ) P (d μ d ν), (x, y) \in X \times Y,

(2.1)

where ν = (ν₁, …, ν_L)′ ∈ S_L₋₁ is a probability vector on the simplex S_L₋₁ = {ν ∈ [0, 1]^L: Σν_j = 1}, K(·; μ, κ) is a kernel located at μ ∈ Inline graphic with precision or inverse-scale κ ∈ ℜ⁺, and P ∈ ( × S_L₋₁) is a mixing measure, with ( × S_L₋₁) denoting the space of all probability measures on × S_L₋₁.

One can interpret this model in the following hierarchical way. Draw (μ, ν) from P. Given (μ, ν, κ), X and Y are conditionally independent with X having the conditional density K(.; μ, κ) with respect to λ₁ and

Pr (Y = l ∣ μ, ν, κ) = ν_{l}, 1 \leq l \leq L .

If K(.; μ, κ) is a valid probability kernel, i.e.

\int_{X} K (x; μ, κ) λ_{1} (d x) = 1, for all (μ, κ) \in X \times R^{+},

one can show that f(x, y; P, κ) is a valid probability density with respect to λ.

To justify model (2.1), it is necessary to show flexibility in approximating any joint density in Inline graphic ( × ), and hence in approximating p(y, x). Our focus is on nonparametric Bayes methods that place a prior on the joint distribution of the measure P and the precision κ to induce a prior over ( × ). Flexibility of the model is quantified in terms of the size of the support of this prior. In particular, our goal is to choose a specification that leads to full L^∞ and Kullback-Leibler (KL) support, meaning that the prior assigns positive probability in arbitrarily small neighborhoods of any density in Inline graphic ( × ). This property will not necessarily hold for arbitrarily chosen kernels, and one of our primary theoretical contributions is to provide sufficient conditions under which KL support and posterior consistency hold. This is not just of theoretical interest, as it is important to verify that the model is sufficiently flexible to approximate any classification function, with the accuracy of the estimate improving as the amount of training data grows. This is not automatic for nonparametric models in which there is often concern about over-fitting.

Remark 2.1

In the joint model (2.1), one may also mix across the precision parameter and make the model more flexible. Posterior computation with such a model is a straight-forward extension of this one which is illustrated in §2.3.

Remark 2.2

One can allow Inline graphic = ⊗ · · · ⊗ , with each corresponding to a different type of space (e.g., a hypersphere, a subset of ℜ^d, a discrete space), by replacing K(x; μ, κ) with a product of kernels appropriate to each space.

2.2. Support of the prior and consistency

Assume that the joint distribution of (X, Y) has a density f_t(x, y) = g_t(x)p_t(y, x) with respect to λ, where g_t is the true marginal density of X and p_t(y, x) is the true Pr(Y = y|X = x). For Bayesian inference, we choose a prior Π₁ on (P, κ) in (2.1), with one possible choice corresponding to DP(w₀P₀) ⊗ Gam(a, b), with DP (w₀P₀) denoting a Dirichlet process prior with precision w₀ and base probability measure P₀ ∈ Inline graphic ( × S_L₋₁) and Gam denoting the gamma distribution. The prior Π₁ induces a corresponding prior Π on the space ( × ) through (2.1). Under minor assumptions on Π₁ and hence Π, the theorem below shows that the prior probability of any uniform neighborhood of a continuous true density is positive.

Theorem 2.1

Under the assumptions

A1:
K is continuous in its arguments,
A2:
For any continuous function φ from to ℜ,
$lim_{κ \to \infty} sup_{x \in X} | φ (x) - \int_{X} K (x; μ, κ) φ (μ) λ_{1} (d μ) | = 0,$
A3:
For any κ > 0, there exists κ̃ ≥ κ such that (P_t, κ̃) ∈ supp(Π₁) where P_t ∈ ( × S_L₋₁) is defined as
$P_{t} (d μ d ν) = \sum_{j \in Y} f_{t} (μ, j) λ_{1} (d μ) δ_{e_{j}} (d ν),$

with e_j ∈ ℜ^L denoting a zero vector with a single one in position j,
A4:
f_t(., j) is continuous for all j ∈ , given any ε > 0,
$Π ({f \in D (X \times Y) : sup_{x \in X, y \in Y} ∣ f (x, y) - f_{t} (x, y) ∣ < ε}) > 0.$

Assumption A4 restricts the true conditional density of X given Y = j to be continuous for all j. Assumptions A1 and A2 place minor regularity conditions on the kernel K. If K(x; μ, κ) is symmetric in x and μ, as will be the case in most examples, A2 implies that K(.; μ, κ) converges to δ_μ in the weak sense uniformly in μ as κ → ∞. This justifies the names ‘location’ and ‘precision’ for the parameters. Assumption A3 provides a minimal condition on the support of the prior for (P, κ). These assumptions provide general sufficient conditions for the induced prior Π on the joint density of (X, Y) to have full L^∞ support.

Although full uniform support is an appealing property, much of the theoretical work on asymptotic properties of nonparametric Bayes estimators relies on KL support. The following corollary shows that KL support follows from A1–A4 and the additional assumption that the true density is everywhere positive. The proof is very much on the same lines of Corollary 1, [5]. The KL divergence of a density f from f_t is defined as $K L (f_{t}; f) = \int_{X \times Y} f_{t} log \frac{f_{t}}{f} λ (dxdy)$ . Given ε > 0, K_ε(f_t) = {f: KL(f_t; f) < ε} will denote an ε-sized KL neighborhood of f_t. The prior Π is said to satisfy the KL condition at f_t, or f_t is said to be in its KL support, if Π{K_ε(f_t)} > 0 for any ε > 0.

Corollary 2.2

Under assumptions A1–A4 and

A5:
f_t(x, y) > 0 for all x, y,

f_t is in the KL support of Π.

Suppose we have an independent and identically distributed (iid) sample $(x_{n}, y_{n}) \equiv {(x_{i}, y_{i})}_{i = 1}^{n}$ from f_t. Since f_t is unobserved, we take the likelihood function to be $Π_{i = 1}^{n} f (x_{i}, y_{i}; P, κ)$ . Using the prior Π on f and the observed sample, we find the posterior distribution of f, denoted by Π(.|x_n, y_n). Using the Schwartz theorem ([33]), Corollary 2.2 implies weak posterior consistency. This in turn implies that for any measurable subset A of Inline graphic , with λ₁(A) > 0, λ₁(∂A) = 0, and y ∈ , the posterior conditional probability of Y being y given X in A converges to the true conditional probability almost surely. Here ∂A denotes the boundary of A.

To give more flexibility to the classification function, we may replace the location-scale kernel by some broader family of parametric distributions on Inline graphic such as K(.; μ, κ, Θ) with Θ denoting additional kernel parameters. When performing posterior computations, we may set hyperpriors on the parameters of the prior. Then the conclusions of Results 2.1 and 2.2 hold and hence weak consistency follows as long as the assumptions are verified given the hyperparameters over a set of positive prior probability. This is immediate and is verified in Lemma 1, [41].

Under stronger assumptions on the kernel and the prior, we prove strong posterior consistency for the joint model. We will illustrate how these conditions are met for a von Mises-Fisher (vMF) mixture model for hyperspherical data through Proposition 4.1.

Theorem 2.3

Under assumptions A1–A5 and

A6:
There exist positive constants , a₁, A₁ such that for all ≥ , μ, ν ∈ ,
$sup_{m \in M, κ \in [0, K]} | K (m; μ, κ) - K (m; ν, κ) | \leq A_{1} K^{a 1} ρ (μ, ν) .$
A7:
There exists positive constants a₂, A₂ such that for all κ₁, κ₂ ∈ [0, ], ≥ ,
$sup_{m, μ \in M} | K (m; μ, κ_{1}) - K (m; μ, κ_{2}) | \leq A_{2} K^{a_{2}} ∣ κ_{1} - κ_{2} ∣ .$
A8:
There exist positive constants a₃, A₃, A₄ such that given any ε > 0, M can be covered by at-most A₃ε^−a₃ + A₄ many subsets of diameter at-most ε.
A9:
Π₁( (M) × (n^a, ∞)) is exponentially small for some a < (a₁a₃)⁻¹, the posterior probability of any total variation neighborhood of f_t converges to 1 almost surely.

Given the training data, we can classify a new subject based on their features using the posterior mean classification function p̂. As a corollary to Theorem 2.3, we show that p̂ converges to p_t in an L¹ sense.

Corollary 2.4

(a) Strong consistency for the posterior of f implies that

Π {f : max_{y \in Y} \int_{X} ∣ p (y, x) - p_{t} (y, x) ∣ g_{t} (x) λ_{1} (d x) < ε | x_{n}, y_{n}}

(2.2)

converges to 1 as n → ∞ a.s. (b) Under assumptions A4–A5 on f_t, this implies

Π {f : max_{y \in Y} \int_{X} ∣ p (y, x) - p_{t} (y, x) ∣ w (x) λ_{1} (d x) < ε | x_{n}, y_{n}}

converges to 1 a.s. for any non-negative function w with sup_x w(x) < ∞.

Remark 2.3

Part (a) of Corollary 2.4 holds even when Inline graphic is non-compact. It just needs strong posterior consistency for the joint model.

From part (b) of Corollary 2.4, it would seem intuitive that point-wise posterior consistency can be obtained for the predictive probability function. However, this is not immediate because the convergence rate may depend on the choice of w.

Assumption A9 is hard to satisfy, especially when the feature space is high dimensional. This type of problem was mentioned by [42] in a different setting. Then a₁ and a₃ turn out to be very big, so that the prior is required to have very light tails and place small mass at high precisions. This is undesirable in applications and instead we can let Π₁ depend on the sample size n and obtain weak and strong consistency under weaker assumptions.

Theorem 2.5

Let Π₁ = Π₁₁ ⊗ π_n where π_n is a sequence of densities on ℜ⁺. Assume the following.

A10:
The prior Π₁₁ has full support.
A11:
For any β > 0, there exists a κ₀ ≥ 0, such that for all κ ≥ κ₀,
$\underset{n \to \infty}{lim inf} exp (n β) π_{n} (κ) = \infty .$
A12:
For some β₀ > 0 and a < (a₁a₃)⁻¹,
$lim_{n \to \infty} exp (n β_{0}) π_{n} {(n^{a}, \infty)} = 0.$
1. Under assumptions A1–A2 on the kernel, A10–A11 on the prior and A4–A5 on f_t, the posterior probability of any weak neighborhood of f_t converges to one a.s.
2. Under assumptions A1–A2, A4–A8 and A10–A12, the posterior probability of any total variation neighborhood of f_t converges to 1 a.s.

The proof is similar to that of Theorems 2.6 and 2.9 in [6] and hence is omitted. With Π₁₁ = DP (ω₀P₀) and π_n = Gam(a, b_n), the conditions in Theorem 2.5 are satisfied (for example) when P₀ has full support and b_n = b₁n/{log(n)}^b₂ for any b₁, b₂ > 0. Then from Corollary 2.4, we have L¹ consistency of the estimated classification function.

2.3. Computation

Given the training sample (x_n, y_n), we classify a new subject based on the predictive probability of allocating it to category j, which is expressed as

Pr (y_{n + 1} = j ∣ x_{n + 1}, x_{n}, y_{n}), j \in Y,

(2.3)

where x_n₊₁ denotes the feature for the new subject and y_n₊₁ its unknown class label. It follows from Theorem 2.1 and Corollary 2.4 that the classification rule is consistent if the kernel and prior are chosen correctly. Following the recommendation in §2.2, for the prior, we let P ~ DP(w₀P₀) independently of κ ~ π, with P₀ = P₀₁ ⊗ P₀₂, P₀₁ a distribution on Inline graphic , P₀₂ a Dirichlet distribution Diri(a) (a = (a₁, …, a_L)) on S_L₋₁, and π a base distribution on ℜ⁺.

Since it is not possible to get a closed form expression for the predictive probability, we use a Markov chain Monte Carlo (MCMC) algorithm. Specifically, we rely on a simple-to-implement Gibbs sampling algorithm that utilizes slice sampling. Slice sampling was proposed for posterior computation in DPMs by [38], with efficiency later improved by Papaspiliopoulos (2008, unpublished technical report). Our approach is related to that used by [43], with [28] providing a general class of algorithms for slice sampling in mixture models.

Using the stick-breaking representation of [34] and introducing cluster allocation indices S = (S₁, …, S_n) (S_i ∈ {1, …, ∞}), the generative model (2.1) can be expressed in hierarchical form as

\begin{array}{l} x_{i} ~ K (μ_{S_{i}}, κ), y_{i} ~ Multi (1, \dots, L; ν_{S_{i}}), \\ S_{i} ~ \sum_{j = 1}^{\infty} w_{j} δ_{θ_{j}}, θ_{j} = (μ_{j}, ν_{j}), \end{array}

(2.4)

where w_j = V_jΠ_h_<_j(1 − V_h) is the probability that subject i is allocated to cluster S_i = j, θ_j is the vector of parameters specific to cluster j, and V_j ~ Beta(1, w₀), μ_j ~ P₀₁ and ν_j ~ Diri(a) are mutually independent for j = 1, …, ∞.

The joint posterior density of $V = {V_{j}}_{j = 1}^{\infty}$ , $θ = {θ_{j}}_{j = 1}^{\infty} = (μ, ν)$ , S and κ given the training data is proportional to

{Π_{i = 1}^{n} K (x_{i}; μ_{S_{i}}, κ) ν_{S_{i} y_{i}} w_{S_{i}}} {Π_{j = 1}^{\infty} Beta (V_{j}; 1, w_{0}) P_{01} (d μ_{j}) Diri (ν_{j}; a)} π (κ) .

To avoid the need for posterior computation for infinitely-many unknowns, we introduce slice sampling latent variables $u = {u_{i}}_{i = 1}^{n}$ drawn iid from Unif(0,1) such that the augmented posterior density is proportional to

\begin{array}{r} π (u, V, θ, S, κ ∣ x_{n}, y_{n}) \propto {\prod_{i = 1}^{n} K (x_{i}; μ_{S_{i}}, κ) ν_{S_{i} y_{i}} I (u_{i} < w_{S_{i}})} \times \\ {\prod_{j = 1}^{\infty} Beta (V_{j}; 1, w_{0}) P_{01} (d μ_{j}) Diri (ν_{j}; a)} π (κ) . \end{array}

(2.5)

Letting S_max = max{S_i}, the conditional posterior distribution of {V_j, θ_j, j > S_max} is the same as the prior, and we can use this to bypass the need for updating infinitely-many unknowns in the Gibbs sampler. After choosing initial values, the sampler iterates through the following steps.

Update S_i, for i = 1, …, n, given (u, V, θ, κ, x_n, y_n) by sampling from the multinomial distribution with
$Pr (S_{i} = j) \propto K (x_{i}; μ_{j}, κ) ν_{{j y}_{i}} for j \in A_{i} = {j : 1 \leq j \leq J, w_{j} > u_{i}},$

with J being the smallest index satisfying $1 - min (u) < \sum_{j = 1}^{J} w_{j}$ . In implementing this step, draw V_j ~ Beta(1, w₀) and θ_j ~ P₀ for j > S_max as needed.
Update the scale parameter κ by sampling from the full conditional posterior which is proportional to
$π (κ) \prod_{i = 1}^{n} K (x_{i}; μ_{S_{i}}, κ) .$

If direct sampling is not possible, rejection sampling or Metropolis-Hastings (MH) sampling can be used.
Update the atoms θ_j, j = 1, …, S_max from the full conditional posterior distribution, which is equivalent to independently sampling from
$\begin{array}{l} π (μ_{j} ∣ -) \propto P_{01} (d μ_{j}) \prod_{i : S_{i} = j} K (x_{i}; μ_{j}, κ) \\ (ν_{j} ∣ -) ~ Diri (a_{1} + \sum_{i : S_{i} = j} I (y_{i} = 1), \dots, a_{L} + \sum_{i : S_{i} = j} I (y_{i} = L)) . \end{array}$

If P₀₁ is not conjugate, then rejection or MH sampling can be used to update μ_j.
Update the stick-breaking random variables V_j, for j = 1, …, S_max, from their conditional posterior distributions given the cluster allocation S but marginalizing out the slice sampling latent variables u. In particular,
$V_{j} ~ Beta (1 + \sum_{i} I (S_{i} = j), w_{0} + \sum_{i} I (S_{i} > j)) .$
Update the slice sampling latent variables from their conditional posterior by letting
$u_{i} ~ Unif (0, w_{S_{i}}), i = 1, \dots, n .$

These steps are repeated a large number of iterations, with a burn-in discarded to allow convergence. Given a draw from the posterior, the predictive probability of allocating a new observation to category l, l ≤ L, as defined through (2.3) is proportional to

\sum_{j = 1}^{S_{\max}} w_{j} ν_{j l} K (x_{n + 1}; μ_{j}, κ) + w_{S_{max} + 1} \int_{X \times S_{L - 1}} ν_{l} K (x_{n + 1}; μ_{j}, κ) P_{0} (d μ d ν)

(2.6)

where $w_{S_{\max} + 1} = 1 - \sum_{j = 1}^{S_{\max}} w_{j}$ . We can average these conditional predictive probabilities across the MCMC iterations after burn-in to estimate predictive probabilities. For moderate to large numbers of training samples n, $\sum_{j = 1}^{S_{\max}} w_{j} \approx 1$ with high probability, so that an accurate approximation can be obtained by setting the final term equal to zero and hence bypassing need to calculate the integral.

3. Nonparametric Bayes Testing

3.1. Hypotheses and Bayes factor

The previous section focused on the classification problem in which there is interest in predicting a class label y_n₊₁ for a new subject based on training data (x_n, y_n). In many applications, the primary focus is instead on conducting inferences on differences in the distribution of features X across groups Y. For example, Y may correspond to a patient’s ethnicity and X to health outcomes in an epidemiology study, with interest being in assessing evidence in the data of differences in the distribution of health outcomes across ethnic groups. This testing problem is somewhat orthogonal to the classification problem considered in §2.3, with classification being of primary interest in certain applications and testing of primary interest in others. However, as we demonstrate in this section, the models, theory and computational algorithms we developed for the classification problem can be adapted to accommodate testing. Although testing of group differences is one of the canonical problems in statistics, the literature addressing this problem from a nonparametric Bayes perspective while providing theoretical guarantees is essentially non-existent. Hence, the material in this section is a major contribution of the paper, which should hopefully stimulate additional research.

Although our methods can allow testing of pairwise differences between groups, we focus for simplicity in exposition on the case in which the null hypothesis corresponds to homogeneity across the groups. Formally, the alternative hypothesis H₁ corresponds to any joint density in Inline graphic ( × ) excluding densities of the form

H_{0} : f (x, y) = g (x) p (y)

(3.1)

for all (x, y) outside of a λ-null set. Note that model (2.1) will in general assign zero probability to H₀, and hence is an appropriate model for the joint density under H₁.

As a model for the joint density under the null hypothesis H₀ in (3.1), we replace P(dμdν) in (2.1) with P₁(dμ)P₂(dν) so that the joint density becomes

f (x, y; P_{1}, P_{2}, κ) = g (x; P_{1}, κ) p (y; P_{2}) where

(3.2)

g (x; P_{1}, κ) = \int_{X} K (x; μ, κ) P_{1} (d μ), p (y; P_{2}) = \int_{S_{L - 1}} ν_{y} P_{2} (d ν) .

(3.3)

We set priors Π₁ and Π₀ for the parameters in the models under H₁ and H₀, respectively. The Bayes factor in favor of H₁ over H₀ is then the ratio of the marginal likelihoods under H₁ and H₀,

B F (H_{1} : H_{0}) = \frac{\int_{M (X \times S_{L - 1}) \times R^{+}} \prod_{i = 1}^{n} f (x_{i}, y_{i}; P, κ) Π_{1} (dPd κ)}{\int_{M (X) \times M (S_{L - 1}) \times R^{+}} \prod_{i = 1}^{n} g (x_{i}; P_{1}, κ) p (y_{i}; P_{2}) Π_{0} ({d P}_{1} {d P}_{2} d κ)}

The priors should be suitably constructed so that we get consistency of the Bayes factor and computation is straightforward and efficient. [3] propose an approach for calculating Bayes factors for comparing Dirichlet process mixture (DPM) models, but their algorithm is quite involved to implement and is limited to DPM models with a single DP prior on an unknown mixture distribution. Simple conditions for consistency of Bayes factors for testing a point null versus a nonparametric alternative have been provided by [11], but there has been limited work on consistency of Bayes tests in more complex cases, such as we are faced with here. [21] develop general theory on nonparametric Bayesian model selection and averaging, with their Corollary 3.1 providing conditions for Bayes factor consistency. [29] shows Bayes factor and model selection consistency in the setting of semiparametric linear models with nonparametric residual distributions. Our theory takes a different direction appropriate to the group testing problem.

The prior Π₁ on (P, κ) under H₁ can be constructed as in §2. To choose a prior Π₀ for (P₁, P₂, κ) under H₀, we take (P₁, κ) to be independent of P₂ so that the marginal likelihood becomes a product of the X and Y marginals if H₀ is true. Dependence in the priors for the mixing measures would induce dependence between the X and Y densities, and it is important to maintain independence under H₀. In addition, as in parametric Bayes model selection [10], it is important to maintain compatibility of the prior specifications under H₀ and H₁. In the parametric literature, compatibility is often obtained through first specifying a prior for the unknowns in the larger model and then marginalizing to induce priors for the unknowns in smaller nested models. Following this idea, we let the prior for (P₁, κ) under H₀ correspond to the marginal on (P₁, κ) from prior Π₁ on (P, κ) assuming P = P₁ ⊗ P₂. It remains to specify a prior for P₂ under H₀. Expression (3.3) suggests that under H₀ the density of Y depends on P₂ only through

p = {(p (1; P_{2}), p (2; P_{2}), \dots, p (L; P_{2}))}^{'} \in S_{L - 1} .

Hence, it is sufficient to choose a prior for p, such as Diri(b) with b = (b₁, …, b_L)′, instead of specifying a full prior for P₂. For compatibility, the Dirichlet hyperparameters can potentially be chosen to approximately match the first two moments of the marginal prior on p induced from Π₁.

Under prior Π₀ noting that P = P₁ ⊗ P₂, the marginal likelihood under H₀ is

\begin{array}{r} \int_{M (X \times S_{L - 1}) \times R^{+}} \prod_{i = 1}^{n} g (x_{i}; P_{1}, κ) Π_{1} (dPd κ) \int_{S_{L - 1}} \prod_{j = 1}^{L} p_{j}^{\sum_{i = 1}^{n} I (y_{i} = j)} Diri (d p; b) \\ = \frac{D (b_{n})}{D (b)} \int_{M (X \times S_{L - 1}) \times R^{+}} \prod_{i = 1}^{n} g (x_{i}; P_{1}, κ) Π_{1} (dPd κ), \end{array}

(3.4)

with b_n being the L-dimensional vector with j^th coordinate $b_{j} + \sum_{i = 1}^{n} I (y_{i} = j)$ , 1 ≤ j ≤ L, D being the normalizing constant for the Dirichlet distribution given by $D (a) = \frac{\prod_{j = 1}^{L} Γ (a_{j})}{Γ (\sum_{j = 1}^{L} a_{j})}$ and Γ denoting the gamma function. The marginal likelihood under H₁ is

\int \prod_{i = 1}^{n} f (x_{i}, y_{i}; P, κ) Π_{1} (dPd κ) .

(3.5)

The Bayes factor in favor of H₁ against H₀ is the ratio of the marginal (3.5) over (3.4).

3.2. Consistency of the Bayes factor

Let Π be the prior induced on the space of all densities Inline graphic ( × ) through Π₁. For any density f (x, y), let g(x) = Σ_j f (x, j) denote the marginal density of X while p(y) = ∫f (x, y) λ₁(dx) denotes the marginal probability vector of Y. Let f_t, g_t and p_t be the corresponding values for the true distribution of (X, Y). The Bayes factor in favor of the alternative, as obtained in the last section, can be expressed as

B F = \frac{D (b)}{D (b_{n})} \frac{\int \prod_{i} f (x_{i}, y_{i}) Π (d f)}{\int \prod_{i} g (x_{i}) Π (d f)} .

(3.6)

Theorem 3.1 proves consistency of the Bayes factor at an exponential rate if the alternative hypothesis of dependence holds.

Theorem 3.1

If X and Y are not independent under the true density f_t and if the prior Π satisfies the KL condition at f_t, then there exists a β₀ > 0 for which lim inf_n_→∞ exp(−nβ₀)BF = ∞ a.s. $f_{t}^{\infty}$ .

3.3. Computation

Seemingly, one of the major reasons for the lack of methodology literature on Bayesian nonparametric testing is the substantial computational hurdles involved in accurately calculating Bayes factors for comparing hypotheses. Except in very simple models, marginal likelihoods cannot be calculated in closed form, accurate analytic approximations are not available and Monte Carlo methods are prohibitively expensive computationally when run sufficiently long to produce an accurate estimate. Outside of conjugate models, one of the most successful strategies for calculating Bayes factors based on MCMC algorithms relies on an encompassing approach in which a single MCMC algorithm is designed, which moves freely between the models. There has been limited success in designing such algorithms for nonparametric Bayes testing, but we devise an efficient data augmentation algorithm that appears to accurately produce Bayes factors and posterior hypothesis probabilities based on a single MCMC chain.

We introduce a latent variable z = I(H₁ is true) which takes value 1 if H₁ is true and 0 if H₀ is true. Assuming equal prior probabilities for H₀ and H₁, the conditional likelihood of (x_n, y_n) given z is

\begin{array}{l} Π (x_{n}, y_{n} ∣ z = 0) = \frac{D (b_{n})}{D (b)} \int \prod_{i = 1}^{n} g (x_{i}; P_{1}, κ) Π_{1} (dPd κ) and \\ Π (x_{n}, y_{n} ∣ z = 1) = \int \prod_{i = 1}^{n} f (x_{i}, y_{i}; P, κ) Π_{1} (dPd κ) . \end{array}

In addition, the Bayes factor can be expressed as

B F = \frac{Pr (z = 1 ∣ x_{n}, y_{n})}{Pr (z = 0 ∣ x_{n}, y_{n})} .

(3.7)

Next introduce latent parameters μ, ν, V, S, κ as in §2.3 such that

\begin{array}{r} Π (x_{n}, y_{n}, μ, V, S, κ, z = 0) = \frac{D (b_{n})}{D (b)} π (k) \prod_{i = 1}^{n} {w_{S_{i}} K (x_{i}; μ_{S_{i}}, κ)} \times \\ \prod_{j = 1}^{\infty} {Be (V_{j}; 1, w_{0}) P_{01} (d μ_{j})}, \end{array}

(3.8)

\begin{array}{r} Π (x_{n}, y_{n}, μ, ν, V, S, κ, z = 1) = π (k) \prod_{i = 1}^{n} {w_{S_{i}} ν_{S_{i} y_{i}} K (x_{i}; μ_{S_{i}}, κ)} \times \\ \prod_{j = 1}^{\infty} {Be (V_{j}; 1, w_{0}) P_{0} (d μ_{j} d ν_{j})} . \end{array}

(3.9)

Marginalize out ν from equation (3.9) to get

\begin{array}{r} Π (x_{n}, y_{n}, μ, V, S, κ, z = 1) = π (k) \prod_{j = 1}^{\infty} \frac{D (a + {\tilde{a}}_{j} (S))}{D (a)} \times \\ \prod_{i = 1}^{n} {w_{S_{i}} K (x_{i}; μ_{S_{i}}, κ)} \prod_{j = 1}^{\infty} Be (V_{j}; 1, w_{0}) P_{01} (d μ_{j})}, \end{array}

(3.10)

with ã_j(S), 1 ≤ j < ∞ being L-dimensional vectors with l^th coordinate Σ_{i:S_i=j} I(y_i = l), l ∈ Inline graphic . Integrate out z by adding equations (3.8) and (3.10) and the joint distribution of (μ, V, S, κ) given the data becomes

\begin{array}{r} Π (μ, V, S, κ ∣ x_{n}, y_{n}) \propto {C_{0} + C_{1} (S)} π (κ) \prod_{i = 1}^{n} {w_{S_{i}} K (x_{i}; μ_{S_{i}}, κ)} \times \\ \prod_{j = 1}^{\infty} {Be (V_{j}; 1, w_{0}) P_{01} (d μ_{j})} \\ with C_{0} = \frac{D (b_{n})}{D (b)} and C_{1} (S) = \prod_{j = 1}^{\infty} \frac{D (a + {\tilde{a}}_{j} (S))}{D (a)} . \end{array}

(3.11)

To estimate the Bayes factor, first make repeated draws from the posterior in (3.11). For each draw, compute the posterior probability distribution of z from equations (3.8) and (3.10) and take their average after discarding a suitable burn-in. The averages estimate the posterior distribution of z given the data, from which we can get an estimate for BF from (3.7). The sampling steps are accomplished as follows

Update the cluster labels S given (μ, V, κ) and the data from their joint posterior which is proportional to
${C_{0} + C_{1} (S)} π (k) \prod_{i = 1}^{n} {w_{S_{i}} K (x_{i}; μ_{S_{i}}, κ)} .$ (3.12)

Introduce slice sampling latent variables u as in §2.3 and replace w_{S_i} by I(u_i < w_{S_i}) to make the total number of possible states finite. However unlike in §2.3, the S_is are no more conditionally independent. We propose to use a Metropolis-Hastings block update step in which a candidate for (S₁, …, S_n), or some subset of this vector if n is large, is sampled independently from multinomials with Pr(S_i = j) ∝ K(x_i; μ_j, κ), for j ∈ A_i where A_i = {j: 1 ≤ j ≤ J, w_j > u_i} and J is the smallest index satisfying $1 - min (u) < \sum_{j = 1}^{J} w_{j}$ . In implementing this step, draw V_j ~ Be(1, w₀) and μ_j ~ P₀₁ for j > S_max as needed. The acceptance probability is simply the ratio of C₀ + C₁(S) calculated for the candidate value and the current value of S.
Update $κ, {μ_{j}}_{j = 1}^{S_{\max}}, {V_{j}}_{j = 1}^{S_{\max}}, {u_{i}}_{i = 1}^{n}$ as in Steps (2) – (5) of the algorithm in §2.3.
Compute the full conditional posterior distribution of z which is given by
$Pr (z ∣ μ, S, x_{n}, y_{n}) \propto {\begin{array}{l} \frac{D (b_{n})}{D (b)} if z = 0, \\ \prod_{j = 1}^{S_{\max}} \frac{D (a + {\tilde{a}}_{j} (S))}{D (a)} if z = 1. \end{array}$

4. Application to the unit sphere S^d

4.1. vMF Kernel Mixture Models

For classification using predictors X lying on the hypersphere

X = S^{d} = {x \in R^{d + 1} : {| | x | |}^{2} \equiv \sum_{j = 1}^{d + 1} x_{j}^{2} = 1},

we recommend using a von Mises-Fisher (vMF) kernel in the mixture model (2.1) to induce a prior over Inline graphic (S^d × ). Although the other distributions on S^d described in [30] could be used, the vMF kernel provides a relatively simple and computationally tractable choice. As shown in Proposition 4.1, this kernel also satisfies the assumptions in §2.2 for building a flexible joint density model and for posterior consistency. For a proof, see [6].

The vMF distribution has the density ([37], [18], [39])

vMF (x; μ, κ) = c^{- 1} (κ) exp (κ x^{'} μ) (x, μ \in S^{d}, κ \in [0, \infty))

(4.1)

with respect to the invariant volume form on S^d, where

c (κ) = \frac{2 π^{d / 2}}{Γ (\frac{d}{2})} \int_{- 1}^{1} exp (κ t) {(1 - t^{2})}^{d / 2 - 1} d t

is its normalizing constant. This distribution has a unique extrinsic mean (as defined in [8]) of μ, thereby μ can be interpreted as the kernel location. The parameter κ is a measure of concentration with κ = 0 corresponding to the uniform distribution while as κ diverges to ∞, it converges weakly to δ_μ uniformly in μ. Sampling from vMF is straightforward using results in [36] and [40].

Proposition 4.1

(a) The vMF kernel K as defined in (4.1) satisfies assumptions A1 and A2. It satisfies A6 with a₁ = d/2+1 and A7 with a₂ = d/2. The compact metric-space S^d endowed with chord distance satisfies A8 with a₃ = d.

In the sequel, we will apply the general methods for classification and testing developed earlier in the paper to hyperspherical features.

4.2. MCMC Details

When features lie on S^d and we choose vMF kernels and priors as in §2.3, simplifications result in the MCMC steps for updating κ and μ. Letting P₀₁ = vMF(μ₀, κ₀), we obtain conjugacy for the full conditional of the kernel locations,

(μ_{j} ∣ -) \overset{D}{=} vMF ({\bar{μ}}_{j}, κ_{j})

with μ̄_j = v_j/||v_j||, κ_j = ||v_j||, v_j = κ₀μ₀ + κX_j and X_j = Σ_i x_iI(S_i = j). As a default, we can set μ₀ equal to the x_n sample extrinsic mean or we can choose κ₀ = 0 to induce a uniform P₀₁. The full posterior of κ is

π (κ) c^{- n} (κ) κ^{- n d / 2} e^{n κ} κ^{n d / 2} exp {- κ (n - \sum_{i} x_{S_{i}}^{'} μ_{S_{i}})} .

(4.2)

If we set

π (κ) \propto c^{n} (κ) κ^{a + \frac{n d}{2} - 1} e^{- κ (n + b)}, a, b > 0,

(4.3)

the posterior simplifies to

Gam (a + \frac{n d}{2}, b + n - \sum_{i} x_{i}^{'} μ_{S_{i}}) .

(4.4)

We can make the MCMC more efficient by marginalizing out μ while updating κ. In particular

π (κ ∣ S) \propto c^{- n} (κ) π (κ) \prod_{j : m s (j) > 0} c (| | κ X_{j} + κ_{0} μ_{0} | |)

with ms(j) = Σ_i I(S_i = j). This is easy to compute in the case d = 2, κ₀ = 0 and π ~ Gam(a, b). Then it simplifies to

\begin{array}{r} π (κ ∣ S) \propto Gam (κ; n - \sum_{j} I (m s (j) > 0) + a, n - \sum_{j} | | X_{j} | | + b) \\ \times {1 - exp (- 2 κ)}^{- n} \prod_{j : m s (j) > 0} {1 - exp (- 2 κ | | X_{j} | |)} . \end{array}

For this choice, we suggest a Metropolis-Hastings proposal that corresponds to the gamma component. This leads to a high acceptance probability when κ is high, and has good performance in general cases we have considered.

In the predictive probability expression in (2.6), the integral simplifies to

\frac{a_{j}}{\sum a_{i}} c^{- 1} (κ) c^{- 1} (κ_{0}) c (| | κ X_{n + 1} + κ_{0} μ_{0} | |) .

5. Simulation Examples

5.1. Classification

We draw iid samples on S⁹ × Inline graphic , = {1, 2, 3} from

f_{t} (x, y) = (1 / 3) \sum_{l = 1}^{3} I (y = l) vMF (x; μ_{l}, 200)

where μ₁ = (1, 0, …)^T, μ_j = cos(0.2)μ₁ + sin(0.2)v_j, j = 2, 3, v₂ = (0, 1, …)^T and $v_{3} = {(0, 0.5, \sqrt{0.75}, 0, \dots)}^{T}$ . Hence, the three response classes y ∈ {1, 2, 3} are equally likely and the distribution of the features within each class is a vMF on S⁹ with distinct location parameters. We purposely chose the separation between the kernel locations to be small, so that the classification task is challenging.

We implemented the approach described in §2.3 to perform nonparametric Bayes classification. The hyperparameters were chosen to be w₀ = 1, P₀ = vMF(μ_n, 10) ⊗ Diri(1, 1, 1), μ_n being the feature sample extrinsic mean, and π as in (4.3) with a = 1, b = 0.1. Cross-validation is used to assess classification performance, with posterior computation applied to data from a training sample of size 200, and the results used to predict y given the x values for subjects in a test sample of size 100. The MCMC algorithm was run for 5 × 10⁴ iterations after a 10⁴ iteration burn-in. Based on examination of trace plots for the predictive probabilities of y for representative test subjects, the proposed algorithm exhibits good rates of convergence and mixing. Note that we purposely avoid examining trace plots for component-specific parameters due to label switching. The out-of-sample misclassification rates for categories y = 1, 2 and 3 were 18.9%, 9.7% and 12.5%, respectively, with the overall rate being 14%.

As an alternative method for flexible model-based classification, we considered a discriminant analysis approach, which models the conditional density of x given y as a finite mixture of 10-dimensional Gaussians. In the literature it is very common to treat data lying on a hypersphere as if the data had support in a Euclidean space to simplify the analysis. Using the EM algorithm to fit the finite mixture model, we encountered singularity problems when allowing more than two Gaussian components per response class. Hence, we present the results only for mixtures of one or two multivariate Gaussian components. In the one component case, we obtained class-specific misclassification rates of 27%, 12.9% and 18.8%, with the overall rate being 20%. The corresponding results for the two component mixture were 21.6%, 16.1% and 28.1% with an overall misclassification rate of 22%.

Hence, the results from a parametric Gaussian discriminant analysis and a mixture of Gaussians classifier were much worse than those for our proposed Bayesian nonparametric approach. There are several possible factors contributing to the improvement in performance. Firstly, the discriminant analysis approach requires separate fitting of different mixture models to each of the response categories. When the amount of data in each category is small, it is difficult to reliably estimate all these parameters, leading to high variance and unstable estimates. In contrast our approach of joint modeling of f_t using a DPM favors a more parsimonious representation. Secondly, inappropriately modeling the data as having support on a Euclidean space has some clear drawbacks. The size of the space over which the densities are estimated is increased from a compact subset S⁹ to an unbounded space ℜ¹⁰. This can lead to an inflated variance and difficulties with convergence of EM and MCMC algorithms. In addition, the properties of the approach are expected to be poor even in larger samples. As Gaussian mixtures give zero probability to the embedded hypersphere, one cannot expect strong posterior consistency.

5.2. Hypothesis Testing

We draw an iid sample of size 100 on S⁹ × Inline graphic , = {1, 2, 3}, from the distribution

f_{t} (x, y) = (1 / 3) \sum_{l = 1}^{3} I (y = l) \sum_{j = 1}^{3} w_{l j} vMF (x; μ_{j}, 200),

where μ_j, j = 1, 2, 3 are as in the earlier example and the weights {w_lj} are chosen so that w₁₁ = 1 and w_lj = 0.5 for l = 2, 3 and j = 2, 3. Hence, in group y = 1, the features are drawn from a single vMF density, while in groups y = 2 and 3, the feature distributions are equally weighted mixtures of the same two vMFs.

Letting f_j denote the conditional density of X given Y = j for j = 1, 2, 3, respectively, the global null hypothesis of no difference in the three groups is H₀: f₁ = f₂ = f₃, while the alternative H₁ is that they are not all the same. We set the hyperparameters as w₀ = 1, P₀ = vMF(μ_n, 10) ⊗ Diri(a), μ_n being the X-sample extrinsic mean, b = a = p̂ = (0.28, 0.36, 0.36) - the sample proportion of observations from each group, and a prior π on κ as in 4.3 with a = 1 and b = 0.1. We run the proposed MCMC algorithm for calculating the Bayes factor (BF) in favor of H₁ over H₀ for 6 ×10⁴ iterations updating cluster labels S in 4 blocks of 25 each every iteration. The starting value of S is obtained by the k-means algorithm (k = 10) applied to the X component of the sample using geodesic distance on S⁹ and we started with κ = 200. The trace plots exhibit good rate of convergence of the algorithm. After discarding a burn-in of 4 × 10⁴ iterations, the estimated BF was 2.23 × 10¹⁵, suggesting strong evidence in the data in favor of H₁. We tried multiple starting points and different hyperparameter choices and found the conclusions to be robust, with the estimated BFs not exactly the same but within an order of magnitude. We also obtained similar estimates using substantially shorter and longer chains.

We can also use the proposed methodology for pairwise hypothesis testing of H_0,_ll_′: f_l = f_l_′ against the alternative H_1,_ll_′: f_l ≠ f_l_′ for any two pairs, l, l′, with l ≠ l^′. The analysis is otherwise implemented exactly as in the global hypothesis testing case. The resulting BF in favor of H_1,_ll_′ over H_0,_ll_′ for the different possible choices of (l, l′) are shown in Table 1. We obtain very large BFs in testing differences between groups 1 and 2 and 1 and 3, but a moderately small BF for testing a difference between groups 2 and 3, suggesting mild evidence that these two groups are equal. These conclusions are all consistent with the truth. We have noted a general tendency for the BF in favor of the alternative to be large when the alternative is true even in modest sample sizes, suggesting a rapid rate of convergence under the alternative in agreement with our theoretical results. When the null is true, the BF appears to converge to zero based on empirical results in our simulations, but at a slow rate.

Table 1.

Nonparametric Bayes and frequentist test results for data simulated for three groups with the second and third groups identical.

groups	BF	p-value
(1,2,3)	2.3 × 10¹⁵	2 × 10⁻⁶
(1,2)	2.4 × 10⁴	1.8 × 10⁻⁴
(1,3)	1.7 × 10⁶	1.5 × 10⁻⁵
(2,3)	0.235	0.43

Open in a new tab

For comparison, we also considered a frequentist nonparametric test for detecting differences in the groups based on comparing the sample extrinsic means of the f_ls. The test statistic used has an asymptotic $X_{d (L - 1)}^{2}$ distribution where d = 9 is the feature space dimension and L is the number of groups that we are comparing. It takes the expression $n \sum_{j = 1}^{L} {\hat{p}}_{j} {\bar{X}}_{j}^{'} B {(B^{'} \sum^{^} B)}^{- 1} B^{'} {\bar{X}}_{j}$ where n = 100 is the sample size, X̄_j is the X sample mean for group j, X̄ denotes the overall sample mean, Σ̂ is the sample covariance, and B is an orthonormal basis for the tangent space of S⁹ at μ_n = X̄/||X̄||. It is obtained via a Taylor expansion of the map x ↦ x/||x|| from ℜ^d⁺¹ to S^d. For more details, see §4.8 of [4]. The corresponding p-values are shown in Table 1. The conclusions are all consistent with those from the nonparametric Bayes approach.

5.3. Testing with No Differences in Mean

In this example, we draw iid samples on S² × Inline graphic , = {1, 2} from the distribution

f_{t} (x, y) = (1 / 2) \sum_{l = 1}^{2} I (y = l) \sum_{j = 1}^{3} w_{l j} vMF (x; μ_{j}, 200),

where $w = [\begin{matrix} 1 & 0 & 0 \\ 0 & 0.5 & 0.5 \end{matrix}]$ , μ₁ = (1, 0, 0)^T, μ_j = cos(0.2)μ₁ + sin(0.2)v_j (j = 2, 3) and v₂ = − v₃ = (0, 1, 0)^T. In this case the features are drawn from two groups equally likely, one of them is a vMF, while the other is a equally weighted mixture of two different vMFs. The locations μ_j are chosen such that both the groups have the same extrinsic mean μ₁.

We draw 10 samples of 50 observations each from the model f_t and carry out hypothesis testing to test for association between X and Y via our method and the frequentist one. The prior, hyperparameters and the algorithm for Bayes Factor (BF) computation are as in the earlier example. In each case we get insignificant p-values, often over 0.5, but very high BFs, often exceeding 10⁶. The values are listed in Table 2.

Table 2.

Nonparametric Bayes and frequentist test results for 10 simulations of 50 observations each for two groups with same population means.

BF	6.1e9	6.4e8	1.3e9	4.3e8	703.1	4.4e7	42.6	4.7e6	1.9e6	379.1
p-value	1.00	0.48	0.31	0.89	0.89	0.49	0.71	0.53	0.56	0.60

Open in a new tab

The reason for the failure of the frequentist test is because it relies on comparing the group specific sample extrinsic means and in this example the difference between them is little. Our method on the other hand compares the full conditionals and hence can detect differences that are not in the means.

6. Applications

6.1. Magnetization direction data

In this example from [16], measurements of remanent magnetization in red silts and claystones were made at 4 locations. This results in samples from four group of directions on the sphere S², the sample sizes are 36, 39, 16 and 16. The goal is to compare the magnetization direction distributions across the groups and test for any significant difference. Figure 1 shows the 3D plot of the sample clouds. The plot suggests no major differences. To test that statistically, we calculate the Bayes factor (BF) in favor of the alternative, as in §5.2. As mixing was not quite as good as in the simulated examples, we implemented label switching moves. We updated the cluster configurations in two blocks of size 54 and 53. The estimated BF was ≈1, suggesting no evidence in favor of the alternative hypothesis that the distribution of magnetization directions vary across locations.

3D coordinates of 4 groups in §6.1: 1(r), 2(b), 3(g), 4(c).

To assess sensitivity to the prior specification, we repeated the analysis with different hyperparameter values of a, b equal to the proportions of samples within each group and P₀₁ corresponding to an uniform on the sphere. In addition, we tried different starting clusterings in the data, with a default choice obtained by implementing k-means with 10 clusters assumed. In each case, we obtain BF ≈1, so the results were robust.

In Example 7.7 of [17], a coordinate-based parametric test was conducted to compare mean direction in these data, producing a p-value of 1 − 1.4205 × 10⁻⁵ based on a $X_{6}^{2}$ statistic. They also compared the mean directions for the first two groups and obtained a non-significant p-value. Repeating this two sample test using our Bayesian nonparametric method, we obtained a Bayes factor of 1.00. The nonparametric frequentist test from §5.2 yield p-values of 0.06 and 0.38 for the two tests.

6.2. Volcano location data

The NOAA National Geophysical Data Center Volcano Location Database contains information on locations and characteristics of volcanoes across the globe. The locations using latitude-longitude coordinates are plotted in Figure 2. We are interested in testing if there is any association between the location and type of the volcano. We consider the most common three types which are Strato, Shield and Submarine volcanoes, with data available for 999 volcanoes of these types worldwide. Their location coordinates are shown in Figure 3. Denoting by X the volcano location which lies on S² and by Y its type which takes values from Inline graphic = {1, 2, 3}, we compute the Bayes factor (BF) for testing if X and Y are independent.

Longitude-Latitude coordinates of volcano locations in §6.2.

Coordinates of 3 major type volcano locations: Strato(r), Shield(b), Submarine(g). Their sample extrinsic mean locations:1, 2, 3. Full sample extrinsic mean

As should be apparent from the Figures, the volcano data are particularly challenging in terms of density estimation because the locations tend to be concentrated along fault lines. Potentially, data on distance to the closest fault, volcano elevation and other information could be utilized to improve performance, but we do not have access to such data. It would be straight forward to include such additional predictors as explained in Remark 2.2. Without such information, the data present a challenging test case for the methodology in that it is clear that one may need to utilize very many vMF kernels to accurately characterize the density of volcano locations across the globe, with the use of moderate to large numbers of kernels leading to challenging mixing issues. Indeed, we did encounter a sensitivity to the starting cluster configuration in our initial analyses.

We found that one of issues that exacerbated the problem with mixing of the cluster allocation was the ordering in the weights on the stick-breaking representation utilized by the exact block Gibbs sampler. Although label switching moves can lead to some improvements, they proved to be insufficient in this case. Hence, we modified the computational algorithm slightly to instead use the finite Dirichlet approximation to the Dirichlet process proposed in [27]. The finite Dirichlet treats the components as exchangeable so eliminates sensitivity to the indices on the starting clusters, which we obtained using k-means for 50 clusters. We used K = 50 as the dimension of the finite Dirichlet and hence the upper bound on the number of occupied clusters. Another issue that lead to mixing problems was the use of a hyperprior on κ. In particular, when the initial clusters were not well chosen, the kernel precision would tend to drift towards smaller than optimal values and as a result too few clusters would be occupied to adequately fit the data. We did not observe such issues at all in a variety of other simulated and real data applications, but the volcano data are particularly difficult as we note above.

To address this second issue, we chose the kernel precision parameter κ by cross-validation. In particular, we split the sample into training and test sets, and then ran our Bayesian nonparametric analysis on the training data separately for a wide variety of κ values between 0 and 1,000. We chose the value that produced the highest expected posterior log likelihood in the test data, leading to κ̂ = 80. In this analysis and the subsequent analyses for estimating the BF, we chose the prior on the mixture weights to be Diri(w₀/K1_K) (K = 50). The other hyper-parameters were chosen to be w₀ = 1, a = b = (0.71, 0.17, 0.11) = the sample proportion of different volcano types, κ₀ = 10, and μ₀ = the X-sample extrinsic mean. We collected 5 × 10⁴ MCMC iterations after discarding a burn-in of 10⁴. Using a fixed band-width considerably improved the algorithm convergence rate.

Based on the complete data set of 999 volcanoes, the resulting BF in favor of the alternative was estimated to be over 10¹⁰⁰, providing conclusive evidence that the different types of volcanos have a different spatial distribution across the globe. For the same fixed κ̂ value, we reran the analysis for a variety of alternative hyperparameter values and different starting points, obtaining similar BF estimates and the same conclusion. We also repeated the analysis for a randomly selected subsample of 300 observations, obtaining BF = 5.4 × 10¹¹. The testing is repeated for other sub-samples, each resulting in a very high BF. We also obtained a high BF in repeating the analysis with a hyperprior on κ.

For comparison, we perform the asymptotic Inline graphic test as described in §5.2, obtaining a p-value of 3.6 × 10⁻⁷ which again favors H₁. The large sample sizes for the three types (713,172,114) justifies the use of asymptotic theory. However given that the volcanoes are spread all over the globe, the validity of the assumption that the three conditionals have unique extrinsic means may be questioned.

We also perform a coordinate based test by comparing the means of the latitude longitude coordinates of the three sub-samples using a Inline graphic statistic. The three coordinate means are (12.6, 27.9), (21.5, 9.2), and (9.97, 21.5) (latitude, longitude). The value of the statistic is 17.07 and the asymptotic p-value equals 1.9 × 10⁻³ which is larger by orders of magnitude than its coordinate-free counterpart, but still significant. Coordinate based methods, however, can be very misleading because of the discontinuity at the boundaries. They heavily distort the geometry of the sphere which is evident from the figures.

7. Discussion

We have proposed a novel Bayesian approach for classification and testing relying on modeling the joint distribution of the categorical response and continuous predictors as a Dirichlet process product mixture. The product mixture likelihood includes a multinomial for the categorical response and an arbitrary kernel for the predictors, with dependence induced through the DP prior on the unknown joint mixing measure. By modifying the kernel for the predictors, one can modify the support, with multivariate Gaussian kernels for predictors in ℜ^p and von Mises Fisher kernels for predictors on the hypersphere. For other predictor spaces, one can appropriately modify the kernel.

Although our focus has been on hyperspherical predictors for concreteness, the proposed product mixture formulation is broadly applicable to classification problems for predictors in general spaces and we can easily consider predictors having a variety of supports. For example, some predictors can be in a Euclidean space and some on a hypersphere. The framework has some clear practical advantages over frequentist and nonparametric Bayes discriminant analysis approaches, which rely on separately modeling the conditional distributions of the feature (predictor) distributions specific to each response category.

One of our primary contributions was showing theoretical properties, including large support and posterior consistency, in modeling of the classification function. In addition, we have added to the under-developed literature on nonparametric Bayes testing of differences in distributions, not only on ℜ^p but on more general manifolds. We provide a novel computational approach for estimating Bayes factors as well as prove theoretical results on Bayes factor consistency. The proposed method can be implemented in broad applications.

An area of substantial interest in ongoing research is the development of models that reduce dimensionality through projecting predictors X to a lower-dimensional subspace. Such dimensionality reduction is critical in addressing the curse of dimensionality that arises inevitably in estimating classification or regression functions with high-dimensional predictors. [7] develop a promising approach for nonparametric Bayes learning of affine subspaces for density estimation and classification in Euclidean spaces, but it remains to develop related methods in non-Euclidean spaces while also obtaining theory on rates of convergence.

Acknowledgments

This research was partially supported by grant R01 ES017240 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH).

8. Appendix

8.1. Proof of Theorem 2.1

Before proving the Theorem, we prove the following Lemma.

Lemma 8.1

Under assumptions A2 and A4,

lim_{κ \to \infty} sup {∣ f (x, y; P_{t}, κ) - f_{t} (x, y) ∣ : (x, y) \in X \times Y} = 0.

Proof

From the definition of P_t, we can write

f (x, y; P_{t}, κ) = \int_{X} K (x; μ, κ) φ_{y}) (μ) λ_{1} (d μ)

for φ_y(μ) = f_t(μ, y). Then from A4, it follows that φ_y is continuous for all y ∈ Inline graphic . Hence from A2, it follows that

lim_{κ \to \infty} sup_{x \in X} | f_{t} (x, y) - \int_{X} K (x; μ, κ) f_{t} (μ, y) λ_{1} (d μ) | = 0

for any y ∈ Inline graphic . Since is finite, the proof is complete.

Proof of Theorem 2.1

Throughout this proof we will view Inline graphic as a compact metric space and ( × S_L₋₁) as a topological space under the weak topology. From Lemma 8.1, it follows that there exists a κ_t ≡ κ_t(ε) > 0 such that

sup_{x, y} ∣ f (x, y; P_{t}, κ) - f_{t} (x, y) ∣ < \frac{ε}{3}

(8.1)

for all κ ≥ κ_t. From assumption A3, it follows that by choosing κ_t sufficiently large, we can ensure that (P_t, κ_t) ∈ supp(Π₁). From assumption A1, it follows that K is uniformly continuous at κ_t, i.e. there exists an open set W (ε) ⊆ ℜ⁺ containing κ_t s.t.

sup_{x, μ \in X} ∣ K (x; μ, κ) - K (x; μ, κ_{t}) ∣ < \frac{ε}{3} \forall κ \in W (ε) .

This in turn implies that, for all κ ∈ W(ε), P ∈ Inline graphic ( × S_L₋₁),

sup_{x, y} ∣ f (x, y; P, κ) - f (x, y; P, κ_{t}) ∣ < \frac{ε}{3}

(8.2)

because the left expression in (8.2) is

sup_{x, y} | \int ν_{y} {K (x; μ, κ) - K (x; μ, κ_{t})} P (d μ d ν) | \leq sup_{x, μ \in X} ∣ K (x; μ, κ) - K (x; μ, κ_{t}) ∣ .

Since Inline graphic is compact and K(.; ., κ_t) is uniformly continuous on × , we can cover by finitely many open sets: U₁, … U_K s.t.

sup_{μ \in X, x, \tilde{x} \in U_{i}} ∣ K (x; μ, κ_{t}) - K (\tilde{x}; μ, κ_{t}) ∣ < \frac{ε}{12}

(8.3)

for each i ≤ K. For fixed x, y, κ, f(x, y; P, κ) is a continuous function of P. Hence for x_i ∈ U_i, y = j ∈ Inline graphic ,

W_{i j} (ε) = {P \in M (X \times S_{L - 1}) : ∣ f (x_{i,} j; P, κ_{t}) - f (x_{i}, j; P_{t}, κ_{t}) ∣ < \frac{ε}{6}},

1 ≤ i ≤ K, 1 ≤ j ≤ L, define open neighborhoods of P_t. Let Inline graphic (ε) = ∩_i_,_j (ε) which is also an open neighborhood of P_t. For a general x ∈ , y ∈ , find U_i containing x. Then for any P ∈ (ε),

\begin{array}{r} ∣ f (x, y; P, κ_{t}) - f (x, y; P_{t}, κ_{t}) ∣ \leq \\ ∣ f (x, y; P, κ_{t}) - f (x_{i}, j; P, κ_{t}) ∣ + ∣ f (x_{i}, j; P, κ_{t}) - f (x_{i}, j; P_{t}, κ_{t}) ∣ \\ + ∣ f (x_{i}, j; P_{t}, κ_{t}) - f (x, y; P_{t}, κ_{t}) ∣ . \end{array}

(8.4)

Denote the three terms to the right in (8.4) as T₁, T₂ and T₃. Since x ∈ U_i, it follows from (8.3) that $T_{1}, T_{3} < \frac{ε}{12}$ . Since P ∈ Inline graphic (ε), $T_{2} < \frac{ε}{6}$ by definition of (ε). Hence ${sup}_{x, y} ∣ f (x, y; P, κ_{t}) - f (x, y; P_{t}, κ_{t}) ∣ < \frac{ε}{3}$ . Therefore

W_{2} (ε) \equiv {P : sup_{x, y} ∣ f (x, y; P, κ_{t}) - f (x, y; P_{t}, κ_{t}) ∣ < \frac{ε}{3}}

contains Inline graphic (ε). Since (P_t, κ_t) ∈ supp(Π₁) and (ε) × W (ε) contains an open neighborhood of (P_t, κ_t), therefore

Π_{1} (W_{2} (ε) \times W (ε)) > 0.

Let (P, κ) ∈ Inline graphic (ε) × W (ε). Then for (x, y) ∈ × ,

\begin{array}{r} ∣ f (x, y; P, κ) - f_{t} (x, y) ∣ \leq \\ ∣ f (x, y; P, κ) - f (x, y; P, κ_{t}) ∣ + ∣ f (x, y; P, κ_{t}) - f (x, y; P_{t}, κ_{t}) ∣ \\ + ∣ f (x, y; P_{t}, κ_{t}) - f_{t} (x, y) ∣ . \end{array}

(8.5)

The first term to the right in (8.5) is $< \frac{ε}{3}$ since κ ∈ W (ε). The second one is $< \frac{ε}{3}$ because P ∈ Inline graphic (ε). The third one is also $< \frac{ε}{3}$ which follows from equation (8.1). Therefore

Π_{1} ({(P, κ) : sup_{x, y} ∣ f (x, y; P, κ) - f_{t} (x, y) ∣ < ε}) > 0.

This completes the proof.

8.2. Proof of Theorem 2.3

The proof uses Proposition 8.2 proved in [6]. Let M be a compact metric-space. Denote by Inline graphic (M) the space of all probability densities on M with respect to some fixed finite base measure τ. Endow it with the total variation distance ||.||. Let $z_{n} = {z_{i}}_{1}^{n}$ be a iid sample from some density f_t on M. Consider a collection of mixture densities on M given by

f (m; P, κ) = \int_{M} K (m; μ, κ) P (d μ), m \in M, κ \in R^{+}, P \in M (M)

(8.6)

with ∫_M K(m; μ, κ)τ(dm) = 1. Set a prior Π₁ on Inline graphic (M) × ℜ⁺ which induces a prior Π on (M) through (8.6). For ⊆ (M) and ε > 0, the L₁-metric entropy N(ε, ) is defined as the logarithm of the minimum number of ε-sized (or smaller) L₁ subsets needed to cover .

Proposition 8.2

For a positive sequence {κ_n} diverging to ∞, define

D_{n} = {f (.; P, κ) : P \in M (M), κ \in [0, κ_{n}]} .

(a) Under assumptions A6–A8, given any ε > 0, for n sufficiently large, $N (ε, D_{n}) \leq C (ε) κ_{n}^{a_{1} a_{3}}$ for some C(ε) > 0. (b) Under further assumption A9, the posterior probability of any total variation neighborhood of f_t converges to 1 a.s. f_t if f_t is in the KL support of Π.

Proof of Theorem 2.3

For a density f ∈ Inline graphic ( × ), let p(y) be the marginal probability of Y = y and g(x, y) be the conditional density of X = x given Y = y. Then f(x, y) = p(y)g(x, y). For f₁, f₂ ∈ ( × ),

\begin{array}{r} | | f_{1} - f_{2} | | = \int ∣ f_{1} (x, y) - f_{2} (x, y) ∣ λ (dxdy) = \sum_{j = 1}^{L} ∣ p_{1} (j) g_{1} (x, j) - p_{2} (j) g_{2} (x, j) ∣ λ_{1} (d x) \\ \leq max_{j} | | g_{1} (., j) - g_{2} (., j) | | + \sum_{j} ∣ p_{1} (j) - p_{2} (j) ∣ . \end{array}

(8.7)

Hence an ε diameter ball in Inline graphic ( × ) contains the intersection of L many ε/2 diameter balls from ( ) with a ε/2 diameter subset of S_L₋₁. For any f (; P, κ) as in (2.1), its X-conditional g(.; j) for j ∈ can be expressed as

\begin{array}{r} g (x, j) = \frac{\int_{X \times S_{L - 1}} ν_{j} K (x; μ, κ) P (d μ d ν)}{\int_{X \times S_{L - 1}} ν_{j} P (d μ d ν)} = \int_{X} K (x; μ, κ) P_{j} (d μ) \\ with P_{j} (d μ) = \frac{\int_{S_{L - 1}} ν_{j} P (d μ d ν)}{\int_{X \times S_{L - 1}} ν_{j} P (d μ d ν)} . \end{array}

Hence g(., j) is as in (8.6) with M = Inline graphic . Define

\begin{array}{r} D_{n} = {f (; P, κ) : P \in M (X \times Y), κ \in [0, κ_{n}]} . Then \\ D_{n} = {f \in D (X \times Y), : g (.; j) \in {\tilde{D}}_{n} \forall j \in Y} \\ where {\tilde{D}}_{n} = {g (; P, κ) : P \in M (X), κ \in [0, κ_{n}]} . \end{array}

(8.8)

From Proposition 8.2(a), N(ε, Inline graphic ) is of order at-most $κ_{n}^{a_{1} a_{3}}$ and hence from (8.7) and (8.8), $N (ε, D_{n}) \leq C κ_{n}^{a_{1} a_{3}}$ , C depending on ε. Therefore from part (b) of the Proposition, under assumptions A1–A9, strong posterior consistency follows.

8.3. Proof of Corollary 2.4

Proof

For any y ∈ Inline graphic

\begin{array}{r} \int_{X} ∣ p (y, x) - p_{t} (y, x) ∣ g_{t} (x) λ_{1} (d x) = \\ \int_{X} ∣ f_{t} (x, y) - f (x, y) + p (y, x) g (x) - p (y, x) g_{t} (x) ∣ λ_{1} (d x) \\ \leq \int_{X} ∣ f_{t} (x, y) - f (x, y) ∣ λ_{1} (d x) + \int_{X} ∣ g_{t} (x) - g (x) ∣ λ_{1} (d x) \\ \leq 2 \sum_{j = 1}^{L} \int_{X} ∣ f (x, j) - f_{t} (x, j) ∣ λ_{1} (d x) = {| | f - f_{t} | |}_{1} \end{array}

and hence any neighborhood of p_t of the form {p: max_y_∈∫|p(y, x)− p_t(y, x)|g_t(x)λ₁(dx) < ε} contains an L¹ neighborhood of f_t. Now part(a) follows from strong consistency of the posterior distribution of f.

Since is compact, f_t being continuous and positive implies that c = inf_x_∈g_t(x) > 0. Hence
$\int_{X} ∣ p (y, x) - p_{t} (y, x) ∣ w (x) λ_{1} (d x) \leq c^{- 1} sup (w (x)) \int_{X} g_{t} (x) ∣ p (y, x) - p_{t} (y, x) ∣ λ_{1} (d x)$

Now the result follows from part (a).

8.4. Proof of Theorem 3.1

The proof uses Lemma 8.3. This lemma is fundamental to proving weak posterior consistency using the Schwartz theorem and its proof can be found in any standard text which contains the theorem’s proof.

Lemma 8.3

(a) If Π includes f_t in its KL support, then

lim {inf}_{n \to \infty} exp (n β) \int \prod_{i} \frac{f (x_{i}, y_{i})}{f_{t} (x_{i}, y_{i})} Π (d f) = \infty

a.s. $f_{t}^{\infty}$ for any β > 0. (b) If U is a weak open neighborhood of f_t and Π₀ is a prior on Inline graphic ( × ) with support in U^c, then there exists a β₀ > 0 for which

{lim}_{n \to \infty} exp (n β_{0}) \int \prod_{i} \frac{f (x_{i}, y_{i})}{f_{t} (x_{i}, y_{i})} Π_{0} (d f) = 0

a.s. $f_{t}^{\infty}$ .

Proof of Theorem 3.1

Express BF as

B F = {\prod_{i} p_{t} (y_{i})} \frac{D (b)}{D (b_{n})} \frac{\int \prod_{i} \frac{f (x_{i}, y_{i})}{f_{t} (x_{i}, y_{i})} Π (d f)}{\int \prod_{i} \frac{g (x_{i}) p_{t} (y_{i})}{f_{t} (x_{i}, y_{i})} Π (d f)} = T_{1} T_{2} / T_{3}

with $T_{1} = {\prod_{i} p_{t} (y_{i})} \frac{D (b)}{D (b_{n})}, T_{2} = \int \prod_{i} \frac{f (x_{i}, y_{i})}{f_{t} (x_{i}, y_{i})} Π (d f)$ , and $T_{3} = \int \prod_{i} \frac{g (x_{i}) p_{t} (y_{i})}{f_{t} (x_{i}, y_{i})} Π (d f)$ .

Since Π satisfies the KL condition, Lemma 8.3(a) implies that lim inf_n_→∞ exp(nβ)T₂ = ∞ a.s. for any β > 0.

Let U be the space of all dependent densities, that is

U^{c} = {f \in D (X \times Y) : f (x, y) = g (x) p (y) a . s . λ (dxdy)} .

The prior Π induces a prior Π₀ on U^c via f ↦ {Σ_j f (., j)}p_t and T₃ can be expressed as $\int \prod_{i} \frac{f (x_{i}, y_{i})}{f_{t} (x_{i}, y_{i})} Π_{0} (d f)$ . It is easy to show that U is open under the weak topology and hence under H₁ is a weak open neighborhood of f_t. Then using Lemma 8.3(b), it follows that lim_n_→∞ exp(nβ₀)T₃ = 0 a.s. for some β₀ > 0.

The proof is complete if we can show that lim inf_n_→∞ exp(nβ)T₁ = ∞ a.s. for any β > 0 or log(T₁) = o(n) a.s. For a positive sequence a_n diverging to ∞, the Stirling’s formula implies that log Γ(a_n) = a_n log(a_n) − a_n + o(a_n). Express log(T₁) as

\sum_{i} log (p_{t} (y_{i})) - log (D (b_{n})) + o (n) .

(8.9)

Since p_t(j) > 0 ∀j ≤ L, by the SLLN,

\sum_{i} log (p_{t} (y_{i})) = n \sum_{j} p_{t} (j) log (p_{t} (j)) + o (n) a . s .

(8.10)

Let b_nj = b_j +Σ_i I(y_i = j) be the jth component of b_n. Then lim_n_→∞ b_nj/n = p_t(j), that is b_nj = np_t(j) + o(n) a.s. and hence the Stirling’s formula implies that

\begin{array}{r} log (Γ (b_{n j})) = b_{n j} log (b_{n j}) - b_{n j} + o (n) \\ = {n p}_{t} (j) log (p_{t} (j)) - {n p}_{t} (j) + log (n) b_{n j} + o (n) a . s . \end{array}

which implies

\begin{array}{r} log (D (b_{n})) = \sum_{j = 1}^{L} log (Γ (b_{n j})) - log Γ (\sum_{j} b_{j} + n) \\ = n \sum_{j} p_{t} (j) log (p_{t} (j)) + o (n) a . s . \end{array}

(8.11)

From (8.9), (8.10) and (8.11), log(T₁) = o(n) a.s. and this completes the proof.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Abhishek Bhattacharya, Theoretical Statistics & Mathematics Division, Indian Statistical Institute.

David Dunson, Department of Statistical Science, Duke University.

References

1.Banerjee A, Dhillon IS, Ghosh J, Sra S. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research. 2005;6:1345–1382. [Google Scholar]
2.Barron A, Schervish MJ, Wasserman L. The consistency of posterior distributions in nonparametric problems. Annals of Statistics. 1999;27:536–561. [Google Scholar]
3.Basu S, Chib S. Marginal likelihood and Bayes factors for D-irichlet process mixture models. Jour of Amer Statist Assoc. 2003;98:224–235. [Google Scholar]
4.Bhattacharya A, Bhattacharya R. Nonparametric Inference on Manifolds. Cambridge Uni. Press; Cambridge: 2012. [Google Scholar]
5.Bhattacharya A, Dunson D. Nonparametric Bayesian density estimation on manifolds with applications to planar shapes. Biometrika. 2010;97:851–865. doi: 10.1093/biomet/asq044. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Bhattacharya A, Dunson D. Strong consistency of nonparametric Bayes density estimation on compact metric spaces. Annals of the Institute of Statistical Mathematics. 2011;63 doi: 10.1007/s10463-011-0341-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Bhattacharya A, Page G, Dunson D. Density estimation and classification via bayesian nonparametric learning of affine subspaces. 2005 doi: 10.1080/01621459.2013.763566. arXiv:1105.5737v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Bhattacharya RN, Patrangenaru V. Large sample theory of intrinsic and extrinsic sample means on manifolds. Ann Statist. 2003;31:1–29. [Google Scholar]
9.Bigelow J, Dunson D. Bayesian semiparametric joint models for functional predictors. J Am Statist Ass. 2009;104:26–36. doi: 10.1198/jasa.2009.0001. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.Consonni G, Veronese P. Compatibility of prior specifications across linear models. Statist Sci. 2008;23:332–353. [Google Scholar]
11.Dass SC, Lee J. A note on the consistency of bayes factors for testing point null versus non-parametric alternatives. J Statist Plan Infer. 2004;119:143–152. [Google Scholar]
12.de Jonge R, van Zanten JH. Adaptive nonparametric Bayesian inference using location-scale mixture priors. Annals of Statistics. 2010;38:3300–3320. [Google Scholar]
13.De la Cruz-Mesia R, Quintana FA, Müller P. Semiparametric bayesian classification with longitudinal markers. Applied Statist. 2007;56:119–137. doi: 10.1111/j.1467-9876.2007.00569.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Dunson DB. Multivariate kernel partition process mixtures. Statistica Sinica. 2010;20:1395–1422. [PMC free article] [PubMed] [Google Scholar]
15.Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–874. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Embleton BJJ, McDonnell KL. Magnetostratigraphy in the Sydney Basin, SouthEastern Australia. J Geomag Geoelectr. 1980;32:304. [Google Scholar]
17.Fisher NI, Lewis T, Embleton BJJ. Statistical Analysis of Spherical Data. Cambridge Uni. Press; N.Y: 1987. [Google Scholar]
18.Fisher RA. Dispersion on a sphere. Proc of the Royal Soc of London Ser A - Math and Phy Sci. 1953;1130:295–305. [Google Scholar]
19.Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]
20.Ghosal S, Ghosh JK, Ramamoorthi RV. Posterior consistency of dirichlet mixtures in density estimation. Ann Statist. 1999;27:143–158. [Google Scholar]
21.Ghosal S, Lember J, van der Vaart A. Nonparametric Bayesian model selection and averaging. Electronic Journal of Statistics. 2008;2:63–89. [Google Scholar]
22.Ghosal S, Roy A. Posterior consistency of a Gaussian process prior for nonparametric binary regression. Ann Statist. 2006;34:2413–2429. [Google Scholar]
23.Hamsici OC, Martinez AM. Spherical-homoscedastic distributions: The equivalency of spherical and normal distributions in classification. Journal of Machine Learning Research. 2007;8:1583–1623. [Google Scholar]
24.Hannah LA, Blei DM, Powell WB. Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research. 2011;12:1923–1953. [Google Scholar]
25.Hastie T, Tibshirani R. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society Series B. 1996;58:155–176. [Google Scholar]
26.Holmes CC, Caron F, Griffin JE, Stephens DA. Two-sample bayesian nonparametric hypothesis testing. Technical Report. http://arxiv.org/abs/0910.5060.
27.Ishwaran H, Zarepour M. Dirichlet prior sieves in finite normal mixtures. Statistica Sinica. 2002;12:941–963. [Google Scholar]
28.Kalli M, Griffin JE, Walker SG. Slice sampling mixture models. Statistics and Computing. 2011;21:93–105. [Google Scholar]
29.Kundu S, Dunson DB. Bayes variable selection in semiparametric linear models. 2011 doi: 10.1080/01621459.2014.881153. arxiv:1108.2722. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Mardia KV, Jupp PE. Directional Statistics. John Wiley & Sons; West Sussex, England: 2000. [Google Scholar]
31.Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79. [Google Scholar]
32.Pennell ML, Dunson DB. Nonparametric Bayes testing of changes in a response distribution with an ordinal predictor. Biometrics. 2008;64:413–423. doi: 10.1111/j.1541-0420.2007.00885.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Schwartz L. On Bayes procedures. Z Wahrsch Verw Gebiete. 1965;4:10–26. [Google Scholar]
34.Sethuraman J. A constructive definition of Dirichlet priors. Statist Sinica. 1994;4:639–50. [Google Scholar]
35.Shahbaba B, Neal R. Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research. 2009;10:1829–1850. [Google Scholar]
36.Urich G. Computer generation of distributions on the m-sphere. Appl Statist Soc. 1984;B16:885–898. [Google Scholar]
37.von Mises RV. Uber die “Ganzzahligkeit” der Atomgewicht und verwandte Fragen. Physik Z. 1918;19:490–500. [Google Scholar]
38.Walker SG. Sampling the Dirichlet mixture model with slices. Commun Statist B. 2007;36:45–54. [Google Scholar]
39.Watson GS, Williams EJ. Construction of significance tests on the circle and sphere. Biometrika. 1953;43:344–52. [Google Scholar]
40.Wood ATA. Simulation of the Von Mises Fisher distribution. Commun Statist-Simula. 1994;23(1):157–164. [Google Scholar]
41.Wu Y, Ghosal S. Kullback-Leibler property of kernel mixture priors in Bayesian density estimation. Elec J Statist. 2008;2:298–331. [Google Scholar]
42.Wu Y, Ghosal S. L1 - consistency of dirichlet mixtures in multivariate bayesian density estimation. Jour of Multivar Analysis. 2010 In Press. [Google Scholar]
43.Yau C, Papaspiliopoulos O, Roberts GO, Holmes C. Nonparametric hidden Markov models with applications in genomics. J R Statist Soc B. 2010;73 doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Banerjee A, Dhillon IS, Ghosh J, Sra S. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research. 2005;6:1345–1382. [Google Scholar]

[R2] 2.Barron A, Schervish MJ, Wasserman L. The consistency of posterior distributions in nonparametric problems. Annals of Statistics. 1999;27:536–561. [Google Scholar]

[R3] 3.Basu S, Chib S. Marginal likelihood and Bayes factors for D-irichlet process mixture models. Jour of Amer Statist Assoc. 2003;98:224–235. [Google Scholar]

[R4] 4.Bhattacharya A, Bhattacharya R. Nonparametric Inference on Manifolds. Cambridge Uni. Press; Cambridge: 2012. [Google Scholar]

[R5] 5.Bhattacharya A, Dunson D. Nonparametric Bayesian density estimation on manifolds with applications to planar shapes. Biometrika. 2010;97:851–865. doi: 10.1093/biomet/asq044. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] 6.Bhattacharya A, Dunson D. Strong consistency of nonparametric Bayes density estimation on compact metric spaces. Annals of the Institute of Statistical Mathematics. 2011;63 doi: 10.1007/s10463-011-0341-x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] 7.Bhattacharya A, Page G, Dunson D. Density estimation and classification via bayesian nonparametric learning of affine subspaces. 2005 doi: 10.1080/01621459.2013.763566. arXiv:1105.5737v1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] 8.Bhattacharya RN, Patrangenaru V. Large sample theory of intrinsic and extrinsic sample means on manifolds. Ann Statist. 2003;31:1–29. [Google Scholar]

[R9] 9.Bigelow J, Dunson D. Bayesian semiparametric joint models for functional predictors. J Am Statist Ass. 2009;104:26–36. doi: 10.1198/jasa.2009.0001. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.Consonni G, Veronese P. Compatibility of prior specifications across linear models. Statist Sci. 2008;23:332–353. [Google Scholar]

[R11] 11.Dass SC, Lee J. A note on the consistency of bayes factors for testing point null versus non-parametric alternatives. J Statist Plan Infer. 2004;119:143–152. [Google Scholar]

[R12] 12.de Jonge R, van Zanten JH. Adaptive nonparametric Bayesian inference using location-scale mixture priors. Annals of Statistics. 2010;38:3300–3320. [Google Scholar]

[R13] 13.De la Cruz-Mesia R, Quintana FA, Müller P. Semiparametric bayesian classification with longitudinal markers. Applied Statist. 2007;56:119–137. doi: 10.1111/j.1467-9876.2007.00569.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Dunson DB. Multivariate kernel partition process mixtures. Statistica Sinica. 2010;20:1395–1422. [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–874. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Embleton BJJ, McDonnell KL. Magnetostratigraphy in the Sydney Basin, SouthEastern Australia. J Geomag Geoelectr. 1980;32:304. [Google Scholar]

[R17] 17.Fisher NI, Lewis T, Embleton BJJ. Statistical Analysis of Spherical Data. Cambridge Uni. Press; N.Y: 1987. [Google Scholar]

[R18] 18.Fisher RA. Dispersion on a sphere. Proc of the Royal Soc of London Ser A - Math and Phy Sci. 1953;1130:295–305. [Google Scholar]

[R19] 19.Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]

[R20] 20.Ghosal S, Ghosh JK, Ramamoorthi RV. Posterior consistency of dirichlet mixtures in density estimation. Ann Statist. 1999;27:143–158. [Google Scholar]

[R21] 21.Ghosal S, Lember J, van der Vaart A. Nonparametric Bayesian model selection and averaging. Electronic Journal of Statistics. 2008;2:63–89. [Google Scholar]

[R22] 22.Ghosal S, Roy A. Posterior consistency of a Gaussian process prior for nonparametric binary regression. Ann Statist. 2006;34:2413–2429. [Google Scholar]

[R23] 23.Hamsici OC, Martinez AM. Spherical-homoscedastic distributions: The equivalency of spherical and normal distributions in classification. Journal of Machine Learning Research. 2007;8:1583–1623. [Google Scholar]

[R24] 24.Hannah LA, Blei DM, Powell WB. Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research. 2011;12:1923–1953. [Google Scholar]

[R25] 25.Hastie T, Tibshirani R. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society Series B. 1996;58:155–176. [Google Scholar]

[R26] 26.Holmes CC, Caron F, Griffin JE, Stephens DA. Two-sample bayesian nonparametric hypothesis testing. Technical Report. http://arxiv.org/abs/0910.5060.

[R27] 27.Ishwaran H, Zarepour M. Dirichlet prior sieves in finite normal mixtures. Statistica Sinica. 2002;12:941–963. [Google Scholar]

[R28] 28.Kalli M, Griffin JE, Walker SG. Slice sampling mixture models. Statistics and Computing. 2011;21:93–105. [Google Scholar]

[R29] 29.Kundu S, Dunson DB. Bayes variable selection in semiparametric linear models. 2011 doi: 10.1080/01621459.2014.881153. arxiv:1108.2722. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] 30.Mardia KV, Jupp PE. Directional Statistics. John Wiley & Sons; West Sussex, England: 2000. [Google Scholar]

[R31] 31.Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79. [Google Scholar]

[R32] 32.Pennell ML, Dunson DB. Nonparametric Bayes testing of changes in a response distribution with an ordinal predictor. Biometrics. 2008;64:413–423. doi: 10.1111/j.1541-0420.2007.00885.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R33] 33.Schwartz L. On Bayes procedures. Z Wahrsch Verw Gebiete. 1965;4:10–26. [Google Scholar]

[R34] 34.Sethuraman J. A constructive definition of Dirichlet priors. Statist Sinica. 1994;4:639–50. [Google Scholar]

[R35] 35.Shahbaba B, Neal R. Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research. 2009;10:1829–1850. [Google Scholar]

[R36] 36.Urich G. Computer generation of distributions on the m-sphere. Appl Statist Soc. 1984;B16:885–898. [Google Scholar]

[R37] 37.von Mises RV. Uber die “Ganzzahligkeit” der Atomgewicht und verwandte Fragen. Physik Z. 1918;19:490–500. [Google Scholar]

[R38] 38.Walker SG. Sampling the Dirichlet mixture model with slices. Commun Statist B. 2007;36:45–54. [Google Scholar]

[R39] 39.Watson GS, Williams EJ. Construction of significance tests on the circle and sphere. Biometrika. 1953;43:344–52. [Google Scholar]

[R40] 40.Wood ATA. Simulation of the Von Mises Fisher distribution. Commun Statist-Simula. 1994;23(1):157–164. [Google Scholar]

[R41] 41.Wu Y, Ghosal S. Kullback-Leibler property of kernel mixture priors in Bayesian density estimation. Elec J Statist. 2008;2:298–331. [Google Scholar]

[R42] 42.Wu Y, Ghosal S. L1 - consistency of dirichlet mixtures in multivariate bayesian density estimation. Jour of Multivar Analysis. 2010 In Press. [Google Scholar]

[R43] 43.Yau C, Papaspiliopoulos O, Roberts GO, Holmes C. Nonparametric hidden Markov models with applications in genomics. J R Statist Soc B. 2010;73 doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Nonparametric Bayes Classification and Hypothesis Testing on Manifolds

Abhishek Bhattacharya

David Dunson

Abstract

1. Introduction

2. Nonparametric Bayes Classification

2.1. Kernel mixture model

Remark 2.1

Remark 2.2

2.2. Support of the prior and consistency

Theorem 2.1

Corollary 2.2

Theorem 2.3

Corollary 2.4

Remark 2.3

Theorem 2.5

2.3. Computation

3. Nonparametric Bayes Testing

3.1. Hypotheses and Bayes factor

3.2. Consistency of the Bayes factor

Theorem 3.1

3.3. Computation

4. Application to the unit sphere Sd

4.1. vMF Kernel Mixture Models

Proposition 4.1

4.2. MCMC Details

5. Simulation Examples

5.1. Classification

5.2. Hypothesis Testing

Table 1.

5.3. Testing with No Differences in Mean

Table 2.

6. Applications

6.1. Magnetization direction data

Figure 1.

6.2. Volcano location data

Figure 2.

Figure 3.

7. Discussion

Acknowledgments

8. Appendix

8.1. Proof of Theorem 2.1

Lemma 8.1

Proof

Proof of Theorem 2.1

8.2. Proof of Theorem 2.3

Proposition 8.2

Proof of Theorem 2.3

8.3. Proof of Corollary 2.4

Proof

8.4. Proof of Theorem 3.1

Lemma 8.3

Proof of Theorem 3.1

Footnotes

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

4. Application to the unit sphere S^d