Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Oct 1.
Published in final edited form as: J Multivar Anal. 2012 Apr 17;111:1–19. doi: 10.1016/j.jmva.2012.02.020

Nonparametric Bayes Classification and Hypothesis Testing on Manifolds

Abhishek Bhattacharya 1, David Dunson 2
PMCID: PMC3384705  NIHMSID: NIHMS371284  PMID: 22754028

Abstract

Our first focus is prediction of a categorical response variable using features that lie on a general manifold. For example, the manifold may correspond to the surface of a hypersphere. We propose a general kernel mixture model for the joint distribution of the response and predictors, with the kernel expressed in product form and dependence induced through the unknown mixing measure. We provide simple sufficient conditions for large support and weak and strong posterior consistency in estimating both the joint distribution of the response and predictors and the conditional distribution of the response. Focusing on a Dirichlet process prior for the mixing measure, these conditions hold using von Mises-Fisher kernels when the manifold is the unit hypersphere. In this case, Bayesian methods are developed for efficient posterior computation using slice sampling. Next we develop Bayesian nonparametric methods for testing whether there is a difference in distributions between groups of observations on the manifold having unknown densities. We prove consistency of the Bayes factor and develop efficient computational methods for its calculation. The proposed classification and testing methods are evaluated using simulation examples and applied to spherical data applications.

Key words and phrases: Bayes factor, Classification, Dirichlet process mixture, Flexible prior, Hypothesis testing, Non-Euclidean manifold, Nonparametric Bayes, Posterior consistency, Spherical data

1. Introduction

Classification is one of the fundamental problems in statistics and machine learning. Let (X, Y) denote a pair of random variables with XInline graphic the predictors and YInline graphic = {1, …, L} the response. The focus in classification is on predicting Y given features X. From a model-based perspective, one can address this problem by first estimating p(y, x) = Pr(Y = y |X = x), for y = 1, …, L and all xInline graphic, based on a training sample of n subjects. Then, under a 0–1 loss function, the optimal predictive value for yn+1, the unknown response for an unlabeled (n+1)st subject, is simply the value of y that maximizes the estimate (y, xn+1) for y ∈ {1, …, L}. The model-based perspective has the advantage of providing measures of uncertainty in classification. However, performance will be critically dependent on obtaining an accurate estimate of p(y, x).

A common strategy for addressing this problem is to use a discriminant analysis approach, which lets

p(y,x)=Pr(Y=yX=x)=Pr(Y=y)f(xY=y)j=1LPr(Y=j)f(xY=j),y=1,,L,

with Pr(Y = y) the marginal probability of having label y, and f(x |Y = y) the conditional density of the features (predictors) for subjects in class y. Then, one can simply use the proportion of subjects in class y as an estimate of Pr(Y = y), while applying a multivariate density estimator to learn f(x |Y = y) separately within each class. For example, [25] proposed a popular approach which estimates f(x |Y = y) using a mixture of multivariate Gaussians (refer also to [19]). [1] instead use mixtures of von Mises-Fisher (vMF) distributions, but for unsupervised clustering on the unit hypersphere instead of classification. [5] extends the idea to features lying on a general manifold and uses mixtures of complex Watson distributions on the shape space.

Even when features can be assumed to have support on ℜp, there are two primary issues that arise. Firstly, the number of mixture components is typically unknown and can be difficult to estimate reliably, and secondly it may be difficult to accurately estimate a multivariate density specific to each class unless p is small and there are abundant training data in each class. To address these issues, a nonparametric Bayes discriminant analysis approach can be used in which the prior incorporates dependence in the unknown class-specific densities ([13]; [14]). An important challenge in implementing nonparametric Bayes methods is showing that the prior is sufficiently flexible to accurately approximate any classification function p(y, x). A primary goal of this article is to provide methods for classification that allow the predictors to have support on a manifold, utilizing priors with full support that lead to posterior consistency for the classification function.

As noted in [23] it is routine in many applications areas, ranging from genomics to computer vision, to normalize the data prior to analysis to remove artifacts. This leads to feature vectors that lie on the surface of the unit hypersphere, though due to the lack of straightforward methods for the analysis of spherical data, Gaussian approximations in Euclidean space are typically used. [23] show that treating spherical features as Euclidean can lead to poor performance in classification if the feature vectors are not approximately spherical-homoscedastic. We will propose a class of product kernel mixture models that can be designed to have full support on the set of densities on a manifold, and that lead to strong posterior consistency. In important special cases, such as spherical data, these models also facilitate computationally convenient Gibbs sampling algorithms for posterior computation.

A closely related problem to the classification problem is testing for differences in the distribution of features across groups. In the testing setting, the nonparametric Bayes literature is surprisingly limited perhaps due to the computational challenges that arise in calculating Bayes factors. For recent articles on nonparametric testing of differences between groups, refer to [15], [32] and [26]. The former two articles considered interval null hypotheses, while the later article considered a point null for testing differences in two groups using Polya tree priors. Here, we modify the methodology developed for the classification problem to obtain an easy to implement approach for nonparametric Bayes testing of differences between groups, with the data within each group constrained to lie on a compact metric space or Riemannian manifold, and prove consistency of this testing procedure.

Here is a very brief overview of the sections to follow. §2 describes the general modeling framework for classification on manifolds. §3 adapts the methodology to the testing problem. §4 focuses on the special case in which the features lie on the surface of a unit hypersphere and adapts the theory and methodology of the earlier sections to this manifold. §5 contains results from simulation studies where the developed methods of classification and testing are compared with existing ones. §6 applies the methods to spherical data applications, and §7 discusses the results. Proofs are included in an Appendix (§8).

2. Nonparametric Bayes Classification

2.1. Kernel mixture model

Let ( Inline graphic, ρ) be a compact metric space, ρ being the distance metric and Inline graphic = {1, …, L} a finite set. Consider a pair of random variables (X, Y) taking values in Inline graphic × Inline graphic. To induce a flexible model on the classification function p(y, x), we propose to model the joint distribution of the (X, Y) pair. The approach of inducing a flexible model on the conditional distribution through a nonparametric model for the joint distribution was proposed by [31]. In particular, they used a Dirichlet process mixture (DPM) of multivariate Gaussians for (X, Y) to induce a flexible prior on E(Y |X = x). [35, 24] recently generalized this approach beyond Gaussian mixtures. [9] used a joint DPM for random effects underlying a functional predictor and a response to induce a flexible model for prediction.

As an alternative to the joint modeling approach, one can directly place a non-parametric Bayesian model on the conditional p(y, x) = Pr(Y = y|X = x). There is a rich applied literature on this topic, with most approaches focusing on categorical predictors and/or predictors in a Euclidean space without consideration of theoretical properties. [22] showed posterior consistency in the Inline graphic = {1, 2} binary case using a Gaussian process prior mapped to the unit interval. [12] showed minimax optimal rates of posterior contraction up to a log factor for a binary response kernel regression model, with Gaussian priors on the weights leading to a conditionally Gaussian process. These and other theoretical articles on nonparametric Bayes classification focus on Inline graphic being a compact d-dimensional subset of ℜd, with the optimal rate for α-smooth functions being nα/(d+2α).

Our focus is on the case in which Y is an unordered categorical variable and X is constrained to have support on a compact metric space, with non-Euclidean Riemannian manifolds of particular interest. Potentially, one could induce a conditional model for p(y, x) without modeling the marginal distribution of the predictors p(x) through appropriate mappings of Gaussian processes defined over Inline graphic. However, Gaussian processes lead to well-known computational challenges even in Euclidean spaces, and there are advantages of the joint modeling approach in terms of simplifying theory, computation and modifications (e.g., to the testing problem considered in §3) while also allowing missing predictors. Most of the literature on nonparametric Bayes asymptotic theory follows a similar path, with some early papers on the type of strategy we will follow including [2, 20]. However, it is far from straightforward to establish the necessary conditions in new settings, and essentially all of the nonparametric Bayes asymptotic literature has focused on Euclidean spaces and relatively simple problems.

Assume that the joint distribution of (X, Y) has a joint density with respect to some fixed base measure λ on Inline graphic × Inline graphic. Let λ=λ1j=1Lδj where δ denotes the Dirac delta measure. If Inline graphic is a Riemannian manifold, the natural choice for λ1 will be the Riemannian volume form V. The distance ρ will be chosen to maintain the topology of the manifold. Letting Inline graphic( Inline graphic × Inline graphic) denote the space of all densities with respect to λ, we propose the following joint density model

f(x,y;P,κ)=X×SL-1νyK(x;μ,κ)P(dμdν),(x,y)X×Y, (2.1)

where ν = (ν1, …, νL)′ ∈ SL−1 is a probability vector on the simplex SL−1 = {ν ∈ [0, 1]L: Σνj = 1}, K(·; μ, κ) is a kernel located at μInline graphic with precision or inverse-scale κ ∈ ℜ+, and PInline graphic( Inline graphic × SL−1) is a mixing measure, with Inline graphic( Inline graphic × SL−1) denoting the space of all probability measures on Inline graphic × SL−1.

One can interpret this model in the following hierarchical way. Draw (μ, ν) from P. Given (μ, ν, κ), X and Y are conditionally independent with X having the conditional density K(.; μ, κ) with respect to λ1 and

Pr(Y=lμ,ν,κ)=νl,1lL.

If K(.; μ, κ) is a valid probability kernel, i.e.

XK(x;μ,κ)λ1(dx)=1,forall(μ,κ)X×R+,

one can show that f(x, y; P, κ) is a valid probability density with respect to λ.

To justify model (2.1), it is necessary to show flexibility in approximating any joint density in Inline graphic( Inline graphic × Inline graphic), and hence in approximating p(y, x). Our focus is on nonparametric Bayes methods that place a prior on the joint distribution of the measure P and the precision κ to induce a prior over Inline graphic( Inline graphic × Inline graphic). Flexibility of the model is quantified in terms of the size of the support of this prior. In particular, our goal is to choose a specification that leads to full L and Kullback-Leibler (KL) support, meaning that the prior assigns positive probability in arbitrarily small neighborhoods of any density in Inline graphic( Inline graphic × Inline graphic). This property will not necessarily hold for arbitrarily chosen kernels, and one of our primary theoretical contributions is to provide sufficient conditions under which KL support and posterior consistency hold. This is not just of theoretical interest, as it is important to verify that the model is sufficiently flexible to approximate any classification function, with the accuracy of the estimate improving as the amount of training data grows. This is not automatic for nonparametric models in which there is often concern about over-fitting.

Remark 2.1

In the joint model (2.1), one may also mix across the precision parameter and make the model more flexible. Posterior computation with such a model is a straight-forward extension of this one which is illustrated in §2.3.

Remark 2.2

One can allow Inline graphic = Inline graphic ⊗ · · · ⊗ Inline graphic, with each Inline graphic corresponding to a different type of space (e.g., Inline graphic a hypersphere, Inline graphic a subset of ℜd, Inline graphic a discrete space), by replacing K(x; μ, κ) with a product of kernels appropriate to each space.

2.2. Support of the prior and consistency

Assume that the joint distribution of (X, Y) has a density ft(x, y) = gt(x)pt(y, x) with respect to λ, where gt is the true marginal density of X and pt(y, x) is the true Pr(Y = y|X = x). For Bayesian inference, we choose a prior Π1 on (P, κ) in (2.1), with one possible choice corresponding to DP(w0P0) ⊗ Gam(a, b), with DP (w0P0) denoting a Dirichlet process prior with precision w0 and base probability measure P0Inline graphic( Inline graphic × SL−1) and Gam denoting the gamma distribution. The prior Π1 induces a corresponding prior Π on the space Inline graphic( Inline graphic × Inline graphic) through (2.1). Under minor assumptions on Π1 and hence Π, the theorem below shows that the prior probability of any uniform neighborhood of a continuous true density is positive.

Theorem 2.1

Under the assumptions

  • A1:

    K is continuous in its arguments,

  • A2:
    For any continuous function φ from Inline graphic to ℜ,
    limκsupxX|φ(x)-XK(x;μ,κ)φ(μ)λ1(dμ)|=0,
  • A3:
    For any κ > 0, there exists κ̃κ such that (Pt, κ̃) ∈ supp(Π1) where PtInline graphic( Inline graphic × SL−1) is defined as
    Pt(dμdν)=jYft(μ,j)λ1(dμ)δej(dν),

    with ej ∈ ℜL denoting a zero vector with a single one in position j,

  • A4:
    ft(., j) is continuous for all jInline graphic, given any ε > 0,
    Π({fD(X×Y):supxX,yYf(x,y)-ft(x,y)<ε})>0.

Assumption A4 restricts the true conditional density of X given Y = j to be continuous for all j. Assumptions A1 and A2 place minor regularity conditions on the kernel K. If K(x; μ, κ) is symmetric in x and μ, as will be the case in most examples, A2 implies that K(.; μ, κ) converges to δμ in the weak sense uniformly in μ as κ → ∞. This justifies the names ‘location’ and ‘precision’ for the parameters. Assumption A3 provides a minimal condition on the support of the prior for (P, κ). These assumptions provide general sufficient conditions for the induced prior Π on the joint density of (X, Y) to have full L support.

Although full uniform support is an appealing property, much of the theoretical work on asymptotic properties of nonparametric Bayes estimators relies on KL support. The following corollary shows that KL support follows from A1A4 and the additional assumption that the true density is everywhere positive. The proof is very much on the same lines of Corollary 1, [5]. The KL divergence of a density f from ft is defined as KL(ft;f)=X×Yftlogftfλ(dxdy). Given ε > 0, Kε(ft) = {f: KL(ft; f) < ε} will denote an ε-sized KL neighborhood of ft. The prior Π is said to satisfy the KL condition at ft, or ft is said to be in its KL support, if Π{Kε(ft)} > 0 for any ε > 0.

Corollary 2.2

Under assumptions A1A4 and

  • A5:

    ft(x, y) > 0 for all x, y,

    ft is in the KL support of Π.

Suppose we have an independent and identically distributed (iid) sample (xn,yn)(xi,yi)i=1n from ft. Since ft is unobserved, we take the likelihood function to be Πi=1nf(xi,yi;P,κ). Using the prior Π on f and the observed sample, we find the posterior distribution of f, denoted by Π(.|xn, yn). Using the Schwartz theorem ([33]), Corollary 2.2 implies weak posterior consistency. This in turn implies that for any measurable subset A of Inline graphic, with λ1(A) > 0, λ1(∂A) = 0, and yInline graphic, the posterior conditional probability of Y being y given X in A converges to the true conditional probability almost surely. Here ∂A denotes the boundary of A.

To give more flexibility to the classification function, we may replace the location-scale kernel by some broader family of parametric distributions on Inline graphic such as K(.; μ, κ, Θ) with Θ denoting additional kernel parameters. When performing posterior computations, we may set hyperpriors on the parameters of the prior. Then the conclusions of Results 2.1 and 2.2 hold and hence weak consistency follows as long as the assumptions are verified given the hyperparameters over a set of positive prior probability. This is immediate and is verified in Lemma 1, [41].

Under stronger assumptions on the kernel and the prior, we prove strong posterior consistency for the joint model. We will illustrate how these conditions are met for a von Mises-Fisher (vMF) mixture model for hyperspherical data through Proposition 4.1.

Theorem 2.3

Under assumptions A1A5 and

  • A6:
    There exist positive constants Inline graphic, a1, A1 such that for all Inline graphicInline graphic, μ, νInline graphic,
    supmM,κ[0,K]|K(m;μ,κ)-K(m;ν,κ)|A1Ka1ρ(μ,ν).
  • A7:
    There exists positive constants a2, A2 such that for all κ1, κ2 ∈ [0, Inline graphic], Inline graphicInline graphic,
    supm,μM|K(m;μ,κ1)-K(m;μ,κ2)|A2Ka2κ1-κ2.
  • A8:

    There exist positive constants a3, A3, A4 such that given any ε > 0, M can be covered by at-most A3εa3 + A4 many subsets of diameter at-most ε.

  • A9:

    Π1( Inline graphic(M) × (na, ∞)) is exponentially small for some a < (a1a3)−1, the posterior probability of any total variation neighborhood of ft converges to 1 almost surely.

Given the training data, we can classify a new subject based on their features using the posterior mean classification function . As a corollary to Theorem 2.3, we show that converges to pt in an L1 sense.

Corollary 2.4

(a) Strong consistency for the posterior of f implies that

Π{f:maxyYXp(y,x)-pt(y,x)gt(x)λ1(dx)<ε|xn,yn} (2.2)

converges to 1 as n → ∞ a.s. (b) Under assumptions A4A5 on ft, this implies

Π{f:maxyYXp(y,x)-pt(y,x)w(x)λ1(dx)<ε|xn,yn}

converges to 1 a.s. for any non-negative function w with supx w(x) < ∞.

Remark 2.3

Part (a) of Corollary 2.4 holds even when Inline graphic is non-compact. It just needs strong posterior consistency for the joint model.

From part (b) of Corollary 2.4, it would seem intuitive that point-wise posterior consistency can be obtained for the predictive probability function. However, this is not immediate because the convergence rate may depend on the choice of w.

Assumption A9 is hard to satisfy, especially when the feature space is high dimensional. This type of problem was mentioned by [42] in a different setting. Then a1 and a3 turn out to be very big, so that the prior is required to have very light tails and place small mass at high precisions. This is undesirable in applications and instead we can let Π1 depend on the sample size n and obtain weak and strong consistency under weaker assumptions.

Theorem 2.5

Let Π1 = Π11πn where πn is a sequence of densities on ℜ+. Assume the following.

  • A10:

    The prior Π11 has full support.

  • A11:
    For any β > 0, there exists a κ0 ≥ 0, such that for all κκ0,
    liminfnexp(nβ)πn(κ)=.
  • A12:
    For some β0 > 0 and a < (a1a3)−1,
    limnexp(nβ0)πn{(na,)}=0.
    1. Under assumptions A1A2 on the kernel, A10A11 on the prior and A4A5 on ft, the posterior probability of any weak neighborhood of ft converges to one a.s.

    2. Under assumptions A1A2, A4A8 and A10A12, the posterior probability of any total variation neighborhood of ft converges to 1 a.s.

The proof is similar to that of Theorems 2.6 and 2.9 in [6] and hence is omitted. With Π11 = DP (ω0P0) and πn = Gam(a, bn), the conditions in Theorem 2.5 are satisfied (for example) when P0 has full support and bn = b1n/{log(n)}b2 for any b1, b2 > 0. Then from Corollary 2.4, we have L1 consistency of the estimated classification function.

2.3. Computation

Given the training sample (xn, yn), we classify a new subject based on the predictive probability of allocating it to category j, which is expressed as

Pr(yn+1=jxn+1,xn,yn),jY, (2.3)

where xn+1 denotes the feature for the new subject and yn+1 its unknown class label. It follows from Theorem 2.1 and Corollary 2.4 that the classification rule is consistent if the kernel and prior are chosen correctly. Following the recommendation in §2.2, for the prior, we let P ~ DP(w0P0) independently of κ ~ π, with P0 = P01P02, P01 a distribution on Inline graphic, P02 a Dirichlet distribution Diri(a) (a = (a1, …, aL)) on SL−1, and π a base distribution on ℜ+.

Since it is not possible to get a closed form expression for the predictive probability, we use a Markov chain Monte Carlo (MCMC) algorithm. Specifically, we rely on a simple-to-implement Gibbs sampling algorithm that utilizes slice sampling. Slice sampling was proposed for posterior computation in DPMs by [38], with efficiency later improved by Papaspiliopoulos (2008, unpublished technical report). Our approach is related to that used by [43], with [28] providing a general class of algorithms for slice sampling in mixture models.

Using the stick-breaking representation of [34] and introducing cluster allocation indices S = (S1, …, Sn) (Si ∈ {1, …, ∞}), the generative model (2.1) can be expressed in hierarchical form as

xi~K(μSi,κ),yi~Multi(1,,L;νSi),Si~j=1wjδθj,θj=(μj,νj), (2.4)

where wj = VjΠh<j(1 − Vh) is the probability that subject i is allocated to cluster Si = j, θj is the vector of parameters specific to cluster j, and Vj ~ Beta(1, w0), μj ~ P01 and νj ~ Diri(a) are mutually independent for j = 1, …, ∞.

The joint posterior density of V={Vj}j=1, θ={θj}j=1=(μ,ν), S and κ given the training data is proportional to

{Πi=1nK(xi;μSi,κ)νSiyiwSi}{Πj=1Beta(Vj;1,w0)P01(dμj)Diri(νj;a)}π(κ).

To avoid the need for posterior computation for infinitely-many unknowns, we introduce slice sampling latent variables u={ui}i=1n drawn iid from Unif(0,1) such that the augmented posterior density is proportional to

π(u,V,θ,S,κxn,yn){i=1nK(xi;μSi,κ)νSiyiI(ui<wSi)}×{j=1Beta(Vj;1,w0)P01(dμj)Diri(νj;a)}π(κ). (2.5)

Letting Smax = max{Si}, the conditional posterior distribution of {Vj, θj, j > Smax} is the same as the prior, and we can use this to bypass the need for updating infinitely-many unknowns in the Gibbs sampler. After choosing initial values, the sampler iterates through the following steps.

  1. Update Si, for i = 1, …, n, given (u, V, θ, κ, xn, yn) by sampling from the multinomial distribution with
    Pr(Si=j)K(xi;μj,κ)νjyiforjAi={j:1jJ,wj>ui},

    with J being the smallest index satisfying 1-min(u)<j=1Jwj. In implementing this step, draw Vj ~ Beta(1, w0) and θj ~ P0 for j > Smax as needed.

  2. Update the scale parameter κ by sampling from the full conditional posterior which is proportional to
    π(κ)i=1nK(xi;μSi,κ).

    If direct sampling is not possible, rejection sampling or Metropolis-Hastings (MH) sampling can be used.

  3. Update the atoms θj, j = 1, …, Smax from the full conditional posterior distribution, which is equivalent to independently sampling from
    π(μj-)P01(dμj)i:Si=jK(xi;μj,κ)(νj-)~Diri(a1+i:Si=jI(yi=1),,aL+i:Si=jI(yi=L)).

    If P01 is not conjugate, then rejection or MH sampling can be used to update μj.

  4. Update the stick-breaking random variables Vj, for j = 1, …, Smax, from their conditional posterior distributions given the cluster allocation S but marginalizing out the slice sampling latent variables u. In particular,
    Vj~Beta(1+iI(Si=j),w0+iI(Si>j)).
  5. Update the slice sampling latent variables from their conditional posterior by letting
    ui~Unif(0,wSi),i=1,,n.

These steps are repeated a large number of iterations, with a burn-in discarded to allow convergence. Given a draw from the posterior, the predictive probability of allocating a new observation to category l, lL, as defined through (2.3) is proportional to

j=1SmaxwjνjlK(xn+1;μj,κ)+wSmax+1X×SL-1νlK(xn+1;μj,κ)P0(dμdν) (2.6)

where wSmax+1=1-j=1Smaxwj. We can average these conditional predictive probabilities across the MCMC iterations after burn-in to estimate predictive probabilities. For moderate to large numbers of training samples n, j=1Smaxwj1 with high probability, so that an accurate approximation can be obtained by setting the final term equal to zero and hence bypassing need to calculate the integral.

3. Nonparametric Bayes Testing

3.1. Hypotheses and Bayes factor

The previous section focused on the classification problem in which there is interest in predicting a class label yn+1 for a new subject based on training data (xn, yn). In many applications, the primary focus is instead on conducting inferences on differences in the distribution of features X across groups Y. For example, Y may correspond to a patient’s ethnicity and X to health outcomes in an epidemiology study, with interest being in assessing evidence in the data of differences in the distribution of health outcomes across ethnic groups. This testing problem is somewhat orthogonal to the classification problem considered in §2.3, with classification being of primary interest in certain applications and testing of primary interest in others. However, as we demonstrate in this section, the models, theory and computational algorithms we developed for the classification problem can be adapted to accommodate testing. Although testing of group differences is one of the canonical problems in statistics, the literature addressing this problem from a nonparametric Bayes perspective while providing theoretical guarantees is essentially non-existent. Hence, the material in this section is a major contribution of the paper, which should hopefully stimulate additional research.

Although our methods can allow testing of pairwise differences between groups, we focus for simplicity in exposition on the case in which the null hypothesis corresponds to homogeneity across the groups. Formally, the alternative hypothesis H1 corresponds to any joint density in Inline graphic( Inline graphic × Inline graphic) excluding densities of the form

H0:f(x,y)=g(x)p(y) (3.1)

for all (x, y) outside of a λ-null set. Note that model (2.1) will in general assign zero probability to H0, and hence is an appropriate model for the joint density under H1.

As a model for the joint density under the null hypothesis H0 in (3.1), we replace P(dμdν) in (2.1) with P1()P2() so that the joint density becomes

f(x,y;P1,P2,κ)=g(x;P1,κ)p(y;P2)where (3.2)
g(x;P1,κ)=XK(x;μ,κ)P1(dμ),p(y;P2)=SL-1νyP2(dν). (3.3)

We set priors Π1 and Π0 for the parameters in the models under H1 and H0, respectively. The Bayes factor in favor of H1 over H0 is then the ratio of the marginal likelihoods under H1 and H0,

BF(H1:H0)=M(X×SL-1)×R+i=1nf(xi,yi;P,κ)Π1(dPdκ)M(X)×M(SL-1)×R+i=1ng(xi;P1,κ)p(yi;P2)Π0(dP1dP2dκ)

The priors should be suitably constructed so that we get consistency of the Bayes factor and computation is straightforward and efficient. [3] propose an approach for calculating Bayes factors for comparing Dirichlet process mixture (DPM) models, but their algorithm is quite involved to implement and is limited to DPM models with a single DP prior on an unknown mixture distribution. Simple conditions for consistency of Bayes factors for testing a point null versus a nonparametric alternative have been provided by [11], but there has been limited work on consistency of Bayes tests in more complex cases, such as we are faced with here. [21] develop general theory on nonparametric Bayesian model selection and averaging, with their Corollary 3.1 providing conditions for Bayes factor consistency. [29] shows Bayes factor and model selection consistency in the setting of semiparametric linear models with nonparametric residual distributions. Our theory takes a different direction appropriate to the group testing problem.

The prior Π1 on (P, κ) under H1 can be constructed as in §2. To choose a prior Π0 for (P1, P2, κ) under H0, we take (P1, κ) to be independent of P2 so that the marginal likelihood becomes a product of the X and Y marginals if H0 is true. Dependence in the priors for the mixing measures would induce dependence between the X and Y densities, and it is important to maintain independence under H0. In addition, as in parametric Bayes model selection [10], it is important to maintain compatibility of the prior specifications under H0 and H1. In the parametric literature, compatibility is often obtained through first specifying a prior for the unknowns in the larger model and then marginalizing to induce priors for the unknowns in smaller nested models. Following this idea, we let the prior for (P1, κ) under H0 correspond to the marginal on (P1, κ) from prior Π1 on (P, κ) assuming P = P1P2. It remains to specify a prior for P2 under H0. Expression (3.3) suggests that under H0 the density of Y depends on P2 only through

p=(p(1;P2),p(2;P2),,p(L;P2))SL-1.

Hence, it is sufficient to choose a prior for p, such as Diri(b) with b = (b1, …, bL)′, instead of specifying a full prior for P2. For compatibility, the Dirichlet hyperparameters can potentially be chosen to approximately match the first two moments of the marginal prior on p induced from Π1.

Under prior Π0 noting that P = P1P2, the marginal likelihood under H0 is

M(X×SL-1)×R+i=1ng(xi;P1,κ)Π1(dPdκ)SL-1j=1Lpji=1nI(yi=j)Diri(dp;b)=D(bn)D(b)M(X×SL-1)×R+i=1ng(xi;P1,κ)Π1(dPdκ), (3.4)

with bn being the L-dimensional vector with jth coordinate bj+i=1nI(yi=j), 1 ≤ jL, D being the normalizing constant for the Dirichlet distribution given by D(a)=j=1LΓ(aj)Γ(j=1Laj) and Γ denoting the gamma function. The marginal likelihood under H1 is

i=1nf(xi,yi;P,κ)Π1(dPdκ). (3.5)

The Bayes factor in favor of H1 against H0 is the ratio of the marginal (3.5) over (3.4).

3.2. Consistency of the Bayes factor

Let Π be the prior induced on the space of all densities Inline graphic( Inline graphic × Inline graphic) through Π1. For any density f (x, y), let g(x) = Σj f (x, j) denote the marginal density of X while p(y) = ∫Inline graphicf (x, y) λ1(dx) denotes the marginal probability vector of Y. Let ft, gt and pt be the corresponding values for the true distribution of (X, Y). The Bayes factor in favor of the alternative, as obtained in the last section, can be expressed as

BF=D(b)D(bn)if(xi,yi)Π(df)ig(xi)Π(df). (3.6)

Theorem 3.1 proves consistency of the Bayes factor at an exponential rate if the alternative hypothesis of dependence holds.

Theorem 3.1

If X and Y are not independent under the true density ft and if the prior Π satisfies the KL condition at ft, then there exists a β0 > 0 for which lim infn→∞ exp(−0)BF = ∞ a.s. ft.

3.3. Computation

Seemingly, one of the major reasons for the lack of methodology literature on Bayesian nonparametric testing is the substantial computational hurdles involved in accurately calculating Bayes factors for comparing hypotheses. Except in very simple models, marginal likelihoods cannot be calculated in closed form, accurate analytic approximations are not available and Monte Carlo methods are prohibitively expensive computationally when run sufficiently long to produce an accurate estimate. Outside of conjugate models, one of the most successful strategies for calculating Bayes factors based on MCMC algorithms relies on an encompassing approach in which a single MCMC algorithm is designed, which moves freely between the models. There has been limited success in designing such algorithms for nonparametric Bayes testing, but we devise an efficient data augmentation algorithm that appears to accurately produce Bayes factors and posterior hypothesis probabilities based on a single MCMC chain.

We introduce a latent variable z = I(H1 is true) which takes value 1 if H1 is true and 0 if H0 is true. Assuming equal prior probabilities for H0 and H1, the conditional likelihood of (xn, yn) given z is

Π(xn,ynz=0)=D(bn)D(b)i=1ng(xi;P1,κ)Π1(dPdκ)andΠ(xn,ynz=1)=i=1nf(xi,yi;P,κ)Π1(dPdκ).

In addition, the Bayes factor can be expressed as

BF=Pr(z=1xn,yn)Pr(z=0xn,yn). (3.7)

Next introduce latent parameters μ, ν, V, S, κ as in §2.3 such that

Π(xn,yn,μ,V,S,κ,z=0)=D(bn)D(b)π(k)i=1n{wSiK(xi;μSi,κ)}×j=1{Be(Vj;1,w0)P01(dμj)}, (3.8)
Π(xn,yn,μ,ν,V,S,κ,z=1)=π(k)i=1n{wSiνSiyiK(xi;μSi,κ)}×j=1{Be(Vj;1,w0)P0(dμjdνj)}. (3.9)

Marginalize out ν from equation (3.9) to get

Π(xn,yn,μ,V,S,κ,z=1)=π(k)j=1D(a+aj(S))D(a)×i=1n{wSiK(xi;μSi,κ)}j=1Be(Vj;1,w0)P01(dμj)}, (3.10)

with ãj(S), 1 ≤ j < ∞ being L-dimensional vectors with lth coordinate Σi:Si=j I(yi = l), lInline graphic. Integrate out z by adding equations (3.8) and (3.10) and the joint distribution of (μ, V, S, κ) given the data becomes

Π(μ,V,S,κxn,yn){C0+C1(S)}π(κ)i=1n{wSiK(xi;μSi,κ)}×j=1{Be(Vj;1,w0)P01(dμj)}withC0=D(bn)D(b)andC1(S)=j=1D(a+aj(S))D(a). (3.11)

To estimate the Bayes factor, first make repeated draws from the posterior in (3.11). For each draw, compute the posterior probability distribution of z from equations (3.8) and (3.10) and take their average after discarding a suitable burn-in. The averages estimate the posterior distribution of z given the data, from which we can get an estimate for BF from (3.7). The sampling steps are accomplished as follows

  1. Update the cluster labels S given (μ, V, κ) and the data from their joint posterior which is proportional to
    {C0+C1(S)}π(k)i=1n{wSiK(xi;μSi,κ)}. (3.12)

    Introduce slice sampling latent variables u as in §2.3 and replace wSi by I(ui < wSi) to make the total number of possible states finite. However unlike in §2.3, the Sis are no more conditionally independent. We propose to use a Metropolis-Hastings block update step in which a candidate for (S1, …, Sn), or some subset of this vector if n is large, is sampled independently from multinomials with Pr(Si = j) ∝ K(xi; μj, κ), for jAi where Ai = {j: 1 ≤ jJ, wj > ui} and J is the smallest index satisfying 1-min(u)<j=1Jwj. In implementing this step, draw Vj ~ Be(1, w0) and μj ~ P01 for j > Smax as needed. The acceptance probability is simply the ratio of C0 + C1(S) calculated for the candidate value and the current value of S.

  2. Update κ,{μj}j=1Smax,{Vj}j=1Smax,{ui}i=1n as in Steps (2) – (5) of the algorithm in §2.3.

  3. Compute the full conditional posterior distribution of z which is given by
    Pr(zμ,S,xn,yn){D(bn)D(b)ifz=0,j=1SmaxD(a+aj(S))D(a)ifz=1.

4. Application to the unit sphere Sd

4.1. vMF Kernel Mixture Models

For classification using predictors X lying on the hypersphere

X=Sd={xRd+1:||x||2j=1d+1xj2=1},

we recommend using a von Mises-Fisher (vMF) kernel in the mixture model (2.1) to induce a prior over Inline graphic(Sd × Inline graphic). Although the other distributions on Sd described in [30] could be used, the vMF kernel provides a relatively simple and computationally tractable choice. As shown in Proposition 4.1, this kernel also satisfies the assumptions in §2.2 for building a flexible joint density model and for posterior consistency. For a proof, see [6].

The vMF distribution has the density ([37], [18], [39])

vMF(x;μ,κ)=c-1(κ)exp(κxμ)(x,μSd,κ[0,)) (4.1)

with respect to the invariant volume form on Sd, where

c(κ)=2πd/2Γ(d2)-11exp(κt)(1-t2)d/2-1dt

is its normalizing constant. This distribution has a unique extrinsic mean (as defined in [8]) of μ, thereby μ can be interpreted as the kernel location. The parameter κ is a measure of concentration with κ = 0 corresponding to the uniform distribution while as κ diverges to ∞, it converges weakly to δμ uniformly in μ. Sampling from vMF is straightforward using results in [36] and [40].

Proposition 4.1

(a) The vMF kernel K as defined in (4.1) satisfies assumptions A1 and A2. It satisfies A6 with a1 = d/2+1 and A7 with a2 = d/2. The compact metric-space Sd endowed with chord distance satisfies A8 with a3 = d.

In the sequel, we will apply the general methods for classification and testing developed earlier in the paper to hyperspherical features.

4.2. MCMC Details

When features lie on Sd and we choose vMF kernels and priors as in §2.3, simplifications result in the MCMC steps for updating κ and μ. Letting P01 = vMF(μ0, κ0), we obtain conjugacy for the full conditional of the kernel locations,

(μj-)=DvMF(μ¯j,κj)

with μ̄j = vj/||vj||, κj = ||vj||, vj = κ0μ0 + κXj and Xj = Σi xiI(Si = j). As a default, we can set μ0 equal to the xn sample extrinsic mean or we can choose κ0 = 0 to induce a uniform P01. The full posterior of κ is

π(κ)c-n(κ)κ-nd/2enκκnd/2exp{-κ(n-ixSiμSi)}. (4.2)

If we set

π(κ)cn(κ)κa+nd2-1e-κ(n+b),a,b>0, (4.3)

the posterior simplifies to

Gam(a+nd2,b+n-ixiμSi). (4.4)

We can make the MCMC more efficient by marginalizing out μ while updating κ. In particular

π(κS)c-n(κ)π(κ)j:ms(j)>0c(||κXj+κ0μ0||)

with ms(j) = Σi I(Si = j). This is easy to compute in the case d = 2, κ0 = 0 and π ~ Gam(a, b). Then it simplifies to

π(κS)Gam(κ;n-jI(ms(j)>0)+a,n-j||Xj||+b)×{1-exp(-2κ)}-nj:ms(j)>0{1-exp(-2κ||Xj||)}.

For this choice, we suggest a Metropolis-Hastings proposal that corresponds to the gamma component. This leads to a high acceptance probability when κ is high, and has good performance in general cases we have considered.

In the predictive probability expression in (2.6), the integral simplifies to

ajaic-1(κ)c-1(κ0)c(||κXn+1+κ0μ0||).

5. Simulation Examples

5.1. Classification

We draw iid samples on S9 × Inline graphic, Inline graphic = {1, 2, 3} from

ft(x,y)=(1/3)l=13I(y=l)vMF(x;μl,200)

where μ1 = (1, 0, …)T, μj = cos(0.2)μ1 + sin(0.2)vj, j = 2, 3, v2 = (0, 1, …)T and v3=(0,0.5,0.75,0,)T. Hence, the three response classes y ∈ {1, 2, 3} are equally likely and the distribution of the features within each class is a vMF on S9 with distinct location parameters. We purposely chose the separation between the kernel locations to be small, so that the classification task is challenging.

We implemented the approach described in §2.3 to perform nonparametric Bayes classification. The hyperparameters were chosen to be w0 = 1, P0 = vMF(μn, 10) ⊗ Diri(1, 1, 1), μn being the feature sample extrinsic mean, and π as in (4.3) with a = 1, b = 0.1. Cross-validation is used to assess classification performance, with posterior computation applied to data from a training sample of size 200, and the results used to predict y given the x values for subjects in a test sample of size 100. The MCMC algorithm was run for 5 × 104 iterations after a 104 iteration burn-in. Based on examination of trace plots for the predictive probabilities of y for representative test subjects, the proposed algorithm exhibits good rates of convergence and mixing. Note that we purposely avoid examining trace plots for component-specific parameters due to label switching. The out-of-sample misclassification rates for categories y = 1, 2 and 3 were 18.9%, 9.7% and 12.5%, respectively, with the overall rate being 14%.

As an alternative method for flexible model-based classification, we considered a discriminant analysis approach, which models the conditional density of x given y as a finite mixture of 10-dimensional Gaussians. In the literature it is very common to treat data lying on a hypersphere as if the data had support in a Euclidean space to simplify the analysis. Using the EM algorithm to fit the finite mixture model, we encountered singularity problems when allowing more than two Gaussian components per response class. Hence, we present the results only for mixtures of one or two multivariate Gaussian components. In the one component case, we obtained class-specific misclassification rates of 27%, 12.9% and 18.8%, with the overall rate being 20%. The corresponding results for the two component mixture were 21.6%, 16.1% and 28.1% with an overall misclassification rate of 22%.

Hence, the results from a parametric Gaussian discriminant analysis and a mixture of Gaussians classifier were much worse than those for our proposed Bayesian nonparametric approach. There are several possible factors contributing to the improvement in performance. Firstly, the discriminant analysis approach requires separate fitting of different mixture models to each of the response categories. When the amount of data in each category is small, it is difficult to reliably estimate all these parameters, leading to high variance and unstable estimates. In contrast our approach of joint modeling of ft using a DPM favors a more parsimonious representation. Secondly, inappropriately modeling the data as having support on a Euclidean space has some clear drawbacks. The size of the space over which the densities are estimated is increased from a compact subset S9 to an unbounded space ℜ10. This can lead to an inflated variance and difficulties with convergence of EM and MCMC algorithms. In addition, the properties of the approach are expected to be poor even in larger samples. As Gaussian mixtures give zero probability to the embedded hypersphere, one cannot expect strong posterior consistency.

5.2. Hypothesis Testing

We draw an iid sample of size 100 on S9 × Inline graphic, Inline graphic = {1, 2, 3}, from the distribution

ft(x,y)=(1/3)l=13I(y=l)j=13wljvMF(x;μj,200),

where μj, j = 1, 2, 3 are as in the earlier example and the weights {wlj} are chosen so that w11 = 1 and wlj = 0.5 for l = 2, 3 and j = 2, 3. Hence, in group y = 1, the features are drawn from a single vMF density, while in groups y = 2 and 3, the feature distributions are equally weighted mixtures of the same two vMFs.

Letting fj denote the conditional density of X given Y = j for j = 1, 2, 3, respectively, the global null hypothesis of no difference in the three groups is H0: f1 = f2 = f3, while the alternative H1 is that they are not all the same. We set the hyperparameters as w0 = 1, P0 = vMF(μn, 10) ⊗ Diri(a), μn being the X-sample extrinsic mean, b = a = = (0.28, 0.36, 0.36) - the sample proportion of observations from each group, and a prior π on κ as in 4.3 with a = 1 and b = 0.1. We run the proposed MCMC algorithm for calculating the Bayes factor (BF) in favor of H1 over H0 for 6 ×104 iterations updating cluster labels S in 4 blocks of 25 each every iteration. The starting value of S is obtained by the k-means algorithm (k = 10) applied to the X component of the sample using geodesic distance on S9 and we started with κ = 200. The trace plots exhibit good rate of convergence of the algorithm. After discarding a burn-in of 4 × 104 iterations, the estimated BF was 2.23 × 1015, suggesting strong evidence in the data in favor of H1. We tried multiple starting points and different hyperparameter choices and found the conclusions to be robust, with the estimated BFs not exactly the same but within an order of magnitude. We also obtained similar estimates using substantially shorter and longer chains.

We can also use the proposed methodology for pairwise hypothesis testing of H0,ll: fl = fl against the alternative H1,ll: flfl for any two pairs, l, l′, with ll. The analysis is otherwise implemented exactly as in the global hypothesis testing case. The resulting BF in favor of H1,ll over H0,ll for the different possible choices of (l, l′) are shown in Table 1. We obtain very large BFs in testing differences between groups 1 and 2 and 1 and 3, but a moderately small BF for testing a difference between groups 2 and 3, suggesting mild evidence that these two groups are equal. These conclusions are all consistent with the truth. We have noted a general tendency for the BF in favor of the alternative to be large when the alternative is true even in modest sample sizes, suggesting a rapid rate of convergence under the alternative in agreement with our theoretical results. When the null is true, the BF appears to converge to zero based on empirical results in our simulations, but at a slow rate.

Table 1.

Nonparametric Bayes and frequentist test results for data simulated for three groups with the second and third groups identical.

groups BF p-value
(1,2,3) 2.3 × 1015 2 × 10−6
(1,2) 2.4 × 104 1.8 × 10−4
(1,3) 1.7 × 106 1.5 × 10−5
(2,3) 0.235 0.43

For comparison, we also considered a frequentist nonparametric test for detecting differences in the groups based on comparing the sample extrinsic means of the fls. The test statistic used has an asymptotic Xd(L-1)2 distribution where d = 9 is the feature space dimension and L is the number of groups that we are comparing. It takes the expression nj=1Lp^jX¯jB(B^B)-1BX¯j where n = 100 is the sample size, j is the X sample mean for group j, denotes the overall sample mean, Σ̂ is the sample covariance, and B is an orthonormal basis for the tangent space of S9 at μn = /||||. It is obtained via a Taylor expansion of the map xx/||x|| from ℜd+1 to Sd. For more details, see §4.8 of [4]. The corresponding p-values are shown in Table 1. The conclusions are all consistent with those from the nonparametric Bayes approach.

5.3. Testing with No Differences in Mean

In this example, we draw iid samples on S2 × Inline graphic, Inline graphic = {1, 2} from the distribution

ft(x,y)=(1/2)l=12I(y=l)j=13wljvMF(x;μj,200),

where w=[10000.50.5], μ1 = (1, 0, 0)T, μj = cos(0.2)μ1 + sin(0.2)vj (j = 2, 3) and v2 = − v3 = (0, 1, 0)T. In this case the features are drawn from two groups equally likely, one of them is a vMF, while the other is a equally weighted mixture of two different vMFs. The locations μj are chosen such that both the groups have the same extrinsic mean μ1.

We draw 10 samples of 50 observations each from the model ft and carry out hypothesis testing to test for association between X and Y via our method and the frequentist one. The prior, hyperparameters and the algorithm for Bayes Factor (BF) computation are as in the earlier example. In each case we get insignificant p-values, often over 0.5, but very high BFs, often exceeding 106. The values are listed in Table 2.

Table 2.

Nonparametric Bayes and frequentist test results for 10 simulations of 50 observations each for two groups with same population means.

BF 6.1e9 6.4e8 1.3e9 4.3e8 703.1 4.4e7 42.6 4.7e6 1.9e6 379.1
p-value 1.00 0.48 0.31 0.89 0.89 0.49 0.71 0.53 0.56 0.60

The reason for the failure of the frequentist test is because it relies on comparing the group specific sample extrinsic means and in this example the difference between them is little. Our method on the other hand compares the full conditionals and hence can detect differences that are not in the means.

6. Applications

6.1. Magnetization direction data

In this example from [16], measurements of remanent magnetization in red silts and claystones were made at 4 locations. This results in samples from four group of directions on the sphere S2, the sample sizes are 36, 39, 16 and 16. The goal is to compare the magnetization direction distributions across the groups and test for any significant difference. Figure 1 shows the 3D plot of the sample clouds. The plot suggests no major differences. To test that statistically, we calculate the Bayes factor (BF) in favor of the alternative, as in §5.2. As mixing was not quite as good as in the simulated examples, we implemented label switching moves. We updated the cluster configurations in two blocks of size 54 and 53. The estimated BF was ≈1, suggesting no evidence in favor of the alternative hypothesis that the distribution of magnetization directions vary across locations.

Figure 1.

Figure 1

3D coordinates of 4 groups in §6.1: 1(r), 2(b), 3(g), 4(c).

To assess sensitivity to the prior specification, we repeated the analysis with different hyperparameter values of a, b equal to the proportions of samples within each group and P01 corresponding to an uniform on the sphere. In addition, we tried different starting clusterings in the data, with a default choice obtained by implementing k-means with 10 clusters assumed. In each case, we obtain BF ≈1, so the results were robust.

In Example 7.7 of [17], a coordinate-based parametric test was conducted to compare mean direction in these data, producing a p-value of 1 − 1.4205 × 10−5 based on a X62 statistic. They also compared the mean directions for the first two groups and obtained a non-significant p-value. Repeating this two sample test using our Bayesian nonparametric method, we obtained a Bayes factor of 1.00. The nonparametric frequentist test from §5.2 yield p-values of 0.06 and 0.38 for the two tests.

6.2. Volcano location data

The NOAA National Geophysical Data Center Volcano Location Database contains information on locations and characteristics of volcanoes across the globe. The locations using latitude-longitude coordinates are plotted in Figure 2. We are interested in testing if there is any association between the location and type of the volcano. We consider the most common three types which are Strato, Shield and Submarine volcanoes, with data available for 999 volcanoes of these types worldwide. Their location coordinates are shown in Figure 3. Denoting by X the volcano location which lies on S2 and by Y its type which takes values from Inline graphic = {1, 2, 3}, we compute the Bayes factor (BF) for testing if X and Y are independent.

Figure 2.

Figure 2

Longitude-Latitude coordinates of volcano locations in §6.2.

Figure 3.

Figure 3

Coordinates of 3 major type volcano locations: Strato(r), Shield(b), Submarine(g). Their sample extrinsic mean locations:1, 2, 3. Full sample extrinsic mean

As should be apparent from the Figures, the volcano data are particularly challenging in terms of density estimation because the locations tend to be concentrated along fault lines. Potentially, data on distance to the closest fault, volcano elevation and other information could be utilized to improve performance, but we do not have access to such data. It would be straight forward to include such additional predictors as explained in Remark 2.2. Without such information, the data present a challenging test case for the methodology in that it is clear that one may need to utilize very many vMF kernels to accurately characterize the density of volcano locations across the globe, with the use of moderate to large numbers of kernels leading to challenging mixing issues. Indeed, we did encounter a sensitivity to the starting cluster configuration in our initial analyses.

We found that one of issues that exacerbated the problem with mixing of the cluster allocation was the ordering in the weights on the stick-breaking representation utilized by the exact block Gibbs sampler. Although label switching moves can lead to some improvements, they proved to be insufficient in this case. Hence, we modified the computational algorithm slightly to instead use the finite Dirichlet approximation to the Dirichlet process proposed in [27]. The finite Dirichlet treats the components as exchangeable so eliminates sensitivity to the indices on the starting clusters, which we obtained using k-means for 50 clusters. We used K = 50 as the dimension of the finite Dirichlet and hence the upper bound on the number of occupied clusters. Another issue that lead to mixing problems was the use of a hyperprior on κ. In particular, when the initial clusters were not well chosen, the kernel precision would tend to drift towards smaller than optimal values and as a result too few clusters would be occupied to adequately fit the data. We did not observe such issues at all in a variety of other simulated and real data applications, but the volcano data are particularly difficult as we note above.

To address this second issue, we chose the kernel precision parameter κ by cross-validation. In particular, we split the sample into training and test sets, and then ran our Bayesian nonparametric analysis on the training data separately for a wide variety of κ values between 0 and 1,000. We chose the value that produced the highest expected posterior log likelihood in the test data, leading to κ̂ = 80. In this analysis and the subsequent analyses for estimating the BF, we chose the prior on the mixture weights to be Diri(w0/K1K) (K = 50). The other hyper-parameters were chosen to be w0 = 1, a = b = (0.71, 0.17, 0.11) = the sample proportion of different volcano types, κ0 = 10, and μ0 = the X-sample extrinsic mean. We collected 5 × 104 MCMC iterations after discarding a burn-in of 104. Using a fixed band-width considerably improved the algorithm convergence rate.

Based on the complete data set of 999 volcanoes, the resulting BF in favor of the alternative was estimated to be over 10100, providing conclusive evidence that the different types of volcanos have a different spatial distribution across the globe. For the same fixed κ̂ value, we reran the analysis for a variety of alternative hyperparameter values and different starting points, obtaining similar BF estimates and the same conclusion. We also repeated the analysis for a randomly selected subsample of 300 observations, obtaining BF = 5.4 × 1011. The testing is repeated for other sub-samples, each resulting in a very high BF. We also obtained a high BF in repeating the analysis with a hyperprior on κ.

For comparison, we perform the asymptotic Inline graphic test as described in §5.2, obtaining a p-value of 3.6 × 10−7 which again favors H1. The large sample sizes for the three types (713,172,114) justifies the use of asymptotic theory. However given that the volcanoes are spread all over the globe, the validity of the assumption that the three conditionals have unique extrinsic means may be questioned.

We also perform a coordinate based test by comparing the means of the latitude longitude coordinates of the three sub-samples using a Inline graphic statistic. The three coordinate means are (12.6, 27.9), (21.5, 9.2), and (9.97, 21.5) (latitude, longitude). The value of the statistic is 17.07 and the asymptotic p-value equals 1.9 × 10−3 which is larger by orders of magnitude than its coordinate-free counterpart, but still significant. Coordinate based methods, however, can be very misleading because of the discontinuity at the boundaries. They heavily distort the geometry of the sphere which is evident from the figures.

7. Discussion

We have proposed a novel Bayesian approach for classification and testing relying on modeling the joint distribution of the categorical response and continuous predictors as a Dirichlet process product mixture. The product mixture likelihood includes a multinomial for the categorical response and an arbitrary kernel for the predictors, with dependence induced through the DP prior on the unknown joint mixing measure. By modifying the kernel for the predictors, one can modify the support, with multivariate Gaussian kernels for predictors in ℜp and von Mises Fisher kernels for predictors on the hypersphere. For other predictor spaces, one can appropriately modify the kernel.

Although our focus has been on hyperspherical predictors for concreteness, the proposed product mixture formulation is broadly applicable to classification problems for predictors in general spaces and we can easily consider predictors having a variety of supports. For example, some predictors can be in a Euclidean space and some on a hypersphere. The framework has some clear practical advantages over frequentist and nonparametric Bayes discriminant analysis approaches, which rely on separately modeling the conditional distributions of the feature (predictor) distributions specific to each response category.

One of our primary contributions was showing theoretical properties, including large support and posterior consistency, in modeling of the classification function. In addition, we have added to the under-developed literature on nonparametric Bayes testing of differences in distributions, not only on ℜp but on more general manifolds. We provide a novel computational approach for estimating Bayes factors as well as prove theoretical results on Bayes factor consistency. The proposed method can be implemented in broad applications.

An area of substantial interest in ongoing research is the development of models that reduce dimensionality through projecting predictors X to a lower-dimensional subspace. Such dimensionality reduction is critical in addressing the curse of dimensionality that arises inevitably in estimating classification or regression functions with high-dimensional predictors. [7] develop a promising approach for nonparametric Bayes learning of affine subspaces for density estimation and classification in Euclidean spaces, but it remains to develop related methods in non-Euclidean spaces while also obtaining theory on rates of convergence.

Acknowledgments

This research was partially supported by grant R01 ES017240 from the National Institute of Environmental Health Sciences (NIEHS) of the National Institutes of Health (NIH).

8. Appendix

8.1. Proof of Theorem 2.1

Before proving the Theorem, we prove the following Lemma.

Lemma 8.1

Under assumptions A2 and A4,

limκsup{f(x,y;Pt,κ)-ft(x,y):(x,y)X×Y}=0.
Proof

From the definition of Pt, we can write

f(x,y;Pt,κ)=XK(x;μ,κ)φy)(μ)λ1(dμ)

for φy(μ) = ft(μ, y). Then from A4, it follows that φy is continuous for all yInline graphic. Hence from A2, it follows that

limκsupxX|ft(x,y)-XK(x;μ,κ)ft(μ,y)λ1(dμ)|=0

for any yInline graphic. Since Inline graphic is finite, the proof is complete.

Proof of Theorem 2.1

Throughout this proof we will view Inline graphic as a compact metric space and Inline graphic( Inline graphic × SL−1) as a topological space under the weak topology. From Lemma 8.1, it follows that there exists a κtκt(ε) > 0 such that

supx,yf(x,y;Pt,κ)-ft(x,y)<ε3 (8.1)

for all κκt. From assumption A3, it follows that by choosing κt sufficiently large, we can ensure that (Pt, κt) ∈ supp(Π1). From assumption A1, it follows that K is uniformly continuous at κt, i.e. there exists an open set W (ε) ⊆ ℜ+ containing κt s.t.

supx,μXK(x;μ,κ)-K(x;μ,κt)<ε3κW(ε).

This in turn implies that, for all κW(ε), PInline graphic( Inline graphic × SL−1),

supx,yf(x,y;P,κ)-f(x,y;P,κt)<ε3 (8.2)

because the left expression in (8.2) is

supx,y|νy{K(x;μ,κ)-K(x;μ,κt)}P(dμdν)|supx,μXK(x;μ,κ)-K(x;μ,κt).

Since Inline graphic is compact and K(.; ., κt) is uniformly continuous on Inline graphic × Inline graphic, we can cover Inline graphic by finitely many open sets: U1, … UK s.t.

supμX,x,xUiK(x;μ,κt)-K(x;μ,κt)<ε12 (8.3)

for each iK. For fixed x, y, κ, f(x, y; P, κ) is a continuous function of P. Hence for xiUi, y = jInline graphic,

Wij(ε)={PM(X×SL-1):f(xi,j;P,κt)-f(xi,j;Pt,κt)<ε6},

1 ≤ iK, 1 ≤ jL, define open neighborhoods of Pt. Let Inline graphic(ε) = ∩i,j Inline graphic(ε) which is also an open neighborhood of Pt. For a general xInline graphic, yInline graphic, find Ui containing x. Then for any PInline graphic(ε),

f(x,y;P,κt)-f(x,y;Pt,κt)f(x,y;P,κt)-f(xi,j;P,κt)+f(xi,j;P,κt)-f(xi,j;Pt,κt)+f(xi,j;Pt,κt)-f(x,y;Pt,κt). (8.4)

Denote the three terms to the right in (8.4) as T1, T2 and T3. Since xUi, it follows from (8.3) that T1,T3<ε12. Since PInline graphic(ε), T2<ε6 by definition of Inline graphic(ε). Hence supx,yf(x,y;P,κt)-f(x,y;Pt,κt)<ε3. Therefore

W2(ε){P:supx,yf(x,y;P,κt)-f(x,y;Pt,κt)<ε3}

contains Inline graphic(ε). Since (Pt, κt) ∈ supp(Π1) and Inline graphic(ε) × W (ε) contains an open neighborhood of (Pt, κt), therefore

Π1(W2(ε)×W(ε))>0.

Let (P, κ) ∈ Inline graphic(ε) × W (ε). Then for (x, y) ∈ Inline graphic × Inline graphic,

f(x,y;P,κ)-ft(x,y)f(x,y;P,κ)-f(x,y;P,κt)+f(x,y;P,κt)-f(x,y;Pt,κt)+f(x,y;Pt,κt)-ft(x,y). (8.5)

The first term to the right in (8.5) is <ε3 since κW (ε). The second one is <ε3 because PInline graphic(ε). The third one is also <ε3 which follows from equation (8.1). Therefore

Π1({(P,κ):supx,yf(x,y;P,κ)-ft(x,y)<ε})>0.

This completes the proof.

8.2. Proof of Theorem 2.3

The proof uses Proposition 8.2 proved in [6]. Let M be a compact metric-space. Denote by Inline graphic(M) the space of all probability densities on M with respect to some fixed finite base measure τ. Endow it with the total variation distance ||.||. Let zn={zi}1n be a iid sample from some density ft on M. Consider a collection of mixture densities on M given by

f(m;P,κ)=MK(m;μ,κ)P(dμ),mM,κR+,PM(M) (8.6)

with ∫M K(m; μ, κ)τ(dm) = 1. Set a prior Π1 on Inline graphic(M) × ℜ+ which induces a prior Π on Inline graphic(M) through (8.6). For Inline graphicInline graphic(M) and ε > 0, the L1-metric entropy N(ε, Inline graphic) is defined as the logarithm of the minimum number of ε-sized (or smaller) L1 subsets needed to cover Inline graphic.

Proposition 8.2

For a positive sequence {κn} diverging to ∞, define

Dn={f(.;P,κ):PM(M),κ[0,κn]}.

(a) Under assumptions A6A8, given any ε > 0, for n sufficiently large, N(ε,Dn)C(ε)κna1a3 for some C(ε) > 0. (b) Under further assumption A9, the posterior probability of any total variation neighborhood of ft converges to 1 a.s. ft if ft is in the KL support of Π.

Proof of Theorem 2.3

For a density fInline graphic( Inline graphic × Inline graphic), let p(y) be the marginal probability of Y = y and g(x, y) be the conditional density of X = x given Y = y. Then f(x, y) = p(y)g(x, y). For f1, f2Inline graphic( Inline graphic × Inline graphic),

||f1-f2||=f1(x,y)-f2(x,y)λ(dxdy)=j=1Lp1(j)g1(x,j)-p2(j)g2(x,j)λ1(dx)maxj||g1(.,j)-g2(.,j)||+jp1(j)-p2(j). (8.7)

Hence an ε diameter ball in Inline graphic( Inline graphic × Inline graphic) contains the intersection of L many ε/2 diameter balls from Inline graphic( Inline graphic) with a ε/2 diameter subset of SL−1. For any f (; P, κ) as in (2.1), its X-conditional g(.; j) for jInline graphic can be expressed as

g(x,j)=X×SL-1νjK(x;μ,κ)P(dμdν)X×SL-1νjP(dμdν)=XK(x;μ,κ)Pj(dμ)withPj(dμ)=SL-1νjP(dμdν)X×SL-1νjP(dμdν).

Hence g(., j) is as in (8.6) with M = Inline graphic. Define

Dn={f(;P,κ):PM(X×Y),κ[0,κn]}.ThenDn={fD(X×Y),:g(.;j)DnjY}whereDn={g(;P,κ):PM(X),κ[0,κn]}. (8.8)

From Proposition 8.2(a), N(ε, Inline graphic) is of order at-most κna1a3 and hence from (8.7) and (8.8), N(ε,Dn)Cκna1a3, C depending on ε. Therefore from part (b) of the Proposition, under assumptions A1A9, strong posterior consistency follows.

8.3. Proof of Corollary 2.4

Proof
  1. For any yInline graphic,
    Xp(y,x)-pt(y,x)gt(x)λ1(dx)=Xft(x,y)-f(x,y)+p(y,x)g(x)-p(y,x)gt(x)λ1(dx)Xft(x,y)-f(x,y)λ1(dx)+Xgt(x)-g(x)λ1(dx)2j=1LXf(x,j)-ft(x,j)λ1(dx)=||f-ft||1

    and hence any neighborhood of pt of the form {p: maxyInline graphicInline graphic|p(y, x)− pt(y, x)|gt(x)λ1(dx) < ε} contains an L1 neighborhood of ft. Now part(a) follows from strong consistency of the posterior distribution of f.

  2. Since Inline graphic is compact, ft being continuous and positive implies that c = infxInline graphicgt(x) > 0. Hence
    Xp(y,x)-pt(y,x)w(x)λ1(dx)c-1sup(w(x))Xgt(x)p(y,x)-pt(y,x)λ1(dx)

    Now the result follows from part (a).

8.4. Proof of Theorem 3.1

The proof uses Lemma 8.3. This lemma is fundamental to proving weak posterior consistency using the Schwartz theorem and its proof can be found in any standard text which contains the theorem’s proof.

Lemma 8.3

(a) If Π includes ft in its KL support, then

liminfnexp(nβ)if(xi,yi)ft(xi,yi)Π(df)=

a.s. ft for any β > 0. (b) If U is a weak open neighborhood of ft and Π0 is a prior on Inline graphic( Inline graphic × Inline graphic) with support in Uc, then there exists a β0 > 0 for which

limnexp(nβ0)if(xi,yi)ft(xi,yi)Π0(df)=0

a.s. ft.

Proof of Theorem 3.1

Express BF as

BF={ipt(yi)}D(b)D(bn)if(xi,yi)ft(xi,yi)Π(df)ig(xi)pt(yi)ft(xi,yi)Π(df)=T1T2/T3

with T1={ipt(yi)}D(b)D(bn),T2=if(xi,yi)ft(xi,yi)Π(df), and T3=ig(xi)pt(yi)ft(xi,yi)Π(df).

Since Π satisfies the KL condition, Lemma 8.3(a) implies that lim infn→∞ exp()T2 = ∞ a.s. for any β > 0.

Let U be the space of all dependent densities, that is

Uc={fD(X×Y):f(x,y)=g(x)p(y)a.s.λ(dxdy)}.

The prior Π induces a prior Π0 on Uc via f ↦ {Σj f (., j)}pt and T3 can be expressed as if(xi,yi)ft(xi,yi)Π0(df). It is easy to show that U is open under the weak topology and hence under H1 is a weak open neighborhood of ft. Then using Lemma 8.3(b), it follows that limn→∞ exp(0)T3 = 0 a.s. for some β0 > 0.

The proof is complete if we can show that lim infn→∞ exp()T1 = ∞ a.s. for any β > 0 or log(T1) = o(n) a.s. For a positive sequence an diverging to ∞, the Stirling’s formula implies that log Γ(an) = an log(an) − an + o(an). Express log(T1) as

ilog(pt(yi))-log(D(bn))+o(n). (8.9)

Since pt(j) > 0 ∀jL, by the SLLN,

ilog(pt(yi))=njpt(j)log(pt(j))+o(n)a.s. (8.10)

Let bnj = bji I(yi = j) be the jth component of bn. Then limn→∞ bnj/n = pt(j), that is bnj = npt(j) + o(n) a.s. and hence the Stirling’s formula implies that

log(Γ(bnj))=bnjlog(bnj)-bnj+o(n)=npt(j)log(pt(j))-npt(j)+log(n)bnj+o(n)a.s.

which implies

log(D(bn))=j=1Llog(Γ(bnj))-logΓ(jbj+n)=njpt(j)log(pt(j))+o(n)a.s. (8.11)

From (8.9), (8.10) and (8.11), log(T1) = o(n) a.s. and this completes the proof.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

Contributor Information

Abhishek Bhattacharya, Theoretical Statistics & Mathematics Division, Indian Statistical Institute.

David Dunson, Department of Statistical Science, Duke University.

References

  • 1.Banerjee A, Dhillon IS, Ghosh J, Sra S. Clustering on the unit hypersphere using von mises-fisher distributions. Journal of Machine Learning Research. 2005;6:1345–1382. [Google Scholar]
  • 2.Barron A, Schervish MJ, Wasserman L. The consistency of posterior distributions in nonparametric problems. Annals of Statistics. 1999;27:536–561. [Google Scholar]
  • 3.Basu S, Chib S. Marginal likelihood and Bayes factors for D-irichlet process mixture models. Jour of Amer Statist Assoc. 2003;98:224–235. [Google Scholar]
  • 4.Bhattacharya A, Bhattacharya R. Nonparametric Inference on Manifolds. Cambridge Uni. Press; Cambridge: 2012. [Google Scholar]
  • 5.Bhattacharya A, Dunson D. Nonparametric Bayesian density estimation on manifolds with applications to planar shapes. Biometrika. 2010;97:851–865. doi: 10.1093/biomet/asq044. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Bhattacharya A, Dunson D. Strong consistency of nonparametric Bayes density estimation on compact metric spaces. Annals of the Institute of Statistical Mathematics. 2011;63 doi: 10.1007/s10463-011-0341-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Bhattacharya A, Page G, Dunson D. Density estimation and classification via bayesian nonparametric learning of affine subspaces. 2005 doi: 10.1080/01621459.2013.763566. arXiv:1105.5737v1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bhattacharya RN, Patrangenaru V. Large sample theory of intrinsic and extrinsic sample means on manifolds. Ann Statist. 2003;31:1–29. [Google Scholar]
  • 9.Bigelow J, Dunson D. Bayesian semiparametric joint models for functional predictors. J Am Statist Ass. 2009;104:26–36. doi: 10.1198/jasa.2009.0001. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Consonni G, Veronese P. Compatibility of prior specifications across linear models. Statist Sci. 2008;23:332–353. [Google Scholar]
  • 11.Dass SC, Lee J. A note on the consistency of bayes factors for testing point null versus non-parametric alternatives. J Statist Plan Infer. 2004;119:143–152. [Google Scholar]
  • 12.de Jonge R, van Zanten JH. Adaptive nonparametric Bayesian inference using location-scale mixture priors. Annals of Statistics. 2010;38:3300–3320. [Google Scholar]
  • 13.De la Cruz-Mesia R, Quintana FA, Müller P. Semiparametric bayesian classification with longitudinal markers. Applied Statist. 2007;56:119–137. doi: 10.1111/j.1467-9876.2007.00569.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dunson DB. Multivariate kernel partition process mixtures. Statistica Sinica. 2010;20:1395–1422. [PMC free article] [PubMed] [Google Scholar]
  • 15.Dunson DB, Peddada SD. Bayesian nonparametric inference on stochastic ordering. Biometrika. 2008;95:859–874. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Embleton BJJ, McDonnell KL. Magnetostratigraphy in the Sydney Basin, SouthEastern Australia. J Geomag Geoelectr. 1980;32:304. [Google Scholar]
  • 17.Fisher NI, Lewis T, Embleton BJJ. Statistical Analysis of Spherical Data. Cambridge Uni. Press; N.Y: 1987. [Google Scholar]
  • 18.Fisher RA. Dispersion on a sphere. Proc of the Royal Soc of London Ser A - Math and Phy Sci. 1953;1130:295–305. [Google Scholar]
  • 19.Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]
  • 20.Ghosal S, Ghosh JK, Ramamoorthi RV. Posterior consistency of dirichlet mixtures in density estimation. Ann Statist. 1999;27:143–158. [Google Scholar]
  • 21.Ghosal S, Lember J, van der Vaart A. Nonparametric Bayesian model selection and averaging. Electronic Journal of Statistics. 2008;2:63–89. [Google Scholar]
  • 22.Ghosal S, Roy A. Posterior consistency of a Gaussian process prior for nonparametric binary regression. Ann Statist. 2006;34:2413–2429. [Google Scholar]
  • 23.Hamsici OC, Martinez AM. Spherical-homoscedastic distributions: The equivalency of spherical and normal distributions in classification. Journal of Machine Learning Research. 2007;8:1583–1623. [Google Scholar]
  • 24.Hannah LA, Blei DM, Powell WB. Dirichlet process mixtures of generalized linear models. Journal of Machine Learning Research. 2011;12:1923–1953. [Google Scholar]
  • 25.Hastie T, Tibshirani R. Discriminant analysis by gaussian mixtures. Journal of the Royal Statistical Society Series B. 1996;58:155–176. [Google Scholar]
  • 26.Holmes CC, Caron F, Griffin JE, Stephens DA. Two-sample bayesian nonparametric hypothesis testing. Technical Report. http://arxiv.org/abs/0910.5060.
  • 27.Ishwaran H, Zarepour M. Dirichlet prior sieves in finite normal mixtures. Statistica Sinica. 2002;12:941–963. [Google Scholar]
  • 28.Kalli M, Griffin JE, Walker SG. Slice sampling mixture models. Statistics and Computing. 2011;21:93–105. [Google Scholar]
  • 29.Kundu S, Dunson DB. Bayes variable selection in semiparametric linear models. 2011 doi: 10.1080/01621459.2014.881153. arxiv:1108.2722. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Mardia KV, Jupp PE. Directional Statistics. John Wiley & Sons; West Sussex, England: 2000. [Google Scholar]
  • 31.Müller P, Erkanli A, West M. Bayesian curve fitting using multivariate normal mixtures. Biometrika. 1996;83:67–79. [Google Scholar]
  • 32.Pennell ML, Dunson DB. Nonparametric Bayes testing of changes in a response distribution with an ordinal predictor. Biometrics. 2008;64:413–423. doi: 10.1111/j.1541-0420.2007.00885.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Schwartz L. On Bayes procedures. Z Wahrsch Verw Gebiete. 1965;4:10–26. [Google Scholar]
  • 34.Sethuraman J. A constructive definition of Dirichlet priors. Statist Sinica. 1994;4:639–50. [Google Scholar]
  • 35.Shahbaba B, Neal R. Nonlinear models using Dirichlet process mixtures. Journal of Machine Learning Research. 2009;10:1829–1850. [Google Scholar]
  • 36.Urich G. Computer generation of distributions on the m-sphere. Appl Statist Soc. 1984;B16:885–898. [Google Scholar]
  • 37.von Mises RV. Uber die “Ganzzahligkeit” der Atomgewicht und verwandte Fragen. Physik Z. 1918;19:490–500. [Google Scholar]
  • 38.Walker SG. Sampling the Dirichlet mixture model with slices. Commun Statist B. 2007;36:45–54. [Google Scholar]
  • 39.Watson GS, Williams EJ. Construction of significance tests on the circle and sphere. Biometrika. 1953;43:344–52. [Google Scholar]
  • 40.Wood ATA. Simulation of the Von Mises Fisher distribution. Commun Statist-Simula. 1994;23(1):157–164. [Google Scholar]
  • 41.Wu Y, Ghosal S. Kullback-Leibler property of kernel mixture priors in Bayesian density estimation. Elec J Statist. 2008;2:298–331. [Google Scholar]
  • 42.Wu Y, Ghosal S. L1 - consistency of dirichlet mixtures in multivariate bayesian density estimation. Jour of Multivar Analysis. 2010 In Press. [Google Scholar]
  • 43.Yau C, Papaspiliopoulos O, Roberts GO, Holmes C. Nonparametric hidden Markov models with applications in genomics. J R Statist Soc B. 2010;73 doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES