Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 May 10.
Published in final edited form as: J Comput Graph Stat. 2011 Mar 1;20(1):260–278. doi: 10.1198/jcgs.2011.09066

A Product Partition Model With Regression on Covariates

Peter Müller 1, Fernando Quintana 2, Gary L Rosner 3
PMCID: PMC3090756  NIHMSID: NIHMS283160  PMID: 21566678

Abstract

We propose a probability model for random partitions in the presence of covariates. In other words, we develop a model-based clustering algorithm that exploits available covariates. The motivating application is predicting time to progression for patients in a breast cancer trial. We proceed by reporting a weighted average of the responses of clusters of earlier patients. The weights should be determined by the similarity of the new patient’s covariate with the covariates of patients in each cluster. We achieve the desired inference by defining a random partition model that includes a regression on covariates. Patients with similar covariates are a priori more likely to be clustered together. Posterior predictive inference in this model formalizes the desired prediction.

We build on product partition models (PPM). We define an extension of the PPM to include a regression on covariates by including in the cohesion function a new factor that increases the probability of experimental units with similar covariates to be included in the same cluster. We discuss implementations suitable for any combination of continuous, categorical, count, and ordinal covariates.

An implementation of the proposed model as R-package is available for download.

Keywords: Clustering, Nonparametric Bayes, Variable selection

1. INTRODUCTION

We develop a probability model for clustering with covariates, that is, a probability model for partitioning a set of experimental units, where the probability of any particular partition is allowed to depend on covariates. The motivating application is inference in a clinical trial. The outcome is time to progression for breast cancer patients. The covariates include treatment dose, initial tumor burden, an indicator for menopause, and more. We wish to define a probability model for clustering patients with the specific feature that patients with equal or similar covariates should be a priori more likely to co-cluster than others.

Let i = 1, … , n, index experimental units, and let ρn = {S1, … , Skn} denote a partition of the n experimental units into kn subsets Sj. Let xi and yi denote the covariates and response reported for the ith unit. Let xn = (x1, … , xn) and yn = (y1, … , yn) denote the entire set of recorded covariates and response data, and let xj=(xi,iSj) and yj=(yi,iSj) denote covariates and response data arranged by clusters. Sometimes it is convenient to introduce cluster membership indicators ei ∈ {1, … , kn} with ei = j if iSj, and use (kn, e1, … , en) to describe the partition. We call a probability model p(ρn) a clustering model, excluding in particular purely constructive definitions of clustering as a deterministic algorithm (without reference to probability models). In particular, the number of clusters kn is itself unknown. The clustering model p(ρn) implies a prior model p(kn). Many clustering models include, implicitly or explicitly, a sampling model p(yn|xn, ρn). Probability models p(ρn) and inference for clustering have been extensively discussed over the past few years. See the article of Quintana (2006) for a recent review. In this article we are interested in adding a regression to replace p(ρn) with p(ρn|xn).

We focus on the product partition model (PPM). The PPM (Hartigan 1990; Barry and Hartigan 1993; Crowley 1997) constructs p(ρn) by introducing cohesion functions c(A) ≥ 0 for A ⊆ {1, … , n} that measure how tightly grouped the elements in A are thought to be, and defines a probability model for a partition ρn and data yn as

p(ρn)j=1knc(Sj)andp(ynρn)=j=1knpj(yj). (1.1)

Model (1.1) is conjugate. The posterior p(ρn|yn) is again in the same product form.

Alternatively, the species sampling model (SSM) (Pitman 1996; Ishwaran and James 2003) defines an exchangeable probability model p(ρn) that depends on ρn only indirectly through the cardinality of the partitioning subsets, p(ρn) = p(|S1|, … , |Skn|). The SSM can be alternatively characterized by a sequence of predictive probability functions (PPFs) that describe how individuals are sequentially assigned to either already formed clusters or to start new ones. The choice of the PPF is not arbitrary. One has to make sure that a sequence of random variables that are sampled by iteratively applying the PPF is exchangeable. The popular Dirichlet process (DP) model (Ferguson 1973) is a special case of a SSM. Moreover, the marginal distribution that a DP induces on partitions is also a PPM with cohesions c(Sj) = M(|Sj| − 1)! (Quintana and Iglesias 2003). Here M denotes the total mass parameter of the DP prior.

Model-based clustering (Banfield and Raftery 1993; Dasgupta and Raftery 1998; Fraley and Raftery 2002) implicitly defines a probability model on clustering by assuming a mixture model p(yiθ,τ,kn)=j=1knτjpj(yiθj), where θ = (θ1, … , θkn) and τ = (τ1, … , τkn). Together with a prior p(kn) on kn and p(θ, τ|kn), the mixture implicitly defines a probability model on clustering. Consider the equivalent hierarchical model

p(yiei=j,kn,θ,τ)=pj(yiθj)andPr(ei=jkn,θ,τ)=τj. (1.2)

The implied posterior distribution on (e1, … , en) and kn defines a probability model on ρn. Richardson and Green (1997) developed posterior simulation strategies for mixtures of normal models. Green and Richardson (1999) discussed the relationship to DP mixture models.

In this article we build on the PPM (1.1) to define a covariate-dependent random partition model by augmenting the PPM with an additional factor that induces the desired dependence on the covariates. We refer to the additional factor as similarity function. Similar approaches are discussed by Shahbaba and Neal (2009) and Park and Dunson (2010). Both effectively included the covariates as part of an augmented response vector. We discuss more details of these two and other alternative approaches in Section 5 where we also carry out a small Monte Carlo study for an empirical comparison.

In Section 2 we state the proposed model and considerations in choosing the similarity function. In Section 3 we show that the computational effort of posterior simulation remains essentially unchanged from PPM models without covariates. In Section 4 we propose specific choices of the similarity function for common data formats. Section 5 reviews some alternative approaches, and we summarize a small Monte Carlo study to compare some of these approaches. In Section 6 we show a simulation study and a data analysis example.

2. THE PPMX MODEL

We build on the PPM (1.1), modifying the cohesion function c(Sj) with an additional factor that achieves the desired regression. Recall that xj=(xi,iSj) denotes all covariates for units in the jth cluster. Let g(xj) denote a nonnegative function of xj that formalizes similarity of the xi with larger values g(xj) for sets of covariates that are judged to be similar. We define the model

p(ρnxn)j=1kng(xj)c(Sj) (2.1)

with the normalization constant gn(xn)=ρnj=1kng(xj)c(Sj). By a slight abuse of notation we include x behind the conditioning bar even when x is not a random variable. We will later discuss specific choices for the similarity function g. As a default choice we propose to define g(·) as the marginal probability in an auxiliary probability model q, even if xi are not considered random,

g(xj)iSjq(xiξj)q(ξj)dξj. (2.2)

There is no notion of the xi being random variables. But the use of a probability density q(·) for the construction of g(·) is convenient since it allows for easy calculus. The correlation that is induced by the cluster-specific parameters ξj in (2.2) leads to higher values of g(xj) for tightly clustered, similar xi, as desired. More importantly, we show below that under some minimal assumptions a similarity function g(·) necessarily is of the form (2.2). The function (2.2) satisfies the following two properties that are desirable for a similarity function in (2.1). First, we require symmetry with respect to permutations of the sample indices i. The probability model must not depend on the order of introducing the experimental units. This implies that the similarity function g(·) must be symmetric in its arguments. Second, we require that the similarity function scales across sample size, in the sense that g(x*) = ∫ g(x*, x)dx. In words, the similarity of any cluster is the average of any augmented cluster.

Under these two requirements (2.2) is not only technically convenient. It is the only possible similarity function that satisfies these two constraints.

Theorem

Assume that (i) a similarity function g(x*) satisfies the two constraints; (ii) g(xj) integrates over the covariate space, g(xj)dxj< Then g(xj) is proportional to the marginal distribution on xj under a hierarchical auxiliary model (2.2).

Proof

The result is a direct consequence of De Finetti’s representation theorem for exchangeable probability measures, applied to a normalized version of q(·). See, for example, the book by Bernardo and Smith (1994, chapter 4.3). The representation theorem applies for an infinite sequence of variables that are subject to the symmetry constraint. The result establishes that all similarity functions that satisfy the symmetry constraint are of the form (2.2).

The definition of the similarity function with the auxiliary model q(·) also implies another important property. The random partition model (2.1) is coherent across sample sizes. The model for the first n experimental units follows from the model for (n + 1) observation by appropriate marginalization. Without covariates we would simply require p(ρn) = ∑en+1 p(ρn+1), with the summation defined over all possible values of en+1. With covariates we consider the following condition. Assume that the similarity function is defined by means of an auxiliary model, as in (2.2). The covariate-dependent PPM (2.1) is coherent across sample sizes as formalized by the following relationship of p(ρn|xn) and p(ρn+1|xn, xn+1):

p(ρnxn)=en+1p(ρn+1xn,xn+1)q(xn+1xn)dxn+1 (2.3)

for the probability model q(xn+1|xn) ∝ gn+1(xn+1)/gn(xn).

We complete the random partition model (2.1) with a sampling model that defines independence across clusters and exchangeability within each cluster. We include cluster-specific parameters θj and common hyperparameters η:

p(ynρn,θ,η,xn)=j=1knxSjp(yixi,θj,η)andp(θη)=j=1knp(θjη), (2.4)

where θ = (θ1, … , θkn). We refer to (2.1) together with (2.4) as PPM with covariates, and write PPMx for short. The resulting model extends PPMs of the type (1.1) while keeping the product structure. Note that the sampling model p(yi|xi, θj, η) in (2.4) can include a regression of the response data on the covariate xi. For example, Shahbaba and Neal (2009) included a logistic regression for yi on xi, in addition to a covariate-dependent partition model p(ρn|xn). Similarly, cluster-specific covariates wj could be included by replacing p(θj|η) by p(θj|η, wj).

3. POSTERIOR INFERENCE

3.1 Markov Chain Monte Carlo Posterior Simulation

A practical advantage of the proposed default choice for g(xj) is that it greatly simplifies posterior simulation. In words, posterior inference in the PPMx (2.1) and (2.4) is identical to the posterior inference that we would obtain if xi were part of the random response vector yi. The opposite is not true. Not every model with augmented response vector is equivalent to a PPMx model.

Formally, and as a computational device only, consider the auxiliary model q(·) defined as

q(yn,xnρn,θ,η)=j=1kniSjp(yixi,θj,η)q(xiξj)andq(θ,ξη)=j=1knp(θjη)q(ξj), (3.1)

and replace the covariate-dependent prior p(ρn|xn) by the PPM q(ρn)j=1knc(Sj). The posterior distribution q(ρn|yn, xn) under the auxiliary model is identical to p(ρn|yn, xn) under the proposed model (2.4). But posterior simulation under q(·) can be carried out using standard techniques, for example, as discussed by Quintana (2006). An important caveat of this operational trick, though, is that ξj and θj must be a priori independent. In particular, the prior on θj and q(ξj) must not include common hyperparameters. If we had p(θj|η) and q(ξj|η) depend on a common hyperparameter η, then the posterior distribution under the auxiliary model (3.1) would differ from the posterior under the original model. Let q(xnη)=ρnjq(xjξj)dq(ξjη). The two posterior distributions would differ by a factor q(xn|η). The implication on the desired inference can be substantial.

However, we do not consider the principled way of introducing the covariates in the PPM to be the main feature of the PPMx. Rather, we argue that the proposed model greatly simplifies the inclusion of covariates including a variety of data formats. It would be unnecessarily difficult to attempt the construction of a joint model for continuous responses and categorical, binary, and continuous covariates. We argue that it is far easier to focus on a modification of the cohesion function. This is illustrated in the data analysis example in Section 6. From a modeling perspective it is desirable to separate the choice of the prior on the partition versus the sampling model conditional on an assumed partition.

3.2 Predictive Inference

A minor complication arises with posterior predictive inference, that is, when reporting p(yn+1|yn, xn, xn+1). Using x~=xn+1,y~=yn+1,ande~=en+1 to simplify notation, we find p(y~x~,xn,yn)=p(y~x~,ρn+1,xn,yn)dp(ρn+1x~,yn,xn). The integral is simply a sum over all configurations ρn+1. But it is not immediately recognizable as a posterior integral with respect to p(ρn|xn, yn). This can easily be overcome by an importance sampling reweighting step. Let g(∅) = c(∅) ≡ 1. The prior on ρn+1=(ρn,e~) can be written as

p(e~=,ρnx~,xn)jc(Sj)g(xj)c(S{n+1})g(x,x~)=p(ρnxn)g(x,x~)g(x)c(S{n+1})c(S)

Let q(x~x)g(x,x~)g(x). The posterior predictive distribution becomes

p(y~x~,yn,xn)=1kn+1p(y~x~,y,x,e~=)q(x~x)×c(S{n+1})c(S)p(ρnyn,xn)dρn. (3.2)

The first factor reduces to p(y~y,e~=) when (2.4) does not include a regression on xi in the sampling model. Sampling from (3.2) is implemented on top of posterior simulation for ρn ~ p(ρn|xn, yn). For each imputed ρn, generate e~= with probabilities proportional to w=q(x~x)c(S{n+1})c(S), and generate y~ from p(y~y,e~=), weighted with w. In the special case ℓ = kn + 1 we get wkn+1g(x~)c({n+1}).

3.3 Posterior Inference on Clusters

In many applications inference on the number of clusters, kn, is of interest. Under the proposed model p(kn=ryn,xn)={ρn:ρn=r}p(ρnyn,xn) is the induced posterior distribution on the number of clusters. The marginal posterior p(kn = r|yn, xn) is reported as a simple summary of posterior Markov chain Monte Carlo (MCMC) simulation. Each iteration of the posterior simulation scheme in Section 3.1 involves imputing a partition ρn, and thus kn. A formal selection of the number of clusters could be carried out using a decision-theoretic approach, that is, a loss function that weighs model complexity and the value of learning about the inferential target. Such formal procedures were discussed, among others, by Quintana and Iglesias (2003) and Lau and Green (2007). The special case of finding the maximum a posteriori clustering in a class of product partition models is considered by Dahl (2009).

Reporting posterior inference for the random partition ρn is complicated by the label switching problem. The posterior distribution is invariant under permutations of the cluster labels when p(ρn) and p(θ) are both symmetric. This implies that inference on cluster specific parameters is meaningless. See the article by Jasra, Holmes, and Stephens (2005) for a recent discussion and review of the literature. Usually the focus of inference in semi-parametric mixture models like (2.4) is on density estimation and prediction, rather than inference for specific clusters and parameter estimation. Predictive inference is not subject to the label switching problem, as the posterior predictive distribution marginalizes over all possible partitions. In the application reported later, when we want to highlight inference about specific clusters we use a simple ad hoc approach that we found particularly meaningful to report inference about the regression p(ρn|xn). We report inference stratified by kn. For given kn we first find a set of kn indices I = (i1, … , ikn) with high posterior probability of the corresponding cluster indicators being distinct, that is, Pr(DI|y) is high for DI = {eiej for ij and i, jI}. To avoid computational complications, we do not insist on finding the kn-tuple with the highest posterior probability. We then use i1, … , ikn as anchors to define cluster labels by restricting posterior simulations to kn clusters with the units ij in distinct clusters. Postprocessing MCMC output, this is easily done by discarding all imputed parameter vectors that do not satisfy the constraint. We relabel clusters, indexing the cluster that contains unit ij as cluster j. Now it is meaningful to report posterior inference on specific clusters. The proposed postprocessing is similar to the pivotal reordering suggested by Marin and Robert (2007, chapter 6.4).

4. SIMILARITY FUNCTIONS

For continuous covariates we suggest as a default choice for g(xj) the marginal distribution of xj under a normal sampling model. Let N(x; m, V) denote a normal model for the random variable x, with moments m and V, and let Ga(x; a, b) denote a Gamma distributed random variable with mean a/b. We use q(xjξj)=iSjN(xi;mj,vj), with a conjugate prior for ξj = (mj, vj) as p(ξj)=N(mj;m,B)Ga(vj1;ν,So), with fixed m, B, ν, S0. The main reason for this choice is operational simplicity. A simplified version uses fixed vjv. The resulting function g(xj) is the joint density of a correlated multivariate t-distribution, with location parameter m and scaling matrix B/(vI + B). The fixed variance v specifies how strongly we weigh similarity of the xi. In implementations we used v=c1S^, where S^ is the empirical variance of the covariate, and c1 = 0.5 is a scaling factor that specifies over what range we consider values of this covariate as similar. The choice between fixed v versus variable vj should reflect prior judgment on the variability of clusters. Variable vj allows for some clusters to include a wider range of x values than others. Finally, a sufficiently vague prior for mj is important to ensure that the similarity is appropriately quantified even for a group of covariate values in the extreme areas of the covariate space. In our implementation we used B=c2S^ with c2 = 10.

When constructing a cohesion function for categorical covariates, a default choice is based on a Dirichlet prior. Assume xi is a categorical covariate, xi ∈ {1, … , C}. To define g(xj), let q(xi = c|ξj) = ξjc denote the probability mass function. Together with a conjugate Dirichlet prior, q(ξj) = Dir(α1, … , αC), we define the similarity function as a Dirichlet-categorical probability

g(xj)=jSjξj,xidq(ξj)=c=1Cξjcnjcdq(ξj), (4.1)

with njc=iSjI(xi=c). This is a Dirichlet-multinomial model without the multinomial coefficient. For binary covariates the similarity function becomes a Beta–Binomial probability without the Binomial coefficient. The choice of the hyperparameters α needs some care. To facilitate the formation of clusters that are characterized by the categorical covariates we recommend Dirichlet hyperparameters αc < 1. For example, for C = 2, the bimodal nature of a Beta distribution with such parameters assigns high probability to binomial success probabilities ξj1 close to 0 or 1. Similarly, the Dirichlet distribution with parameters αc < 1 favors clusters corresponding to specific levels of the covariate.

For ordinal covariates, a convenient specification of g(·) is an ordinal probit model. The model can be defined by means of latent variables and cutoffs. Assume an ordinal covariate x with C categories. Following Johnson and Albert (1999), consider a latent trait Z and cutoffs −∞ = γ0 < γ1γ2 ≤ ⋯ ≤ γC−1 < γC = ∞, so that x = c if and only if γc−1 < Zγc. We use fixed cutoff values γc = c−1, c = 1, … , C−1, and a normally distributed latent trait, Zi ~ N(mj, vj). Let Φc(mj, vj) = Pr(γc−1 < Zγc|mj, vj) denote the normal quantiles and let ξj = (mj, vj) denote the cluster-specific moments of the latent score. We define q(xi = c|ei = j, ξj) = Φc(mj, vj). The definition of the similarity function is completed with q(ξj) as a normal-inverse gamma distribution as for the continuous covariates. We have q(xj)=c=1CΦcncj(mj,vj)N(mj;m,B)Ga(vj1;ν,S0)dmjdvj, where ncj=iSjI{xi=c}.

Finally, to define g(·) for count-type covariates, we use a mixture of Poisson distributions

g(xj)=1iSjxi!ξjiSjxiexp(ξjSj)dq(ξj). (4.2)

With a conjugate gamma prior, q(ξj) = Ga(ξj; a, b), the similarity function g(xj) allows easy analytic evaluation. As a default, we suggest choosing a = 1 and ab=cS^, where S^ is the empirical variance of the covariate, and a/b is the expectation of the gamma prior.

The main advantage of the proposed similarity functions is computational simplicity of posterior inference. A minor limitation is the fact that the proposed default similarity functions, in addition to the desired dependence of the random partition model on covariates, also include a dependence on the cluster size nj = |Sj|. From a modeling perspective this is undesirable. The mechanism to define size dependence should be the underlying PPM and the cohesion functions c(Sj) in (2.1). However, we argue that the additional cluster size penalty that is introduced through g(·) is negligible in comparison to, for example, the cohesion function c(Sj) = M(nj − 1)! that is implied by the popular DP prior. It is straightforward to show that the similarity function introduces the following additional cluster size penalty in model (2.1). Consider the case of constant covariates, xix, and let nj denote the size of cluster j. The default choices of g(xj) for continuous, categorical, and count covariates introduce a penalty for large nj, with limnjg(xj)=0. But the rate of decrease is ignorable compared to the cohesion c(Sj). Let f(nj) be such that limnjg(xj)f(nj)M with 0 < M < ∞. For continuous covariates the rate is f(nj)=(2π)nj2V(nj1)2(r+nj)12, with r = V/B. For categorical covariates it is f(nj) = (A + nj)Aαx with A=cαc. For count covariates it is f(nj)=Cnj2(α+njx)12, with C = 2πx exp(1/6x).

5. ALTERNATIVE APPROACHES AND COMPARISON

5.1 Other Approaches

We propose the PPMx model as a principled approach to defining covariate-dependent random partition models. A related approach that has attracted considerable attention, especially in the recent marketing and psychometrics literature, is clusterwise regression. The basic model is defined in the article by Späth (1979). The model uses a partition of the experimental units (observations) into subsets Sj. For each partitioning subset a different linear regression mean function is defined. A recent implementation of clusterwise regression as an R package flexmix was described by Leisch (2004). We use flexmix in the comparison below.

Another popular class of models that implements flexible regression based on clustering experimental units are hierarchical mixtures of experts (HME) and related models. In contrast to clusterwise regression the HME model allows for the cluster weights to be dependent on covariates themselves. HME models were introduced by Jordan and Jacobs (1994). Bayesian posterior inference was discussed by Bishop and Svensén (2003). We consider a specific instance of the HME model that we will use for an empirical comparison in Section 5.2. The model is expressed as a finite mixture of normal linear regressions, with mixture weights that themselves depend on the available covariates as well,

p(yi;η)=j=1knwij(xi;αj)N(yi;θjTxi,σi2). (5.1)

Here η=(θ1,,θkn,σ12,,σkn2,α1,,αkn) is the full vector of parameters. The size kn of the mixture is prespecified. The covariate vector xi includes a 1 for an intercept term. The weights 0 ≤ wij ≤ 1 are defined as wijexp(αjTxi). For identifiability we assume α1 ≡ 0. The model specification is completed by assuming θj ~ N(μθ, Vθ), αj~N(μα,Vα),σj2~Ga(a0,a1), and prior independence of all parameters. The HME formulation defines a highly flexible class of parametric models. But there are some important limitations. The dependence of cluster membership probabilities on the covariates is strictly limited to the assumed parametric form of wij(·). For example, unless an interaction of two covariates is included in wij(·), no amount of data will allow clusters that are specific to interactions of covariates. Second, the number of mixture components kn is assumed known. Ad hoc fixes are possible, such as choosing a very large kn, or introducing a hyperprior on kn.

Clusterwise regression and HME do not explicitly include a probability model over random partitions. A model p(ρn) is implicitly defined as in (1.2), as model-based clustering. The dependence of cluster membership on covariates remains by definition restricted to the given structure. Alternatively, some recently proposed methods in the Bayesian literature take a perspective similar to the proposed PPMx model and include an explicit probability model for a random partition.

Park and Dunson (2010) considered the special case of continuous covariates. They defined the desired covariate-dependent random partition model as the posterior random partition under a PPM model for the covariates xi, that is, (1.1) with covariates xi replacing the response data yi. The posterior p(ρn|xn) is used to define the prior for the random partition ρn. In other words, they proceeded with an augmented response vector (xi, yi), carefully separating the prior for parameters related to the x and y subvectors, as in (3.1). See Section 3.1 for a discussion of the need to separate prior parameters related to x and y.

Another recently proposed approach is that of Shahbaba and Neal (2009). They used a logistic regression for a categorical response variable on covariates together with a random partition of the samples, with cluster-specific regression parameters in each partitioning subset. The random partition is defined by a DP prior and includes the covariates as part of the response vector. The proposed model can be written as (3.1) with p(yi|xi, θj, η) specified as a logistic regression with cluster-specific parameters θj, and q(xi|ξj) as a normal model with moments ξj = (μj, Σj). The cohesion functions c(Sj) are defined by a DP prior.

Dahl (2008) recently proposed another interesting approach to covariate-dependent random clustering. Let e = (e1, … , en) and let ei denote the vector of cluster membership indicators without the ith element. Let p(ei = j|ei) denote the conditional prior probabilities for the cluster membership indicators. For the random clustering implied by the DP prior, G ~ DP(M, G*), these conditional probabilities are known as the Polya urn. Let kn denote the number of distinct elements in ei, sj={h:eh=jandhi}, and nj=Sj. The Polya urn specifies p(ei=jei)=nj(n1+M) for j=1,,kn, and M/(n − 1 + M) for j=kn+1. Dahl (2008) defined the desired covariate-dependent random partition model by modifying the Polya urn. Assume covariates xi are available and it is desired to modify p(e) such that any two units with similar covariates should have increased prior probability for co-clustering. Assume dih = d(xi, xh) can be interpreted as a distance of xi and xh. Let d* = maxi<h dih and define hi(Sj)=chSj(ddih) with c chosen to achieve j=1knhi(Sj)=n. Dahl (2008) defined the modified Polya urn

p(ei=jei,xn){hi(Sj),j=1,,knM,j=kn+1.} (5.2)

This set of transition probabilities defines an ergodic Markov chain. Thus it implicitly defines a joint probability distribution on e that is informed by the relative distances dij as desired.

5.2 Comparison of Competing Approaches

We set up a Monte Carlo study to compare the performance of the proposed PPMx model against four of the described alternative approaches: (i) clusterwise regression using the flexmix R package (Leisch 2004); (ii) the HME model (5.1); (iii) the approach of Dahl (2008); and (iv) the model of Park and Dunson (2010).

The study is set up to compare the competing methods as nonlinear regression methods. We thus use mean squared error (MSE) as the criterion for the comparison. MSE is the average squared deviation of the fitted regression mean function and the known simulation truth, averaging with respect to repeated sampling. We report MSE for a set of covariate combinations including some covariate combinations that are not observed in the data, that is, that require extrapolation. For a fair comparison we use a simulation truth that is not characterized as the assumed model under any of the competing approaches. In particular, we do not generate a simulation truth of assumed clusters. Instead we set up a nonlinear regression of a continuous dependent variable on three independent variables, including two binary variables and one continuous variable.

Let x1, x2, x3 denote the three covariates, x1R and x2 ∈ {0, 1}, x3 ∈ {0, 1}. The simulation truth is chosen to mimic the example in Section 6. Think of x2 as an indicator for high dose (HI), x3 as an indicator for ER positive tumors (ER+), and x1 as tumor size (TS), with −1, 0, and 1 representing small, medium, and large size. There is a strong interaction of x2 and x3 with longest survival for (x2, x3) = (1, 1), and a sizeable main effect for x1.

Figure 1 shows the simulation truth. The figure plots the distribution of the univariate response yi arranged by x1 ∈ {−1, 0, 1} (the three distributions shown in each panel) and the four combinations of (x2, x3) (one panel for each combination). Note the interaction of x1 and x2. We simulated M = 100 datasets of size n = 200 using the described sampling model. The simulated data included all covariate combinations shown in Figure 1 except for (x1, x2, x3) = (1, 0, 0) and (1, 1, 0). Simulation also included intermediate values of the continuous variable x1. The M = 100 datasets were generated by resampling one big set of N = 1000 simulated data points. The big dataset and the expectations E(Y|x1, x2, x3) under the simulation truth are available in the supplemental materials or at http://odin.mdacc.tmc.edu/~pmueller/prog.html#PPMx.

Figure 1.

Figure 1

Simulation truth pdf for the outcome variable for different combinations of the three covariates. The online version of this figure is in color.

We compare the proposed PPMx model and the alternative approaches (i) through (iv). As criterion we use the root mean squared error (RMSE) in estimating E(y|x1, x2, x3) for the 12 combinations of the covariates (x1, x2, x3) shown in Figure 1. We evaluated MSE by averaging squared errors over M = 100 repeat simulations. Table 1 summarizes the results. Rows (x1, x2, x3) = (1, 0, 0) and (1, 1, 0) report covariate combinations that were not included in the dataset. Results are thus based on extrapolation. The PPMx model performs well for these extrapolation problems. The last three rows correspond to the right lower panel in Figure 1. Performance for these scenarios reflects the adaptation of the model to extreme interaction effects. The HME performs surprisingly well. The PPMx reports reasonable MSE. The clusterwise regression (flexmix) is comparable to PPMx. Overall there is no clear winner in the comparison, although PPMx attains the lowest average RMSE across all covariate combinations. In any case we caution against over-interpreting the results for one example. The main conclusion is that all five approaches are reasonably comparable, and a caveat about extrapolation in the approach by Park and Dunson (2010). The choice of approach should depend on the inference goal. Clusterwise regression and HME are perfectly appropriate if the main objective is flexible regression. If available information about similarities is naturally expressed by pairwise distances dij, and the main focus is predictive inference, then the approach by Dahl (2008) is attractive. A limitation of the latter three approaches is the lack of specific inference on the random partition. Inference on cluster membership indicators can be reported. But under the HME cluster membership is strictly limited to the functional form of w(·; αj). Under the approach of Dahl (2008), although a model is implied by the set of conditional distributions (5.2), there is in general no clearly identified prior probability model for ρn. The approach of Park and Dunson (2010) is reasonable if the DP prior is chosen for the underlying PPM, the covariates are continuous, and the covariates can be considered random variables. The proposed PPMx is attractive when the set of covariates includes a mix of different data formats, and if there is specific prior information of how important different covariates should be for the judgment of similarity.

Table 1.

Root MSE for estimating E(y|x1, x2, x3) for 12 combinations of (x1, x2, x3) and the four competing models. Models (i) through (iv) and the proposed model are indicated as flexmix, HME, DAHL, P&D, and PPMx, respectively. Covariate combinations that require extrapolation beyond the range of the data are indicated by *

x1 x2 x3 flexmix PPMx P&D DAHL HME
−1 0 0 13.9 7.9 2.7 8.6 13.0
0 0 0 7.7 3.9 15.0 4.6 8.3
1 0 0 5.2 2.8 21.5 8.7 8.7 *
−1 1 0 7.4 5.4 2.3 6.2 5.7
0 1 0 4.4 4.6 15.5 7.9 5.5
1 1 0 3.3 4.0 21.0 12.6 5.3 *
−1 0 1 8.1 6.1 1.8 9.5 7.5
0 0 1 6.8 4.2 7.0 2.4 3.5
1 0 1 3.7 4.5 17.4 8.2 4.9
−1 1 1 8.0 9.5 12.1 12.2 6.5
0 1 1 11.5 8.3 8.7 10.2 5.4
1 1 1 3.1 6.2 2.4 4.8 3.8
Average 6.9 5.6 10.6 8.0 6.5

6. EXAMPLE: A SURVIVAL MODEL WITH PATIENT BASELINE COVARIATES

We consider data from a high-dose chemotherapy treatment of women with breast cancer. The data for this particular study have been discussed by Rosner (2005) and come from Cancer and Leukemia Group B (CALGB) Study 9082. They consist of measurements taken from 763 randomized patients, available as of October 1998 (enrollment had occurred between January 1991 and May 1998). The response of interest is the survival time, defined as the time until death from any cause, relapse, or diagnosis with a second malignancy. There are two treatments, one involving a low dose of the anti-cancer drugs, and the other consisting of aggressively high-dose chemotherapy. The high-dose patients were given considerable regenerative blood-cell supportive care (including marrow transplantation) to help decrease the impact of opportunistic infections rising from the severely affected immune system. The number of observed failures was 361, with 176 under highdose and 185 under low-dose chemotherapy.

The dataset also includes information on the following covariates for each patient: a treatment indicator defined as 1 if a high dose was administered and 0 otherwise (HI); age in years at baseline (AGE); the number of positive lymph nodes found at diagnosis (POS) (the more the worse the prognosis, i.e., the more likely the cancer has spread); tumor size in millimeters (TS), a one-dimensional measurement; an indicator of whether the tumor is positive for the estrogen or progesterone receptor (ER+) (patients who were positive also received the drug tamoxifen and are expected to have better risk), and an indicator of the woman’s menopausal status, defined as 1 if she is either perimenopausal or postmenopausal or 0 otherwise (MENO). Two of these six covariates are continuous (AGE, TS), three are binary (ER+, MENO, HI), and one is a count (POS).

6.1 PPMx With One Covariate

First we carried out inference in a model using the indicator for high dose as the only covariate, that is, xi = HI. We implemented model (2.4) with a similarity function for the binary covariate based on the beta–binomial model (4.1). We used α = (0.1, 0.1) to favor clusters with homogeneous dose assignment. Conditional on an assumed partition ρn we use a normal sampling model p(yi|θj) = N(μj, Vj), with a conjugate normal–inverse gamma prior p(θjη)=N(μjmy,By)Ga(Vj1s2,sSy2) and hyperprior my ~ N(am, Am), Sy ~ Ga(q, q/R). Here η = (am, Am, By, q, R) are fixed hyperparameters. We use amm^, the sample average of yi, Am = 100, By = 1002, s = q = 4, and R = 100. The cohesion functions were chosen as before with c(Sj) = M(nj − 1)!, matching the PPM implied by the DP prior. We include a Ga(1, 1) hyperprior for the total mass parameter M.

Figure 2 shows inference summaries. The posterior distribution p(kn|data) for the number of clusters is shown in Figure 3. The three largest clusters contain 28%, 23%, and 14% of the experimental units.

Figure 2.

Figure 2

Survival example: Estimated survival function (left panel) and hazard (center panel), arranged by x ∈ {HI, LO}. The gray shades show pointwise one posterior predictive standard deviation uncertainty. The right panel shows the data for comparison (Kaplan–Meier curve by dose). The online version of this figure is in color.

Figure 3.

Figure 3

Survival example: Posterior for the number of clusters kn (left panel), and proportion of patients with high dose (% HI) and average progression free survival (PFS) by cluster. The size (area) of the bullets is proportional to the average cluster size.

6.2 PPMx With Multiple Covariates

Next we extended the covariate vector to include all six covariates. Denote by xi = (xi1, … , xi6) the six-dimensional covariate vector. We implement random clustering with regression on covariates as in model (2.4). The similarity function is defined as

g(xj)==l6g(xj). (6.1)

For each covariate we follow the suggestion in Section 4 to define a factor g(·) of the similarity function, using hyperparameters specified as follows. The similarity function for the three binary covariates is defined as in (4.1) with α = (0.1, 0.1) for HI, and α = (0.5, 0.5) for ER+ and MENO. The two continuous covariates AGE and TS were standardized to sample mean 0 and unit standard deviation. The similarity functions were specified as described in Section 4, with fixed s = 0.25, m = 0, and B = 1. Finally, for the count covariate POS we used the similarity function (4.2) with (a, b) = (1.5, 0.1). The sampling model is unchanged from before.

We assume that censoring times are independent of the event times and all parameters in the model. Posterior predictive survival curves for various covariate combinations are shown in Figure 4. In the figure, “baseline” refers to HI = 0, tumor size 38 mm (the empirical median), ER = 0, MENO = 0, average age (44 years), and POS = 15 (empirical mean). Other survival curves are labeled to indicate how the covariates change from baseline, with TS– indicating tumor size 26 mm (the empirical first quartile), TS+ indicating tumor size 50 mm (third quartile), HI referring to high-dose chemotherapy, and ER+ indicating positive estrogen or progesterone receptor status. The inference suggests that treated patients with tumor size below the empirical median and that were positive for estrogen or progesterone receptor have almost uniformly highest predicted survival curves than any other combination of covariates.

Figure 4.

Figure 4

Survival example: Posterior predictive survival function S(t|x) ≡ p(yn+1 > t|xn+1 = x, data), arranged by x. The “baseline” case refers to all continuous and count covariates covariates equal empirical mean, and all binary convariates equal 0. The legend indicates TS− and TS+ for tumor size equal 26 mm and 50 mm (first and third empirical quartile), HI for HI = 1 and ER+ for ER = 1. The legend is sorted by the survival probabulity at 5 years, indicated by thin vertical line.

Figure 5 summarizes features of the posterior clustering. Interestingly, clusters are typically highly correlated to the postmenopausal status, as seen in the right panel.

Figure 5.

Figure 5

Survival example: Posterior distribution for the number of clusters (left panel), mean PFS and % high-dose patients per cluster (center panel), mean PFS and % postmenopausal patients per cluster (right panel).

6.3 Variable Selection

The PPMx model replaces variable selection in a regression model by clustering of experimental units (patients). Consider, for example, inference in Figure 5. Clusters are heterogeneous with respect to HI, but very homogeneous with respect to MENO. In other words, cluster membership and thus predictive inference for a future observation varies more by MENO than by HI.

Similar summaries for all covariates lead us to consider formal variable selection. We explore the use of reduced models by comparing alternative models including subsets of the p = 6 covariates. Let γ = (γ1, … , γp) denote a vector of binary indicators and modify (6.1) to include only selected covariates, g(xj)=:γ=1g(xj). We complete the probability model with a prior on γ. Without loss of generality we assume p(γj = 1) = π, independently, j = 1, … , p.

Posterior simulation for γ involves a minor challenge related to the normalization constant gn(xn) in (2.1). We write gn(xn|γ) to highlight that the definition of g(·) is indexed by γ. We implement posterior simulation with a Metropolis–Hastings transition probability to change indicators γj, keeping the currently imputed partition ρ and all other parameters unchanged. Let γ1 and γ0 denote two vectors of variable selection indicators. Without loss of generality we assume that the two vectors differ only in the first indicator with γ11=1, γ10=0, and γj1=γj0,j=2,,p, j = 2, … , p. The acceptance probability includes the ratio

p(γ1ρ,xn)p(γ0ρ,xn)=j=1kng(xj1)gn(xnγ0)gn(xnγ1).

We use (independent) Monte Carlo integration to compute gn(xn|γ) for all 2p possible γ vectors. We do this once, up front, before starting the posterior MCMC simulation. Let g(xnρ,γ)=jg(xjγ) denote the product of similarity functions under partition ρ and variable selection γ. Recall that q(ρ)=c(Sj) denotes the random partition under the PPM without the additional similarity function. We use

gn(xnγ)=ρg(xρ,γ)q(ρ)1Mmg(xnρm,γ1)

for ρm ~ q(ρ), iid. We use Rao–Blackwellization to compute the marginal posterior probabilities p(γ|xn, yn) as ergodic averages of p(γ|ρ, xn), averaging over imputed partitions ρ.

Using the described posterior simulation and the data from the previous example, we find the following posterior probabilities for variable selection. The posterior mode is achieved with p(γ = (1, … , 1)|xn, yn) = 0.62. This is the model with all covariates. The next most likely model is given by p(γ = (1, 1, 0, 1, 1, 1)|xn, yn) = 0.38 for the model without the indicator HI for high dose. It is clinically plausible that the effect of high dose would fade in comparison with initial tumor burden and other important baseline characteristics. One would still continue to include HI as the treatment covariate, that is, the only covariate that can be manipulated by choice. The cumulative posterior probability of all other 62 models is less than 0.001.

In summary, model-based inference confirms the inference reported by summaries such as Figure 5. In general we prefer the use of summaries of cluster homogeneity as in Figure 5, as it readily generalizes to larger numbers of covariates, including possibly interesting interactions.

7. CONCLUSION

We have proposed a novel model for random partitions with a regression on covariates. The model builds on the popular PPM random partition models by introducing an additional factor to modify the cohesion function. We refer to the additional factor as similarity function. It increases the prior probability that experimental units with similar covariates are co-clustered. We provide default choices of the similarity function for popular data formats.

The main features of the model are the possibility to include additional prior information related to the covariates, the principled nature of the model construction, and a computationally efficient implementation.

Among the limitations of the proposed method is an implicit penalty for the cluster size that is implied by the similarity function. Consider all equal covariates xix. The value of the similarity functions proposed in Section 4 decreases across cluster size. This limitation could be mitigated by allowing an additional factor c*(|Sj|) in (2.4) to compensate the size penalty implicit in the similarity function.

Supplementary Material

code

R-package: R-package PPMx containing code to carry out inference for the model described in the article. The function ppmx implements the proposed covariate-dependent random partition model for an arbitrary combination of continuous, categorical, binary, and count covariates, using a mixture of normal sampling model for yi. The package is also available for download from http://www.math.utexas.edu/users/pmueller/prog.html.

data

Simulation data: The data file contains the 1000 simulated data points used for the Monte Carlo study that is described in Section 5.2.

ACKNOWLEDGMENTS

The research of P. Müller and G. L. Rosner was supported in part by NIH grant 1R01CA75981. The research of F. Quintana was supported by FONDECYT grant 1060729 and the Laboratorio de Análisis Estocástico PBCT-ACT13.

Contributor Information

Peter Müller, Department of Biostatistics, M. D. Anderson Cancer Center, Houston, TX 77230-1402 (pmueller@math.utexas.edu).

Fernando Quintana, Departamento de Estadística, Pontificia Universidad Católica de Chile, Santiago, Chile.

Gary L. Rosner, Division of Oncology Biostatistics, The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Baltimore, MD 21205-2013.

REFERENCES

  1. Banfield JD, Raftery AE. Model-Based Gaussian and Non-Gaussian Clustering. Biometrics. 1993;49:803–821. 261. [Google Scholar]
  2. Barry D, Hartigan JA. A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association. 1993;88:309–319. 261. [Google Scholar]
  3. Bernardo J-M, Smith AFM. Bayesian Theory. Wiley Series in Probability and Mathematical Statistics. Wiley; Chichester: 1994. p. 263. [Google Scholar]
  4. Bishop CM, Svensén M. In: Kjaerulff U, Meek C, editors. Bayesian Hierarchical Mixtures of Experts; 2003 Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence; 2003.pp. 57–64.pp. 268 [Google Scholar]
  5. Crowley EM. Product Partition Models for Normal Means. Journal of the American Statistical Association. 1997;92:192–198. 261. [Google Scholar]
  6. Dahl DB. Proceedings of the Section on Bayesian Statistical Science. American Statistical Association; Alexandria, VA: 2008. Distance-Based Probability Distribution for Set Partitions With Applications to Bayesian Nonparametrics; p. 269.p. 271. [Google Scholar]
  7. Dahl DB. Modal Clustering in a Class of Product Partition Models. Bayesian Analysis. 2009;4:243–264. 265. [Google Scholar]
  8. Dasgupta A, Raftery AE. Detecting Features in Spatial Point Processes With Clutter via Model-Based Clustering. Journal of the American Statistical Association. 1998;93:294–302. 261. [Google Scholar]
  9. Ferguson TS. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics. 1973;1:209–230. 261. [Google Scholar]
  10. Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97:611–631. 261. [Google Scholar]
  11. Green PJ, Richardson S. technical report. University of Bristol, Dept. of Mathematics; 1999. Modelling Heterogeneity With and Without the Dirichlet Process; p. 262. [Google Scholar]
  12. Hartigan JA. Partition Models. Communications in Statistics, Part A—Theory and Methods. 1990;19:2745–2756. 261. [Google Scholar]
  13. Ishwaran H, James LF. Generalized Weighted Chinese Restaurant Processes for Species Sampling Mixture Models. Statistica Sinica. 2003;13:1211–1235. 261. [Google Scholar]
  14. Jasra A, Holmes CC, Stephens DA. Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling. Statistical Science. 2005;20:50–67. 265. [Google Scholar]
  15. Johnson VE, Albert JH. Ordinal Data Modeling. Statistics for Social Science and Public Policy. Springer-Verlag; New York: 1999. p. 267. [Google Scholar]
  16. Jordan M, Jacobs R. Hierarchical Mixtures-of-Experts and the EM Algorithm. Neural Computation. 1994;6:181–214. 268. [Google Scholar]
  17. Lau JW, Green PJ. Bayesian Model-Based Clustering Procedures. Journal of Computational and Graphical Statistics. 2007;16:526–558. 265. [Google Scholar]
  18. Leisch F. FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R. Journal of Statistical Software. 2004;11:1–18. 268, 269. [Google Scholar]
  19. Marin J-M, Robert CP. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer-Verlag; New York: 2007. p. 266. [Google Scholar]
  20. Park J-H, Dunson D. Bayesian Generalized Product Partition Models. Statistica Sinica. 2010;20:1203–1226. 262, 268, 269, 271. [Google Scholar]
  21. Pitman J. In: Ferguson TS, Shapeley LS, MacQueen JB, editors. Some Developments of the Blackwell–MacQueen Urn Scheme; Statistics, Probability and Game Theory: Papers in Honor of David Blackwell. Institute of Mathematical Statistics Lecture Notes—Monograph Series; Hayward, CA: Institute of Mathematical Statistics. 1996.pp. 245–268.pp. 261 [Google Scholar]
  22. Quintana FA. A Predictive View of Bayesian Clustering. Journal of Statistical Planning and Inference. 2006;136:2407–2429. 261, 264. [Google Scholar]
  23. Quintana FA, Iglesias PL. Bayesian Clustering and Product Partition Models. Journal of the Royal Statistical Society, Ser. B. 2003;65:557–574. 261, 265. [Google Scholar]
  24. Richardson S, Green PJ. “On Bayesian Analysis of Mixtures With an Unknown Number of Components” (with discussion) Journal of the Royal Statistical Society, Ser. B. 1997;59:731–792. 261. [Google Scholar]
  25. Rosner GL. Bayesian Monitoring of Clinical Trials With Failure-Time Endpoints. Biometrics. 2005;61:239–245. 272. doi: 10.1111/j.0006-341X.2005.031037.x. [DOI] [PubMed] [Google Scholar]
  26. Shahbaba B, Neal RM. Nonlinear Models Using Dirichlet Process Mixtures. Journal of Machine Learning Research. 2009;10:1829–1850. 262, 263, 269. [Google Scholar]
  27. Späth H. Clusterwise Linear Regression. Computing. 1979;22:93–119. 268. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

code

R-package: R-package PPMx containing code to carry out inference for the model described in the article. The function ppmx implements the proposed covariate-dependent random partition model for an arbitrary combination of continuous, categorical, binary, and count covariates, using a mixture of normal sampling model for yi. The package is also available for download from http://www.math.utexas.edu/users/pmueller/prog.html.

data

Simulation data: The data file contains the 1000 simulated data points used for the Monte Carlo study that is described in Section 5.2.

RESOURCES