A Product Partition Model With Regression on Covariates

Peter Müller; Fernando Quintana; Gary L Rosner

doi:10.1198/jcgs.2011.09066

. Author manuscript; available in PMC: 2011 May 10.

Published in final edited form as: J Comput Graph Stat. 2011 Mar 1;20(1):260–278. doi: 10.1198/jcgs.2011.09066

A Product Partition Model With Regression on Covariates

Peter Müller ¹, Fernando Quintana ², Gary L Rosner ³

PMCID: PMC3090756 NIHMSID: NIHMS283160 PMID: 21566678

Abstract

We propose a probability model for random partitions in the presence of covariates. In other words, we develop a model-based clustering algorithm that exploits available covariates. The motivating application is predicting time to progression for patients in a breast cancer trial. We proceed by reporting a weighted average of the responses of clusters of earlier patients. The weights should be determined by the similarity of the new patient’s covariate with the covariates of patients in each cluster. We achieve the desired inference by defining a random partition model that includes a regression on covariates. Patients with similar covariates are a priori more likely to be clustered together. Posterior predictive inference in this model formalizes the desired prediction.

We build on product partition models (PPM). We define an extension of the PPM to include a regression on covariates by including in the cohesion function a new factor that increases the probability of experimental units with similar covariates to be included in the same cluster. We discuss implementations suitable for any combination of continuous, categorical, count, and ordinal covariates.

An implementation of the proposed model as R-package is available for download.

Keywords: Clustering, Nonparametric Bayes, Variable selection

1. INTRODUCTION

We develop a probability model for clustering with covariates, that is, a probability model for partitioning a set of experimental units, where the probability of any particular partition is allowed to depend on covariates. The motivating application is inference in a clinical trial. The outcome is time to progression for breast cancer patients. The covariates include treatment dose, initial tumor burden, an indicator for menopause, and more. We wish to define a probability model for clustering patients with the specific feature that patients with equal or similar covariates should be a priori more likely to co-cluster than others.

Let i = 1, … , n, index experimental units, and let ρ_n = {S₁, … , S_{k_n}} denote a partition of the n experimental units into k_n subsets S_j. Let x_i and y_i denote the covariates and response reported for the ith unit. Let xⁿ = (x₁, … , x_n) and yⁿ = (y₁, … , y_n) denote the entire set of recorded covariates and response data, and let $x_{j}^{*} = (x_{i}, i \in S_{j})$ and $y_{j}^{*} = (y_{i}, i \in S_{j})$ denote covariates and response data arranged by clusters. Sometimes it is convenient to introduce cluster membership indicators e_i ∈ {1, … , k_n} with e_i = j if i ∈ S_j, and use (k_n, e₁, … , e_n) to describe the partition. We call a probability model p(ρ_n) a clustering model, excluding in particular purely constructive definitions of clustering as a deterministic algorithm (without reference to probability models). In particular, the number of clusters k_n is itself unknown. The clustering model p(ρ_n) implies a prior model p(k_n). Many clustering models include, implicitly or explicitly, a sampling model p(yⁿ|xⁿ, ρ_n). Probability models p(ρ_n) and inference for clustering have been extensively discussed over the past few years. See the article of Quintana (2006) for a recent review. In this article we are interested in adding a regression to replace p(ρ_n) with p(ρ_n|xⁿ).

We focus on the product partition model (PPM). The PPM (Hartigan 1990; Barry and Hartigan 1993; Crowley 1997) constructs p(ρ_n) by introducing cohesion functions c(A) ≥ 0 for A ⊆ {1, … , n} that measure how tightly grouped the elements in A are thought to be, and defines a probability model for a partition ρ_n and data yⁿ as

p (ρ_{n}) \propto \prod_{j = 1}^{k_{n}} c (S_{j}) and p (y^{n} ∣ ρ_{n}) = \prod_{j = 1}^{k_{n}} p_{j} (y_{j}^{*}) .

(1.1)

Model (1.1) is conjugate. The posterior p(ρ_n|yⁿ) is again in the same product form.

Alternatively, the species sampling model (SSM) (Pitman 1996; Ishwaran and James 2003) defines an exchangeable probability model p(ρ_n) that depends on ρ_n only indirectly through the cardinality of the partitioning subsets, p(ρ_n) = p(|S₁|, … , |S_{k_n}|). The SSM can be alternatively characterized by a sequence of predictive probability functions (PPFs) that describe how individuals are sequentially assigned to either already formed clusters or to start new ones. The choice of the PPF is not arbitrary. One has to make sure that a sequence of random variables that are sampled by iteratively applying the PPF is exchangeable. The popular Dirichlet process (DP) model (Ferguson 1973) is a special case of a SSM. Moreover, the marginal distribution that a DP induces on partitions is also a PPM with cohesions c(S_j) = M(|S_j| − 1)! (Quintana and Iglesias 2003). Here M denotes the total mass parameter of the DP prior.

Model-based clustering (Banfield and Raftery 1993; Dasgupta and Raftery 1998; Fraley and Raftery 2002) implicitly defines a probability model on clustering by assuming a mixture model $p (y_{i} ∣ θ, τ, k_{n}) = \sum_{j = 1}^{k_{n}} τ_{j} p_{j} (y_{i} ∣ θ_{j})$ , where θ = (θ₁, … , θ_{k_n}) and τ = (τ₁, … , τ_{k_n}). Together with a prior p(k_n) on k_n and p(θ, τ|k_n), the mixture implicitly defines a probability model on clustering. Consider the equivalent hierarchical model

p (y_{i} ∣ e_{i} = j, k_{n}, θ, τ) = p_{j} (y_{i} ∣ θ_{j}) and \Pr (e_{i} = j ∣ k_{n}, θ, τ) = τ_{j} .

(1.2)

The implied posterior distribution on (e₁, … , e_n) and k_n defines a probability model on ρ_n. Richardson and Green (1997) developed posterior simulation strategies for mixtures of normal models. Green and Richardson (1999) discussed the relationship to DP mixture models.

In this article we build on the PPM (1.1) to define a covariate-dependent random partition model by augmenting the PPM with an additional factor that induces the desired dependence on the covariates. We refer to the additional factor as similarity function. Similar approaches are discussed by Shahbaba and Neal (2009) and Park and Dunson (2010). Both effectively included the covariates as part of an augmented response vector. We discuss more details of these two and other alternative approaches in Section 5 where we also carry out a small Monte Carlo study for an empirical comparison.

In Section 2 we state the proposed model and considerations in choosing the similarity function. In Section 3 we show that the computational effort of posterior simulation remains essentially unchanged from PPM models without covariates. In Section 4 we propose specific choices of the similarity function for common data formats. Section 5 reviews some alternative approaches, and we summarize a small Monte Carlo study to compare some of these approaches. In Section 6 we show a simulation study and a data analysis example.

2. THE PPMX MODEL

We build on the PPM (1.1), modifying the cohesion function c(S_j) with an additional factor that achieves the desired regression. Recall that $x_{j}^{*} = (x_{i}, i \in S_{j})$ denotes all covariates for units in the jth cluster. Let $g (x_{j}^{*})$ denote a nonnegative function of $x_{j}^{*}$ that formalizes similarity of the x_i with larger values $g (x_{j}^{*})$ for sets of covariates that are judged to be similar. We define the model

p (ρ_{n} ∣ x^{n}) \propto \prod_{j = 1}^{k_{n}} g (x_{j}^{*}) \cdot c (S_{j})

(2.1)

with the normalization constant $g_{n} (x^{n}) = \sum_{ρ_{n}} \prod_{j = 1}^{k_{n}} g (x_{j}^{*}) c (S_{j})$ . By a slight abuse of notation we include x behind the conditioning bar even when x is not a random variable. We will later discuss specific choices for the similarity function g. As a default choice we propose to define g(·) as the marginal probability in an auxiliary probability model q, even if x_i are not considered random,

g (x_{j}^{*}) \equiv \int \prod_{i \in S_{j}} q (x_{i} ∣ ξ_{j}) q (ξ_{j}) d ξ_{j} .

(2.2)

There is no notion of the x_i being random variables. But the use of a probability density q(·) for the construction of g(·) is convenient since it allows for easy calculus. The correlation that is induced by the cluster-specific parameters ξ_j in (2.2) leads to higher values of $g (x_{j}^{*})$ for tightly clustered, similar x_i, as desired. More importantly, we show below that under some minimal assumptions a similarity function g(·) necessarily is of the form (2.2). The function (2.2) satisfies the following two properties that are desirable for a similarity function in (2.1). First, we require symmetry with respect to permutations of the sample indices i. The probability model must not depend on the order of introducing the experimental units. This implies that the similarity function g(·) must be symmetric in its arguments. Second, we require that the similarity function scales across sample size, in the sense that g(x*) = ∫ g(x*, x)dx. In words, the similarity of any cluster is the average of any augmented cluster.

Under these two requirements (2.2) is not only technically convenient. It is the only possible similarity function that satisfies these two constraints.

Theorem

Assume that (i) a similarity function g(x*) satisfies the two constraints; (ii) $g (x_{j}^{*})$ integrates over the covariate space, $\int g (x_{j}^{*}) d x_{j}^{*} < \infty$ Then $g (x_{j}^{*})$ is proportional to the marginal distribution on $x_{j}^{*}$ under a hierarchical auxiliary model (2.2).

Proof

The result is a direct consequence of De Finetti’s representation theorem for exchangeable probability measures, applied to a normalized version of q(·). See, for example, the book by Bernardo and Smith (1994, chapter 4.3). The representation theorem applies for an infinite sequence of variables that are subject to the symmetry constraint. The result establishes that all similarity functions that satisfy the symmetry constraint are of the form (2.2).

The definition of the similarity function with the auxiliary model q(·) also implies another important property. The random partition model (2.1) is coherent across sample sizes. The model for the first n experimental units follows from the model for (n + 1) observation by appropriate marginalization. Without covariates we would simply require p(ρ_n) = ∑_{e_n+1} p(ρ_n+1), with the summation defined over all possible values of e_n+1. With covariates we consider the following condition. Assume that the similarity function is defined by means of an auxiliary model, as in (2.2). The covariate-dependent PPM (2.1) is coherent across sample sizes as formalized by the following relationship of p(ρ_n|xⁿ) and p(ρ_n+1|xⁿ, x_n+1):

p (ρ_{n} ∣ x^{n}) = \sum_{e_{n + 1}} \int p (ρ_{n + 1} ∣ x^{n}, x_{n + 1}) q (x_{n + 1} ∣ x^{n}) d x_{n + 1}

(2.3)

for the probability model q(x_n+1|xⁿ) ∝ g_n+1(xⁿ⁺¹)/g_n(xⁿ).

We complete the random partition model (2.1) with a sampling model that defines independence across clusters and exchangeability within each cluster. We include cluster-specific parameters θ_j and common hyperparameters η:

p (y^{n} ∣ ρ_{n}, θ, η, x^{n}) = \prod_{j = 1}^{k_{n}} \prod_{x \in S_{j}} p (y_{i} ∣ x_{i}, θ_{j}, η) and p (θ ∣ η) = \prod_{j = 1}^{k_{n}} p (θ_{j} ∣ η),

(2.4)

where θ = (θ₁, … , θ_{k_n}). We refer to (2.1) together with (2.4) as PPM with covariates, and write PPMx for short. The resulting model extends PPMs of the type (1.1) while keeping the product structure. Note that the sampling model p(y_i|x_i, θ_j, η) in (2.4) can include a regression of the response data on the covariate x_i. For example, Shahbaba and Neal (2009) included a logistic regression for y_i on x_i, in addition to a covariate-dependent partition model p(ρ_n|xⁿ). Similarly, cluster-specific covariates w_j could be included by replacing p(θ_j|η) by p(θ_j|η, w_j).

3. POSTERIOR INFERENCE

3.1 Markov Chain Monte Carlo Posterior Simulation

A practical advantage of the proposed default choice for $g (x_{j}^{*})$ is that it greatly simplifies posterior simulation. In words, posterior inference in the PPMx (2.1) and (2.4) is identical to the posterior inference that we would obtain if x_i were part of the random response vector y_i. The opposite is not true. Not every model with augmented response vector is equivalent to a PPMx model.

Formally, and as a computational device only, consider the auxiliary model q(·) defined as

q (y^{n}, x^{n} ∣ ρ_{n}, θ, η) = \prod_{j = 1}^{k_{n}} \prod_{i \in S_{j}} p (y_{i} ∣ x_{i}, θ_{j}, η) q (x_{i} ∣ ξ_{j}) and q (θ, ξ ∣ η) = \prod_{j = 1}^{k_{n}} p (θ_{j} ∣ η) q (ξ_{j}),

(3.1)

and replace the covariate-dependent prior p(ρ_n|xⁿ) by the PPM $q (ρ_{n}) \propto \prod_{j = 1}^{k_{n}} c (S_{j})$ . The posterior distribution q(ρ_n|yⁿ, xⁿ) under the auxiliary model is identical to p(ρ_n|yⁿ, xⁿ) under the proposed model (2.4). But posterior simulation under q(·) can be carried out using standard techniques, for example, as discussed by Quintana (2006). An important caveat of this operational trick, though, is that ξ_j and θ_j must be a priori independent. In particular, the prior on θ_j and q(ξ_j) must not include common hyperparameters. If we had p(θ_j|η) and q(ξ_j|η) depend on a common hyperparameter η, then the posterior distribution under the auxiliary model (3.1) would differ from the posterior under the original model. Let $q (x^{n} ∣ η) = \sum_{ρ_{n}} \prod_{j} \int q (x_{j}^{*} ∣ ξ_{j}) d q (ξ_{j} ∣ η)$ . The two posterior distributions would differ by a factor q(xⁿ|η). The implication on the desired inference can be substantial.

However, we do not consider the principled way of introducing the covariates in the PPM to be the main feature of the PPMx. Rather, we argue that the proposed model greatly simplifies the inclusion of covariates including a variety of data formats. It would be unnecessarily difficult to attempt the construction of a joint model for continuous responses and categorical, binary, and continuous covariates. We argue that it is far easier to focus on a modification of the cohesion function. This is illustrated in the data analysis example in Section 6. From a modeling perspective it is desirable to separate the choice of the prior on the partition versus the sampling model conditional on an assumed partition.

3.2 Predictive Inference

A minor complication arises with posterior predictive inference, that is, when reporting p(y_n+1|yⁿ, xⁿ, x_n+1). Using $\tilde{x} = x_{n + 1}, \tilde{y} = y_{n + 1}, and \tilde{e} = e_{n + 1}$ to simplify notation, we find $p (\tilde{y} ∣ \tilde{x}, x^{n}, y^{n}) = \int p (\tilde{y} ∣ \tilde{x}, ρ_{n + 1}, x^{n}, y^{n}) d p (ρ_{n + 1} ∣ \tilde{x}, y^{n}, x^{n})$ . The integral is simply a sum over all configurations ρ_n+1. But it is not immediately recognizable as a posterior integral with respect to p(ρ_n|xⁿ, yⁿ). This can easily be overcome by an importance sampling reweighting step. Let g(∅) = c(∅) ≡ 1. The prior on $ρ_{n + 1} = (ρ_{n}, \tilde{e})$ can be written as

\begin{matrix} p (\tilde{e} = ℓ, ρ_{n} ∣ \tilde{x}, x^{n}) & \propto \prod_{j \neq ℓ} c (S_{j}) g (x_{j}^{*}) c (S_{ℓ} \cup {n + 1}) g (x_{ℓ}^{*}, \tilde{x}) \\ = p (ρ_{n} ∣ x^{n}) \frac{g (x_{ℓ}^{*}, \tilde{x})}{g (x_{ℓ}^{*})} \frac{c (S_{ℓ} \cup {n + 1})}{c (S_{ℓ})} \end{matrix}

Let $q_{ℓ} (\tilde{x} ∣ x_{ℓ}^{*}) \equiv g (x_{ℓ}^{*}, \tilde{x}) ∕ g (x_{ℓ}^{*})$ . The posterior predictive distribution becomes

\begin{matrix} p (\tilde{y} ∣ \tilde{x}, y^{n}, x^{n}) & \propto \int \sum_{ℓ = 1}^{k_{n} + 1} p (\tilde{y} ∣ \tilde{x}, y_{ℓ}^{*}, x_{ℓ}^{*}, \tilde{e} = ℓ) q_{ℓ} (\tilde{x} ∣ x_{ℓ}^{*}) \\ \times \frac{c (S_{ℓ} \cup {n + 1})}{c (S_{ℓ})} p (ρ_{n} ∣ y^{n}, x^{n}) d ρ_{n} . \end{matrix}

(3.2)

The first factor reduces to $p (\tilde{y} ∣ y_{ℓ}^{*}, \tilde{e} = ℓ)$ when (2.4) does not include a regression on x_i in the sampling model. Sampling from (3.2) is implemented on top of posterior simulation for ρ_n ~ p(ρ_n|xⁿ, yⁿ). For each imputed ρ_n, generate $\tilde{e} = ℓ$ with probabilities proportional to $w_{ℓ} = q_{ℓ} (\tilde{x} ∣ x_{ℓ}^{*}) c (S_{ℓ} \cup {n + 1}) ∕ c (S_{ℓ})$ , and generate $\tilde{y}$ from $p (\tilde{y} ∣ y_{ℓ}^{*}, \tilde{e} = ℓ)$ , weighted with w_ℓ. In the special case ℓ = k_{n + 1} we get $w_{k_{n} + 1} \equiv g (\tilde{x}) c ({n + 1})$ .

3.3 Posterior Inference on Clusters

In many applications inference on the number of clusters, k_n, is of interest. Under the proposed model $p (k_{n} = r ∣ y^{n}, x_{n}) = \sum_{{ρ_{n} : ∣ ρ_{n} ∣ = r}} p (ρ_{n} ∣ y^{n}, x^{n})$ is the induced posterior distribution on the number of clusters. The marginal posterior p(k_n = r|yⁿ, xⁿ) is reported as a simple summary of posterior Markov chain Monte Carlo (MCMC) simulation. Each iteration of the posterior simulation scheme in Section 3.1 involves imputing a partition ρ_n, and thus k_n. A formal selection of the number of clusters could be carried out using a decision-theoretic approach, that is, a loss function that weighs model complexity and the value of learning about the inferential target. Such formal procedures were discussed, among others, by Quintana and Iglesias (2003) and Lau and Green (2007). The special case of finding the maximum a posteriori clustering in a class of product partition models is considered by Dahl (2009).

Reporting posterior inference for the random partition ρ_n is complicated by the label switching problem. The posterior distribution is invariant under permutations of the cluster labels when p(ρ_n) and p(θ) are both symmetric. This implies that inference on cluster specific parameters is meaningless. See the article by Jasra, Holmes, and Stephens (2005) for a recent discussion and review of the literature. Usually the focus of inference in semi-parametric mixture models like (2.4) is on density estimation and prediction, rather than inference for specific clusters and parameter estimation. Predictive inference is not subject to the label switching problem, as the posterior predictive distribution marginalizes over all possible partitions. In the application reported later, when we want to highlight inference about specific clusters we use a simple ad hoc approach that we found particularly meaningful to report inference about the regression p(ρ_n|xⁿ). We report inference stratified by k_n. For given k_n we first find a set of k_n indices I = (i₁, … , i_{k_n}) with high posterior probability of the corresponding cluster indicators being distinct, that is, Pr(D_I|y) is high for D_I = {e_i ≠ e_j for i ≠ j and i, j ∈ I}. To avoid computational complications, we do not insist on finding the k_n-tuple with the highest posterior probability. We then use i₁, … , i_{k_n} as anchors to define cluster labels by restricting posterior simulations to k_n clusters with the units i_j in distinct clusters. Postprocessing MCMC output, this is easily done by discarding all imputed parameter vectors that do not satisfy the constraint. We relabel clusters, indexing the cluster that contains unit i_j as cluster j. Now it is meaningful to report posterior inference on specific clusters. The proposed postprocessing is similar to the pivotal reordering suggested by Marin and Robert (2007, chapter 6.4).

4. SIMILARITY FUNCTIONS

For continuous covariates we suggest as a default choice for $g (x_{j}^{*})$ the marginal distribution of $x_{j}^{*}$ under a normal sampling model. Let N(x; m, V) denote a normal model for the random variable x, with moments m and V, and let Ga(x; a, b) denote a Gamma distributed random variable with mean a/b. We use $q (x_{j}^{*} ∣ ξ_{j}) = \prod_{i \in S_{j}} N (x_{i}; m_{j}, v_{j})$ , with a conjugate prior for ξ_j = (m_j, v_j) as $p (ξ_{j}) = N (m_{j}; m, B) \cdot Ga (v_{j}^{- 1}; ν, S_{o})$ , with fixed m, B, ν, S₀. The main reason for this choice is operational simplicity. A simplified version uses fixed v_j ≡ v. The resulting function $g (x_{j}^{*})$ is the joint density of a correlated multivariate t-distribution, with location parameter m and scaling matrix B/(vI + B). The fixed variance v specifies how strongly we weigh similarity of the x_i. In implementations we used $v = c_{1} \hat{S}$ , where $\hat{S}$ is the empirical variance of the covariate, and c₁ = 0.5 is a scaling factor that specifies over what range we consider values of this covariate as similar. The choice between fixed v versus variable v_j should reflect prior judgment on the variability of clusters. Variable v_j allows for some clusters to include a wider range of x values than others. Finally, a sufficiently vague prior for m_j is important to ensure that the similarity is appropriately quantified even for a group of covariate values in the extreme areas of the covariate space. In our implementation we used $B = c_{2} \hat{S}$ with c₂ = 10.

When constructing a cohesion function for categorical covariates, a default choice is based on a Dirichlet prior. Assume x_i is a categorical covariate, x_i ∈ {1, … , C}. To define $g (x_{j}^{*})$ , let q(x_i = c|ξ_j) = ξ_jc denote the probability mass function. Together with a conjugate Dirichlet prior, q(ξ_j) = Dir(α₁, … , α_C), we define the similarity function as a Dirichlet-categorical probability

g (x_{j}^{*}) = \int \prod_{j \in S_{j}} ξ_{j, x_{i}} d q (ξ_{j}) = \int \prod_{c = 1}^{C} ξ_{j c}^{n_{j c}} d q (ξ_{j}),

(4.1)

with $n_{j c} = \sum_{i \in S_{j}} I (x_{i} = c)$ . This is a Dirichlet-multinomial model without the multinomial coefficient. For binary covariates the similarity function becomes a Beta–Binomial probability without the Binomial coefficient. The choice of the hyperparameters α needs some care. To facilitate the formation of clusters that are characterized by the categorical covariates we recommend Dirichlet hyperparameters α_c < 1. For example, for C = 2, the bimodal nature of a Beta distribution with such parameters assigns high probability to binomial success probabilities ξ_j1 close to 0 or 1. Similarly, the Dirichlet distribution with parameters α_c < 1 favors clusters corresponding to specific levels of the covariate.

For ordinal covariates, a convenient specification of g(·) is an ordinal probit model. The model can be defined by means of latent variables and cutoffs. Assume an ordinal covariate x with C categories. Following Johnson and Albert (1999), consider a latent trait Z and cutoffs −∞ = γ₀ < γ₁ ≤ γ₂ ≤ ⋯ ≤ γ_C−1 < γ_C = ∞, so that x = c if and only if γ_c−1 < Z ≤ γ_c. We use fixed cutoff values γ_c = c−1, c = 1, … , C−1, and a normally distributed latent trait, Z_i ~ N(m_j, v_j). Let Φ_c(m_j, v_j) = Pr(γ_c−1 < Z ≤ γ_c|m_j, v_j) denote the normal quantiles and let ξ_j = (m_j, v_j) denote the cluster-specific moments of the latent score. We define q(x_i = c|e_i = j, ξ_j) = Φ_c(m_j, v_j). The definition of the similarity function is completed with q(ξ_j) as a normal-inverse gamma distribution as for the continuous covariates. We have $q (x_{j}^{*}) = \int \prod_{c = 1}^{C} Φ_{c}^{n_{c j}} (m_{j}, v_{j}) N (m_{j}; m, B) Ga (v_{j}^{- 1}; ν, S_{0}) d m_{j} d v_{j}$ , where $n_{c j} = \sum_{i \in S_{j}} I {x_{i} = c}$ .

Finally, to define g(·) for count-type covariates, we use a mixture of Poisson distributions

g (x_{j}^{*}) = \frac{1}{\prod_{i \in S_{j}} x_{i}!} \int ξ_{j}^{\sum_{i \in S_{j}} x_{i}} \exp (- ξ_{j} ∣ S_{j} ∣) d q (ξ_{j}) .

(4.2)

With a conjugate gamma prior, q(ξ_j) = Ga(ξ_j; a, b), the similarity function $g (x_{j}^{*})$ allows easy analytic evaluation. As a default, we suggest choosing a = 1 and $a ∕ b = c \hat{S}$ , where $\hat{S}$ is the empirical variance of the covariate, and a/b is the expectation of the gamma prior.

The main advantage of the proposed similarity functions is computational simplicity of posterior inference. A minor limitation is the fact that the proposed default similarity functions, in addition to the desired dependence of the random partition model on covariates, also include a dependence on the cluster size n_j = |S_j|. From a modeling perspective this is undesirable. The mechanism to define size dependence should be the underlying PPM and the cohesion functions c(S_j) in (2.1). However, we argue that the additional cluster size penalty that is introduced through g(·) is negligible in comparison to, for example, the cohesion function c(S_j) = M(n_j − 1)! that is implied by the popular DP prior. It is straightforward to show that the similarity function introduces the following additional cluster size penalty in model (2.1). Consider the case of constant covariates, x_i ≡ x, and let n_j denote the size of cluster j. The default choices of $g (x_{j}^{*})$ for continuous, categorical, and count covariates introduce a penalty for large n_j, with $\lim_{n_{j} \to \infty} g (x_{j}^{*}) = 0$ . But the rate of decrease is ignorable compared to the cohesion c(S_j). Let f(n_j) be such that $\lim_{n_{j} \to \infty} g (x_{j}^{*}) ∕ f (n_{j}) \geq M$ with 0 < M < ∞. For continuous covariates the rate is $f (n_{j}) = {(2 π)}^{- n_{j} ∕ 2} V^{- (n_{j} - 1) ∕ 2} {(r + n_{j})}^{1 ∕ 2}$ , with r = V/B. For categorical covariates it is f(n_j) = (A + n_j)^A−α_x with $A = \sum_{c} α_{c}$ . For count covariates it is $f (n_{j}) = C^{- n_{j} ∕ 2} {(α + n_{j} x)}^{1 ∕ 2}$ , with C = 2πx exp(1/6x).

5. ALTERNATIVE APPROACHES AND COMPARISON

5.1 Other Approaches

We propose the PPMx model as a principled approach to defining covariate-dependent random partition models. A related approach that has attracted considerable attention, especially in the recent marketing and psychometrics literature, is clusterwise regression. The basic model is defined in the article by Späth (1979). The model uses a partition of the experimental units (observations) into subsets S_j. For each partitioning subset a different linear regression mean function is defined. A recent implementation of clusterwise regression as an R package flexmix was described by Leisch (2004). We use flexmix in the comparison below.

Another popular class of models that implements flexible regression based on clustering experimental units are hierarchical mixtures of experts (HME) and related models. In contrast to clusterwise regression the HME model allows for the cluster weights to be dependent on covariates themselves. HME models were introduced by Jordan and Jacobs (1994). Bayesian posterior inference was discussed by Bishop and Svensén (2003). We consider a specific instance of the HME model that we will use for an empirical comparison in Section 5.2. The model is expressed as a finite mixture of normal linear regressions, with mixture weights that themselves depend on the available covariates as well,

p (y_{i}; η) = \sum_{j = 1}^{k_{n}} w_{i j} (x_{i}; α_{j}) N (y_{i}; θ_{j}^{T} x_{i}, σ_{i}^{2}) .

(5.1)

Here $η = (θ_{1}, \dots, θ_{k_{n}}, σ_{1}^{2}, \dots, σ_{k_{n}}^{2}, α_{1}, \dots, α_{k_{n}})$ is the full vector of parameters. The size k_n of the mixture is prespecified. The covariate vector x_i includes a 1 for an intercept term. The weights 0 ≤ w_ij ≤ 1 are defined as $w_{i j} \propto \exp (α_{j}^{T} x_{i})$ . For identifiability we assume α₁ ≡ 0. The model specification is completed by assuming θ_j ~ N(μ_θ, V_θ), $α_{j} ~ N (μ_{α}, V_{α}), σ_{j}^{- 2} ~ Ga (a_{0}, a_{1})$ , and prior independence of all parameters. The HME formulation defines a highly flexible class of parametric models. But there are some important limitations. The dependence of cluster membership probabilities on the covariates is strictly limited to the assumed parametric form of w_ij(·). For example, unless an interaction of two covariates is included in w_ij(·), no amount of data will allow clusters that are specific to interactions of covariates. Second, the number of mixture components k_n is assumed known. Ad hoc fixes are possible, such as choosing a very large k_n, or introducing a hyperprior on k_n.

Clusterwise regression and HME do not explicitly include a probability model over random partitions. A model p(ρ_n) is implicitly defined as in (1.2), as model-based clustering. The dependence of cluster membership on covariates remains by definition restricted to the given structure. Alternatively, some recently proposed methods in the Bayesian literature take a perspective similar to the proposed PPMx model and include an explicit probability model for a random partition.

Park and Dunson (2010) considered the special case of continuous covariates. They defined the desired covariate-dependent random partition model as the posterior random partition under a PPM model for the covariates x_i, that is, (1.1) with covariates x_i replacing the response data y_i. The posterior p(ρ_n|xⁿ) is used to define the prior for the random partition ρ_n. In other words, they proceeded with an augmented response vector (x_i, y_i), carefully separating the prior for parameters related to the x and y subvectors, as in (3.1). See Section 3.1 for a discussion of the need to separate prior parameters related to x and y.

Another recently proposed approach is that of Shahbaba and Neal (2009). They used a logistic regression for a categorical response variable on covariates together with a random partition of the samples, with cluster-specific regression parameters in each partitioning subset. The random partition is defined by a DP prior and includes the covariates as part of the response vector. The proposed model can be written as (3.1) with p(y_i|x_i, θ_j, η) specified as a logistic regression with cluster-specific parameters θ_j, and q(x_i|ξ_j) as a normal model with moments ξ_j = (μ_j, Σ_j). The cohesion functions c(S_j) are defined by a DP prior.

Dahl (2008) recently proposed another interesting approach to covariate-dependent random clustering. Let e = (e₁, … , e_n) and let e_−i denote the vector of cluster membership indicators without the ith element. Let p(e_i = j|e_−i) denote the conditional prior probabilities for the cluster membership indicators. For the random clustering implied by the DP prior, G ~ DP(M, G*), these conditional probabilities are known as the Polya urn. Let $k_{n}^{-}$ denote the number of distinct elements in e_−i, $s_{j}^{-} = {h : e_{h} = j and h \neq i}$ , and $n_{j}^{-} = ∣ S_{j}^{-} ∣$ . The Polya urn specifies $p (e_{i} = j ∣ e_{- i}) = n_{j}^{-} ∕ (n - 1 + M)$ for $j = 1, \dots, k_{n}^{-}$ , and M/(n − 1 + M) for $j = k_{n}^{-} + 1$ . Dahl (2008) defined the desired covariate-dependent random partition model by modifying the Polya urn. Assume covariates x_i are available and it is desired to modify p(e) such that any two units with similar covariates should have increased prior probability for co-clustering. Assume d_ih = d(x_i, x_h) can be interpreted as a distance of x_i and x_h. Let d* = max_i<h d_ih and define $h_{i} (S_{j}) = c \sum_{h \in S_{j}} (d^{*} - d_{i h})$ with c chosen to achieve $\sum_{j = 1}^{k_{n}} h_{i} (S_{j}) = n$ . Dahl (2008) defined the modified Polya urn

p (e_{i} = j ∣ e_{- i}, x^{n}) \propto {\begin{matrix} h_{i} (S_{j}^{-}), & j = 1, \dots, k_{n}^{-} \\ M, & j = k_{n}^{-} + 1 . \end{matrix}

(5.2)

This set of transition probabilities defines an ergodic Markov chain. Thus it implicitly defines a joint probability distribution on e that is informed by the relative distances d_ij as desired.

5.2 Comparison of Competing Approaches

We set up a Monte Carlo study to compare the performance of the proposed PPMx model against four of the described alternative approaches: (i) clusterwise regression using the flexmix R package (Leisch 2004); (ii) the HME model (5.1); (iii) the approach of Dahl (2008); and (iv) the model of Park and Dunson (2010).

The study is set up to compare the competing methods as nonlinear regression methods. We thus use mean squared error (MSE) as the criterion for the comparison. MSE is the average squared deviation of the fitted regression mean function and the known simulation truth, averaging with respect to repeated sampling. We report MSE for a set of covariate combinations including some covariate combinations that are not observed in the data, that is, that require extrapolation. For a fair comparison we use a simulation truth that is not characterized as the assumed model under any of the competing approaches. In particular, we do not generate a simulation truth of assumed clusters. Instead we set up a nonlinear regression of a continuous dependent variable on three independent variables, including two binary variables and one continuous variable.

Let x₁, x₂, x₃ denote the three covariates, x₁ ∈ R and x₂ ∈ {0, 1}, x₃ ∈ {0, 1}. The simulation truth is chosen to mimic the example in Section 6. Think of x₂ as an indicator for high dose (HI), x₃ as an indicator for ER positive tumors (ER+), and x₁ as tumor size (TS), with −1, 0, and 1 representing small, medium, and large size. There is a strong interaction of x₂ and x₃ with longest survival for (x₂, x₃) = (1, 1), and a sizeable main effect for x₁.

Figure 1 shows the simulation truth. The figure plots the distribution of the univariate response y_i arranged by x₁ ∈ {−1, 0, 1} (the three distributions shown in each panel) and the four combinations of (x₂, x₃) (one panel for each combination). Note the interaction of x₁ and x₂. We simulated M = 100 datasets of size n = 200 using the described sampling model. The simulated data included all covariate combinations shown in Figure 1 except for (x₁, x₂, x₃) = (1, 0, 0) and (1, 1, 0). Simulation also included intermediate values of the continuous variable x₁. The M = 100 datasets were generated by resampling one big set of N = 1000 simulated data points. The big dataset and the expectations E(Y|x₁, x₂, x₃) under the simulation truth are available in the supplemental materials or at http://odin.mdacc.tmc.edu/~pmueller/prog.html#PPMx.

Simulation truth pdf for the outcome variable for different combinations of the three covariates. The online version of this figure is in color.

We compare the proposed PPMx model and the alternative approaches (i) through (iv). As criterion we use the root mean squared error (RMSE) in estimating E(y|x₁, x₂, x₃) for the 12 combinations of the covariates (x₁, x₂, x₃) shown in Figure 1. We evaluated MSE by averaging squared errors over M = 100 repeat simulations. Table 1 summarizes the results. Rows (x₁, x₂, x₃) = (1, 0, 0) and (1, 1, 0) report covariate combinations that were not included in the dataset. Results are thus based on extrapolation. The PPMx model performs well for these extrapolation problems. The last three rows correspond to the right lower panel in Figure 1. Performance for these scenarios reflects the adaptation of the model to extreme interaction effects. The HME performs surprisingly well. The PPMx reports reasonable MSE. The clusterwise regression (flexmix) is comparable to PPMx. Overall there is no clear winner in the comparison, although PPMx attains the lowest average RMSE across all covariate combinations. In any case we caution against over-interpreting the results for one example. The main conclusion is that all five approaches are reasonably comparable, and a caveat about extrapolation in the approach by Park and Dunson (2010). The choice of approach should depend on the inference goal. Clusterwise regression and HME are perfectly appropriate if the main objective is flexible regression. If available information about similarities is naturally expressed by pairwise distances d_ij, and the main focus is predictive inference, then the approach by Dahl (2008) is attractive. A limitation of the latter three approaches is the lack of specific inference on the random partition. Inference on cluster membership indicators can be reported. But under the HME cluster membership is strictly limited to the functional form of w(·; α_j). Under the approach of Dahl (2008), although a model is implied by the set of conditional distributions (5.2), there is in general no clearly identified prior probability model for ρ_n. The approach of Park and Dunson (2010) is reasonable if the DP prior is chosen for the underlying PPM, the covariates are continuous, and the covariates can be considered random variables. The proposed PPMx is attractive when the set of covariates includes a mix of different data formats, and if there is specific prior information of how important different covariates should be for the judgment of similarity.

Table 1.

Root MSE for estimating E(y|x₁, x₂, x₃) for 12 combinations of (x₁, x₂, x₃) and the four competing models. Models (i) through (iv) and the proposed model are indicated as flexmix, HME, DAHL, P&D, and PPMx, respectively. Covariate combinations that require extrapolation beyond the range of the data are indicated by *

x₁	x₂	x₃	flexmix	PPMx	P&D	DAHL	HME
−1	0	0	13.9	7.9	2.7	8.6	13.0
0	0	0	7.7	3.9	15.0	4.6	8.3
1	0	0	5.2	2.8	21.5	8.7	8.7	*
−1	1	0	7.4	5.4	2.3	6.2	5.7
0	1	0	4.4	4.6	15.5	7.9	5.5
1	1	0	3.3	4.0	21.0	12.6	5.3	*
−1	0	1	8.1	6.1	1.8	9.5	7.5
0	0	1	6.8	4.2	7.0	2.4	3.5
1	0	1	3.7	4.5	17.4	8.2	4.9
−1	1	1	8.0	9.5	12.1	12.2	6.5
0	1	1	11.5	8.3	8.7	10.2	5.4
1	1	1	3.1	6.2	2.4	4.8	3.8
Average			6.9	5.6	10.6	8.0	6.5

Open in a new tab

6. EXAMPLE: A SURVIVAL MODEL WITH PATIENT BASELINE COVARIATES

We consider data from a high-dose chemotherapy treatment of women with breast cancer. The data for this particular study have been discussed by Rosner (2005) and come from Cancer and Leukemia Group B (CALGB) Study 9082. They consist of measurements taken from 763 randomized patients, available as of October 1998 (enrollment had occurred between January 1991 and May 1998). The response of interest is the survival time, defined as the time until death from any cause, relapse, or diagnosis with a second malignancy. There are two treatments, one involving a low dose of the anti-cancer drugs, and the other consisting of aggressively high-dose chemotherapy. The high-dose patients were given considerable regenerative blood-cell supportive care (including marrow transplantation) to help decrease the impact of opportunistic infections rising from the severely affected immune system. The number of observed failures was 361, with 176 under highdose and 185 under low-dose chemotherapy.

The dataset also includes information on the following covariates for each patient: a treatment indicator defined as 1 if a high dose was administered and 0 otherwise (HI); age in years at baseline (AGE); the number of positive lymph nodes found at diagnosis (POS) (the more the worse the prognosis, i.e., the more likely the cancer has spread); tumor size in millimeters (TS), a one-dimensional measurement; an indicator of whether the tumor is positive for the estrogen or progesterone receptor (ER+) (patients who were positive also received the drug tamoxifen and are expected to have better risk), and an indicator of the woman’s menopausal status, defined as 1 if she is either perimenopausal or postmenopausal or 0 otherwise (MENO). Two of these six covariates are continuous (AGE, TS), three are binary (ER+, MENO, HI), and one is a count (POS).

6.1 PPMx With One Covariate

First we carried out inference in a model using the indicator for high dose as the only covariate, that is, x_i = HI. We implemented model (2.4) with a similarity function for the binary covariate based on the beta–binomial model (4.1). We used α = (0.1, 0.1) to favor clusters with homogeneous dose assignment. Conditional on an assumed partition ρ_n we use a normal sampling model p(y_i|θ_j) = N(μ_j, V_j), with a conjugate normal–inverse gamma prior $p (θ_{j} ∣ η) = N (μ_{j} ∣ m_{y}, B_{y}) Ga (V_{j}^{- 1} ∣ s ∕ 2, s S_{y} ∕ 2)$ and hyperprior m_y ~ N(a_m, A_m), S_y ~ Ga(q, q/R). Here η = (a_m, A_m, B_y, q, R) are fixed hyperparameters. We use $a_{m} \equiv \hat{m}$ , the sample average of y_i, A_m = 100, B_y = 100², s = q = 4, and R = 100. The cohesion functions were chosen as before with c(S_j) = M(n_j − 1)!, matching the PPM implied by the DP prior. We include a Ga(1, 1) hyperprior for the total mass parameter M.

Figure 2 shows inference summaries. The posterior distribution p(k_n|data) for the number of clusters is shown in Figure 3. The three largest clusters contain 28%, 23%, and 14% of the experimental units.

Survival example: Estimated survival function (left panel) and hazard (center panel), arranged by x ∈ {HI, LO}. The gray shades show pointwise one posterior predictive standard deviation uncertainty. The right panel shows the data for comparison (Kaplan–Meier curve by dose). The online version of this figure is in color.

Survival example: Posterior for the number of clusters *k_n* (left panel), and proportion of patients with high dose (% HI) and average progression free survival (PFS) by cluster. The size (area) of the bullets is proportional to the average cluster size.

6.2 PPMx With Multiple Covariates

Next we extended the covariate vector to include all six covariates. Denote by x_i = (x_i1, … , x_i6) the six-dimensional covariate vector. We implement random clustering with regression on covariates as in model (2.4). The similarity function is defined as

g (x_{j}^{*}) = \prod_{ℓ = l}^{6} g^{ℓ} (x_{j ℓ}^{*}) .

(6.1)

For each covariate we follow the suggestion in Section 4 to define a factor g^ℓ(·) of the similarity function, using hyperparameters specified as follows. The similarity function for the three binary covariates is defined as in (4.1) with α = (0.1, 0.1) for HI, and α = (0.5, 0.5) for ER+ and MENO. The two continuous covariates AGE and TS were standardized to sample mean 0 and unit standard deviation. The similarity functions were specified as described in Section 4, with fixed s = 0.25, m = 0, and B = 1. Finally, for the count covariate POS we used the similarity function (4.2) with (a, b) = (1.5, 0.1). The sampling model is unchanged from before.

We assume that censoring times are independent of the event times and all parameters in the model. Posterior predictive survival curves for various covariate combinations are shown in Figure 4. In the figure, “baseline” refers to HI = 0, tumor size 38 mm (the empirical median), ER = 0, MENO = 0, average age (44 years), and POS = 15 (empirical mean). Other survival curves are labeled to indicate how the covariates change from baseline, with TS– indicating tumor size 26 mm (the empirical first quartile), TS+ indicating tumor size 50 mm (third quartile), HI referring to high-dose chemotherapy, and ER+ indicating positive estrogen or progesterone receptor status. The inference suggests that treated patients with tumor size below the empirical median and that were positive for estrogen or progesterone receptor have almost uniformly highest predicted survival curves than any other combination of covariates.

Survival example: Posterior predictive survival function S(t|x) ≡ p(y_n+1 > t|x_n+1 = x, data), arranged by x. The “baseline” case refers to all continuous and count covariates covariates equal empirical mean, and all binary convariates equal 0. The legend indicates TS− and TS+ for tumor size equal 26 mm and 50 mm (first and third empirical quartile), HI for HI = 1 and ER+ for ER = 1. The legend is sorted by the survival probabulity at 5 years, indicated by thin vertical line.

Figure 5 summarizes features of the posterior clustering. Interestingly, clusters are typically highly correlated to the postmenopausal status, as seen in the right panel.

Survival example: Posterior distribution for the number of clusters (left panel), mean PFS and % high-dose patients per cluster (center panel), mean PFS and % postmenopausal patients per cluster (right panel).

6.3 Variable Selection

The PPMx model replaces variable selection in a regression model by clustering of experimental units (patients). Consider, for example, inference in Figure 5. Clusters are heterogeneous with respect to HI, but very homogeneous with respect to MENO. In other words, cluster membership and thus predictive inference for a future observation varies more by MENO than by HI.

Similar summaries for all covariates lead us to consider formal variable selection. We explore the use of reduced models by comparing alternative models including subsets of the p = 6 covariates. Let γ = (γ₁, … , γ_p) denote a vector of binary indicators and modify (6.1) to include only selected covariates, $g (x_{j}^{*}) = \prod_{ℓ : γ_{ℓ} = 1} g^{ℓ} (x_{j ℓ}^{*})$ . We complete the probability model with a prior on γ. Without loss of generality we assume p(γ_j = 1) = π, independently, j = 1, … , p.

Posterior simulation for γ involves a minor challenge related to the normalization constant g_n(xⁿ) in (2.1). We write g_n(xⁿ|γ) to highlight that the definition of g(·) is indexed by γ. We implement posterior simulation with a Metropolis–Hastings transition probability to change indicators γ_j, keeping the currently imputed partition ρ and all other parameters unchanged. Let γ¹ and γ⁰ denote two vectors of variable selection indicators. Without loss of generality we assume that the two vectors differ only in the first indicator with $γ_{1}^{1} = 1$ , $γ_{1}^{0} = 0$ , and $γ_{j}^{1} = γ_{j}^{0}, j = 2, \dots, p$ , j = 2, … , p. The acceptance probability includes the ratio

\frac{p (γ^{1} ∣ ρ, x^{n})}{p (γ^{0} ∣ ρ, x^{n})} = \prod_{j = 1}^{k_{n}} g (x_{j 1}^{*}) \frac{g_{n} (x^{n} ∣ γ^{0})}{g_{n} (x^{n} ∣ γ^{1})} .

We use (independent) Monte Carlo integration to compute g_n(xⁿ|γ) for all 2^p possible γ vectors. We do this once, up front, before starting the posterior MCMC simulation. Let $g (x^{n} ∣ ρ, γ) = \prod_{j} g (x_{j}^{*} ∣ γ)$ denote the product of similarity functions under partition ρ and variable selection γ. Recall that $q (ρ) = \prod c (S_{j})$ denotes the random partition under the PPM without the additional similarity function. We use

g_{n} (x^{n} ∣ γ) = \sum_{ρ} g (x^{} ∣ ρ, γ) q (ρ) \approx \frac{1}{M} \sum_{m} g (x^{n} ∣ ρ_{m}, γ^{1})

for ρ_m ~ q(ρ), iid. We use Rao–Blackwellization to compute the marginal posterior probabilities p(γ|xⁿ, yⁿ) as ergodic averages of p(γ|ρ, xⁿ), averaging over imputed partitions ρ.

Using the described posterior simulation and the data from the previous example, we find the following posterior probabilities for variable selection. The posterior mode is achieved with p(γ = (1, … , 1)|xⁿ, yⁿ) = 0.62. This is the model with all covariates. The next most likely model is given by p(γ = (1, 1, 0, 1, 1, 1)|xⁿ, yⁿ) = 0.38 for the model without the indicator HI for high dose. It is clinically plausible that the effect of high dose would fade in comparison with initial tumor burden and other important baseline characteristics. One would still continue to include HI as the treatment covariate, that is, the only covariate that can be manipulated by choice. The cumulative posterior probability of all other 62 models is less than 0.001.

In summary, model-based inference confirms the inference reported by summaries such as Figure 5. In general we prefer the use of summaries of cluster homogeneity as in Figure 5, as it readily generalizes to larger numbers of covariates, including possibly interesting interactions.

7. CONCLUSION

We have proposed a novel model for random partitions with a regression on covariates. The model builds on the popular PPM random partition models by introducing an additional factor to modify the cohesion function. We refer to the additional factor as similarity function. It increases the prior probability that experimental units with similar covariates are co-clustered. We provide default choices of the similarity function for popular data formats.

The main features of the model are the possibility to include additional prior information related to the covariates, the principled nature of the model construction, and a computationally efficient implementation.

Among the limitations of the proposed method is an implicit penalty for the cluster size that is implied by the similarity function. Consider all equal covariates x_i ≡ x. The value of the similarity functions proposed in Section 4 decreases across cluster size. This limitation could be mitigated by allowing an additional factor c*(|S_j|) in (2.4) to compensate the size penalty implicit in the similarity function.

Supplementary Material

code

R-package: R-package PPMx containing code to carry out inference for the model described in the article. The function ppmx implements the proposed covariate-dependent random partition model for an arbitrary combination of continuous, categorical, binary, and count covariates, using a mixture of normal sampling model for y_i. The package is also available for download from http://www.math.utexas.edu/users/pmueller/prog.html.

NIHMS283160-supplement-code.gz^{(279.8KB, gz)}

data

Simulation data: The data file contains the 1000 simulated data points used for the Monte Carlo study that is described in Section 5.2.

NIHMS283160-supplement-data.txt^{(27.4KB, txt)}

ACKNOWLEDGMENTS

The research of P. Müller and G. L. Rosner was supported in part by NIH grant 1R01CA75981. The research of F. Quintana was supported by FONDECYT grant 1060729 and the Laboratorio de Análisis Estocástico PBCT-ACT13.

Contributor Information

Peter Müller, Department of Biostatistics, M. D. Anderson Cancer Center, Houston, TX 77230-1402 (pmueller@math.utexas.edu).

Fernando Quintana, Departamento de Estadística, Pontificia Universidad Católica de Chile, Santiago, Chile.

Gary L. Rosner, Division of Oncology Biostatistics, The Sidney Kimmel Comprehensive Cancer Center at Johns Hopkins, Baltimore, MD 21205-2013.

REFERENCES

Banfield JD, Raftery AE. Model-Based Gaussian and Non-Gaussian Clustering. Biometrics. 1993;49:803–821. 261. [Google Scholar]
Barry D, Hartigan JA. A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association. 1993;88:309–319. 261. [Google Scholar]
Bernardo J-M, Smith AFM. Bayesian Theory. Wiley Series in Probability and Mathematical Statistics. Wiley; Chichester: 1994. p. 263. [Google Scholar]
Bishop CM, Svensén M. In: Kjaerulff U, Meek C, editors. Bayesian Hierarchical Mixtures of Experts; 2003 Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence; 2003.pp. 57–64.pp. 268 [Google Scholar]
Crowley EM. Product Partition Models for Normal Means. Journal of the American Statistical Association. 1997;92:192–198. 261. [Google Scholar]
Dahl DB. Proceedings of the Section on Bayesian Statistical Science. American Statistical Association; Alexandria, VA: 2008. Distance-Based Probability Distribution for Set Partitions With Applications to Bayesian Nonparametrics; p. 269.p. 271. [Google Scholar]
Dahl DB. Modal Clustering in a Class of Product Partition Models. Bayesian Analysis. 2009;4:243–264. 265. [Google Scholar]
Dasgupta A, Raftery AE. Detecting Features in Spatial Point Processes With Clutter via Model-Based Clustering. Journal of the American Statistical Association. 1998;93:294–302. 261. [Google Scholar]
Ferguson TS. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics. 1973;1:209–230. 261. [Google Scholar]
Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97:611–631. 261. [Google Scholar]
Green PJ, Richardson S. technical report. University of Bristol, Dept. of Mathematics; 1999. Modelling Heterogeneity With and Without the Dirichlet Process; p. 262. [Google Scholar]
Hartigan JA. Partition Models. Communications in Statistics, Part A—Theory and Methods. 1990;19:2745–2756. 261. [Google Scholar]
Ishwaran H, James LF. Generalized Weighted Chinese Restaurant Processes for Species Sampling Mixture Models. Statistica Sinica. 2003;13:1211–1235. 261. [Google Scholar]
Jasra A, Holmes CC, Stephens DA. Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling. Statistical Science. 2005;20:50–67. 265. [Google Scholar]
Johnson VE, Albert JH. Ordinal Data Modeling. Statistics for Social Science and Public Policy. Springer-Verlag; New York: 1999. p. 267. [Google Scholar]
Jordan M, Jacobs R. Hierarchical Mixtures-of-Experts and the EM Algorithm. Neural Computation. 1994;6:181–214. 268. [Google Scholar]
Lau JW, Green PJ. Bayesian Model-Based Clustering Procedures. Journal of Computational and Graphical Statistics. 2007;16:526–558. 265. [Google Scholar]
Leisch F. FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R. Journal of Statistical Software. 2004;11:1–18. 268, 269. [Google Scholar]
Marin J-M, Robert CP. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer-Verlag; New York: 2007. p. 266. [Google Scholar]
Park J-H, Dunson D. Bayesian Generalized Product Partition Models. Statistica Sinica. 2010;20:1203–1226. 262, 268, 269, 271. [Google Scholar]
Pitman J. In: Ferguson TS, Shapeley LS, MacQueen JB, editors. Some Developments of the Blackwell–MacQueen Urn Scheme; Statistics, Probability and Game Theory: Papers in Honor of David Blackwell. Institute of Mathematical Statistics Lecture Notes—Monograph Series; Hayward, CA: Institute of Mathematical Statistics. 1996.pp. 245–268.pp. 261 [Google Scholar]
Quintana FA. A Predictive View of Bayesian Clustering. Journal of Statistical Planning and Inference. 2006;136:2407–2429. 261, 264. [Google Scholar]
Quintana FA, Iglesias PL. Bayesian Clustering and Product Partition Models. Journal of the Royal Statistical Society, Ser. B. 2003;65:557–574. 261, 265. [Google Scholar]
Richardson S, Green PJ. “On Bayesian Analysis of Mixtures With an Unknown Number of Components” (with discussion) Journal of the Royal Statistical Society, Ser. B. 1997;59:731–792. 261. [Google Scholar]
Rosner GL. Bayesian Monitoring of Clinical Trials With Failure-Time Endpoints. Biometrics. 2005;61:239–245. 272. doi: 10.1111/j.0006-341X.2005.031037.x. [DOI] [PubMed] [Google Scholar]
Shahbaba B, Neal RM. Nonlinear Models Using Dirichlet Process Mixtures. Journal of Machine Learning Research. 2009;10:1829–1850. 262, 263, 269. [Google Scholar]
Späth H. Clusterwise Linear Regression. Computing. 1979;22:93–119. 268. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

code

NIHMS283160-supplement-code.gz^{(279.8KB, gz)}

data

Simulation data: The data file contains the 1000 simulated data points used for the Monte Carlo study that is described in Section 5.2.

NIHMS283160-supplement-data.txt^{(27.4KB, txt)}

[R1] Banfield JD, Raftery AE. Model-Based Gaussian and Non-Gaussian Clustering. Biometrics. 1993;49:803–821. 261. [Google Scholar]

[R2] Barry D, Hartigan JA. A Bayesian Analysis for Change Point Problems. Journal of the American Statistical Association. 1993;88:309–319. 261. [Google Scholar]

[R3] Bernardo J-M, Smith AFM. Bayesian Theory. Wiley Series in Probability and Mathematical Statistics. Wiley; Chichester: 1994. p. 263. [Google Scholar]

[R4] Bishop CM, Svensén M. In: Kjaerulff U, Meek C, editors. Bayesian Hierarchical Mixtures of Experts; 2003 Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence; 2003.pp. 57–64.pp. 268 [Google Scholar]

[R5] Crowley EM. Product Partition Models for Normal Means. Journal of the American Statistical Association. 1997;92:192–198. 261. [Google Scholar]

[R6] Dahl DB. Proceedings of the Section on Bayesian Statistical Science. American Statistical Association; Alexandria, VA: 2008. Distance-Based Probability Distribution for Set Partitions With Applications to Bayesian Nonparametrics; p. 269.p. 271. [Google Scholar]

[R7] Dahl DB. Modal Clustering in a Class of Product Partition Models. Bayesian Analysis. 2009;4:243–264. 265. [Google Scholar]

[R8] Dasgupta A, Raftery AE. Detecting Features in Spatial Point Processes With Clutter via Model-Based Clustering. Journal of the American Statistical Association. 1998;93:294–302. 261. [Google Scholar]

[R9] Ferguson TS. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics. 1973;1:209–230. 261. [Google Scholar]

[R10] Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97:611–631. 261. [Google Scholar]

[R11] Green PJ, Richardson S. technical report. University of Bristol, Dept. of Mathematics; 1999. Modelling Heterogeneity With and Without the Dirichlet Process; p. 262. [Google Scholar]

[R12] Hartigan JA. Partition Models. Communications in Statistics, Part A—Theory and Methods. 1990;19:2745–2756. 261. [Google Scholar]

[R13] Ishwaran H, James LF. Generalized Weighted Chinese Restaurant Processes for Species Sampling Mixture Models. Statistica Sinica. 2003;13:1211–1235. 261. [Google Scholar]

[R14] Jasra A, Holmes CC, Stephens DA. Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modeling. Statistical Science. 2005;20:50–67. 265. [Google Scholar]

[R15] Johnson VE, Albert JH. Ordinal Data Modeling. Statistics for Social Science and Public Policy. Springer-Verlag; New York: 1999. p. 267. [Google Scholar]

[R16] Jordan M, Jacobs R. Hierarchical Mixtures-of-Experts and the EM Algorithm. Neural Computation. 1994;6:181–214. 268. [Google Scholar]

[R17] Lau JW, Green PJ. Bayesian Model-Based Clustering Procedures. Journal of Computational and Graphical Statistics. 2007;16:526–558. 265. [Google Scholar]

[R18] Leisch F. FlexMix: A General Framework for Finite Mixture Models and Latent Class Regression in R. Journal of Statistical Software. 2004;11:1–18. 268, 269. [Google Scholar]

[R19] Marin J-M, Robert CP. Bayesian Core: A Practical Approach to Computational Bayesian Statistics. Springer-Verlag; New York: 2007. p. 266. [Google Scholar]

[R20] Park J-H, Dunson D. Bayesian Generalized Product Partition Models. Statistica Sinica. 2010;20:1203–1226. 262, 268, 269, 271. [Google Scholar]

[R21] Pitman J. In: Ferguson TS, Shapeley LS, MacQueen JB, editors. Some Developments of the Blackwell–MacQueen Urn Scheme; Statistics, Probability and Game Theory: Papers in Honor of David Blackwell. Institute of Mathematical Statistics Lecture Notes—Monograph Series; Hayward, CA: Institute of Mathematical Statistics. 1996.pp. 245–268.pp. 261 [Google Scholar]

[R22] Quintana FA. A Predictive View of Bayesian Clustering. Journal of Statistical Planning and Inference. 2006;136:2407–2429. 261, 264. [Google Scholar]

[R23] Quintana FA, Iglesias PL. Bayesian Clustering and Product Partition Models. Journal of the Royal Statistical Society, Ser. B. 2003;65:557–574. 261, 265. [Google Scholar]

[R24] Richardson S, Green PJ. “On Bayesian Analysis of Mixtures With an Unknown Number of Components” (with discussion) Journal of the Royal Statistical Society, Ser. B. 1997;59:731–792. 261. [Google Scholar]

[R25] Rosner GL. Bayesian Monitoring of Clinical Trials With Failure-Time Endpoints. Biometrics. 2005;61:239–245. 272. doi: 10.1111/j.0006-341X.2005.031037.x. [DOI] [PubMed] [Google Scholar]

[R26] Shahbaba B, Neal RM. Nonlinear Models Using Dirichlet Process Mixtures. Journal of Machine Learning Research. 2009;10:1829–1850. 262, 263, 269. [Google Scholar]

[R27] Späth H. Clusterwise Linear Regression. Computing. 1979;22:93–119. 268. [Google Scholar]

PERMALINK

A Product Partition Model With Regression on Covariates

Peter Müller

Fernando Quintana

Gary L Rosner

Roles

Abstract

1. INTRODUCTION

2. THE PPMX MODEL

Theorem

Proof

3. POSTERIOR INFERENCE

3.1 Markov Chain Monte Carlo Posterior Simulation

3.2 Predictive Inference

3.3 Posterior Inference on Clusters

4. SIMILARITY FUNCTIONS

5. ALTERNATIVE APPROACHES AND COMPARISON

5.1 Other Approaches

5.2 Comparison of Competing Approaches

Figure 1.

Table 1.

6. EXAMPLE: A SURVIVAL MODEL WITH PATIENT BASELINE COVARIATES

6.1 PPMx With One Covariate

Figure 2.

Figure 3.

6.2 PPMx With Multiple Covariates

Figure 4.

Figure 5.

6.3 Variable Selection

7. CONCLUSION

Supplementary Material

ACKNOWLEDGMENTS

Contributor Information

REFERENCES

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases