Nonparametric Bayes Stochastically Ordered Latent Class Models

Hongxia Yang; Sean O’Brien; David B Dunson

doi:10.1198/jasa.2011.ap10058

. Author manuscript; available in PMC: 2012 Apr 11.

Published in final edited form as: J Am Stat Assoc. 2011 Sep 1;106(495):807–817. doi: 10.1198/jasa.2011.ap10058

Nonparametric Bayes Stochastically Ordered Latent Class Models

Hongxia Yang ¹, Sean O’Brien ², David B Dunson ³

PMCID: PMC3324040 NIHMSID: NIHMS363980 PMID: 22505787

Abstract

Latent class models (LCMs) are used increasingly for addressing a broad variety of problems, including sparse modeling of multivariate and longitudinal data, model-based clustering, and flexible inferences on predictor effects. Typical frequentist LCMs require estimation of a single finite number of classes, which does not increase with the sample size, and have a well-known sensitivity to parametric assumptions on the distributions within a class. Bayesian nonparametric methods have been developed to allow an infinite number of classes in the general population, with the number represented in a sample increasing with sample size. In this article, we propose a new nonparametric Bayes model that allows predictors to flexibly impact the allocation to latent classes, while limiting sensitivity to parametric assumptions by allowing class-specific distributions to be unknown subject to a stochastic ordering constraint. An efficient MCMC algorithm is developed for posterior computation. The methods are validated using simulation studies and applied to the problem of ranking medical procedures in terms of the distribution of patient morbidity.

Keywords: Factor analysis, Latent variables, Mixture model, Model-based clustering, Nested Dirichlet process, Order restriction, Random probability measure, Stick breaking

1. INTRODUCTION

Latent class models (LCMs) are routinely used for analysis and interpretation of multivariate data. LCMs comprise an extremely rich class of discrete mixture models, which allow units to be allocated to latent subpopulations or clusters, with the allocation probabilities potentially dependent on predictors. Suppose one collects response data y_i = (y_i1, …, y_ip)′ ∈ ℜ^p and predictors x_i = (x_i1, …, x_iq)′ for subjects i = 1, …, n. Then, a simple Gaussian LCM model could be specified as

f (y_{i} | x_{i}) = \sum_{k = 1}^{K} π_{k} (x_{i}) N_{p} (y_{i}; μ_{k}, Σ_{k}),

(1)

where π_k(x_i) is the probability of allocation to latent class k given predictors x_i, the response data for subjects in class k are normally distributed with mean μ_k and covariance Σ_k, and K is the number of latent classes. In routine applications of such models, π_k(x_i) is typically specified as a logistic regression model and the EM algorithm is used for maximum likelihood estimation.

There are a number of well-known issues that arise in considering model (1) and related LCMs. First, there is the so-called label ambiguity problem, which results because there is nothing distinguishing class k from k′ a priori. The estimates produced by the EM algorithm correspond to a local mode, with an identical likelihood obtained for any permutation of the labels {1, …, K} on the K clusters. Label ambiguity is even more of a problem in Bayesian analyses of LCMs relying on Markov chain Monte Carlo (MCMC) for posterior computation, as label switching makes it difficult to obtain meaningful posterior summaries of the cluster-specific parameters from the MCMC output, though postprocessing can potentially be used (Stephens 2000; Jasra, Holmes, and Stephens 2005). Although constraints on the component-specific parameters, such as ordered means, are widely used to avoid label ambiguity, it is typically not clear what constraints are appropriate in multivariate models such as (1) and partial ambiguity may remain even with constraints. A second well-known issue is uncertainty in the choice of K. Although standard analyses rely on selection criteria, such as the BIC, the theoretical justification for use of the BIC in mixture models such as LCMs is unclear. In addition, conditioning on a selected value in a two-stage procedure clearly ignores uncertainty in the selection process. A third issue with LCMs is sensitivity to parametric assumptions, with a very different number of clusters and allocation to clusters potentially obtained if one replaces the normality assumption in (1) with a multivariate t distribution or other choice.

Our motivation is drawn from an application to ranking of medical procedures in terms of the distribution of patient morbidity following the procedure. In particular, we would like to obtain clusters (latent classes) of procedures having a similar morbidity distribution, while also estimating an ordering in severity of the procedures. Ideally, we would like to avoid some of the problems arising in typical LCMs through stochastic ordering restrictions that are natural in many applications, with nonparametric Bayes methods used to allow infinitely many classes and avoid parametric assumptions on the class-specific distributions. We will focus on the setting in which subjects are nested within prespecified groups, with i = 1, …, n indexing the groups and j = 1, …, n_i the subjects in the ith group. In the motivating application, groups correspond to different medical procedures.

For illustration, initially consider the case in which y_ij is a single outcome for subject j in group i, there are no predictors, and we let y_ij ~ F_i, with F_i the distribution specific to group i. Then, taking a nonparametric Bayes approach, we require a prior for the collection of distributions ${F_{i}}_{i = 1}^{n}$ . Two possibilities that have been proposed in the literature include hierarchical Dirichlet process (HDP) (Teh et al. 2006) and nested Dirichlet process (nDP) mixtures (Rodriguex, Dunson, and Gelfand 2008). The HDP specification automatically allocates patients to clusters, with dependence incorporated in the cluster weights across the groups. The nDP is more relevant in clustering groups, with each cluster having a different distribution of subject-level outcomes. Specifically, the nDP mixture model would let F_i(y) = F_i′(y) with prior probability 1/(1 + α), with α a precision parameter. The densities specific to each cluster are then modeled using separate DP mixture models.

This approach partly addresses our interests in allowing clustering of procedures based on the distribution of patient outcomes, while allowing the number of clusters (latent classes) to be unknown. However, there is no allowance for predictors that provide information about the cluster allocation and there is no natural way to obtain a ranking of the procedures. Potentially, one may rank the procedures based on the mean of F_i, but it is not clear that the mean is the best summary to rank on, as the proportion of subjects having extreme or life-threatening adverse events may be more clinically relevant. With this motivation, we propose a nonparametric Bayes stochastically ordered LCM (SO-LCM) that is inspired by the nDP but has a fundamentally different structure.

Section 2 proposes the basic structure of the SO-LCM, with considerations of properties and extensions to more complex hierarchical models motivated in particular by the ranking of medical procedures application. Section 3 outlines an MCMC algorithm for posterior computation. Section 4 contains a simulation study assessing operating characteristics under a default prior. Section 5 applies the method to the medical procedures data, showing advantages relative to parametric methods, and Section 6 contains a discussion.

2. STOCHASTICALLY ORDERED LATENT CLASS PRIORS

2.1 Basic Formulation and Properties

Consider a collection of unknown distributions P = {P₁, …, P_n}, with P ~ 𝒫, where 𝒫 is a prior. In particular, the prior 𝒫 is induced by letting,

\begin{matrix} P_{i} ~ \sum_{k = 1}^{\infty} π_{k} (x_{i}) δ_{P_{k}^{*}}, & P_{k}^{*} = \sum_{l = 1}^{\infty} v_{l} δ_{θ_{kl}}, \end{matrix}

(2)

where $π_{k} (x_{i}) = Pr (P_{i} = P_{k}^{*} | x_{i})$ is the conditional probability of allocating distribution i to cluster k given predictors x_i = (x_i1, …, x_iq)′, and each of the cluster-specific distributions is assumed to be discrete. In particular, the distribution $P_{k}^{*}$ specific to cluster k has probability weights {v_l} on atoms {θ_kl}. This discreteness assumption will be relaxed later by using P_i as a mixture distribution within a continuous kernel. Note that instead of setting P_i equal to a convex combination of the $P_{k}^{*}$ ’s, related to the approaches of Dunson, Pillai, and Park (2007) and Müller, Quintana, and Rosner (2004), P_i is set exactly equal to $P_{k}^{*}$ with probability π_k(x_i).

There are two main distinct features of prior (2) relative to the nested Dirichlet process. First, we allow covariates to impact the allocation to clusters. In the motivating application to ranking of medical procedures, this is an important modification, as we have preliminary rankings of the different procedures by physicians. These rankings can serve as a predictor informing the allocation to clusters. Hence, instead of simply relying on the preliminary physician rankings or the outcomes data in isolation, we allow for a combination or fusion of these data in ranking the procedures. Second, as we are interested in ranking the procedures, we impose a stochastic ordering restriction on the cluster-specific distributions with $P_{k}^{*} ⪯ P_{k'}^{*}$ for all k < k′, where $P_{k}^{*} ⪯ P_{k'}^{*}$ denotes that $P_{k}^{*}$ is stochastically no larger than $P_{k'}^{*}$ so that $P_{k}^{*} (a, \infty) \leq P_{k'}^{*} (a, \infty)$ for all a. This restriction implies that clusters with a higher index correspond to stochastically higher distributions.

Dunson and Peddada (2008) proposed a restricted dependent Dirichlet process (rDDP) prior for stochastically ordered distributions. Here, we apply the rDDP prior to the cluster-specific distributions $P^{*} = {P_{k}^{*}}_{k = 1}^{\infty}$ . We could have instead used an alternative stochastically ordered prior, such as the approaches proposed by Karabatsos and Walker (2007). Jara and Hanson (2011) proposed a class of mixtures of dependent tail-free processes, with particular applications to the mixture of Polya trees model, allowing each beta random variable in the tree to be covariate-dependent via a regression model. Potentially one can combine this approach with that of Karabatsos and Walker (2007) to obtain a new class of nonparametric Bayes stochastic ordering models. We used the rDDP mixture prior instead to avoid the partitioning effect of the Pólya tree prior. Such an effect can be removed using mixtures of Pólya trees, though the computation can be more intensive for such models and the results still tend to be quite spiky looking densities. The estimates produced in DP mixtures of Gaussian kernels in our experience tend to match our prior beliefs for the latent variable density more closely.

The stochastic ordering prior from the rDDP is accomplished by first letting v_l = ν_l ∏_s<l(1 − ν_s) with ν_l ~ beta(1, α₂) independently for l = 1, …, ∞. Then, we let $θ_{l} = {θ_{kl}}_{k = 1}^{\infty} ~ P_{0}$ independently for l = 1, …, ∞, with P₀ chosen so that P₀(θ_1l ≤ θ_2l ≤ ⋯) = 1. The cluster k distribution, $P_{k}^{*}$ , is marginally distributed according to a Dirichlet process prior with precision α₂ and base distribution P_0k, with P_0k the kth marginal distribution of P₀. This implies that θ_kl ~ P_0k marginally, where θ_kl is the kth element of the multivariate vector θ_l. In addition, $Pr (P_{k}^{*} ⪯ P_{k'}^{*}) = 1$ for all k < k′ a priori (and hence a posteriori). Dependence in the elements of P* is incorporated through the use of fixed weights ${v_{l}}_{l = 1}^{\infty}$ for all k and dependent atoms. This dependence structure allows flexible borrowing of information across the cluster-specific distributions.

As a specific choice of P₀, let $θ_{1 l} = γ_{1 l}^{*} ~ N (m_{0}, s_{0}^{2}) and γ_{kl}^{*} = θ_{kl} - θ_{k - 1, l}$ , for k = 2, …, ∞, with

\begin{matrix} γ_{kl}^{*} ~ w_{0} δ_{0} + (1 - w_{0}) N^{+} (0, κ^{- 1}), & k = 2, \dots, \infty \end{matrix},

(3)

where $w_{0} = Pr (γ_{kl}^{*} = 0)$ and N⁺ denotes the normal distribution truncated to have positive support. By including positive mass at zero, the prior allows a subset of the atoms in P_k and P_k′ to be identical. This is appealing in allowing commonalities between the distributions specific to different latent classes. Also including a positive probability of zero values allows collapsing on an effectively lower-dimensional model through zeroing out the coefficients. This allows us to start with a very richly parameterized model and adaptively drop out parameters that are not needed. To allow the data to inform about the appropriate value for the point mass probability w₀, we choose a hyperprior w₀ ~ beta(a_w₀, b_w₀), with a_w₀ = b_w₀ = 1 used routinely as a default.

To complete a specification of the SO-LCM, we require a prior for the predictor-dependent probabilities. For simplicity, we use the logistic regression-type model

\begin{matrix} π_{k} (x) & = \frac{ψ_{k} exp {x' β_{k}}}{\sum_{l = 1}^{K} ψ_{l} exp {x' β_{l}}}, \\ ψ_{k} & ~ Gamma (α_{1} / K, 1), & β_{k} ~ H, \end{matrix}

(4)

where ψ_k ≥ 0 is a baseline weight for mixture component k, β_k are regression parameters controlling the impact of the predictors on the probabilities of allocation to each cluster (latent class), and H is a prior on the regression coefficients. For example, H can be chosen to be Gaussian or, to allow shrinkage towards zero for unimportant coefficients, we can choose a heavy-tailed Cauchy prior or a variable selection mixture prior with a mass at zero.

Unlike in typical generalized logistic regression models, we avoid placing identifiability constraints on the parameters, such as setting the coefficients equal to zero in a reference class. Unlike in frequentist models fitted by maximum likelihood, the choice of the reference class can impact the results, and it is important to maintain exchangeability of the cluster indices in model (4). Otherwise, there may be some bias introduced in which we favor stochastically smaller or larger distributions a priori. In Bayesian modeling, it is not necessary to satisfy frequentist identifiability criteria, and indeed it is often quite useful to consider over-parameterized models as long as inferences are based on identifiable quantities.

To further motivate models (2)–(4), it is useful to consider properties in the baseline case in which x = 0, so that we obtain $π_{k} = ψ_{k} / \sum_{l = 1}^{K} ψ_{l}$ . In this case, the particular gamma prior that was chosen for the cluster-specific weight parameters leads to (π₁, …, π_K) ~ Dir(α₁/K, …, α₁/K). This is the same distribution on the cluster-specific probabilities that was proposed by Ishwaran and Zarepour (2002) in developing a finite approximation to the Dirichlet process. It is straightforward to show (proof in Appendix) that the prior probability of clustering two groups in this baseline case is

Pr (P_{i} = P_{i'}) = E (\sum_{k = 1}^{K} π_{k} π_{k}) = \frac{1 + α_{1} / K}{1 + α_{1}},

(5)

which simplifies to 1/(1 + α₁) in the limit as K → ∞. In addition, the prior probability that group i is stochastically less than group i′ can be derived as

Pr (P_{i} ≺ P_{i'}) = \frac{1}{2} {1 - Pr (P_{i} = P_{i'})} = \frac{α_{1}}{2 (1 + α_{1})} (1 - \frac{1}{K}),

(6)

which reduces to $\frac{α_{1}}{2 (1 + α_{1})}$ in the limit as K → ∞.

Hence, α₁ is a key hyperparameter controlling the prior on clustering and ordering of the groups. For greater flexibility, we recommend letting α₁ ~ Gamma(a₁, b₁). In many applications, it is appealing to favor a slow rate of introduction of new clusters with sample size. As in the DP, clusters are introduced at a rate proportion to α₁ log n when K is sufficiently large. In order to favor few clusters relative to the number of groups n, one can choose the hyperparameters a₁, b₁ so that the prior is concentrated at values close to zero. In the application to ranking of medical procedures in terms of their severity, our physician collaborators have a strong preference for parsimony and expect a model with six (or fewer) clusters to fit the data adequately. This knowledge is used to elicit the a₁, b₁ hyperparameters. In the case in which covariates are included, (5) and (6) can potentially be extended, and it will be the case that prior clustering and ordering probabilities depend on the relative values of the predictors for the two groups. However, it is not straightforward to obtain simple analytic forms.

2.2 Applications to Ranking Medical Procedures

In the motivating application to ranking medical procedures based on the distribution of patient morbidity following each procedure, response data consist of a vector $y_{ij}^{*} = (y_{ij 1}^{*}, \dots, y_{ijp}^{*})'$ of p measures of morbidity on the jth patient having procedure i, for i = 1, …, n and j = 1, …, n_i. The first p₁ elements of $y_{ij}^{*}$ are continuous and the next p₂ elements are binary with p₁ + p₂ = p. Higher values of each of the measurements imply higher morbidity, and we relate the measurements to a latent morbidity score for each patient within each procedure through the following factor model,

\begin{matrix} y_{ijt}^{*} & \begin{matrix} = h_{t} (y_{ijt}), & h_{t} (y) = y, & t = 1, \dots, p_{1}, \end{matrix} \\ h_{t} (y) & \begin{matrix} = 1 (y > 0), & t = p_{1} + 1, \dots, p, \end{matrix} \\ y_{ij} & \begin{matrix} = μ + Λ η_{ij} + ε_{ij}, & ε_{ijt} ~ N (0, σ_{it}^{2}), \end{matrix} \\ η_{ij} & \begin{matrix} ~ f_{i}, & f_{i} (η) = \int 𝒦 (η; θ) {dP}_{i} (θ), \end{matrix} \\ 𝒦 (η; θ) & = \int N (η; θ, φ) dQ (φ), \end{matrix}

(7)

where y_ijt is a continuous variable underlying $y_{ijt}^{*}, with y_{ijt}^{*} = y_{ijt}$ for continuous responses and $y_{ijt}^{*} = 1 (y_{ijt} > 0)$ for binary responses, μ = (μ₁, …, μ_p)′ is a p × 1 intercept vector, Λ = (λ₁, …, λ_p)′ is a p × 1 vector of factor loadings, η_ij is a latent morbidity score for the jth patient having procedure i and 𝒦(·; θ) is an unknown unimodal kernel that is symmetric about θ. The procedure-specific latent variable density functions f_i are modeled as a flexible location mixture of scale mixture of Gaussian kernels. By using an unknown kernel, we favor fewer and more biologically interpretable clusters. Letting $σ_{it}^{- 2} = c_{i} d_{t}$ for continuous responses, we obtain an additive log-linear model for the residual precision, with c_i a procedure-specific multiple and d_t a response type specific multiple, while fixing $σ_{it}^{- 2} = 1$ for binary responses. This allows the residual variance to change for the different procedures, while also allowing a shift specific to each measure of morbidity. The constraint on the residual variances for the continuous variables underlying the binary responses is a standard identifiability condition. Because higher values of $y_{ijt}^{*}$ imply higher morbidity, we constrain the factor loadings to be nonnegative so that λ_t ≥ 0 for t = 1, …, p. For the scale mixture component, we let Q ~ DP(α₀Q₀) where Q₀ = Inv-Gamma(c₀, d₀) is the base measure. Q can also be denoted as $Q (\cdot) = \sum_{t = 1}^{\infty} u_{t} δ_{φ_{t}^{*}} with φ_{t}^{*} ~ Inv-Gamma (c_{0}, d_{0})$ .

We avoid using P_i directly as the distribution of the latent factor scores within procedure i, since that would assume that the factor scores follow a discrete distribution. It seems more biologically realistic to allow a continuum of patient morbidity, while allowing patients with similar but not identical morbidity to be clustered. This is accomplished by the proposed model in that patients allocated to the same mixture component will be clustered. As mentioned above, we are more interested in clustering and ranking of the medical procedures instead of the patients. Because 𝒦(·; θ) is monotonically stochastically increasing in θ, we maintain the stochastic ordering restriction in the P_i’s. Note that two procedures i and i′ having P_i = P_i′, which is allowed by the proposed prior, will also have f_i = f_i′ and hence have the same morbidity density. In addition, f_i ≺ f_i′ (the distribution of patient morbidity under procedure i is stochastically less than that under procedure i′) if and only if P_i ≺ P_i′. Hence, the clustering and ranking properties of the prior for {P_i} proposed above extend directly to the continuous latent factor model in (7).

To complete a Bayesian specification of the SO-LCM model in (7), we choose priors as follows. The intercept vector is assigned a normal prior, $μ_{t} ~ N (μ_{0}, σ_{0}^{2})$ for t = 1, …, p, and the factor loadings are assigned robust truncated Cauchy priors by letting λ_t ~ N⁺ (0, τ) for t = 1, …, p with τ ~ Inv-Gamma(1/2, 1/2). We use a common precision τ to induce dependent shrinkage across the loadings. The multiplicative terms in the variance model, {c_i} and {d_t}, are assigned gamma priors. Elicitation of the different hyperparameters in these priors is considered later.

3. POSTERIOR COMPUTATION

Due to the structure of the model described in Section 2.1, it becomes straightforward to adapt previously proposed algorithms for posterior computation in DPMs and logistic regression models. We will focus on the exact block Gibbs sampler (Yau et al. 2011) for posterior computation and update polychotomous weights through Holmes and Held (2006). We will focus on the simple model y_ij ~ f_i, f_i(η) = ∫ 𝒦(η; θ)dP_i(θ), 𝒦(η; θ) = ∫ N(η; θ, φ)dQ(φ), {P_i} ~ SO-LCM, as in expression (2) to (4), and Q ~ DP(α₀Q₀). Denote the procedure location cluster index by ζ_i, the patient location cluster index by ξ_ij and the precision cluster index by ε_ij. In the sequel, let ζ_i = k, ξ_ij = l, and ε_ij = t iff $P_{i} = P_{k}^{*}, θ_{ij} = θ_{kl}^{*}, and φ_{ij} = φ_{t}^{*}$ . The sampling steps are as follows:

Sample π_k(x_i) through the following steps. The polychotomous generalization of the logistic regression model is defined via
$\begin{matrix} ζ_{i} ~ & ℳ (1; π_{1} (x_{i}), \dots, π_{K} (x_{i})), \\ π_{k} (x_{i}) = & \frac{ψ_{k} exp (x_{i}^{'} β_{j})}{\sum_{l = 1}^{K} ψ_{l} exp (x_{i}^{'} β_{l})}, \end{matrix}$ (8)
where ζ_i is the procedure cluster indicator and ℳ(1; ·) denotes the single sample multinomial distribution. Defining β_[k] = (β₁, …, β_k−1, β_k+1, …, β_K), we have
$L (β_{k} | ζ, β_{[k]}) \propto \prod_{i = 1}^{n} \prod_{k = 1}^{K} {[π_{k} (x_{i})]}^{1 (ζ_{i} = j)} \propto \prod_{i = 1}^{n} {[χ_{ij}]}^{1 (ζ_{i} = j)} {[1 - χ_{ij}]}^{1 (ζ_{i} \neq j)},$ (9)
where
$\begin{matrix} χ_{ij} = \frac{exp {x_{i}^{'} β_{j} + log (ψ_{j}) - C_{ij}}}{1 + exp {x_{i}^{'} β_{j} + log (ψ_{j}) - C_{ij}}} = \frac{exp ({x̂}_{i}^{'} {β̂}_{j} - C_{ij})}{1 + exp ({x̂}_{i}^{'} {β̂}_{j} - C_{ij})}, \\ C_{ij} = log [\sum_{k \neq j} exp {x_{i}^{'} β_{k} + log (ψ_{k})}] = log [\sum_{k \neq j} exp ({x̂}_{i}^{'} {β̂}_{j})], \end{matrix}$ (10)
where ${x̂}_{i} = (x_{i}^{'}, 1)' and {β̂}_{j} = (β_{j}^{'}, log ψ_{j})'$ . The prior for log(ψ_k) from (4) can be approximated as $log (ψ_{k}) \overset{𝒟}{\approx} N (m, v)$ . We use this approximation to obtain an efficient Metropolis independence chain proposal. The conditional likelihood L(β̂_j|ζ, β̂_[j]) has the form of a logistic regression on class indicator 1(ζ_i = j), which allows us to use the algorithm of Holmes and Held (2006). Details are in the Appendix.
Sample the procedure cluster indicators ζ_i, for i = 1, …, n, from a multinomial distribution with probabilities
$Pr (ζ_{i} = k | \dots) \propto π_{k} (x_{i}) \prod_{j = 1}^{n_{i}} \sum_{l = 1}^{L} v_{l} N (y_{ij}; θ_{kl}, φ_{ij})$
and construct $m_{k} = \sum_{i = 1}^{n} 1 (ζ_{i} = k)$ . For K, we first choose a reasonable upper bound and then monitor the maximum index of the occupied components. If all the MCMC samples have maximum indices several units below the upper bound, then the upper bound is sufficiently high, while otherwise the upper bound can be increased, with the analysis rerun.
The joint prior distribution of the group indicator ξ_ij and a latent variable q_ij can be written as
$f (ξ_{ij}, q_{ij} | v) = \sum_{l : v_{l} > q_{ij}} δ_{l} (\cdot) = \sum_{l = 1}^{\infty} 1 (q_{ij} < v_{l}) δ_{l} (\cdot) .$
Implement the Exact Block Gibbs sampler steps:
1. Sample q_ij ~ Unif(0, v_{ξ_ij}) for j = 1, …, n_i with v_l = ν_l ∏_s<l(1 − ν_s).
2. Sample the stick-breaking random variables
  $ν_{l} ~ beta (1 + \sum_{k = 1}^{K} n_{kl,} α_{2} + \sum_{s = l + 1}^{L} \sum_{k = 1}^{K} n_{ks}), l = 1, \dots, L,$
  where n_kl is the number of observations assigned to atom l of distribution k with $n_{kl} = \sum_{i = 1}^{n} \sum_{j = 1}^{n_{i}} 1 (ζ_{i} = k, ξ_{ij} = l)$ and L the minimum value satisfying v₁ + ⋯ + v_L > 1 − min{q_ij}.
3. Sample ξ_ij for i = 1, …, n and j = 1, …, n_i from the multinomial conditional with
  $Pr (ξ_{ij} = l) \propto 1 (q_{ij} < v_{l}) N (y_{ij}; θ_{ζ_{i} l}^{*}, φ_{ij}) .$
Sample $γ_{l}^{*}$ from
$p (γ_{l}^{*} | \dots) \propto {\prod_{{(i, j) : ξ_{ij} = l}} N (y_{ij}; θ_{ζ_{i} l}^{*}, φ_{ij})} P_{0} (γ_{l}^{*}),$
where $θ_{kl} = w_{k}^{'} γ_{l}^{*}, w_{k} = (1_{k}^{'}, 0_{K - k}^{'})', γ_{l}^{*} = (γ_{1 l}^{*}, \dots, γ_{Kl}^{*})$ as defined in (3).
φ_ij is updated through the exact blocked Gibbs sampler similar to the above steps.
Use random walk Metropolis–Hastings method to update concentration parameter α₁.
Sample concentration parameter α₂ with conjugate prior Gamma(a₂, b₂) directly from
$(α_{2} | \dots) ~ Gamma (a_{2} + K (L - 1), b_{2} - K \sum_{l = 1}^{L - 1} log (1 - v_{l})) .$
α₀ is updated similarly.

Note that this algorithm can be generalized easily to accommodate model (7), so the details are omitted.

4. SIMULATION STUDY

We separate this section into two parts. Predictors are not considered in the first simulation but will be considered in the second simulation. Model (7) is studied and both simulations mimic the structure of the medical procedure data.

4.1 Without Predictors

Data y_ij are generated according to (7), with one continuous response (p₁ = 1) and six binary responses (p₂ = 6) for each of 100 patients (j = 1, …, 100) in each of 60 procedures (i = 1, …, 60). Parameters for the data-generating model are μ = (0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4)′, $Λ = 1_{p \times 1}^{'}$ , and Σ = diag(0.5, 1, 1, 1, 1, 1, 1). The latent morbidity η_ij is generated from one of four mixtures of Gaussian components outlined in Table 1, with the first 15 procedures being generated from mixture distribution T₁, the second 15 procedures generated from T₂, the third 15 procedures generated from T₃ and the last 15 procedures generated from T₄, where T₁ ≺ T₂ ≺ T₃ ≺ T₄ such that the generated latent morbidity distributions are stochastically ordered. As shown in Figure 1 and Table 1, distributions share components with each other and the ordering of the distributions is subtle.

Table 1.

Parameters for the true distributions used in simulation study in Section 4.1

	Comp1		Comp2		Comp3		Comp4

Dist	w	(μ, σ²)	w	(μ, σ²)	w	(μ, σ²)	w	(μ, σ²)
T₁	0.75	(−3, 0.5)	0.25	(0, 1)
T₂	0.75	(−2.5, 0.5)	0.25	(0, 1)
T₃	0.2	(1, 0.5)	0.5	(1.5, 1)	0.3	(2, 1)
T₄	0.2	(2, 1)	0.3	(2.5, 0.5)	0.4	(3, 1)	0.1	(3.5, 0.5)

Open in a new tab

True distributions used in simulation study in Section 4.1.

To obtain an initial clustering of the medical procedures using standard methods, we first averaged the severity data for the different patients having each procedure to obtain ${y̅}_{i} = \frac{1}{n_{i}} \sum_{j = 1}^{n_{i}} y_{ij}$ as a p = 7 dimensional summary of severity for procedure i. We then applied model-based clustering (Fraley and Raftery 2002; Fraley, Raftery, and Wehrens 2005) to the data {y̅₁, …, y̅_n} using the R functions available in the package described in Fraley, Raftery, and Wehrens (2005). These approaches rely on fitting of finite mixture models with the EM algorithm, with the model fit for a variety of choices of the number of mixture components, which also corresponds to the number of clusters. The BIC is used to select the optimal number of clusters. Figure 2 plots the BIC for the simulated data versus the number of clusters for four different options on the cluster shapes. In this case, the best model according to BIC is EEI (equal size and shape) with six clusters. Note that this approach does not utilize the patient-specific data and instead clusters based on the mean severity measure across patients, while the proposed approach should have advantages in clustering procedures based on the entire distribution across patients.

Frequentist model-based clustering results implemented via the EM algorithm using the mclust function in R in simulation study in Section 4.1, with the different symbols representing different model assumptions. EII: spherical, equal volume; EEI: spherical, equal volume, and shape; EEV: spherical, equal volume but varying orientation; EEE: ellipsoidal, equal volume, and shape. Refer to Fraley, Raftery, and Wehrens (2005) for more details. The online version of this figure is in color.

For the SO-LCM estimation, parameters α₀, α₁, and α₂ are fixed to be 1 and a normal inverse-gamma prior distribution, NIG(0, 0.1, 2, 3) is chosen for the baseline measure m₀ and $s_{0}^{2}$ described in (3), implying that $E (m_{0} | s_{0}^{2}) = 0, V (m_{0} | s_{0}^{2}) = 10 s_{0}^{2}, E (s_{0}^{2}) = 1, and V (s_{0}^{2}) = 3$ . Additionally, we assign priors w₀ ~ beta(1, 1), κ ~ Gamma(1/2, 1/2), representing a robust and flexible default prior for the base measure P₀. Without predictors, the prior for the cluster-specific allocation probabilities turns out to be (π₁, …, π_K)′ ~ Dir(α₁/K, …, α₁/K). Posterior samples under this SO-LCM prior are obtained through the algorithm described in Section 3 with prespecified truncation bounds K = 20. This truncation tends to be accurate for α₁ ≤ 1, where such values of α₁ favor a small number of mixture components. In this particular application, mixture components close to the upper bound are not occupied in any of the MCMC samples after the burn-in period; 20,000 iterations are found to be enough for parameters to converge. All results are based on 20,000 samples obtained after a burn-in period of 20,000 iterations.

For each pair of distributions P_i and P_i′, (i < i′), the probability Pr(P_i = P_i′) was estimated as the proportion of posterior samples for which P_i and P_i′ are assigned to the same cluster; and Pr(P_i ≺ P_i′) is calculated as the proportion of posterior samples for which P_i is assigned to a cluster with stochastically less morbidity than P_i′. Results are shown in Figure 3, where Figure 3(a) is the ranking plot with the (i, j)th entry of the lower triangular matrix identifying the probability for P_i ≺ P_i′ and Figure 3(b) is the clustering plot with the (i, j)th entry identifying the probability for P_i = P_i′. Figure 3 illustrates that there is not enough information in the data to differentiate the first 30 procedures, which is not surprising given the very subtle differences in T₁ and T₂ shown in Figure 1. However, the true rankings and clusterings in the medical procedures are otherwise accurately reflected in the results. The estimated density of T₁ is shown in Figure 4(a). For comparison, this density is also estimated under a DPM prior with the same base measure and precision parameter α₂ = 1 [in Figure 4(b)]. The estimate obtained using the SO-LCM prior distribution appears to capture both the small and large modes more accurately than the DPM alternative.

Posterior probability for ranking and clustering in study of Section 4.1 with entry (i, j) in (a) being the lower triangular matrix identifying the probability for *P_i* ≺ *P_i′* and in (b) the probability for *P_i* = *P_i′*.

True (solid lines) and estimated (dashed lines) densities from SO-LCM and DPM for distribution T₁.

4.2 With Predictors

Potentially, the incorporation of predictors may improve the ability to detect subtle differences in the distributions of patient morbidity between procedures. To assess this, we repeated the simulation study of Section 4.1 but modified the model to allow predictor-dependent mixture weights. Mimicking the real data, we assumed there was a single predictor corresponding to an initial physician severity score obtained from their clinical experience and not from examination of the current data. In particular, similar to range of the potentially useful auxiliary covariate: Aristotle Basic Complexity (ABC) level, we let predictors for the first 15 procedures drawn uniformly from (−1.5, −1), the second 15 procedures drawn uniformly from (−0.6, −0.4), the third 15 procedures drawn uniformly from (0.4, 0.6) and the last 15 procedures drawn uniformly from (1, 1.5). Data are then generated from the assumed model exactly as described in Section 4.1 but assuming a logistic regression model (4) for the weights with ψ = (0.5, 1, 1.65, 1.45)′ and β = (−1.5, −0.5, 1, 1.2)′. Procedures with the first 15 predictors are then assigned to the first cluster and so forth.

In the analysis, priors are specified as described in Section 4.1 and model (4), and we additionally choose a N(0, 10I) prior for β to complete the specification. The MCMC algorithm was run for 20,000 iterations following a 20,000 iteration burn-in. Apparent convergence was rapid and mixing was adequate. The truncation level of K = 20 was still sufficiently high.

Figure 5(a) depicts the ranking performance and Figure 5(b) depicts the clustering plot. Both ranking and clustering performances are improved compared to study in Section 4.1 such that the first 30 procedures are ranked consistently with the true order and are clustered correctly.

Posterior probability for ranking and clustering in study of Section 4.2 with entry (i, j) in (a) being the lower triangular matrix identifying the probability for *P_i* ≺ *P_i′* and in (b) the probability for *P_i* = *P_i′*.

5. MEDICAL PROCEDURE APPLICATION

The push for accountability in medicine has led to a proliferation of “report cards” evaluating health care providers in various therapeutic areas such as adult and pediatric cardiac surgery, treatment of heart attacks, and management of chronic conditions. In adult cardiac surgery, report cards typically focus on a single commonly performed procedure, coronary artery bypass grafting (CABG), and a single endpoint, short-term all-cause mortality. Regression models are used to adjust for differences in each provider’s case mix that may impact short-term mortality. Although regression models are straightforward to apply in adult cardiac surgery, the development of such models for pediatric and congenital heart surgery is relatively challenging. In congenital heart surgery, the total number of cases is much smaller than CABG—there are literally hundreds of different types of surgical procedures—and no single type of procedure accounts for the majority of cases. To address the challenge of multiple rare procedures, researchers have proposed methods to allow procedures with similar mortality and morbidity risk to be grouped together for analysis. Two widely used methods are the Risk Adjustment for Congenital Heart Surgery (RAHCS-1) methodology (Jenkins 2004) and the Aristotle Basic Complexity Levels (Lacour-Gayet et al. 2004). RACHS-1 groups more than 100 types of congenital heart surgery procedures into six categories based on their estimated risk of in-hospital mortality. Similarly, the Aristotle method groups over 160 types of procedures into four categories (levels) based on their potential for mortality, morbidity, and technical difficulty. For both RACHS-1 and Aristotle, procedure categories were determined by panels of subject matter experts without using a formal statistical framework. In this section, our goal is to show that the SO-LCM methodology provides a useful statistical framework for grouping procedures into categories of risk and for choosing the number of categories. More formally, we sought to identify clusters of congenital procedures with similar distributions of postprocedural morbidity.

Data for this analysis were obtained from the Society of Thoracic Surgeons (STS) Congenital Heart Surgery database. The study population consisted of N = 79,635 patients who underwent one of 145 types of congenital cardiovascular procedures at an STS-participating center during the years 2002–2008. Postoperative morbidity was regarded as a patient-level unobserved latent variable. Indicators of morbidity included a single continuous variable, postoperative length of stay (PLOS), modeled as y₁ = log(1 + PLOS); and six binary (yes/no) variables: y₂ = renal failure, y₃ = stroke, y₃ = heart block, y₄ = requirement for extracorporeal membrane oxygenation or ventricular assist device, y₅ = phrenic nerve injury, and y₆ = in-hospital mortality. Responses from different patients were assumed to be independent. Multiple responses from the same patient were conditionally independent given the latent morbidity variable. The joint model for all seven endpoints is

$y_{ij 1} | η_{ij} ~ N [μ_{1} + λ_{1} η_{ij}, σ_{i 1}^{2}]$ (PLOS),
y_ij2|η_ij ~ Bernoulli[Φ(μ₂ + λ₂η_ij)] (stroke),
y_ij3|η_ij ~ Bernoulli[Φ(μ₃ + λ₃η_ij)] (renal failure),
y_ij4|η_ij ~ Bernoulli[Φ(μ₄ + λ₄η_ij)] (heart block),
y_ij5|η_ij ~ Bernoulli[Φ(μ₅ + λ₅η_ij)] (ECMO/VAD),
y_ij6|η_ij ~ Bernoulli[Φ(μ₆ + λ₆η_ij)] (phrenic nerve injury),
y_ij7|η_ij ~ Bernoulli[Φ(μ₇ + λ₇η_ij)] (in-hospital mortality).

Let f_i denote the density of η_ij among patients undergoing the ith type of procedure, that is, η_ij ~ f_i. Our goal is to estimate the densities f₁, f₂, …, f₁₄₅ nonparametrically under the assumption that they have an unknown stochastic ordering. This assumption facilitates ranking of the procedures and is less restrictive than alternative parametric models which assume a Gaussian distribution and procedure-specific location parameters. An important consideration for the analysis is that the procedure-specific sample sizes are small and highly variable (median = 50; range 10 to 2000). To account for these low sample sizes, we propose a method of analysis that permits borrowing of information across procedures and incorporates external prior information. A potentially useful auxiliary covariate is each procedure’s Aristotle Basic Complexity (ABC) level. As noted above, the ABC level represents the average subjective ranking by an international panel of congenital heart surgeons. Large ABC values imply that the procedure is considered to be a difficult operation with high potential for mortality and morbidity.

The procedure-specific morbidity distributions are estimated nonparametrically under the SO-LCM model as in Section 4.2. Hyperparameters are $ψ_{k} ~ Gamma (α_{1} / K, 1), β_{k} \overset{ind}{~} N (0, 10)$ , α₀ ~ Gamma(1, 1), α₁ ~ Gamma(1, 6/ ln 145), and α₂ ~ Gamma(1, 1). The prior for α₁ is chosen based on expert elicitation to favor six or fewer clusters in the procedures. The prespecified truncation bound K = 20 is found to be enough since mixture components close to the upper bound are not occupied after the algorithm converges. The baseline distribution P₀ is constructed as in (3) with w₀ ~ beta(1, 1), κ ~ Gamma(1/2, 1/2), and $(m_{0}, s_{0}^{2}) ~ NIG (0, 0.1, 2, 3)$ . Priors for the outcomes model are $μ_{t} \overset{ind}{~} N (0, 10) and λ_{t} \overset{ind}{~} N^{+} (0, τ)$ , t = 1, 2, …, p, where τ ~ Inv-Gamma(1/2, 1/2). Estimates are calculated with 50,000 MCMC iterations after a burn-in period of 20,000 iterations. We took a long burn-in and collection interval to be conservative, but similar results are obtained using shorter chains.

Figures 6 and 7 summarize posterior inferences for procedures (n = 66) having at least 200 occurrences in the database. For the ith procedure, let $S_{i} = \sum_{h = 1}^{66} I (f_{i} ⪯ f_{h})$ denote the number of procedures having a morbidity distribution that is stochastically no smaller than f_i. In Figure 6, the posterior means of each procedure (θ_i) are plotted on the vertical axis along with 95% credible intervals and procedures are sorted in ascending order based on the posterior mean E[S_i]. The relatively wide probability intervals indicate that there is uncertainty regarding the true rank ordering of the procedure-specific morbidity distributions. Nonetheless, several procedures have narrow intervals and are clearly distinguished as either low (e.g., atrial septal defect repair) or high (e.g., Norwood operation) latent morbidity. Ranking and clustering performances are depicted in Figure 7(a) and (b).

Sorted procedures posterior means for latent scores of each procedure with 95% credible intervals.

Ranking and clustering for selected 66 procedures with more than 200 patients, with entry (i, j) in (a) being the lower triangular matrix identifying the probability for *P_i* ≺ *P_i′* and in (b) the probability for *P_i* = *P_i′*.

The 145 procedures can be grouped into four, five, or six homogeneous clusters according to posterior clustering probabilities shown in Table 2. The data suggest high (99%) posterior probability of fewer than eight clusters, with 32% probability assigned to the posterior mode of 5.

Table 2.

Posterior probability for clustering (epidemiology application)

Number of clusters	Posterior probability
1	0.01
2	0.02
3	0.12
4	0.21
5	0.32
6	0.25
7	0.06
8	0.01

Open in a new tab

Quite often, an optimal point estimate of the ranked clustering is required and we propose the following method which obtains much smaller Bayes risk than the existed popular Aristotle score for ranking congenital heart surgery. For a vector-valued parameter θ = (θ₁, …, θ_n), denote the corresponding true ranks as R = (R₁, …, R_n) and R̃ = (R̃₁, …, R̃_n) a possible point estimate of R (we drop the dependency on θ for notational convenience). To find the optimal ranked clustering, we define the following loss function L(R, R̃):

L (R, R̃) = \sum_{(i, j) \in ℳ} {2 \times 1 ({R̃}_{i} < {R̃}_{j}, R_{i} > R_{j}) + 2 \times 1 ({R̃}_{i} > {R̃}_{j}, R_{i} < R_{j}) + 1 ({R̃}_{i} = {R̃}_{j}, R_{i} \neq R_{j}) + 1 ({R̃}_{i} \neq {R̃}_{j}, R_{i} = R_{j})} .

(11)

Here ℳ = {(i, j) : i < j; i, j ∈ {1, …, n}}.We penalize estimates of ties when the true ranks are strictly ordered half as much as estimates in the wrong direction. The posterior expected loss is

E (L (R, R̃ | y)) = \sum_{(i, j) \in ℳ} {2 \times 1 ({R̃}_{i} < {R̃}_{j}) Pr {R_{i} > R_{j} | y} + 2 \times 1 ({R̃}_{i} > {R̃}_{j}) Pr {R_{i} < R_{j} | y} + 1 ({R̃}_{i} = {R̃}_{j}) Pr {R_{i} \neq R_{j} | y} + 1 ({R̃}_{i} \neq {R̃}_{j}) Pr {R_{i} = R_{j} | y}},

where Pr{R_i > R_j|y}, Pr{R_i < R_j|y}, Pr{R_i = R_j|y}, and Pr{R_i ≠ R_j|y} are estimated through the MCMC outputs. As a pragmatic approach to avoid additional computation, we estimate the Bayes risk for each MCMC ranked clustering sample, and choose the sample with the smallest risk as the optimal estimate.

To compare the performance of our method with that of the Aristotle level (with four levels) based on the Bayes risk, we let K = 4 so that we will only obtain four clusters. The Bayes risks for the ranked clustering obtained from the Aristotle score is 4900.5. Our optimal Bayesian ranked clustering achieves much smaller risk: 749.8.

We compare groupings based on Aristotle to our final point estimate in Table 3. Several procedures that were predicted to be relatively low risk by the Aristotle score are actually moderate risk or high risk according to our proposed methodology. Among 23 procedures that were Aristotle Category 1 (lowest risk), only 14 of these procedures are assigned to the lowest risk category according to our method. The correlation between these two ranked clusterings (from the SO-LCM and the Aristotle score) is 0.44. The correlation between the ranked clustering of the SO-LCM and the log(1 + PLOS), where log(1 + PLOS) is considered as a critical impact factor for surgery procedure ranking, is 0.81 and the correlation between the ranked clustering of the Aristotle level and the log(1 + PLOS) is only 0.58.

Table 3.

Clustering comparison between Aristotle level and SO-LCM

	Aristotle

SO-LCM	Level 1	Level 2	Level 3	Level 4
Level 1	15	18	15	2
Level 2	4	17	28	11
Level 3	4	6	6	11
Level 4	0	0	1	7

Open in a new tab

6. DISCUSSION

We have formulated a novel extension of the nested Dirichlet process (nDP) to the latent class modeling with a partial stochastic ordering that allows us to simultaneously rank and cluster procedures. The procedures are clustered by their entire distribution rather than by particular features of it. We avoid some of the problems arising in typical LCMs through stochastic ordering restrictions and we also relax parametric assumptions on the class-specific distributions through nonparametric Bayes methods. Similar to the nDP, the SO-LCM also allows us to cluster subjects within procedures. The SO-LCM is also straightforward to be imbedded for stochastically ordered mixture distributions within a large hierarchical model.

Although inspired by the pioneering work of the nDP, this article makes several important contributions. First, the stochastically ordered priors that allow covariates to impact the allocation to clusters are developed to apply to nonparametrically estimate densities for multiple procedures subject to a stochastic ordering constraint. In addition, we can also test the hypothesis of equalities between procedures against stochastically ordered alternatives. After examining some of the theoretical properties of the model, we describe a computationally efficient implementation and demonstrate the flexibility of the model through both a simulation study and an application where the SO-LCM is used within a hierarchical model. Heat maps are also offered to summarize the ranking and clustering structures generated by the model.

It is straightforward to make several generalizations of the SO-LCM. One natural generalization is to include hyperparameters in the prior on the regression coefficients of the predictor dependent probabilities H and the baseline measure P₀. For H, we can choose a heavy-tailed Cauchy prior or a variable selection mixture prior with a mass at zero to shrink unimportant coefficients towards zero. We note that, conditional on P₀, the distinct atoms ${P_{k}^{*}}_{k = 1}^{\infty}$ are assumed to be independent. Therefore, including hyperparameters in P₀ allows us to parametrically borrow information across the distinct distributions.

Another natural generalization of the SO-LCM is to replace the beta(1, α₂) stick-breaking densities with more general forms beta(a_k, b_k) as considered in Ishwaran and James (2001), with the SO-LCM corresponding to the special case a_k = 1, b_k = α₂. Richer classes of priors that encompass the SO-LCM as a particular case will be obtained, though in some regression contexts it does not always outperform the DP model with beta(1, α₂) in terms of the log-predictive marginal likelihood.

We can also generalize the procedure to incorporate multivariate latent factors whose distributions are stochastically ordered. This generalization is inspired by the valuable suggestion from the editors. In having univariate stochastic ordering on the latent variable level, we actually induce multivariate stochastic ordering for the responses (albeit in a somewhat restrictive manner). To directly place the stochastic ordering constraint on multivariate distributions, we can adopt multivariate monotone functions. In particular, in place of the scalar θ_kl we could have a vector θ_kl with P₀ chosen (e.g., multivariate truncated normal) so that the different elements are appropriately ordered to satisfy the constraint. In the simple ordering case, we could just let (3) independently for each element of the θ_kl vector instead of just for the θ_kl scalar. We could even have different orders for different variables and could have some variables with no ordering.

APPENDIX

Clustering Probability

Under (2), the probability that P_i = P_i′ so that groups i and i′ are allocated to the same cluster is

ℙ (P_{i} = P_{i'}) = E (\sum_{k = 1}^{K} π_{k} π_{k}) = \sum_{k = 1}^{K} E (π_{k}^{2}) = \sum_{k = 1}^{K} Var (π_{k}) + E {(π_{k})}^{2} = \sum_{k = 1}^{K} \frac{α_{1} / K (α_{1} - α_{1} / K)}{α_{1}^{2} (α_{1} + 1)} + {(\frac{α_{1} / K}{α_{1}})}^{2} = K (\frac{α_{1} / K (α_{1} - α_{1} / K)}{α_{1}^{2} (α_{1} + 1)} + {(\frac{α_{1} / K}{α_{1}})}^{2}) = \frac{α_{1} - α_{1} / K}{α_{1} (α_{1} + 1)} + \frac{1}{K} .

As K goes to infinity,

lim_{K \to \infty} \frac{α_{1} - α_{1} / K}{α_{1} (α_{1} + 1)} + \frac{1}{K} = \frac{1}{α_{1} + 1},

MCMC Supplement

We would introduce a set of variables d_ij, i = 1, …, p, j = 1, …, K, and define ŷ_ij = 1 if the ith observation belongs to class j, j ∈ {1, …, K}, and ŷ_ij = 0 otherwise. Notice that the equivalent representation of equation (8) is

\begin{matrix} ŷ_{ij} = {\begin{matrix} 1, & s_{ij} \geq 0, \\ 0, & else, \end{matrix} \\ \begin{matrix} s_{ij} = {x̂}_{i}^{'} {β̂}_{j} - C_{ij} + ε_{ij}, & ε_{ij} ~ N (0, d_{ij}), \end{matrix} \\ \begin{matrix} d_{ij} = {(2 e_{ij})}^{2}, & e_{ij} ~ KS, & {β̂}_{j} ~ π ({β̂}_{j}), \end{matrix} \\ λ_{j} ~ Gamma (\frac{α_{1}}{K}, 1) . \end{matrix}

(A.1)

Parameters of equation (A.1) are updated through following steps:

Sampling β̂_j in the case of a normal prior on β̂_j, π(β̂_j) = N(b₀, v₀), the full conditional distribution of β̂_j given s_·j and d_·j is still normal,
$\begin{matrix} {β̂}_{j} | s_{\cdot j}, d_{\cdot j} ~ & MN (B_{j}, V_{j}), \\ B_{j} = & V_{j} (v_{0}^{- 1} b_{0} + x̂' W_{j} (s_{\cdot j} + C_{\cdot j})), \\ V_{j} = & \begin{matrix} {(v_{0}^{- 1} + x̂' W_{j} x̂)}^{- 1}, & W_{j} = diag (d_{1 j}^{- 1}, \dots, d_{nj}^{- 1}), \end{matrix} \end{matrix}$
where x̂ = (x̂₁, …, x̂_p)′, s_·j = (s_1j, …, s_nj)′, C_·j = (C_1j, …, C_nj)′, and d_·j = (d_1j, …, d_nj)′.
Following advice of Holmes and Held (2006), we update {s_·j, d_·j} jointly given β̂_j,
$π (s_{\cdot j}, d_{\cdot j} | {β̂}_{j}, ŷ_{\cdot j}) = π (s_{\cdot j} | {β̂}_{j}, ŷ_{\cdot j}) π (d_{\cdot j} | s_{\cdot j}, {β̂}_{j})$
followed by an update to β̂_j|s_·j, d_·j. The marginal densities for the s_ij’s are independent truncated logistic distributions,
$s_{ij} | {β̂}_{j}, ŷ_{ij} \propto {\begin{matrix} Logistic ({x̂}_{i}^{'} {β̂}_{j} + log (λ_{j}) - C_{ij}, 1) I (s_{ij} > 0), & ŷ_{ij} = 1, \\ Logistic ({x̂}_{i}^{'} {β̂}_{j} + log (λ_{j}) - C_{ij}, 1) I (s_{ij} \leq 0), & else, \end{matrix}$
where Logistic(a, b) denotes the density function of the logistic distribution with mean a and scale parameter b.
Sampling d_·j through rejection sampling. As advised by Holmes and Held (2006), we use Generalized Inverse Gaussian distribution $g (d_{lj}) = GIG (0.5, 1, r^{2}) = \frac{r}{Inv-Gamma (1, r)}, where r^{2} = {(s_{ij} - {x̂}_{i}^{'} {β̂}_{j} + C_{ij})}^{2}$ , as rejection sampling density. Following a draw from g(·) the sample is accepted with probability α(·),
$α (d_{lj}) = \frac{L (d_{lj}) π (d_{lj})}{Mg (d_{lj})},$
where $M \geq {sup}_{d_{lj}} \frac{L (d_{lj}) π (d_{lj})}{g (d_{lj})}$ , L(d_lj) denotes the likelihood, $L (d_{lj}) \propto d_{lj}^{- 1 / 2} exp (- 0.5 r^{2} / d_{lj})$ , and π(d_lj) is the prior,
$π (d_{lj}) = \frac{1}{4} d_{lj}^{- 1 / 2} KS (\frac{1}{2} d_{lj}^{1 / 2}),$
where KS(·) denotes the Kolmogorov–Smirnov density.

Contributor Information

Hongxia Yang, Mathematical Sciences Department, Watson Research Center, IBM, Yorktown Heights, NY 10598 (yangho@us.ibm.com).

Sean O’Brien, Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27710 (obrie027@mc.duke.edu).

David B. Dunson, Department of Statistical Science, Duke University, Durham, NC 27708 (dunson@stat.duke.edu).

REFERENCES

Dunson D, Peddada S. Bayesian Nonparametric Inference on Stochastic Ordering. Biometrika. 2008;95:859–874. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dunson D, Pillai N, Park J. Bayesian Density Regression. Journal of the Royal Statistical Society, Ser. B. 2007;69:163–183. [Google Scholar]
Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]
Fraley C, Raftery AE, Wehrens R. mclust: Model-Based Cluster Analysis. R package version 2.1-11. 2005 [Google Scholar]
Holmes C, Held L. Bayesian Auxiliary Variable Models for Binary and Multinomial Regression. Bayesian Analysis. 2006;1:145–168. [Google Scholar]
Ishwaran H, James L. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]
Ishwaran H, Zarepour M. Exact and Approximate Sum-Representations for the Dirichlet Process. Canadian Journal of Statistics. 2002;30:269–283. [Google Scholar]
Jara A, Hanson T. A Class of Mixtures of Dependent Tail-Free Processes. Biometrika. 2011 doi: 10.1093/biomet/asq082. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jasra A, Holmes CC, Stephens DA. Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modelling. Statistical Science. 2005;20:50–67. [Google Scholar]
Jenkins K. Risk Adjustment for Congenital Heart Surgery: The RACHS-1 Method. Seminars in Thoracic and Cardiovascular Surgery: Pediatric Cardiac Surgery Annual. 2004;7:180–184. doi: 10.1053/j.pcsu.2004.02.009. [DOI] [PubMed] [Google Scholar]
Karabatsos G, Walker S. Bayesian Nonparametric Inference of Stochastically Ordered Distributions, With Polya Trees and Bernstein Polynomials. Statistics and Probability Letters. 2007;77:907–913. [Google Scholar]
Lacour-Gayet F, Clarke D, Jacobs J, Comas J, Daebritz S, Daenen W, Gaynor W, Hamilton L, Jacobs M, Maruszewski B, Pozzi M, Spray T, Stellin G, Tchervenkov C, Mavroudis C The Aristotle Committee. The Aristotle Score: A Complexity-Adjusted Method to Evaluate Surgical Results. European Journal of Cardio-Thoracic Surgery. 2004;25:911–924. doi: 10.1016/j.ejcts.2004.03.027. [DOI] [PubMed] [Google Scholar]
Müller P, Quintana F, Rosner G. A Method for Combining Inference Across Related Nonparametric Bayesian Models. Journal of the Royal Statistical Society, Ser. B. 2004;66:735–749. [Google Scholar]
Rodriguex A, Dunson D, Gelfand A. The Nested Dirichlet Process. Journal of the American Statistical Association. 2008;103:1131–1154. [Google Scholar]
Stephens M. Dealing With Label Switching in Mixture Models. Journal of the Royal Statistical Society, Ser. B. 2000;62:795–809. [Google Scholar]
Teh Y, Jordan M, Beal M, Blei D. Hierarchical Dirichlet Processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]
Yau C, Papaspiliopoulos O, Roberts G, Holmes C. Nonparametric Hidden Markov Models With Application to the Analysis of Copy-Number-Variation in Mammalian Genomes. Journal of the Royal Statistical Society, Ser. B. 2011;73:37–57. doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] Dunson D, Peddada S. Bayesian Nonparametric Inference on Stochastic Ordering. Biometrika. 2008;95:859–874. doi: 10.1093/biomet/asn043. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Dunson D, Pillai N, Park J. Bayesian Density Regression. Journal of the Royal Statistical Society, Ser. B. 2007;69:163–183. [Google Scholar]

[R3] Fraley C, Raftery AE. Model-Based Clustering, Discriminant Analysis, and Density Estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]

[R4] Fraley C, Raftery AE, Wehrens R. mclust: Model-Based Cluster Analysis. R package version 2.1-11. 2005 [Google Scholar]

[R5] Holmes C, Held L. Bayesian Auxiliary Variable Models for Binary and Multinomial Regression. Bayesian Analysis. 2006;1:145–168. [Google Scholar]

[R6] Ishwaran H, James L. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association. 2001;96:161–173. [Google Scholar]

[R7] Ishwaran H, Zarepour M. Exact and Approximate Sum-Representations for the Dirichlet Process. Canadian Journal of Statistics. 2002;30:269–283. [Google Scholar]

[R8] Jara A, Hanson T. A Class of Mixtures of Dependent Tail-Free Processes. Biometrika. 2011 doi: 10.1093/biomet/asq082. to appear. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Jasra A, Holmes CC, Stephens DA. Markov Chain Monte Carlo Methods and the Label Switching Problem in Bayesian Mixture Modelling. Statistical Science. 2005;20:50–67. [Google Scholar]

[R10] Jenkins K. Risk Adjustment for Congenital Heart Surgery: The RACHS-1 Method. Seminars in Thoracic and Cardiovascular Surgery: Pediatric Cardiac Surgery Annual. 2004;7:180–184. doi: 10.1053/j.pcsu.2004.02.009. [DOI] [PubMed] [Google Scholar]

[R11] Karabatsos G, Walker S. Bayesian Nonparametric Inference of Stochastically Ordered Distributions, With Polya Trees and Bernstein Polynomials. Statistics and Probability Letters. 2007;77:907–913. [Google Scholar]

[R12] Lacour-Gayet F, Clarke D, Jacobs J, Comas J, Daebritz S, Daenen W, Gaynor W, Hamilton L, Jacobs M, Maruszewski B, Pozzi M, Spray T, Stellin G, Tchervenkov C, Mavroudis C The Aristotle Committee. The Aristotle Score: A Complexity-Adjusted Method to Evaluate Surgical Results. European Journal of Cardio-Thoracic Surgery. 2004;25:911–924. doi: 10.1016/j.ejcts.2004.03.027. [DOI] [PubMed] [Google Scholar]

[R13] Müller P, Quintana F, Rosner G. A Method for Combining Inference Across Related Nonparametric Bayesian Models. Journal of the Royal Statistical Society, Ser. B. 2004;66:735–749. [Google Scholar]

[R14] Rodriguex A, Dunson D, Gelfand A. The Nested Dirichlet Process. Journal of the American Statistical Association. 2008;103:1131–1154. [Google Scholar]

[R15] Stephens M. Dealing With Label Switching in Mixture Models. Journal of the Royal Statistical Society, Ser. B. 2000;62:795–809. [Google Scholar]

[R16] Teh Y, Jordan M, Beal M, Blei D. Hierarchical Dirichlet Processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]

[R17] Yau C, Papaspiliopoulos O, Roberts G, Holmes C. Nonparametric Hidden Markov Models With Application to the Analysis of Copy-Number-Variation in Mammalian Genomes. Journal of the Royal Statistical Society, Ser. B. 2011;73:37–57. doi: 10.1111/j.1467-9868.2010.00756.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Nonparametric Bayes Stochastically Ordered Latent Class Models

Hongxia Yang

Sean O’Brien

David B Dunson

Roles

Abstract

1. INTRODUCTION