Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data

C B Storlie; S M Myers; S K Katusic; A L Weaver; R Voigt; P E Croarkin; R E Stoeckel; J D Port

doi:10.1002/sim.7697

. Author manuscript; available in PMC: 2019 Nov 17.

Published in final edited form as: Stat Med. 2018 May 17:10.1002/sim.7697. doi: 10.1002/sim.7697

Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data

C B Storlie ^†, S M Myers ^‡, S K Katusic ^†, A L Weaver ^†, R Voigt ^§, P E Croarkin ^†, R E Stoeckel ^†, J D Port ^†

PMCID: PMC6240391 NIHMSID: NIHMS976265 PMID: 29774571

Abstract

We consider the problem of model-based clustering in the presence of many correlated, mixed continuous and discrete variables, some of which may have missing values. Discrete variables are treated with a latent continuous variable approach and the Dirichlet process is used to construct a mixture model with an unknown number of components. Variable selection is also performed to identify the variables that are most influential for determining cluster membership. The work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder (ASD) on the basis of many cognitive and/or behavioral test scores. There are a modest number of patients (486) in the data set along with many (55) test score variables (many of which are discrete valued and/or missing). The goal of the work is to (i) cluster these patients into similar groups to help identify those with similar clinical presentation, and (ii) identify a sparse subset of tests that inform the clusters in order to eliminate unnecessary testing. The proposed approach compares very favorably to other methods via simulation of problems of this type. The results of the ASD analysis suggested three clusters to be most likely, while only four test scores had high (> 0.5) posterior probability of being informative. This will result in much more efficient and informative testing. The need to cluster observations on the basis of many correlated, continuous/discrete variables with missing values, is a common problem in the health sciences as well as in many other disciplines.

Keywords: Model-Based Clustering, Dirichlet Process, Missing Data, Hierarchical Bayesian Modeling, Mixed Variable Types, Variable Selection

1 Introduction

Model-based clustering has become a very popular means for unsupervised learning^1–4. This is due in part to the ability to use the model likelihood to inform, not only the cluster membership, but also the number of clusters M which has been a heavily researched problem for many years. The most widely used model-based approach is the normal mixture model which is not suitable for mixed continuous/discrete variables. For example, this work is motivated by the need to cluster patients thought to potentially have autism spectrum disorder (ASD) on the basis of many correlated test scores. There are a modest number of patients (486) in the data set along with many (55) test score/self-report variables, many of which are discrete valued or have left or right boundaries. Figure 1 provides a look at the data across three of the variables; Beery_standard is discrete valued and ABC_irritability is continuous, but with significant mass at the left boundary of zero. The goals of this problem are to (i) cluster these patients into similar groups to help identify those with similar clinical presentation, and (ii) identify a sparse subset of tests that inform the clusters in an effort to eliminate redundant testing. This problem is also complicated by the fact that many patients in the data have missing test scores. The need to cluster incomplete observations on the basis of many correlated continuous/discrete variables is a common problem in the health sciences as well as in many other disciplines.

3D scatter plot of three of the test score variables for potential ASD subjects.

When clustering in high dimensions, it becomes critically important to use some form of dimension reduction or variable selection to achieve accurate cluster formation. A common approach to deal with this is a principal components or factor approach⁵. However, such a solution does not address goal (ii) above for the ASD clustering problem. The problem of variable selection in regression or conditional density estimation has been well studied from both the L₁ penalization^6–8 and Bayesian perspectives^9–11. However, variable selection in clustering is more challenging than that in regression as there is no response to guide (supervise) the selection. Still, there have been several articles considering this topic; see Fop and Murphy¹² for a review. For example, Raftery and Dean¹³ propose a partition of the variables into informative (dependent on cluster membership even after conditioning on all of the other variables) and non-informative (conditionally independent of cluster membership given the values of the other variables). They use BIC to accomplish variable selection with a greedy search which is implemented in the R package clustvarsel. Similar approaches are used by Maugis et al.¹⁴ and Fop et al.¹⁵. An efficient algorithm for identifying the optimal set of informative variables is provided by Marbac and Sedki¹⁶ and implemented in the R package VarSelLCM. Their approach also allows for mixed data types and missing data, however, it assumes both local and global independence (i.e., independence of variables within a cluster and unconditional independence of informative and non-informative variables, respectively). The popular LASSO or L1 type penalization has also been applied to shrink cluster means together for variable selection^17–19. There have also been several approaches developed for sparse K-means and distance based clustering^20–22.

In the Bayesian literature Tadesse et al.⁴ consider variable selection in the finite normal mixture model using reversible jump (RJ) Markov chain Monte Carlo (MCMC)²³. Kim et al.²⁴ extend that work to the nonparametric Bayesian mixture model via the Dirichlet process model (DPM)^25–28. The DPM has the advantage of allowing for a countably infinite number of possible components (thus making it nonparametric), while providing a posterior distribution for how many components have been observed in the data set at hand. Both Tadesse et al.⁴ and Kim et al.²⁴ use a point mass prior to achieve sparse representation of the informative variables. However, for simplicity they assume all non-informative variables are (unconditionally) independent of the informative variables. This assumption is frequently violated in practice and it is particularly problematic in the case of the ASD analysis as it would force far too many variables to be included into the informative set as is demonstrated later in this paper.

There is not a generally accepted best practice to clustering with mixed discrete and continuous variables. Hunt and Jorgensen²⁹, Biernacki et al.³⁰, and Murray and Reiter³¹ meld mixtures of independent multinomials for the categorical variables and mixtures of Gaussian for the continuous variables. However, it may not be desirable for the dependency between the discrete variables to be entirely represented by mixture components when clustering is the primary objective. As pointed out in Hennig and Liao³², mixture models can approximate any distribution arbitrarily well so care must be taken to ensure the mixtures fall in line with the goals of clustering. When using mixtures of Gaussian combined with independent multinomials, a data set with many correlated discrete variables will tend to result in more clusters than a comparable dataset with mostly continuous variables. A discrete variable measure of some quantity instead of the continuous version could therefore result in very different clusters. Thus, a Gaussian latent variable approach^33–37 would seem more appropriate for treating discrete variables when clustering is the goal. An observed ordinal variable x_j, for example, is assumed to be the result of thresholding a latent Gaussian variable z_j. For binary variables, this reduces to the multivariate probit model^38,39. There are also extensions of this approach to allow for unordered categorical variables.

In this paper, we propose a Bayesian nonparametric approach to perform simultaneous estimation of the number of clusters, cluster membership, and variable selection while explicitly accounting for discrete variables and partially observed data. The discrete variables as well as continuous variables with boundaries are treated with a Gaussian latent variable approach. The informative variable construct of Raftery and Dean¹³ for normal mixtures is then adopted. However, in order to effectively handle the missing values and account for uncertainty in the variable selection and number of clusters, the proposed model is cast in a fully Bayesian framework via the Dirichlet process. This is then similar to the work of Kim et al.²⁴, however, they did not consider discrete variables or missing data. Further, a key result of this paper is a solution to allow for dependence between informative and non-informative variables in the nonparametric Bayesian mixture model. Thus, this work overcomes the assumption of (global) independence between informative and non-informative variables. Furthermore, by using the latent variable approach it also overcomes the (local) independence assumption among the informative/clustering variables often assumed when clustering data of mixed type¹².

The solution takes a particularly simple form and also provides an intuitive means with which to define the prior distribution in a manner that decreases prior sensitivity. The component parameters are marginalized out to facilitate more efficient MCMC sampling via a modified version of the split-merge algorithm of Jain and Neal⁴⁰. Finally, missing data is then handled in a principled manner by treating missing values as unknown parameters in the Bayesian framework^41,42. This approach implicitly assumes a missing at random (MAR) mechanism⁴³, which implies that the likelihood of a missing value can depend on the value of the unobserved variable(s), marginally, just not after conditioning on the observed variables.

The rest of the paper is laid out as follows. Section 2 describes the proposed nonparametric Bayesian approach to clustering observations of mixed discrete and continuous variables with variable selection. Section 3 evaluates the performance of this approach when compared to other methods on several simulation cases. The approach is then applied to the problem for which it was designed in Section 4 where a comprehensive analysis of the ASD problem is presented. Section 5 concludes the paper. This paper also has supplementary material which contains derivations, full exposition of the proposed MCMC algorithm, and MCMC trace plots.

2 Methodology

2.1 Dirichlet Process Mixture Models

As discussed above, the proposed model for clustering uses mixture distributions with a countably infinite number of components via the Dirichlet process prior^25,44,45. Let y = (y₁, . . . , y_p) be a p-variate random vector and let y_i, i = 1, . . . , n, denote the i^th observation of y. It is assumed that y_i are independent random vectors coming from distribution F (θ_i). The model parameters θ_i are assumed to come from a mixing distribution G which has a Dirichlet process prior, i.e., the familiar model,

y_{i} ∣ θ_{i} ~ F (θ_{i}), θ_{i} ~ G, G ~ DP (G_{0}, α),

(1)

where DP represents a Dirichlet Process distribution, G₀ is the base distribution and α is a precision parameter, determining the concentration of the prior for G about G₀ ⁴⁴. The prior distribution for θ_i in terms of successive conditional distributions is obtained by integrating over G, i.e.,

θ_{i} ∣ θ_{1}, \dots, θ_{i - 1} ~ \frac{1}{i - 1 + α} \sum_{i^{'} = 1}^{i - 1} δ (θ_{i^{'}}) + \frac{α}{i - 1 + α} G_{0},

(2)

where δ(θ) is a point mass distribution at θ. The representation in (2) makes it clear that (1) can be viewed as a countably infinite mixture model. Alternatively, let Ω = [ω₁, ω₂, . . . ] denote the unique values of the θ_i and let ϕ_i be the index for the component to which observation i belongs, i.e., so that ω_{ϕ_i} = θ_i. The following model²⁶ is equivalent to (2)

P (ϕ_{i} = m ∣ ϕ_{1}, \dots, ϕ_{i - 1}) = {\begin{cases} 1 & if i = 1 and m = 1. \\ \frac{n_{i, m}}{i - 1 + α} & if ϕ_{i^{'}} = m for any i^{'} < i . \\ \frac{α}{i - 1 + α} & if m = max (ϕ_{1}, \dots, ϕ_{i - 1}) + 1. \\ 0 & otherwise, \end{cases}

(3)

with y_i | ϕ_i, Ω ~ F(ω_{ϕ_i}), ω_m ~ G₀ and n_i,m is the number of ϕ_i_′ = m for i′ < i. Thus, a new observation i is allocated to an existing cluster with probability proportional to the cluster size or it is assigned to a new cluster with probability proportional to α. This is often called the Chinese restaurant representation of the Dirichlet process. It is common to assume that F is a normal distribution in which case ω_m = (μ_m, Σ_m) describes the mean and covariance of the m^th component. This results in a normal mixture model with a countably infinite number of components.

2.2 Discrete Variables and Boundaries/Censoring

Normal mixture models are not effective for clustering when some of the variables are too discretized as demonstrated in Section 3. This is also a problem when the data have left or right boundaries that can be achieved (e.g., several people score the minimum or maximum on a test). However, a Gaussian latent variable approach can be used to circumvent these issues. Suppose that variables y_j for j ∈ 𝒟 are discrete, ordinal variables taking on possible values d_j = {d_j,₁, . . . , d_j,_{L_j}} and that y_j for j ∈ 𝒞 = 𝒟^c are continuous variables with lower and upper limits of b_j and c_j, which could be infinite. Assume for some latent, p-variate, continuous random vector z that

y_{j} = {\begin{cases} \sum_{l = 1}^{L_{j}} d_{j, l} I_{{a_{j, l - 1} < z_{j} \leq a_{j, l}}} & for j \in D \\ z_{j} I_{{b_{j} \leq z_{j} \leq c_{j}}} + b_{j} I_{{z_{j} < b_{j}}} + c_{j} I_{{z_{j} > c_{j}}} & for j \in C \end{cases}

(4)

where I_A is the indicator function equal to 1 if A and 0 otherwise, a_j,₀ = −∞, a_j,_{L_j}= ∞, and a_j,l = d_j,l for l = 1, . . . , L_j − 1. That is, the discrete y_j are the result of thresholding the latent variable z_j on the respective cut-points. The continuous y_j variables are simply equal to the z_j unless the z_j cross the left or right boundary of what can be observed for y_j. That is, if there are finite limits for y_j, then y_j is assumed to be a left and/or right censored version of z_j, thus producing a positive mass at the boundary values of y_j.

A joint mixture model for mixed discrete and continuous variables is then,

z_{i} ∣ ϕ_{i}, Ω ~ N (μ_{ϕ_{i}}, \sum_{ϕ_{i}}),

(5)

with prior distributions for ω_m and ϕ = [ϕ₁, . . . , ϕ_n]′ as in (3).

Binary y_j such as gender can be accommodated by setting d_j = {0, 1}. However, if there is only one cut-point then the model must be restricted for identifiability³⁹; namely, if y_j is binary, then we must set Σ_m(j, j) = 1. The restriction that Σ_m(j, j) = 1 for binary y_j complicates posterior inference, however, this problem has been relatively well studied in the multinomial probit setting and various proposed solutions exist⁴⁶. It is also straight-forward to use the latent Gaussian variable approach to allow for unordered categorical variables^47,46,48,49, however, inclusion of categorical variables also complicates notation and there are no such variables in the ASD data. For brevity, attention is restricted here to continuous and ordinal discrete variables.

2.3 Variable Selection

Variable selection in clustering problems is more challenging than in regression problems due to the lack of targeted information with which to guide the selection. Using model-based clustering allows a likelihood based approach to model selection, but exactly how the parameter space should be restricted when a variable is “out of the model” requires some care. Raftery and Dean¹³ defined a variable y_j to be non-informative if conditional on the values of the other variables, it is independent of cluster membership. This implies that a non-informative y_j may still be quite dependent on cluster membership through its dependency with other variables. They assumed a Gaussian mixture distribution for the informative variables, with a conditional Gaussian distribution for the non-informative variables and used maximum likelihood to obtain the change in BIC between candidate models. Thus, they accomplished variable selection with a greedy search to minimize BIC. They further considered restricted covariance parameterizations to reduce the parameter dimensionality (e.g., diagonal, common volume, common shape, common orientation, etc.). We instead take a Bayesian approach to this problem via Stochastic Search Variable Selection (SSVS)^9,50 as this allows for straight-forward treatment of uncertainty in the selected variables and that due to missing values. Kim et al.²⁴ used such an approach with a DPM for infinite normal mixtures, however, due to the difficulty imposed they did not use the same definition as Raftery and Dean¹³ for a non-informative variable. They defined a non-informative variable to be one that is (unconditionally) independent of cluster membership and all other variables. This is not reasonable in many cases, particularly in the ASD problem, and can result in negative consequences as seen in Section 3. Below, we layout a more flexible model specification akin to that taken in Raftery and Dean¹³ to allow for (global) dependence between informative and non-informative variables in a DPM.

Let the informative variables be represented by the model γ, a vector of binary values such that {y_j : γ_j = 1} is the set of informative variables. A priori it is assumed that Pr(γ_j = 1) = ρ_j. Without loss of generality assume that y has elements ordered such that y = [y⁽¹⁾, y⁽²⁾], with y⁽¹⁾ = {y_j : γ_j = 1} and y⁽²⁾ = {y_j : γ_j = 0}, and similarly for z⁽¹⁾ and z⁽²⁾. The model in (5) becomes,

z_{i} ∣ γ, ϕ_{i}, Ω ~ N (μ_{ϕ_{i}}, \sum_{ϕ_{i}}),

(6)

with

μ_{m} = (\begin{matrix} μ_{m 1} \\ μ_{m 2} \end{matrix}), \sum_{m} = (\begin{matrix} \sum_{m 11} & \sum_{m 12} \\ \sum_{m 21} & \sum_{m 22} \end{matrix}) .

(7)

From standard multivariate normal theory, [z⁽²⁾ | z⁽¹⁾, ϕ = m] ~ N(μ₂_|₁, Σ₂_|₁) with $μ_{2 ∣ 1} = μ_{m 2} + \sum_{m 21} \sum_{m 11}^{- 1} (z^{(1)} - μ_{m 1})$ and $\sum_{2 ∣ 1} = \sum_{m 22} - \sum_{m 21} \sum_{m 11}^{- 1} \sum_{m 12}$ . Now in order for the non-informative variables to follow the definition of Raftery and Dean¹³, the μ_m and Σ_m must be parameterized so that μ₂_|₁, Σ₂_|₁ do not depend on m. In order to accomplish this, it is helpful to make use of the canonical parameterization of the Gaussian⁵¹,

z ∣ γ, Ω, ϕ = m ~ N_{C} (b_{m}, Q_{m}),

with precision $Q_{m} = \sum_{m}^{- 1}$ and b_m = Q_mμ_m. Partition the canonical parameters as,

b_{m} = (\begin{matrix} b_{m 1} \\ b_{2} \end{matrix}), Q_{m} = (\begin{matrix} Q_{m 11} & Q_{12} \\ Q_{21} & Q_{22} \end{matrix}) .

(8)

Result 1

The parameterization in (8) results in (μ₂_|₁, Σ₂_|₁) that does not depend on m.

Proof

The inverse of a partitioned matrix directly implies that $\sum_{2 ∣ 1} = Q_{22}^{- 1}$ , which does not depend on m. It also implies that $- Q_{22}^{- 1} Q_{21} = \sum_{m 21} \sum_{m 11}^{- 1}$ , and substituting Σ_mb_m for μ_m in μ₂_|₁ gives $μ_{2 ∣ 1} = Q_{22}^{- 1} (b_{2} - Q_{21} z^{(1)})$ , which also does not depend on m.

The Q₂₁ does not depend on m which implies the same dependency structure across the mixture components. This is a necessary assumption in order for z⁽²⁾ to be non-informative variables, i.e., so that cluster membership conditional on z⁽¹⁾ is independent of z⁽²⁾.

Now the problem reduces to defining a prior distribution for Ω, i.e., ω_m = {b_m, Q_m}, m = 1, 2, . . . , conditional on the model γ, that maintains the form of (8). Let $ω_{m}^{(1)} = {b_{m 1}, Q_{m 11}}$ and $ω_{m}^{(2)} = ω^{(2)} = {b_{2}, Q_{21}, Q_{22}}$ . The prior distribution for Ω will be defined first unconditionally for ω⁽²⁾ and then for $ω_{m}^{(1)}$ , m = 1, 2, . . . , conditional on ω⁽²⁾. There are several considerations in defining these distributions: (i) the resulting Q_m must be positive definite, (ii) it is desirable for the marginal distribution of (μ_m, Σ_m) to remain unchanged for any model γ to limit the influence of the prior for ω_m on variable selection, and (iii) it is desirable for them to be conjugate to facilitate MCMC sampling^26,40.

Let Ψ be a p × p positive definite matrix, partitioned just as Q_m, and for a given γ assume the following distribution for ω⁽²⁾,

Q_{22} ~ W (Ψ_{22 ∣ 1}^{- 1}, η), b_{2} ∣ Q_{22} ~ N (0, \frac{1}{λ} Q_{22}), Q_{21} ∣ Q_{22} ~ ℳ N (- Q_{22} Ψ_{21} Ψ_{11}^{- 1}, Q_{22}, Ψ_{11}^{- 1}),

(9)

where 𝒲 denotes the Wishart distribution, and ℳ𝒩 denotes the matrix normal distribution.

The distribution of $ω_{m}^{(1)}$ , conditional on ω⁽²⁾ is defined implicitly below. A prior distribution is not placed on (b_m₁, Q_m₁₁), directly. It is helpful to reparameterize from (b₂, Q₂₂, Q₂₁, b_m₁, Q_m₁₁) to (b₂, Q₂₂, Q₂₁, μ_m₁, Σ_m₁₁). By doing this, independent priors can be placed on (b₂, Q₂₂, Q₂₁) and (μ_m₁, Σ_m₁₁) and still maintain all of the desired properties as will be seen in Results 2 and 3.

The prior distribution of (μ_m₁, Σ_m₁₁) is

\sum_{m 11} \overset{iid}{\sim} W^{- 1} (Ψ_{11}, η - p_{2}), μ_{m 1} ∣ \sum_{m 11} \overset{ind}{\sim} N (0, \frac{1}{λ} \sum_{m 11}),

(10)

where 𝒲⁻¹ denotes the inverse-Wishart distribution and (μ_m₁, Σ_m₁₁) are independent of ω⁽²⁾. The resulting distribution of (b_m₁, Q_m₁₁) conditional on (b₂, Q₂₂, Q₂₁) is not a common or named distribution, but it is well defined via the relations, $b_{m 1} = \sum_{m 11}^{- 1} μ_{m 1} + Q_{12} Q_{22}^{- 1} b_{2}$ , and $Q_{m 11} = \sum_{m 11}^{- 1} + Q_{12} Q_{22}^{- 1} Q_{21}$ .

Result 2

The prior distribution defined in (9) and (10) results in a marginal distribution for (μ_m, Σ_m) of 𝒩 ℐ 𝒲 (0, λ, Ψ, η),i.e.,the same normal-inverse-Wishart regardless of γ.

Proof

It follows from Theorem 3 of Bodnar and Okhrin⁵² that Σ_m ~ ℐ 𝒲 (η, Ψ). It remains to show μ_m | Σ_m ~ 𝒩 (0, (1/λ) Σ_m). However, according to (9) and (10) and the independence assumption,

(\begin{matrix} μ_{m 1} \\ b_{2} \end{matrix}) | \sum_{m} ~ N ((\begin{matrix} 0 \\ 0 \end{matrix}), \frac{1}{λ} (\begin{matrix} \sum_{m 11} & 0 \\ 0 & Q_{22} \end{matrix})) .

Also, b_m = Q_mμ_m implies,

(\begin{matrix} μ_{m 1} \\ μ_{m 2} \end{matrix}) = (\begin{matrix} I & 0 \\ - Q_{22}^{- 1} Q_{21} & Q_{22}^{- 1} \end{matrix}) (\begin{matrix} μ_{m 1} \\ b_{2} \end{matrix}) .

Using the relation Ax ~ 𝒩(Aμ, AΣA′) for x ~ 𝒩 (μ, Σ) gives the desired result.

As mentioned above, the normal-inverse-Wishart distribution is conjugate for ω_m in the unrestricted (no variable selection) setting. It turns out that the distribution defined in (9) and (10) is conjugate for the parameterization in (8) as well, so that the component parameters can be integrated out of the likelihood. Let the (latent) observations be denoted as $Z = {[z_{1}^{'}, \dots, z_{n}^{'}]}^{'}$ , and the data likelihood as f(Z | γ, ϕ, Ω).

Result 3

The marginal likelihood of Z is given by

f (Z ∣ γ, ϕ) = \int f (Z ∣ γ, ϕ, Ω) f (Ω ∣ γ) d Ω = π^{- \frac{n p}{2}} \prod_{m = 1}^{M} [{(\frac{λ}{n_{m} + λ})}^{\frac{p_{1}}{2}} \frac{{∣ Ψ_{11} ∣}^{\frac{η - p_{2}}{2}} Γ_{p_{1}} (\frac{n_{m} + η - p_{2}}{2})}{{∣ V_{m 11} ∣}^{\frac{n_{m} + η - p_{2}}{2}} Γ_{p_{1}} (\frac{η - p_{2}}{2})}] [{(\frac{λ}{n + λ})}^{\frac{p_{2}}{2}} \frac{{∣ Ψ_{11} ∣}^{\frac{p_{2}}{2}} {∣ Ψ_{2 ∣ 1} ∣}^{\frac{η}{2}} Γ_{p_{2}} (\frac{n + η}{2})}{{∣ V_{11} ∣}^{\frac{p_{2}}{2}} {∣ V_{2 ∣ 1} ∣}^{\frac{n + η}{2}} Γ_{p_{2}} (\frac{η}{2})}],

where (i) M = max(ϕ), i.e., the number of observed components, (ii) p₁ = Σγ_j is the number of informative variables, (iii) p₂ = p − p₁, (iv) n_m is the number of ϕ_i = m, (v) Γ_p(·) is the multivariate gamma function, and (vi) V_m₁₁, V₁₁, V₂_|₁ are defined as,

V_{m 11} = \sum_{ϕ_{i} = m} (z_{i}^{(1)} - {\bar{z}}_{m 1}) {(z_{i}^{(1)} - {\bar{z}}_{m 1})}^{'} + \frac{n_{m} λ}{n_{m} + λ} {\bar{z}}_{m 1} {\bar{z}}_{m 1}^{'} + Ψ_{11}, V_{11} = \sum_{i = 1}^{n} (z_{i}^{(1)} - {\bar{z}}_{1}) {(z_{i}^{(1)} - {\bar{z}}_{1})}^{'} + \frac{n λ}{n + λ} {\bar{z}}_{1} {\bar{z}}_{1}^{'} + Ψ_{11}, V_{2 ∣ 1} = V_{22} - V_{21} V_{11}^{- 1} V_{21}^{'},

with ${\bar{z}}_{m 1} = \frac{1}{n_{m}} \sum_{ϕ_{i} = m} z_{i}^{(1)}, {\bar{z}}_{1} = \frac{1}{n} \sum_{i = 1}^{n} z_{i}^{(1)}, {\bar{z}}_{2} = \frac{1}{n} \sum_{i = 1}^{n} z_{i}^{(2)}$ ,

V_{22} = \sum_{i = 1}^{n} (z_{i}^{(2)} - {\bar{z}}_{2}) {(z_{i}^{(2)} - {\bar{z}}_{2})}^{'} + \frac{n λ}{n + λ} {\bar{z}}_{2} {\bar{z}}_{2}^{'} + Ψ_{22}, and V_{21} = \sum_{i = 1}^{n} (z_{i}^{(2)} - {\bar{z}}_{2}) {(z_{i}^{(1)} - {\bar{z}}_{1})}^{'} + \frac{n λ}{n + λ} {\bar{z}}_{2} {\bar{z}}_{1}^{'} + Ψ_{21} .

(11)

The derivation of Result 3 is provided in Web Appendix B.

2.4 Hyper-Prior Distributions

Kim et al.²⁴ found there to be a lot of prior sensitivity due to the choice of prior for the component parameters. This is in part due to the separate prior specification for the parameters corresponding to informative and non-informative variables, respectively. The specification above treats all component parameters collectively, in a single prior, so that the choice will not be sensitive to the interplay between the priors chosen for informative and non-informative variables. A further stabilization can be obtained by rationale similar to that used in Raftery and Dean¹³ for restricted forms of the covariance (such as equal shape, orientation, etc.). We do not enforce such restrictions exactly, but one might expect the components to have similar covariances or similar means for some of the components. Thus it makes sense to put hierarchical priors on λ, Ψ, and η, to encourage such similarity if warranted by the data. A Gamma prior is also placed on the concentration parameter α, i.e.,

\begin{array}{l} λ ~ Gamma (A_{λ}, B_{λ}), & Ψ ~ W (P, N), \\ η - (p + 1) ~ Gamma (A_{η}, B_{η}), & α ~ Gamma (A_{α}, B_{α}) . \end{array}

(12)

In the analyses below, relatively vague priors were used with A_λ = B_λ = A_η = B_η = 2. The prior for α was set to A_α = 2, B_α = 2, to encourage anywhere from 1 to 15 clusters from 100 observations. The results still have some sensitivity to the choice of P. In addition, there are some drawbacks to Wishart priors which can be exaggerated when applied to variables of differing scale^53,54. In order to alleviate these issues, we recommend first standardizing the columns of the data to mean zero and unit variance, then using N = p + 2, P = (1/N)I. Finally, the prior probability for variable inclusion was set to ρ_j =0.5 for all j. The data model in (4) and (6), the component prior distribution in (9) and (10), along with the hyper-priors in (12), completes the model specification.

2.5 MCMC Algorithm

Complete MCMC details are provided in the Web Appendix C. However, an overview is provided here to illustrate the main idea. The complete list of parameters to be sampled in the MCMC are Θ = {γ, ϕ, λ, η, Ψ, α, Z̃}, where Z̃ contains any latent element of Z (i.e., either corresponding to missing data, discrete variable, or boundary/censored observation). The only update that depends on the raw observed data $Y = {[y_{1}^{'}, \dots, y_{n}^{'}]}^{'}$ is the update of Z̃ . All other parameters, when conditioned on Y and Z, only depend on Z. The Z̃ are block updated, each with a MH step, but with a proposal that looks almost conjugate, and is therefore accepted with high probability; the block size can be adjusted to trade-off between acceptance and speed (e.g., acceptance ~ 40%). A similar strategy is taken with the Ψ update, i.e., a nearly conjugate update is proposed and accepted/rejected via an MH step. Because the component parameters are integrated out, the ϕ_i can be updated with simple Gibbs sampling²⁶, however, this approach has known mixing issues^40,55. Thus, a modified split-merge algorithm⁴⁰ similar to that used in²⁴ was developed to sample from the posterior distribution of ϕ. The remaining parameters are updated in a hybrid Gibbs, Metropolis Hastings (MH) fashion. The γ vector is updated with MH by proposing an add, delete, or swap move⁵⁰. The λ, η, α parameters have standard MH random walk updates on log-scale. The MCMC routine then consists of applying each of the above updates in turn to complete a single MCMC iteration, with the exception that the γ update be applied L_g times each iteration.

Two modifications were also made to the above strategy to improve mixing. The algorithm above would at times have trouble breaking away from local modes when proposing ϕ and γ updates separately. Thus, an additional joint update is proposed for ϕ and γ each iteration which substantially improved the chance of a move each iteration. Also, as described in more detail in Web Appendix C, the traditional split merge algorithm proposes an update by first selecting two points, i and i′, at random. If they are from the same cluster (according to the current ϕ) it then assigns them to separate clusters and assigns the remaining points from that cluster to each of the two new clusters at random. It then conducts several (L) restricted (to one of the two clusters) Gibbs sampling updates to the remaining ϕ_h from the original cluster. The resulting ϕ^* then becomes the proposal for a split move. We found that the following adjustment resulted in better acceptance of split/merge moves. Instead of assigning the remaining points to the two clusters at random, simply assign them to the closest of the two observations i or i′. Then conduct L restricted Gibbs sample updates to produce the proposal. We found little performance gain beyond L = 3. Lastly, it would be possible to instead use a finite mixture approximation via the kernel stick breaking representation of a DPM^56,55. However, this approach would be complicated by the dependency between γ and the structure and dimensionality of the component parameters. This issue is entirely avoided with the proposed approach as the component parameters are integrated out. The code to perform the MCMC for this model has been made available in a GitHub repository at https://github.com/cbstorlie/DPM-vs.git.

2.6 Inference for ϕ and γ

The estimated cluster membership ϕ̂ for all of the methods was taken to be the respective mode of the estimated cluster membership probabilities. For the DPM methods, the cluster membership probability matrix P (which is an n× ∞ matrix in principle) is not sampled in the MCMC, and is not identified due to many symmetric modes (thus their can be label switching in the posterior samples). However, the information theoretic approach of Stephens⁵⁷ (applied to the DPM in Fu et al.⁵⁸) can be used to address this issue and relabel the posterior samples of ϕ to provide an estimate of P. The resulting estimate P̂ has i^th row, m^th column that can be thought of as the proportion of the relabeled posterior samples of ϕ_i that have the value m. While technically P is an n × ∞ matrix, all columns after M^* have zero entries in P̂, where M^* is the maximum number of clusters observed in the posterior. For the results below, the point estimate of γ̂ is determined by γ̂_j = 1 if Pr(γ_j = 1) > ρ_j = 0.5, and γ̂_j = 0 otherwise.

3 Simulation Results

In this section the performance of the proposed approach for clustering is evaluated on two simulation cases similar in nature to the ASD clustering problem. Each of the cases is examined (i) without missing data or discrete variables/censoring, (ii) with missing data, but no discrete variables/censoring, (iii) with missing data and several discrete and/or censored variables.

The approaches to be compared are listed below.

DPM-vs –	the proposed method.
DPM-cont –	the proposed method without accounting for discrete variables/censoring (i.e., assuming all continuous variables).
DPM –	the proposed method with variable selection turned off (i.e., a prior probability ρ_j = 1).
DPM-ind –	the approach of Kim et al.²⁴ when all variables are continuous (i.e., assuming non-informative variables are independent of the rest), but modified to treat discrete variables/censoring and missing data when applicable just as the proposed approach.
Mclust-vs –	the approach of Raftery and Dean¹³ implemented with the `clustvarsel` package in R. When there are missing data, Random Forest Imputation⁵⁹ implemented with the `missForest` package in R is used prior to application of `clustvarsel`. However, the Mclust-vs approach does not treat discrete variables differently and thus treats all variables as continuous and uncensored.
VarSelLCM –	the approach of Marbac and Sedki¹⁶ implemented in the R package `VarSelLCM`. It allows for mixed data types and missing data, however, it assumes both local independence of variables within cluster and global independence between informative and non-informative variables.

Open in a new tab

Each simulation case is described below. Figure 2 provides a graphical depiction of the problem for the first eight variables from the first of the 100 realizations of Case 2(c). Case 1 simulations resulted in very similar data patterns as well.

Pairwise scatter plots of the first eight variables for simulation Case 2(c).

Case 1(a) –	n = 150, p = 10. The true model has M = 3 components with mixing proportions 0.5, 0.25, 0.25, respectively, and y \| ϕ is a multivariate normal with no censoring nor missing data. Only two variables y⁽¹⁾ = [y₁, y₂]′ are informative, with means of (2, 0), (0, 2), (−1.5, −1.5), unit variances, and correlations of 0.5, 0.5, −0.5 in each component, respectively. The non-informative variables y⁽²⁾ = [y₃, . . . , y₁₀]′ are generated as iid 𝒩(0, 1).
Case 1(b) –	Same as the setup in 1(a) only the non-informative variables y⁽²⁾ are correlated with y⁽¹⁾ through the relation y⁽²⁾ = By⁽¹⁾ + ε, where B is a 8 × 2 matrix whose elements are distributed as iid 𝒩(0, 0.3), and $ε ~ N (0, Q_{22}^{- 1})$ , with Q₂₂ ~ 𝒲(I, 10).
Case 1(c) –	Same as in 1(b), but variables y₁, y₆ are discretized to the closest integer, variables y₂, y₉ are left censored at −1.4 (~8% of the observations), and y₃, y₁₀ are right censored at 1.4.
Case 1(d) –	Same as 1(c), but the even numbered y_j have ~ 30% of the observations MAR.
Case 2(a) –	n = 300, p = 30. The true model has M = 3 components with mixing proportions 0.5, 0.25, 0.25, respectively, y\|ϕ is a multivariate normal with no censoring nor missing data. Only four variables (y₁, y₂, y₃, y₄) are informative, with means of (0.6, 0, 1.2, 0), (0, 1.5, −0.6, 1.9), (−2, −2, 0, 0.6) and all variables with unit variance for each of the three components, respectively. All correlations among informative variables are equal to 0.5 in components 1 and 2, while component 3 has correlation matrix, Σ₃₁₁(i, j) = 0.5(−1)^\|\|i⁺^j^\|\| I_{_i _≠ _j} +I_{_i₌_j_}. The non-informative variables y⁽²⁾ = [y₅, . . . , y₃₀]′ are generated as iid 𝒩(0, 1).
Case 2(b) –	Same as the setup in Case 2(a) only the non-informative variables y⁽²⁾ are correlated with y⁽¹⁾ through the relation y⁽²⁾ = By⁽¹⁾ + ε, where B is a 26 × 4 matrix whose elements are distributed as iid 𝒩(0, 0.3), and $ε ~ N (0, Q_{22}^{- 1})$ , with Q₂₂ ~ 𝒲(I, 30).
Case 2(c) –	Same setup as in Case 2(b), but now variables y₁, y₆, y₁₁ are discretized to the closest integer, variables y₂, y₉, y₁₀, y₁₁ are left censored at −1.4 (~8% of the observations), and variables y₃, y₁₂, y₁₃, y₁₄ are right censored at 1.4.
Case 2(d) –	Same as Case 2(c), but the even numbered y_j have ~ 30% MAR.

Open in a new tab

For each of the eight simulation cases, 100 data sets were randomly generated and each of the five methods above was fit to each data set. The methods are compared on the basis of the following statistics.

Acc –	Accuracy calculated as the proportion of observations in the estimated clusters that are in the same group as they are in the true clusters, when put in the arrangement (relabeling) that best matches the true clusters.
FI –	Fowlkes-Mallows index of ϕ̂ relative to the true clusters.
ARI –	Adjusted Rand index.
M –	The number of estimated clusters. The estimated number of clusters for the Mclust-vs and VarSelLCM methods was chosen as the best of the possible M = 1, . . . 8 cluster models via BIC. The number of clusters for the Bayesian methods is chosen as the posterior mode and is inherently allowed to be as large as n.
p₁ –	The model size, p₁ = Σ_j γ̂_j.
PVC –	The proportion of variables correctly included/excluded from the model, PVC = (1/p)Σ_jI_{{γ̂_j = γ_j}.}
CompT –	The computation time in minutes (using 20,000 MCMC iterations for the Bayesian methods).

Open in a new tab

These measures are summarized in the columns of the tables below by their mean (and standard deviation) over the 100 data sets. It appeared that 20,000 iterations (10,000 burn in and 10,000 posterior samples) was sufficient for the Bayesian methods to summarize the posterior in the simulation cases via several trial runs, however not every simulation result was inspected for convergence.

The simulation results from Cases 1(a)–(d) are summarized in Table 1. The summary score for the best method for each summary is in bold along with that for any other method that was not statistically different from the best method on the basis of the 100 trials (via an uncorrected paired t-test with α = 0.05). As would be expected, DPM-ind is one of the best methods on Case 1(a), however, it is not significantly better than DPM-vs or Mclust-vs on any of the metrics. VarSelLCM performs slightly worse than the top three methods in this case since the local independence assumption is being violated. All of the other methods solidly outperform DPM though, which had a difficult time finding more than a single cluster since it had to include all 10 variables. In Case 1(b) the assumptions of DPM-ind are now being violated and it is unable to perform adequate variable selection. It must include far too many non-informative variables due to the correlation within y⁽²⁾ and between y⁽¹⁾ and y⁽²⁾. The clustering performance suffers as a result and like DPM, it has difficulty finding more than a single cluster. VarSelLCM also struggles in this case for the same reason; the global independence assumption is being violated. Mclust-vs still performs well in this case, but DPM-vs (and DPM-cont) is significantly better on two of the metrics. In case 1(c) DPM-vs is now explicitly accounting for the discrete and left/right censored variables, while DPM-cont does not. When the discrete variables are incorrectly assumed to be continuous it tends to create separate clusters at some of the unique values of the discrete variables. This is because a very high likelihood can be obtained by normal distributions that are almost singular along the direction of the discrete variables. Thus, DPM-vs substantially outperforms DPM-cont and Mclust-vs, demonstrating the importance of explicitly treating the discrete nature of the data when clustering. Finally, Case 1(d) shows that the loss of 30% of the data for half of the variables (including an informative variable) does not degrade the performance of DPM-vs by much. In this case Mclust-vs uses Random Forest Imputation to first impute the data, then cluster. The imputation procedure does not explicitly take into account of the cluster structure of the data, rather it could mask this structure. This is another reason that the performance is worse than the proposed approach which incorporates the missingness directly into the clustering model. Mclust-vs and VarSelLCM both have much faster run-times than the Bayesian methods, however, when there are local or global correlations and discrete variables and/or missing data, they did not perform nearly as well as DPM-vs.

Table 1.

Simulation Case 1 Results.

Method	Acc	FI	ARI	M	p₁	PVC	CompT
Case 1(a)

DPM-vs^*	0.91 (0.11)	0.86 (0.07)	0.78 (0.16)	2.9 (0.4)	2.0 (0.3)	0.99 (0.04)	296 (18)
DPM	0.37 (0.02)	0.58 (0.00)	0.00 (0.00)	1.0 (0.1)	10.0 (0.0)	0.20 (0.00)	341 (28)
DPM-ind	0.90 (0.13)	0.86 (0.08)	0.76 (0.20)	3.0 (0.6)	1.9 (0.4)	0.99 (0.04)	295 (23)
Mclust-vs	0.91 (0.10)	0.86 (0.07)	0.78 (0.15)	3.0 (0.4)	2.0 (0.2)	0.99 (0.06)	1 (0)
VarSelLCM	0.74 (0.21)	0.73 (0.10)	0.52 (0.29)	2.9 (1.2)	3.5 (3.2)	0.84 (0.32)	6 (2)

Case 1(b)

DPM-vs^*	0.89 (0.14)	0.85 (0.08)	0.76 (0.20)	3.0 (0.7)	1.9 (0.4)	0.98 (0.05)	267 (19)
DPM	0.37 (0.02)	0.58 (0.00)	0.00 (0.00)	1.0 (0.0)	10.0 (0.0)	0.20 (0.00)	292 (15)
DPM-ind	0.37 (0.02)	0.58 (0.00)	0.00 (0.00)	1.0 (0.0)	8.5 (0.8)	0.35 (0.08)	243 (18)
Mclust-vs	0.87 (0.12)	0.83 (0.10)	0.72 (0.19)	2.8 (0.4)	2.0 (0.1)	0.94 (0.12)	1 (0)
VarSelLCM	0.52 (0.06)	0.57 (0.05)	0.40 (0.06)	6.7 (0.9)	9.1 (0.8)	0.29 (0.08)	11 (2)

Case 1(c)

DPM-vs	0.85 (0.14)	0.81 (0.08)	0.68 (0.20)	2.8 (0.6)	1.9 (0.4)	0.97 (0.07)	254 (18)
DPM-cont	0.62 (0.06)	0.52 (0.06)	0.31 (0.07)	4.9 (0.7)	2.0 (0.5)	0.89 (0.11)	296 (25)
DPM	0.37 (0.02)	0.58 (0.00)	0.00 (0.00)	1.0 (0.0)	10.0 (0.0)	0.20 (0.00)	285 (19)
DPM-ind	0.37 (0.02)	0.58 (0.00)	0.00 (0.00)	1.0 (0.0)	8.2 (1.0)	0.38 (0.10)	243 (24)
Mclust-vs	0.49 (0.10)	0.43 (0.09)	0.19 (0.12)	6.7 (1.5)	2.7 (0.8)	0.67 (0.14)	1 (0)
VarSelLCM	0.48 (0.08)	0.51 (0.06)	0.34 (0.07)	6.8 (1.1)	8.4 (0.8)	0.36 (0.08)	10 (1)

Case 1(d)

DPM-vs	0.77 (0.19)	0.76 (0.10)	0.57 (0.25)	2.6 (0.8)	1.9 (0.6)	0.95 (0.09)	304 (21)
DPM-cont	0.62 (0.04)	0.52 (0.05)	0.31 (0.06)	4.8 (0.8)	1.9 (0.5)	0.89 (0.11)	355 (29)
DPM	0.37 (0.02)	0.58 (0.00)	0.00 (0.00)	1.0 (0.1)	10.0 (0.0)	0.20 (0.00)	343 (29)
DPM-ind	0.37 (0.02)	0.58 (0.00)	0.00 (0.00)	1.0 (0.0)	7.6 (1.3)	0.44 (0.13)	277 (40)
Mclust-vs	0.50 (0.10)	0.43 (0.08)	0.20 (0.11)	7.0 (1.1)	3.1 (0.9)	0.64 (0.14)	1 (0)
VarSelLCM	0.50 (0.08)	0.51 (0.06)	0.34 (0.07)	6.5 (1.1)	8.3 (0.8)	0.36 (0.08)	12 (9)

True	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	3.0 (0.0)	2.0 (0.0)	1.00 (0.00)	–

Open in a new tab

DPM-cont is identical to DPM-vs for cases 1(a) and 1(b) and is therefore not listed separately.

The simulation results from Cases 2(a)–(d) are summarized in Table 2. A similar story line carries over into Case 2 where there are now p = 30 (four informative) variables and n = 300 observations. Namely, DPM-vs is not significantly different from DPM-ind or Mclust-vs on any of the summary measures for Case 2(a), with the exception of computation time. DPM-vs is the best method on all summary statistics (except CompT) by a sizeable margin on the remaining cases. While Mclust is much faster than DPM-vs, the cases of the most interest in this paper are those with discrete variables, censoring and/or missing data (i.e., Cases 1(c), 1(d), 2(c), and 2(d)). In these cases, the additional computation time of DPM-vs might seem inconsequential relative to the enormous gain in accuracy. It is interesting that DPM-vs suffers far less from the missing values when moving from Case 2(c) to 2(d) than it did from Case 1(c) to 1(d). This is likely due to the fact that there are a larger number of observations to offset the additional complexity of a larger p. However, it is also likely that the additional (correlated) variables may help to reduce the posterior variance of the imputed values.

Table 2.

Simulation Case 2 Results.

Method	Acc	FI	ARI	M	p₁	PVC	CompT
Case 2(a)

DPM-vs^*	0.89 (0.11)	0.85 (0.09)	0.74 (0.20)	3.0 (0.6)	3.8 (0.7)	0.99 (0.02)	544 (45)
DPM	0.50 (0.03)	0.61 (0.01)	0.00 (0.00)	1.4 (0.8)	30.0 (0.0)	0.13 (0.00)	1158 (120)
DPM-ind	0.90 (0.10)	0.85 (0.08)	0.75 (0.17)	3.1 (0.6)	3.9 (0.4)	1.00 (0.02)	543 (36)
Mclust-vs	0.91 (0.12)	0.88 (0.09)	0.78 (0.23)	2.9 (0.5)	4.0 (0.8)	0.98 (0.07)	8 (3)
VarSelLCM	0.68 (0.07)	0.68 (0.04)	0.53 (0.06)	4.7 (0.6)	4.0 (0.0)	1.00 (0.00)	25 (1)

Case 2(b)

DPM-vs^*	0.90 (0.10)	0.85 (0.08)	0.75 (0.17)	3.1 (0.6)	3.9 (0.5)	0.99 (0.03)	557 (27)
DPM	0.50 (0.03)	0.61 (0.01)	0.00 (0.00)	1.0 (0.0)	30.0 (0.0)	0.13 (0.00)	1009 (67)
DPM-ind	0.50 (0.03)	0.61 (0.01)	0.00 (0.00)	1.0 (0.0)	28.9 (0.2)	0.17 (0.01)	966 (71)
Mclust-vs	0.83 (0.14)	0.80 (0.12)	0.63 (0.24)	2.6 (0.5)	3.4 (0.8)	0.91 (0.10)	8 (3)
VarSelLCM	0.43 (0.04)	0.46 (0.04)	0.26 (0.05)	7.3 (0.6)	27.3 (1.4)	0.22 (0.05)	53 (3)

Case 2(c)

DPM-vs	0.93 (0.03)	0.89 (0.05)	0.82 (0.07)	3.3 (0.6)	4.0 (0.2)	1.00 (0.02)	589 (47)
DPM-cont	0.62 (0.16)	0.57 (0.14)	0.34 (0.20)	4.4 (1.2)	3.4 (1.4)	0.87 (0.05)	597 (49)
DPM	0.50 (0.03)	0.61 (0.01)	0.00 (0.00)	1.0 (0.0)	30.0 (0.0)	0.13 (0.00)	1078 (54)
DPM-ind	0.50 (0.03)	0.61 (0.01)	0.00 (0.00)	1.0 (0.0)	28.8 (0.5)	0.17 (0.02)	1060 (101)
Mclust-vs	0.45 (0.06)	0.40 (0.04)	0.13 (0.06)	6.4 (1.3)	2.8 (1.0)	0.81 (0.04)	8 (24)
VarSelLCM	0.42 (0.05)	0.43 (0.05)	0.21 (0.06)	7.3 (0.8)	26.6 (1.5)	0.24 (0.05)	59 (3)

Case 2(d)

DPM-vs	0.91 (0.08)	0.86 (0.08)	0.78 (0.14)	3.2 (0.6)	3.9 (0.4)	0.99 (0.03)	577 (54)
DPM-cont	0.61 (0.16)	0.55 (0.14)	0.32 (0.19)	4.3 (1.1)	3.2 (1.2)	0.87 (0.05)	581 (41)
DPM	0.50 (0.03)	0.61 (0.01)	0.00 (0.00)	1.3 (0.5)	30.0 (0.0)	0.13 (0.00)	1028 (85)
DPM-ind	0.50 (0.03)	0.61 (0.01)	0.00 (0.00)	1.0 (0.0)	28.2 (1.1)	0.19 (0.04)	958 (81)
Mclust-vs	0.46 (0.06)	0.41 (0.04)	0.14 (0.05)	6.7 (1.4)	3.0 (1.2)	0.81 (0.04)	6 (7)
VarSelLCM	0.43 (0.05)	0.43 (0.05)	0.20 (0.06)	7.0 (1.0)	26.0 (1.8)	0.26 (0.06)	73 (19)

True	1.00 (0.00)	1.00 (0.00)	1.00 (0.00)	3.0 (0.0)	4.0 (0.0)	1.00 (0.00)	–

Open in a new tab

DPM-cont is identical to DPM-vs for cases 1(a) and 1(b) and is therefore not listed separately.

4 Application to Autism and Related Disorders

The cohort for this study consists of subjects falling in the criteria for “potential ASD” (PASD) on the basis of various combinations of developmental and psychiatric diagnoses obtained from comprehensive medical and educational records as described in Katusic et al.⁶⁰. The population of individuals with PASD is important because this group represents the pool of patients with developmental/behavioral symptoms from which clinicians have to determine who has ASD and/or other disorders. Subjects 18 years of age or older were invited to participate in a face-to-face session to complete psychometrist-administered assessments of autism symptoms, cognition/intelligence, memory/learning, speech and language, adaptive functions, and maladaptive behavior. In addition, guardians were asked to complete several self-reported, validated questionnaires. The goal is to describe how the patients’ test scores separate them in terms of clinical presentation and which test scores are the most useful for this purpose. This falls in line with the new Research Domain Criteria (RDoC) philosophy that has gained traction in the field of mental health research. RDoC is a new research framework for studying mental disorders. It aims to integrate many levels of information (cognitive/self-report tests, imaging, genetics) to understand how all of these might be related to similar clinical presentations.

A total of 87 test scores measuring cognitive and/or behavioral characteristics were considered from a broad list of commonly used tests for assessing such disorders. A complete list of the individual tests considered is provided in Web Appendix A. Using expert judgment to include several commonly used aggregates in place of individual subtest scores, this list was reduced to 55 test score variables to be considered in the clustering procedure. Five of the 55 variables have fewer than 15 possible values and are treated here as discrete, ordinal variables. A majority (46) of the 55 variables also have a lower bound, which is attained by a significant portion of the individuals, and are treated as left censored. Five of the variables have an upper bound that is attained by many of the individuals and are thus treated as right censored. There are 486 observations (individuals) in the dataset, however, only 67 individuals have complete data, i.e., a complete case analysis would throw out 86% of the observations.

DPM-vs was applied to these data; four chains with random starting points were run in parallel for 85,000 iterations each, which took ~ 40 hours on a 2.2GHz processor. The first 10,000 iterations were discarded as burn-in. More iterations were used here than in the simulation cases due to the fact that this analysis is slightly more complicated (e.g., more variables and observations) and it only needed to be performed once. MCMC trace plots are provided in Web Appendix D. All chains converged to the same distribution (aside from relabeling) and were thus combined.Four of the tests (Beery standard, CompTsc_ol, WJ_Pass_Comprehen, and Adaptive Composite) had a high (> 0.88) posterior probability of being informative (Table 3). There is also evidence that Ach_abc_Attention and Ach_abc_AnxDep are informative. The posterior samples were split on which of these two should be included in the model (they were only informative together for 0.1% of the MCMC samples). The next highest posterior inclusion probability for any of the remaining variables was 0.17 and the sum of the inclusion probabilities for all remaining variables was only 0.28. Thus, there is strong evidence to suggest that only five of the 55 variables are sufficient to inform the cluster membership.

Table 3.

Posterior inclusion probabilities and sample means for the six most informative tests.

Variable	Pr(γ_j = 1)	Cluster Means
Variable	Pr(γ_j = 1)	1	2	3
Beery_standard	1.000	0.77	−1.02	−0.44
CompTsc_ol	1.000	0.46	−1.21	−0.01
WJ_Pass_Comprehen	0.944	0.38	−1.26	0.30
Adaptive_Composite	0.889	0.44	−0.68	0.16
ach_abc_Attention	0.460	−0.18	0.21	0.04
ach_abc_AnxDep	0.427	−0.04	0.06	−0.01

Open in a new tab

A majority (54%) of the posterior samples identified three components/clusters, with 0.12 and 0.25 posterior probability of two and four clusters, respectively. The calculation of ϕ̂ also resulted in three components. Figure 3 displays the estimated cluster membership via pairwise scatterplots of the five most informative variables on a standardized scale. Ach_abc_Attention has also been multiplied by minus one so that higher values imply better functioning for all tests. The corresponding mean vectors of the three main components are also provided in Table 3. There are two groups that are very distinct (i.e., Clusters 1 and 2 are the “high” and “low” groups, respectively), but there is also a “middle” group (Cluster 3). Cluster 3 subjects generally have medium-to-high Adaptive_Composite, WJ_Pass_Comprehen, and Ach_abc_Attention scores, but low-to-medium Beery_standard and WJ_Pass_Comprehen.

Pairwise scatter plots of the standardized version of the five most informative variables in Table 3 with estimated cluster membership above the diagonal and the raw data below.

Figure 4(a) provides a 3D scatter plot on the three most informative variables, highlighting separation between Cluster 1 and Clusters 2 and 3. However, Clusters 2 and 3 are not well differentiated in this plot. Figure 4(b) shows a 3D scatter plot on the variables CompTsc_ol, WJ_Pass_Comprehen, and Ach_Attention, illustrating differentiation between Clusters 2 and 3.

Three dimensional scatter plots of the tests on standardized scale: (a) The most informative three variables with estimated cluster membership. (b) Observations plotted on the variables CompTsc_ol, WJ_Pass_Comprehen, and Ach_abc_Attention, to better illustrate the separation of clusters 2 and 3.

The goal of this work is not necessarily to identify clusters that align with clinical diagnosis of ASD, i.e., it is not a classification problem. The Research Domain Criteria (RDoC) philosophy is to get away from subjective based diagnosis of disease. The hope is that these clusters provide groups of similar patients that may have similar underlying physiological causes and can be treated similarly (whether the clinical diagnosis was ASD or not). That being said, the “high” cluster aligned with no clinical diagnosis of ASD for 92% of its subjects, while the “low” cluster aligned with positive clinical ASD diagnosis for 50% of its subjects. As these clusters result from a bottom-up, data-driven method, they may prove useful to determine imaging biomarkers that correspond better with cluster assignment than a more subjective diagnosis provided by a physician. This will be the subject of future work.

5 Conclusions & Further Work

In this paper we developed a general approach to clustering via a Dirichlet process model that explicitly allows for discrete and censored variables via a latent variable approach, and missing data. This approach overcomes the assumption of (global) independence between informative and non-informative variables and the assumption of (local) independence of variables within cluster often assumed when clustering data of mixed type. The MCMC computation proceeds via a split/merge algorithm by integrating out the component parameters. This approach was shown to perform markedly better than other approaches on several simulated test cases. The approach was developed for moderate p in the range of ~10–300. The computation is 𝒪(p³), which makes it ill-suited for extremely large dimensions. However, it may be possible to use a graphical model^61,62 within the proposed framework to alleviate this burden for large p.

The approach was used to analyze test scores of individuals with potential ASD and identified three clusters. Further, it was determined that only five of the 55 variables were informative to assess the cluster membership of an observation. This could have a large impact for diagnosis of ASD as there are currently ~100 tests/subtest scores that could be used, and there is no universal standard. Further, the clustering results have served to generate hypotheses about what might show up in brain imaging to explain some of the differences between potential ASD patients. A follow-up study has been planned to investigate these possible connections.

Supplementary Material

Supp info

NIHMS976265-supplement-Supp_info.pdf^{(4.1MB, pdf)}

Acknowledgments

This study was supported by Public Health Service research grants MH093522 andAG034676 from the National Institutes of Health.

References

1.Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]
2.Basu Sanjib, Chib Siddhartha. Marginal likelihood and bayes factors for dirichlet process mixture models. Journal of the American Statistical Association. 2003;98(461):224–235. [Google Scholar]
3.Quintana Fernando A, Iglesias Pilar L. Bayesian clustering and product partition models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(2):557–574. [Google Scholar]
4.Tadesse Mahlet G, Sha Naijun, Vannucci Marina. Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association. 2005;100(470):602–617. [Google Scholar]
5.Liu JS, Zhang JL, Palumbo MJ, Lawrence CE. Bayesian clustering with variable and transformation selections. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting. Oxford University Press; USA: 2003. pp. 249–275. [Google Scholar]
6.Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B. 1996;58:267–288. [Google Scholar]
7.Zou Hui, Hastie Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–320. [Google Scholar]
8.Lin Y, Zhang H. Component selection and smoothing in smoothing spline analysis of variance models. Annals of Statistics. 2006;34:2272–2297. [Google Scholar]
9.George EI, McCulloch RE. Variable selection via Gibbs sampling. Journal of the American Statistical Association. 1993;88:881–889. [Google Scholar]
10.Reich BJ, Storlie CB, Bondell HD. Variable selection in Bayesian smoothing spline ANOVA models: Application to deterministic computer codes. Technometrics. 2009;51:110–120. doi: 10.1198/TECH.2009.0013. [DOI] [PMC free article] [PubMed] [Google Scholar]
11.Chung Yeonseung, Dunson David B. Nonparametric bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2012 doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Fop Michael, Murphy Thomas Brendan. Variable selection methods for model-based clustering. 2017 arXiv preprint arXiv:1707.00306. [Google Scholar]
13.Raftery Adrian E, Dean Nema. Variable selection for model-based clustering. Journal of the American Statistical Association. 2006;101(473):168–178. [Google Scholar]
14.Maugis Cathy, Celeux Gilles, Martin-Magniette Marie-Laure. Variable selection for clustering with gaussian mixture models. Biometrics. 2009;65(3):701–709. doi: 10.1111/j.1541-0420.2008.01160.x. [DOI] [PubMed] [Google Scholar]
15.Fop Michael, Smart Keith, Murphy Thomas Brendan. Variable selection for latent class analysis with application to low back pain diagnosis. 2015 arXiv preprint arXiv:1512.03350. [Google Scholar]
16.Marbac Matthieu, Sedki Mohammed. Variable selection for model-based clustering using the integrated complete-data likelihood. Statistics and Computing. 2017;27(4):1049–1063. [Google Scholar]
17.Pan Wei, Shen Xiaotong. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research. 2007 May;8:1145–1164. [Google Scholar]
18.Wang Sijian, Zhu Ji. Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics. 2008;64(2):440–448. doi: 10.1111/j.1541-0420.2007.00922.x. [DOI] [PubMed] [Google Scholar]
19.Xie Benhuai, Pan Wei, Shen Xiaotong. Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics. 2008;64(3):921–930. doi: 10.1111/j.1541-0420.2007.00955.x. [DOI] [PubMed] [Google Scholar]
20.Friedman Jerome H, Meulman Jacqueline J. Clustering objects on subsets of attributes (with discussion) Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2004;66(4):815–849. [Google Scholar]
21.Hoff Peter D. Model-based subspace clustering. Bayesian Analysis. 2006;1(2):321–344. [Google Scholar]
22.Witten Daniela M, Tibshirani Robert. A framework for feature selection in clustering. Journal of the American Statistical Association. 2012 doi: 10.1198/jasa.2010.tm09415. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Richardson Sylvia, Green Peter J. On bayesian analysis of mixtures with an unknown number of components (with discussion) Journal of the Royal Statistical Society: series B (statistical methodology) 1997;59(4):731–792. [Google Scholar]
24.Kim Sinae, Tadesse Mahlet G, Vannucci Marina. Variable selection in clustering via dirichlet process mixture models. Biometrika. 2006;93(4):877–893. [Google Scholar]
25.Ferguson Thomas S. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1(2):209–230. [Google Scholar]
26.Neal Radford M. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics. 2000;9(2):249–265. [Google Scholar]
27.Teh Yee Whye, Jordan Michael I, Beal Matthew J, Blei David M. Hierarchical dirichlet processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]
28.Hjort Nils Lid, Holmes Chris, Müller Peter, Walker Stephen G. Bayesian Nonparametrics. Cambridge University Press; New York, NY: 2010. [Google Scholar]
29.Hunt Lynette, Jorgensen Murray. Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis. 2003;41(3):429–440. [Google Scholar]
30.Biernacki Christophe, Deregnaucourt Thibault, Kubicki Vincent. Model-based clustering with mixed/missing data using the new software mixtcomp. CMStatistics 2015 (ERCIM 2015) 2015 [Google Scholar]
31.Murray Jared S, Reiter Jerome P. Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence. 2016 arXiv preprint arXiv:1410.0438. [Google Scholar]
32.Hennig Christian, Liao Tim F. How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2013;62(3):309–369. [Google Scholar]
33.Muthen Bengt. Latent variable structural equation modeling with categorical data. Journal of Econometrics. 1983;22(1):43–65. [Google Scholar]
34.Everitt Brian S. A finite mixture model for the clustering of mixed-mode data. Statistics & probability letters. 1988;6(5):305–309. [Google Scholar]
35.Dunson David B. Bayesian latent variable models for clustered mixed outcomes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000;62(2):355–366. [Google Scholar]
36.Ranalli Monia, Rocci Roberto. Mixture models for mixed-type data through a composite likelihood approach. Computational Statistics & Data Analysis. 2017;110:87–102. [Google Scholar]
37.McParland Damien, Gormley Isobel Claire. Model based clustering for mixed data: clustmd. Advances in Data Analysis and Classification. 2016;10(2):155–169. [Google Scholar]
38.Lesaffre Emmanuel, Molenberghs Geert. Multivariate probit analysis: a neglected procedure in medical statistics. Statistics in Medicine. 1991;10(9):1391–1403. doi: 10.1002/sim.4780100907. [DOI] [PubMed] [Google Scholar]
39.Chib Siddhartha, Greenberg Edward. Analysis of multivariate probit models. Biometrika. 1998;85(2):347–361. [Google Scholar]
40.Jain Sonia, Neal Radford M. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics. 2004;13(1):158–182. [Google Scholar]
41.Storlie Curtis B, Lane William A, Ryan Emily M, Gattiker James R, Higdon David M. Calibration of computational models with categorical parameters and correlated outputs via bayesian smoothing spline anova. Journal of the American Statistical Association. 2015;110(509):68–82. [Google Scholar]
42.Storlie CB, Therneau Terry, Carter Rickey, Chia Nicholas, Bergquist John, Romero-Brufau Santiago. Prediction and inference with missing data in patient alert systems. Journal of the American Statistical Association (in review) 2017 https://arxiv.org/pdf/1704.07904.pdf.
43.Rubin Donald B. Inference and missing data. Biometrika. 1976;63(3):581–592. [Google Scholar]
44.Escobar Michael D, West Mike. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90(430):577–588. [Google Scholar]
45.MacEachern Steven N, Müller Peter. Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics. 1998;7(2):223–238. [Google Scholar]
46.Imai Kosuke, van Dyk David A. A bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics. 2005;124(2):311–334. [Google Scholar]
47.McCulloch Robert E, Polson Nicholas G, Rossi Peter E. A bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics. 2000;99(1):173–193. [Google Scholar]
48.Zhang Xiao, John Boscardin W, Belin Thomas R. Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computational statistics & data analysis. 2008;52(7):3697–3708. doi: 10.1016/j.csda.2007.12.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Bhattacharya Anirban, Dunson David B. Simplex factor models for multivariate unordered categorical data. Journal of the American Statistical Association. 2012;107(497):362–377. doi: 10.1080/01621459.2011.646934. [DOI] [PMC free article] [PubMed] [Google Scholar]
50.George EI, McCulloch RE. Approaches for Bayesian variable selection. Statistica Sinica. 1997;7:339–373. [Google Scholar]
51.Rue Havard, Held Leonhard. Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC; Boca Raton, FL: 2005. [Google Scholar]
52.Bodnar Taras, Okhrin Yarema. Properties of the singular, inverse and generalized inverse partitioned wishart distributions. Journal of Multivariate Analysis. 2008;99(10):2389–2405. [Google Scholar]
53.Gelman Andrew, et al. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper) Bayesian analysis. 2006;1(3):515–534. [Google Scholar]
54.Huang Alan, Wand Matthew P, et al. Simple marginally noninformative prior distributions for covariance matrices. Bayesian Analysis. 2013;8(2):439–452. [Google Scholar]
55.Ishwaran Hemant, James Lancelot F. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001 [Google Scholar]
56.Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]
57.Stephens Matthew. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000;62(4):795–809. [Google Scholar]
58.Fu Audrey Qiuyan, Russell Steven, Bray Sarah J, Tavaré Simon, et al. Bayesian clustering of replicated time-course gene expression data with weak signals. The Annals of Applied Statistics. 2013;7(3):1334–1361. [Google Scholar]
59.Stekhoven Daniel J, Bühlmann Peter. Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118. doi: 10.1093/bioinformatics/btr597. [DOI] [PubMed] [Google Scholar]
60.Katusic Slavica K, Myers Scott, Colligan Robert C, Voigt Robert, Stoeckel Ruth E, Port John D, Croarkin Paul E, Weaver Amy. Developmental brain dysfunction-related disorders and potential autism spectrum disorder (pasd) among children and adolescents - population based 1976–2000 birth cohort. Lancet Neurology. 2016 (in review) [Google Scholar]
61.Giudici Paolo, Green PJ. Decomposable graphical gaussian model determination. Biometrika. 1999;86(4):785–801. [Google Scholar]
62.Wong Frederick, Carter Christopher K, Kohn Robert. Efficient estimation of covariance selection models. Biometrika. 2003;90(4):809–830. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp info

NIHMS976265-supplement-Supp_info.pdf^{(4.1MB, pdf)}

[R1] 1.Fraley C, Raftery AE. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association. 2002;97:611–631. [Google Scholar]

[R2] 2.Basu Sanjib, Chib Siddhartha. Marginal likelihood and bayes factors for dirichlet process mixture models. Journal of the American Statistical Association. 2003;98(461):224–235. [Google Scholar]

[R3] 3.Quintana Fernando A, Iglesias Pilar L. Bayesian clustering and product partition models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(2):557–574. [Google Scholar]

[R4] 4.Tadesse Mahlet G, Sha Naijun, Vannucci Marina. Bayesian variable selection in clustering high-dimensional data. Journal of the American Statistical Association. 2005;100(470):602–617. [Google Scholar]

[R5] 5.Liu JS, Zhang JL, Palumbo MJ, Lawrence CE. Bayesian clustering with variable and transformation selections. In: Bernardo JM, Bayarri MJ, Berger JO, Dawid AP, Heckerman D, Smith AFM, West M, editors. Bayesian Statistics 7: Proceedings of the Seventh Valencia International Meeting. Oxford University Press; USA: 2003. pp. 249–275. [Google Scholar]

[R6] 6.Tibshirani RJ. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society B. 1996;58:267–288. [Google Scholar]

[R7] 7.Zou Hui, Hastie Trevor. Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(2):301–320. [Google Scholar]

[R8] 8.Lin Y, Zhang H. Component selection and smoothing in smoothing spline analysis of variance models. Annals of Statistics. 2006;34:2272–2297. [Google Scholar]

[R9] 9.George EI, McCulloch RE. Variable selection via Gibbs sampling. Journal of the American Statistical Association. 1993;88:881–889. [Google Scholar]

[R10] 10.Reich BJ, Storlie CB, Bondell HD. Variable selection in Bayesian smoothing spline ANOVA models: Application to deterministic computer codes. Technometrics. 2009;51:110–120. doi: 10.1198/TECH.2009.0013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] 11.Chung Yeonseung, Dunson David B. Nonparametric bayes conditional distribution modeling with variable selection. Journal of the American Statistical Association. 2012 doi: 10.1198/jasa.2009.tm08302. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R12] 12.Fop Michael, Murphy Thomas Brendan. Variable selection methods for model-based clustering. 2017 arXiv preprint arXiv:1707.00306. [Google Scholar]

[R13] 13.Raftery Adrian E, Dean Nema. Variable selection for model-based clustering. Journal of the American Statistical Association. 2006;101(473):168–178. [Google Scholar]

[R14] 14.Maugis Cathy, Celeux Gilles, Martin-Magniette Marie-Laure. Variable selection for clustering with gaussian mixture models. Biometrics. 2009;65(3):701–709. doi: 10.1111/j.1541-0420.2008.01160.x. [DOI] [PubMed] [Google Scholar]

[R15] 15.Fop Michael, Smart Keith, Murphy Thomas Brendan. Variable selection for latent class analysis with application to low back pain diagnosis. 2015 arXiv preprint arXiv:1512.03350. [Google Scholar]

[R16] 16.Marbac Matthieu, Sedki Mohammed. Variable selection for model-based clustering using the integrated complete-data likelihood. Statistics and Computing. 2017;27(4):1049–1063. [Google Scholar]

[R17] 17.Pan Wei, Shen Xiaotong. Penalized model-based clustering with application to variable selection. Journal of Machine Learning Research. 2007 May;8:1145–1164. [Google Scholar]

[R18] 18.Wang Sijian, Zhu Ji. Variable selection for model-based high-dimensional clustering and its application to microarray data. Biometrics. 2008;64(2):440–448. doi: 10.1111/j.1541-0420.2007.00922.x. [DOI] [PubMed] [Google Scholar]

[R19] 19.Xie Benhuai, Pan Wei, Shen Xiaotong. Variable selection in penalized model-based clustering via regularization on grouped parameters. Biometrics. 2008;64(3):921–930. doi: 10.1111/j.1541-0420.2007.00955.x. [DOI] [PubMed] [Google Scholar]

[R20] 20.Friedman Jerome H, Meulman Jacqueline J. Clustering objects on subsets of attributes (with discussion) Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2004;66(4):815–849. [Google Scholar]

[R21] 21.Hoff Peter D. Model-based subspace clustering. Bayesian Analysis. 2006;1(2):321–344. [Google Scholar]

[R22] 22.Witten Daniela M, Tibshirani Robert. A framework for feature selection in clustering. Journal of the American Statistical Association. 2012 doi: 10.1198/jasa.2010.tm09415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Richardson Sylvia, Green Peter J. On bayesian analysis of mixtures with an unknown number of components (with discussion) Journal of the Royal Statistical Society: series B (statistical methodology) 1997;59(4):731–792. [Google Scholar]

[R24] 24.Kim Sinae, Tadesse Mahlet G, Vannucci Marina. Variable selection in clustering via dirichlet process mixture models. Biometrika. 2006;93(4):877–893. [Google Scholar]

[R25] 25.Ferguson Thomas S. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1(2):209–230. [Google Scholar]

[R26] 26.Neal Radford M. Markov chain sampling methods for dirichlet process mixture models. Journal of computational and graphical statistics. 2000;9(2):249–265. [Google Scholar]

[R27] 27.Teh Yee Whye, Jordan Michael I, Beal Matthew J, Blei David M. Hierarchical dirichlet processes. Journal of the American Statistical Association. 2006;101:1566–1581. [Google Scholar]

[R28] 28.Hjort Nils Lid, Holmes Chris, Müller Peter, Walker Stephen G. Bayesian Nonparametrics. Cambridge University Press; New York, NY: 2010. [Google Scholar]

[R29] 29.Hunt Lynette, Jorgensen Murray. Mixture model clustering for mixed data with missing information. Computational Statistics & Data Analysis. 2003;41(3):429–440. [Google Scholar]

[R30] 30.Biernacki Christophe, Deregnaucourt Thibault, Kubicki Vincent. Model-based clustering with mixed/missing data using the new software mixtcomp. CMStatistics 2015 (ERCIM 2015) 2015 [Google Scholar]

[R31] 31.Murray Jared S, Reiter Jerome P. Multiple imputation of missing categorical and continuous values via bayesian mixture models with local dependence. 2016 arXiv preprint arXiv:1410.0438. [Google Scholar]

[R32] 32.Hennig Christian, Liao Tim F. How to find an appropriate clustering for mixed-type variables with application to socio-economic stratification. Journal of the Royal Statistical Society: Series C (Applied Statistics) 2013;62(3):309–369. [Google Scholar]

[R33] 33.Muthen Bengt. Latent variable structural equation modeling with categorical data. Journal of Econometrics. 1983;22(1):43–65. [Google Scholar]

[R34] 34.Everitt Brian S. A finite mixture model for the clustering of mixed-mode data. Statistics & probability letters. 1988;6(5):305–309. [Google Scholar]

[R35] 35.Dunson David B. Bayesian latent variable models for clustered mixed outcomes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000;62(2):355–366. [Google Scholar]

[R36] 36.Ranalli Monia, Rocci Roberto. Mixture models for mixed-type data through a composite likelihood approach. Computational Statistics & Data Analysis. 2017;110:87–102. [Google Scholar]

[R37] 37.McParland Damien, Gormley Isobel Claire. Model based clustering for mixed data: clustmd. Advances in Data Analysis and Classification. 2016;10(2):155–169. [Google Scholar]

[R38] 38.Lesaffre Emmanuel, Molenberghs Geert. Multivariate probit analysis: a neglected procedure in medical statistics. Statistics in Medicine. 1991;10(9):1391–1403. doi: 10.1002/sim.4780100907. [DOI] [PubMed] [Google Scholar]

[R39] 39.Chib Siddhartha, Greenberg Edward. Analysis of multivariate probit models. Biometrika. 1998;85(2):347–361. [Google Scholar]

[R40] 40.Jain Sonia, Neal Radford M. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. Journal of Computational and Graphical Statistics. 2004;13(1):158–182. [Google Scholar]

[R41] 41.Storlie Curtis B, Lane William A, Ryan Emily M, Gattiker James R, Higdon David M. Calibration of computational models with categorical parameters and correlated outputs via bayesian smoothing spline anova. Journal of the American Statistical Association. 2015;110(509):68–82. [Google Scholar]

[R42] 42.Storlie CB, Therneau Terry, Carter Rickey, Chia Nicholas, Bergquist John, Romero-Brufau Santiago. Prediction and inference with missing data in patient alert systems. Journal of the American Statistical Association (in review) 2017 https://arxiv.org/pdf/1704.07904.pdf.

[R43] 43.Rubin Donald B. Inference and missing data. Biometrika. 1976;63(3):581–592. [Google Scholar]

[R44] 44.Escobar Michael D, West Mike. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90(430):577–588. [Google Scholar]

[R45] 45.MacEachern Steven N, Müller Peter. Estimating mixture of dirichlet process models. Journal of Computational and Graphical Statistics. 1998;7(2):223–238. [Google Scholar]

[R46] 46.Imai Kosuke, van Dyk David A. A bayesian analysis of the multinomial probit model using marginal data augmentation. Journal of Econometrics. 2005;124(2):311–334. [Google Scholar]

[R47] 47.McCulloch Robert E, Polson Nicholas G, Rossi Peter E. A bayesian analysis of the multinomial probit model with fully identified parameters. Journal of Econometrics. 2000;99(1):173–193. [Google Scholar]

[R48] 48.Zhang Xiao, John Boscardin W, Belin Thomas R. Bayesian analysis of multivariate nominal measures using multivariate multinomial probit models. Computational statistics & data analysis. 2008;52(7):3697–3708. doi: 10.1016/j.csda.2007.12.012. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] 49.Bhattacharya Anirban, Dunson David B. Simplex factor models for multivariate unordered categorical data. Journal of the American Statistical Association. 2012;107(497):362–377. doi: 10.1080/01621459.2011.646934. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] 50.George EI, McCulloch RE. Approaches for Bayesian variable selection. Statistica Sinica. 1997;7:339–373. [Google Scholar]

[R51] 51.Rue Havard, Held Leonhard. Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC; Boca Raton, FL: 2005. [Google Scholar]

[R52] 52.Bodnar Taras, Okhrin Yarema. Properties of the singular, inverse and generalized inverse partitioned wishart distributions. Journal of Multivariate Analysis. 2008;99(10):2389–2405. [Google Scholar]

[R53] 53.Gelman Andrew, et al. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper) Bayesian analysis. 2006;1(3):515–534. [Google Scholar]

[R54] 54.Huang Alan, Wand Matthew P, et al. Simple marginally noninformative prior distributions for covariance matrices. Bayesian Analysis. 2013;8(2):439–452. [Google Scholar]

[R55] 55.Ishwaran Hemant, James Lancelot F. Gibbs sampling methods for stick-breaking priors. Journal of the American Statistical Association. 2001 [Google Scholar]

[R56] 56.Sethuraman J. A constructive definition of Dirichlet priors. Statistica Sinica. 1994;4:639–650. [Google Scholar]

[R57] 57.Stephens Matthew. Dealing with label switching in mixture models. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000;62(4):795–809. [Google Scholar]

[R58] 58.Fu Audrey Qiuyan, Russell Steven, Bray Sarah J, Tavaré Simon, et al. Bayesian clustering of replicated time-course gene expression data with weak signals. The Annals of Applied Statistics. 2013;7(3):1334–1361. [Google Scholar]

[R59] 59.Stekhoven Daniel J, Bühlmann Peter. Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28(1):112–118. doi: 10.1093/bioinformatics/btr597. [DOI] [PubMed] [Google Scholar]

[R60] 60.Katusic Slavica K, Myers Scott, Colligan Robert C, Voigt Robert, Stoeckel Ruth E, Port John D, Croarkin Paul E, Weaver Amy. Developmental brain dysfunction-related disorders and potential autism spectrum disorder (pasd) among children and adolescents - population based 1976–2000 birth cohort. Lancet Neurology. 2016 (in review) [Google Scholar]

[R61] 61.Giudici Paolo, Green PJ. Decomposable graphical gaussian model determination. Biometrika. 1999;86(4):785–801. [Google Scholar]

[R62] 62.Wong Frederick, Carter Christopher K, Kohn Robert. Efficient estimation of covariance selection models. Biometrika. 2003;90(4):809–830. [Google Scholar]

PERMALINK

Clustering and Variable Selection in the Presence of Mixed Variable Types and Missing Data

C B Storlie

S M Myers

S K Katusic

A L Weaver

R Voigt

P E Croarkin

R E Stoeckel

J D Port

Abstract

1 Introduction

Figure 1.

2 Methodology

2.1 Dirichlet Process Mixture Models

2.2 Discrete Variables and Boundaries/Censoring

2.3 Variable Selection

Result 1

Proof

Result 2

Proof

Result 3

2.4 Hyper-Prior Distributions

2.5 MCMC Algorithm

2.6 Inference for ϕ and γ

3 Simulation Results

Figure 2.

Table 1.

Table 2.

4 Application to Autism and Related Disorders

Table 3.

Figure 3.

Figure 4.

5 Conclusions & Further Work

Supplementary Material

Acknowledgments

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases