Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

Oliver M Crook; Laurent Gatto; Paul DW Kirk

doi:10.1515/sagmb-2018-0065

. Author manuscript; available in PMC: 2023 Jan 3.

Published in final edited form as: Stat Appl Genet Mol Biol. 2019 Dec 12;18(6):/j/sagmb.2019.18.issue-6/sagmb-2018-0065/sagmb-2018-0065.xml. doi: 10.1515/sagmb-2018-0065

Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

Oliver M Crook ^1,^2,^3,^✉, Laurent Gatto ⁴, Paul DW Kirk ^3,^5,^✉

PMCID: PMC7614016 EMSID: EMS158447 PMID: 31829970

Abstract

The Dirichlet Process (DP) mixture model has become a popular choice for model-based clustering, largely because it allows the number of clusters to be inferred. The sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011) was proposed as a fast method for performing approximate Bayesian inference in DP mixture models, by posing clustering as a Bayesian model selection (BMS) problem and avoiding the use of computationally costly Markov chain Monte Carlo methods. Here we consider how this approach may be extended to permit variable selection for clustering, and also demonstrate the benefits of Bayesian model averaging (BMA) in place of BMS. Through an array of simulation examples and well-studied examples from cancer transcriptomics, we show that our method performs competitively with the current state-of-the-art, while also offering computational benefits. We apply our approach to reverse-phase protein array (RPPA) data from The Cancer Genome Atlas (TCGA) in order to perform a pan-cancer proteomic characterisation of 5157 tumour samples. We have implemented our approach, together with the original SUGS algorithm, in an open-source R package named sugsvarsel, which accelerates analysis by performing intensive computations in C++ and provides automated parallel processing. The R package is freely available from: https://github.com/ococrook/sugsvarsel

Keywords: Bayesian clustering, cancer proteomics, variable selection

1. Introduction

Bayesian nonparametric methods have become commonplace in the statistics and machine learning literature due to their flexibility and wide applicability. For model-based clustering, Dirichlet process (Ferguson 1973; 1974) mixture models have become particularly popular (Antoniak, 1974; Lo, 1984; Escobar, 1994; Escobar & West, 1995; Blei & Jordan, 2006), partly because they allow the number of clusters supported by the data to be inferred. By introducing latent selection indicators, these models can be extended to perform variable selection for clustering (Kim, Tadesse & Vannucci, 2006), which is particularly relevant in high-dimensional settings (Law, Figueiredo & Jain, 2004; Constantinopoulos, Titsias & Likas, 2006). There are now several approaches for modelbased clustering and variable selection (see Fop & Murphy, 2018, for a recent review), but current Markov chain Monte Carlo (MCMC) algorithms for Bayesian inference in Dirichlet process (DP) mixture models (e.g. Neal, 2000; Jain & Neal, 2004) are computationally costly, and often infeasible for large datasets.

A number of algorithms have been proposed for fast approximate inference in DP and related mixture models, which make possible the analysis of datasets with large numbers of observations. In the present paper, we focus on the sequential updating and greedy search (SUGS) algorithm (Wang & Dunson, 2011; Zhang et al., 2014), which we describe in more detail in Section 2.2. However, there are many other approximate inference procedures, a (non-exhaustive, but representative) selection of which we now briefly describe. Variational Bayes (VB) approaches for approximate inference in mixture models have a long history (Attias 1999; 2000), and were extended to DP mixture models by Blei and Jordan (2006). Despite well-known limitations in terms of generally underestimating the variance of the posterior, variational techniques have enabled (approximate) Bayesian inference to be applied to a large class of models and “big data” settings, and are now a mainstay of modern computational Bayesian statistics (Blei, Kucukelbir & McAuliffe, 2016). We note that SUGS was previously shown by Wang and Dunson (2011) to be 10 times faster than VB (largely due to the authors finding that VB required a computationally costly initialisation step in order to provide good results), while performing comparably to VB in terms of model fit. Daumé III (2007) provided an alternative approximate inference strategy that uses fast search algorithms to seek the maximum a posteriori (MAP) allocation of observations to clusters, and demonstrated that these techniques permit clustering of very large datasets. The results obtained depend upon the order in which observations are considered, and hence Daumé III (2007) considered a number of ordering strategies. Bayesian hierarchical clustering (Heller & Ghahramani, 2005; Savage et al., 2009; Cooke et al., 2011; Darkins et al., 2013) is another method for performing approximate inference for a DP mixture model that also identifies a single optimal clustering structure, but does so using an agglomerative hierarchical clustering approach that determines which clusters to merge at each step on the basis of computed marginal likelihoods. In contrast, by revisiting the widely used k-means algorithm from a Bayesian nonparametric viewpoint, Kulis and Jordan (2012) proposed a novel hard clustering algorithm called DP-means, which was subsequently generalised beyond the Gaussian mixtures case (Jiang, Kulis & Jordan, 2012) and was also adapted to cluster large sequencing datasets (Jiang et al., 2016). The MAP-DP approach of Raykov, Boukouvalas, and Little (2016a) is an approximate maximum a posteriori inference algorithm for DP mixtures, which has also been proposed as a principled alternative to k-means (Raykov et al., 2016b), but which – in contrast to DP-means – inherits the “rich get richer” property of the DP mixture model, and allows standard model selection and model fit diagnostics to be used (Raykov, Boukouvalas & Little, 2016a). Despite the advances provided by the above methods in terms of reduced computational cost and scalability to large datasets, we note that without variable selection all of these approaches may be ill-suited in high-dimensional settings.

In the spirit of the original SUGS algorithm, here we pose clustering and variable selection as a Bayesian model selection (BMS) problem. We consider variable selection for clustering in terms of partitioning variables into those which are relevant and those which are irrelevant for defining the clustering structure, and thereby pose the problem as one of using BMS to select both a partition of the variables and a partition of the observations. We moreover consider the benefits of performing Bayesian model averaging (BMA) (Madigan & Raftery, 1994; Hoeting et al., 1999) for summarising the SUGS output. For ease of exposition, we focus on the case of DP Gaussian mixtures, but note that all of our methods extend straightforwardly to other distributions for which conjugate priors may be chosen.

We consider a range of simulation settings and well-studied examples from cancer transcriptomics to show that our methods perform competitively with the current state-of-the-art. Having established the utility of our approach, we consider an application to reverse-phase protein arrays (RPPA) datasets in order to characterise the pan-cancer functional proteome. Such datasets have the potential to provide a deeper understanding of the biomolecular processes at work in cancer cells, and have previously been shown to offer additional insights beyond what may be captured by genomics or transcriptomics datasets (Akbani et al., 2014). Here we consider RPPA data for 5157 tumour samples obtained from The Cancer Genome Atlas (TCGA).

Section 2 recaps DP mixture models and the SUGS algorithm, then describes our extensions to SUGS including variable selection and BMA. Section 3 evaluates our method on simulated datasets and compares it with other approaches to clustering and variable selection. We then apply our method to a large proteomics dataset, highlighting its applicability. In the final section, we make some concluding remarks and discuss limitations and extensions. Our methods are implemented in an R package: https://github.com/ococrook/sugsvarsel.

2. Methods

2.1. Dirichlet process mixtures

We provide a very brief recap of DP mixture models, mainly to introduce notation, and refer to the overview provided in Section 3 of Teh et al. (2006) for further details. Let G ~ DP(βP₀) where β > 0 is the DP concentration parameter, P₀ is the base probability measure, and G is a random probability measure. We consider a Pólya urn scheme in which we have independent and identically distributed (i.i.d.) random variables θ₁, θ₂, … distributed according to G. Computing the sequential conditional distributions of θ_i given θ₁, …, θ_i−1, upon marginalising out the random G, we obtain (Blackwell & MacQueen, 1973):

θ_{i} ∣ θ_{1}, \dots, θ_{i - 1} \sim \frac{β}{β + i - 1} P_{0} + \frac{1}{β + i - 1} \sum_{i - 1}^{l = 1} δ_{θ_{l^{'}}} i = 1, \dots, n,

(1)

where δ_θ is a probability measure with mass concentrated at θ. It is clear from this equation that for any r = 1, 2, …, i − 1, the probability that θ_i is equal to θ_r is given by $\sum_{l = 1}^{i = 1} 𝕀 (θ_{l} = θ_{r}) / (β + i - 1)$ , where 𝕀(X) = 1 if X is true and 𝕀(X) = 0 otherwise. Thus θ_i has non-zero probability to be equal to one of the previous draws, and it is this clustering property that makes the DP a suitable prior for mixture models.

The DP mixture model is obtained by introducing an additional parametric probability distribution, F. More precisely, let observations x_i be modelled according to the following hierarchical model:

\begin{array}{l} G \sim D P (β P_{0}), \\ θ_{i} ∣ G \sim G, \\ x_{i} ∣ θ_{i} \sim F (θ_{i}), \end{array}

(2)

where F denotes the conditional distribution of the observation x_i given θ_i. For example, when F is chosen to be a Gaussian random variable we arrive at the DP Gaussian mixture model (also referred to as the infinite Gaussian mixture model; Rasmussen, 2000).

When performing inference for such models, it is common to introduce a set of latent variables (cluster labels) z₁, …, z_n associated with the observations, such that z_i is the cluster label for observation x_i. From the above specification of the DP mixture model, it follows that the conditional prior distribution of z_i given z_−i = (z₁, …, z_i−1) is categorical with:

π_{l k} : = P (z_{i} = k ∣ z_{- i}, β) = {\begin{array}{l} \frac{n_{k}}{β + 1 - 1}, & for k = 1, .., K - 1 \\ \frac{β}{β + 1 - 1}, & for k = K, \end{array}

(3)

where β > 0 is the DP concentration parameter, n_k := $\sum_{l = 1}^{i = 1} 𝕀 (z_{l} = k)$ is the number of previous observations allocated to cluster k, and K = max{z_−i} + 1. Larger values of β encourage observations to be allocated to new clusters, hence β plays a role in controlling the number of clusters.

Inference for DP mixture models can performed using computationally intensive MCMC methods (Neal, 2000; Jain & Neal, 2004). However, as we discuss below, here we are interested in the SUGS algorithm for approximate inference, proposed by Wang and Dunson (2011).

2.2. Sequential updating and greedy search (SUGS)

SUGS is a sequential approach for allocating observations to clusters, which (greedily) allocates the i-th observation to a cluster, given the allocations of the previous i − 1 observations. Suppose that observations x_−i = (x₁, …, x_i−1) have previously been allocated to clusters. As described in Wang and Dunson (2011), the posterior probability of allocating observation i to cluster k according to the DP mixture model formulation above is given by:

P (z_{i} = k | x_{i}, x_{- i}, z_{- i}, β) = \frac{π_{i k} L_{i k} (x_{i})}{\sum_{l = 1}^{K} π_{i k} L_{i l} (x_{i})},

(4)

where π_ik is defined as in Equation (3), and

L_{i k} = \int f (x_{i} ∣ θ_{k}) p (θ_{k} ∣ x_{- i}, z_{- i}) d θ_{k}

(5)

is the conditional marginal likelihood associated with x_i given allocation to cluster k and the cluster allocations for observations 1, …, i−1, with f(x_i|θ_k) denoting the likelihood associated with x_i as a function of θ_k. If k is a cluster to which previous observations have already been allocated, then p(θ_k|x_−i,z_−i) is the posterior distribution of θ_k given the observations previously allocated to cluster k; i.e. $p (θ_{k} ∣ x_{- i}, z_{- i}) \propto p_{0} (θ_{k}) \prod_{j : z_{j} = k, 1 \leq j \leq i - 1} f (x_{j} ∣ θ_{k})$ , where p₀(θ_k) is the prior on the cluster-specific parameters, θ_k. For a new cluster, i.e. for k = K, we have p(θ_k|x_−i,z_−i) = p₀(θ_k). If P₀ is taken to be conjugate for the likelihood f, then the posterior and conditional marginal likelihood are available analytically.

Assuming that the concentration parameter β is given and that conjugate priors are taken, the above suggests a computationally efficient deterministic clustering algorithm (the SUGS algorithm). That is, z₁ is initialised as z₁ = 1, and then subsequent observations are sequentially allocated to clusters by setting z_i = arg max_{k∈{1,…,K}} P(z_i = 𝑘|x_i,x_−i,z_−i, β), where we recall that K = max{z_−i} + 1 may change after each sequential allocation.

2.2.1. Dealing with unknown β

The DP concentration parameter β directly influences the number of clusters, thus we treat this as a random variable to be inferred, in the same way as in Wang and Dunson (2011). In particular, let $\hat{β} = ({\hat{β}}_{1}, \dots, {\hat{β}}_{L})$ be a discrete grid of permissible values for β with a large range, and then define the prior for β to be discrete with the following form:

p_{0} (β ∣ κ_{1}, \dots, κ_{L}) = \sum_{l = 1}^{L} κ_{l} 𝕀 (β = {\hat{β}}_{l}),

(6)

where $κ_{l} = p (β = {\hat{β}}_{l})$ . Further defining $ϕ_{l}^{(i - 1)} = p (β = {\hat{β}}_{l} ∣ x_{- i}, z_{- i})$ and $π_{i k l} = p (z_{i} = k ∣ β = {\hat{β}}_{l}, z_{- i})$ , the β parameter may be marginalised in Equation (4) to obtain:

p (z_{i} = k ∣ x_{- i}, x_{i}, z_{- i}) = \frac{\sum_{l = 1}^{L} ϕ_{l}^{(i - 1)} π_{i k l} L_{i k} (x_{i})}{\sum_{l = 1}^{L} ϕ_{l}^{(i - 1)} \sum_{k = 1}^{K} π_{i k l} L_{i k} (x_{i})},

(7)

where $π_{i k l} : = p (z_{i} = k ∣ β = {\hat{β}}_{l}, z_{- i})$ is given by Equation (3); $ϕ_{l}^{(0)} = κ_{l}$ and:

ϕ_{l}^{(i)} = p (β = {\hat{β}}_{l} ∣ x_{- i}, x_{i}, z_{- i}, z_{i}) = \frac{ϕ_{l}^{(i - 1)} π_{i z_{i} l}}{\sum_{s = 1}^{L} ϕ_{s}^{(i - 1)} π_{i z_{i} s}}

(8)

may be calculated sequentially for i = 1, …, n. The SUGS algorithm for allocating observations to clusters when β is unknown is then as presented in Algorithm 1.

Algorithm 1: The SUGS algorithm, when the DP precision parameter β is allowed to be unknown.

Input: Data $X = {x_{i}}_{i = 1}^{n}$ , Prior P₀(θ),Hyperparameters ${κ_{l}}_{l = 1}^{L}$

Output: Cluster allocations $Z = {z_{i}}_{i = 1}^{n}$

1 Initialise z₁ = 1, K = 2, and ${ϕ_{l}^{(0)} = κ_{l}}_{l = 1}^{L}$

2 Evaluate p(θ₁|z₁, x₁) ∝ p₀(θ₁)f(x₁|θ₁);

3 Calculate ${ϕ_{l}^{(1)}}_{l = 1}^{L}$ according to Eq. (8);

4 for i = 2 to N do

5 for k= 1 to K do

6 Calculate L_ik according to Eq. (5);

7 Evaluate p(z_i = k|x₁, …, x_i, z₁, …, z_i−1) according to Eq. (7);

8 end

9 Set z_i =arg max_{k=1, …}, K(p(z_i = k|x₁, …, x_i, z₁, …, z_i−1));

10 Set K = max{z₁, …, z_i} + 1;

11 for l = 1 to L do

12 Calculate $ϕ_{l}^{(i)}$ , according to Eq. (8);

13 end

14 Evaluate p(θ_zi|x₁, …, x_i,Z₁, …, x_i,z₁, …, z_i) ∝ p₀(θ_{z_i}) ∏_{j:z_j=z_i,1≤j≤i} f(x_j|θ_{z_i})

15 end

2.2.2. Formulation of Bayesian model selection problem

A notable limitation of the (deterministic) SUGS algorithm as presented so far is that the clustering structure obtained is dependent upon the initial ordering of the observations. To remove this dependence, Wang and Dunson (2011) consider multiple permutations of this ordering, and pose SUGS as a Bayesian model selection (BMS) problem. More concretely, the algorithm is repeated for many random orderings of the data and a final partition of the observations is then chosen by optimising an appropriate objective function for BMS, such as the marginal likelihood (ML):

L (X ∣ Z) = \prod_{k = 1}^{K} \int_{θ_{k}} [\prod_{i : z_{i} = k} f (x_{i} ∣ θ_{k})] p_{0} (θ_{k}) d θ_{k} .

(9)

In practice, Wang and Dunson (2011) advocate optimising the pseudo-marginal likelihood (PML), since they found that the marginal likelihood to often produce many small clusters. The PML is given by:

\begin{array}{l} P M L_{z} (X) & = \prod_{i = 1}^{N} p (x_{i} ∣ X_{n ∖ - i}, z_{n ∖ - i}) \\ = \underset{θ}{\prod_{i = 1}^{N} \int p (x_{i} ∣ θ) p (θ ∣ X_{n ∖ - i}, z_{n ∖ - i}) d θ} \\ \begin{array}{l} = \prod_{i = 1}^{N} \sum_{k = 1}^{K} P (z_{i} = k ∣ X_{n ∖ - i}, z_{n ∖ - i}) \int_{θ_{k}} f (x_{i} ∣ θ_{k}) p (θ_{k} ∣ X_{n ∖ - i}, z_{n ∖ - i}) d θ_{k^{'}} \end{array} \end{array}

(10)

where, defining X = {x₁, …,x_n} and Z = {z₁, …,z_n}, we have X_n\−i = X\{x_i is the set of all observations except the ith, and similarly z_n\−i = Z\{z_i}. In addition, Wang and Dunson (2011) remark that that p(x_i|X, Z) can be used to approximate p(x_i|X_n\−i, z_n\−i) to speed up computations and that this approximation is accurate for large sample sizes.

2.3. SUGS for variable selection

Irrelevant variables in high-dimensions can present a considerable challenge for clustering models and algorithms, because the number of variables with no clustering structure can overwhelm those where a clustering structure exists (Witten & Tibshirani, 2010). There have been many approaches to model-based clustering and variable selection (e.g. Raftery & Dean, 2006; Maugis, Celeux & Martin-Magniette, 2009), and we direct readers to Fop and Murphy (2018) for a recent review. However, many of these scale poorly with increasing dataset dimension, and/or require the number of clusters to be determined as a separate analysis step. To address these challenges, here we extend the SUGS algorithm to simultaneously perform clustering and variable selection, and refer to the resulting procedure as SUGSVarSel.

Since we are in the high-dimensional setting, we assume for simplicity that variables are independent given the cluster allocations (which, in the Gaussian case, is equivalent to assuming a diagonal structure for the covariance matrix). Let x_i,d be the dth element of the ith observation vector, with d = 1, …, D, and D the number of variables. Introducing indicator variables γ_d, which is 1 if the dth variable is relevant for the clustering structure and 0 if not, we follow a common approach from the literature (Law, Figueiredo & Jain, 2004; Tadesse, Sha & Vannucci, 2005; Kim, Tadesse & Vannucci, 2006) and assume that the cluster conditional likelihood can be factorised as follows:

f (x_{i} ∣ θ, θ_{0}, z_{i} = k) = \prod_{d = 1}^{D} f {(x_{i, d} ∣ θ_{k, d})}^{𝕀 (γ_{d} = 1)} f {(x_{i, d} ∣ θ_{0, d})}^{𝕀 (γ_{d} = 0)},

(11)

where θ₀ are “global” (i.e. not cluster-specific) parameters. In other words, the variables for which γ_d = 1 are modelled by a mixture distribution with cluster-specific parameters θ_k,d, while the variables for which γ_d = 0 are modelled by a single component with (global, not cluster-specific) parameters θ_0,d. Having introduced the D indicator variables γ_d, we now extend the SUGS algorithm in order to estimate them.

2.3.1. The SUGSVarSel algorithm

Given a realisation of the indicator variables, Γ = {γ₁, …, γ_D}, we may plug the cluster conditional likelihood given in Equation (11) into Equation (5) and proceed as before in order to identify a clustering, Z.

Conversely, suppose we have a realisation, Z, of the set of component allocation variables, but that the indicator variables Γ are unknown. In this case, the posterior probabilities associated with the variable indicators are given by:

P (γ_{d} = 1 ∣ X, Z) = \frac{p_{0} (γ_{d} = 1)}{B} \prod_{k \in Z} \int_{θ_{k, d}} (\prod_{i : z_{i} = k} f (x_{i, d} ∣ θ_{k, d})) p_{0} (θ_{k, d}) d θ_{k, d}

(12)

P (γ_{d} = 0 ∣ X, Z) = \frac{p_{0} (γ_{d} = 0)}{B} \int_{θ_{0, d}} (\prod_{i : z_{i} = k} f (X_{d} ∣ θ_{0, d})) p_{0} (θ_{0, d}) d θ_{0, d,}

(13)

where p₀(γ_d = q) indicates the prior probability that γ_d = q, and B is a normalising constant that ensures that P(γ_d = 0|X,Z) and p(γ_d = 1|X,Z)sum to 1. Thus, given a realisation, Z, of the set of component allocation variables, a greedy approach to finding γ_d is to set γ_d = arg max_q∈{0,1} P(γ_d = q|X, Z).

Given an initial realisation of the indicator variables, Γ= Γ⁽⁰⁾, the above suggests an iterative strategy in which at each iteration we use the SUGS algorithm to find a partition Z^(t) given Γ^(t−1), and then greedily update the indicator variables according to Equations (12) and (13) above in order to obtain Γ^(t) given Z^(t). This algorithm, which we refer to as SUGSVarSel, is presented in Algorithm 2.

Algorithm 2: The SugsVarSel algorithm.

Input: Data $X = {x_{i}}_{i = 1}^{n}$ , Priors P₀(θ) and P₀(y),

Hyperparameters ${κ_{l}}_{l = 1}^{L}$ Initial Indicator Switches Γ⁽⁰⁾

Maximum Iterations T.

Output: Cluster allocation $Z = {z_{i}}_{i = 1}^{n}$ Variable switches $Γ = {y_{d}}_{d = 1}^{D}$

1 Initialise z₁ = 1, K = 2, and ${ϕ_{l}^{(0)} = κ_{l}}_{l = 1}^{L}$ ;

2 Evaluate p(θ₁|z₁, x₁) ∝ P₀(θ₁)f(x₁|θ₁);

3 Calculate ${ϕ_{l}^{(1)}}_{l = 1}^{L}$ , according to Eq. (8);

4 while t ≤ T do

5 for i = 2 to N do

6 for k = 1 to K do

7 Calculate L_ik given Γ^(t−1) according to Eqs. (5) and (11);

8 Evaluate p(z_i = k|x₁, …, x_i, Z₁, …, z_i−1) according to Eq. (7);

9 end

10 Set Z_i = arg max_k=1,…,k(p(z_i = k|x₁, …, x_i, Z₁, …,z_i−1));

11 Set K = max{Z₁, …,z_i} + 1;

12 for l = 1 to L do

13 Calculate $ϕ_{l}^{(i)}$ according to Eq. (8);

14 end

15 Evaluate, using the cluster conditional likelihood in Eq. (11), p(θ_{z_i}|x_i,z₁, …z_i) ∝ p₀(θ_{z_i}) ∏_{j:z_j=z_i,1≤j≤i} f(x_j|θ_{z_i});

16 end

17 for d = 1 to D do

18 Calculate p(y_d = r|X,Z), according to Eqs. (12) and (13);

19 Set y_d = arg max_r∈{0,1}(p(y_d = r|X, Z));

20 end

21 t ← t + 1

22 end

2.3.2. Initialisation strategies for SUGSVarSel

Like the SUGS algorithm, the output of SUGSVarSel depends upon the initial ordering of the observations. It moreover depends upon the initialisation of the variable selection switches, Γ⁽⁰⁾. To address this latter issue, we propose a random sub-sampling initialisation strategy. This is as follows: first randomly select p₁ variables (with 1 < p₁ ≤ D) and apply SUGSVarSel on this new dataset $\tilde{X}$ of size n×p₁ with a small number of random orderings of the observations (we find 10 works in practice). The initial indicator for the variables of $\tilde{X}$ , which we write as ${\tilde{Γ}}^{(0)}$ , are set as all-on (γ_d = 1 for these p₁ variables). ${\tilde{Γ}}^{(0)}$ is held the same for each of the random orderings. For each of the random orderings, this approach outputs $\tilde{Z}$ for all observations but $\tilde{Γ}$ for only a subset of size p₁ of the variables. To obtain Γ for all D variables, we use the cluster allocations $\tilde{Z}$ and the full data X to compute probabilities for the remaining variables using 12 and 13. We then greedily assign the indicator variables. A single best model generated by these random orderings is selected using the ML. This procedure returns a Γ₁ ∈ {0, 1}^D; that is, variable selection switches with some variables switched on and other variables switched off. We repeat this process for a total of M random sub-samples of the variables to produce a set of clusterings Z₁, …, Z_M and a set of variables Γ₁,… Γ_M. These variable sets are then used as initial inputs Γ⁽⁰⁾ = Γ_i for i = 1, …, M for the SUGSVarSel algorithm (which is now run using all variables p = D) with Q new random orderings (again we find 10 is sufficient in practice). This SUGSVarSel with sub-sampling initialisation strategy returns Q models for each random sub-sample of the variables. Thus, we have QM models from which to choose. For each model obtained in this way, we calculate the marginal likelihood (see Section 2.2.2). We can then perform BMS to obtain a single “best” model, or we can use Bayesian model averaging (BMA; see next section).

2.4. Bayesian model-averaged co-clustering matrices

2.4.1. Bayesian model averaging

The output of our algorithm is a set of clusterings, associated variables and a marginal likelihood. One can select a single “best” model amongst these possible clustering, however we can also average over these models to capture the model uncertainty. The idea is called Bayesian model averaging (BMA) and we apply the method to clustering and variable selection (Madigan & Raftery, 1994; Hoeting et al., 1999; Russell, Murphy & Raftery, 2015).

For each model we form a co-clustering matrix S. S is defined in the following way:

S_{i j} = {\begin{array}{l} 0, & if z_{i} \neq z_{j} \\ 1, & if z_{i} = z_{j} . \end{array}

(14)

That is the ijth entry of S is 1 if observation x_i and x_j are in the same cluster and 0 otherwise. We note that the S is invariant to relabelling and the number of clusters. Now, suppose we have M models ℳ₁, …, ℳ_M, letting X be our observations and θ_m be the parameters associated with model ℳ_m. The posterior probability for ℳ_m is given by

p (M_{m} ∣ X) = \frac{p (X ∣ M_{m}) p_{0} (M)}{\sum_{l = 1}^{M} p (X ∣ M_{l}) p_{0} (M_{l})},

(15)

where

P (X ∣ M_{m}) = \int P (X ∣ θ_{m}, M_{m}) P (θ_{m} ∣ M_{m}) d θ_{m} .

(16)

The marginal likelihood (16) is the key quantity for model comparison and can be interpreted as the weight given to each proposed model. Further note the two sources of averaging: the averaging over the parameters in the ML and the averaging over the models in equation (15). We suppose that a priori all models are equally likely, choosing the prior on each model to be p₀(ℳ_m) = 1/M. One computational challenge that (15) gives us is computing the summation, since it can involve evaluating possibly thousands of models. To overcome this, one can discount models that are poor at describing our observations comparatively to our best model. More precisely, let us form Occam’s window (Hoeting et al., 1999):

W = {M_{k} : \frac{\max_{l} (p (M_{l} ∣ X))}{p (M_{k} ∣ D)} \leq K},

(17)

where K is a tuning parameter. Occam’s window is the set of all possible models within a reasonable Bayes factor from the best model under consideration. The summation in (15) is then replaced with a summation over the set 𝒲.

2.4.2. Averaging the co-clustering matrices

We can form the Bayesian model-averaged co-clustering matrix (BMAC) by taking the set of co-clustering matrices S_𝒲 and averaging, weighting by their ML:

S_{B M A C} = \frac{p (X ∣ M_{m}) S_{m}}{\sum_{l \in W} p (X ∣ M_{l})} .

(18)

The BMA of the variable set can be found in the same way by averaging over the weighted variable sets for each model:

F_{B M A} = \frac{p (X ∣ M_{m}) F_{m}}{\sum_{l \in W} p (X ∣ M_{l})},

(19)

where we denote by ℱ_m the variable set associated with model ℳ_m.

3. Comparisons with the state-of-the-art

We compare sugsVarSel to a number of alternative algorithms, and demonstrate the performance of our method in two situations. The first is the p > n paradigm, where the number of variables exceeds the number of observations. The second situation considers n > p for n = 1000, while simultaneously considering different proportions of variables being relevant. In both cases, we consider a variety of scenarios, for which different proportions of the variables are relevant.

3.1. Alternative methods for clustering and variable selection

We compare our method relative to the current state-of-the-art, including methods that do and do not peform variable selection. These include: mclust, a finite mixture model based clustering method (Fraley & Raftery, 2002; Fraley et al., 2012; Scrucca et al., 2016); DP-means, a non-parametric interpretation of K-means (Kulis & Jordan, 2012); clustvarsel, a finite mixture model method with variable selection (Raftery & Dean, 2006; Maugis, Celeux & Martin-Magniette, 2009; Scrucca & Raftery, 2014); the original sequential updating and greedy search algorithm (Wang & Dunson, 2011) as implemented in our sugsvarsel R package; and VarSelLCM, a model-based clustering and variable selection approach using the integrated complete-data likelihood (Marbac & Sedki, 2017).

3.2. High-dimensional example

In the first example, we simulate a mixture of 3 Gaussians with mixture proportions 0.5, 0.3, 0.2 centred at (0,0,..,0), (2,2, …,2), (−2,−2,…,−2) respectively, each with variance-covariance matrix equal to the identity. The irrelevant variables are simulated from a standard Gaussian. First, we simulate 100 observations from this model with 200 variables and explore varying the number of relevant variables.

When running SUGS and SUGSVarSel we use the same prior specification for both methods and 30 random orderings of the data. Throughout this article, we always perform 2 iterations of variable selection in the SUGSVarSel algorithm. To initialise variable selection in SUGSVarSel, we subsample 10% of the variables 20 times to produce an initial variable selection set. For SUGS we choose the partition with maximal PML (as advised in the original SUGS paper by Wang and Dunson 2011), while for SUGSVarSel we select the result with maximal ML. Prior choices for SUGS and SUGSVarSel can be found in the Supplementary Material. For mclust and clustvarsel, we find the appropriate number of clusters using a sequential search up to a maximum of 9 possible clusters. We then use then Bayesian Information Criterion (BIC) to select an appropriate model (Schwarz, 1978). For DP-means we repeat the algorithm over a range of penalty parameters λ ={0.01, 0.1,1,10, 100, 200, 400, 600, 800, 1000} and select the partition which minimises the DP-means objective function. For VarSelLCM we run the algorithm up to a maximum of 9 possible clusters and select an appropriate model using the Maximum Integrated Complete-data Likelihood (MICL) (Marbac & Sedki, 2017, 2018). All methods are run in serial for fair comparison.

Results are presented in Table 1–Table 4. In all tables, we provide runtimes for each of the methods, indicate the proportion of relevant and irrelevant variables that each method correctly identified (for methods without variable selection this is reported as 1 for relevant and 0 for irrelevant variables), and report the adjusted Rand index (Rand, 1971; Hubert & Arabie, 1985) between the clustering produced and the truth. We repeat all methods for 10 different random realisation of the datasets to produce a distribution of scores. We report the median scores, along with the upper and lower quartiles.

Table 1. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 50% of variables are relevant.

Method	Time, secs	Correct relevant variables	Correct irrelevant variables	ARI
mclust	<1	1	0	1[1, 1]
DP-means	<1	1	0	0.60 [0.37, 0.66]
clustvarsel	14280.8 [10431.6, 20310.4]	0.47 [0.45, 0.48]	1[1, 1]	1[1, 1]
SUGS	0.92 [0.90, 0.97]	1	0	0.955 [0.90, 0.97]
SUGSVarSel	24.6 [23.8, 24.9]	1[1, 1]	1[1, 1]	1[1, 1]
VarSelLCM	620.0 [574.9, 650.8]	1[1, 1]	1[1, 1]	1[1, 1]

Open in a new tab

Table 4. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 5% of variables are relevant.

Method	Time, secs	Correct relevant variables	Correct irrelevant variables	ARI
mclust	<1	1	0	0 [0, 0]
DP-means	<1	1	0	0 [0, 0]
clustvarsel	2183.1 [802.5, 2992.4]	0.1 [0.1, 0.1]	0.879 [0.814, 0.959]	0 [0, 0]
SUGS	6.30 [6.07,10.11]	1	0	0.04 [0.02, 0.05]
SUGSVarSel	19.9 [19.7, 20.5]	1[1, 1]	1[1, 1]	1[1, 1]
VarSelLCM	583.5 [521.6, 532.0]	1[1, 1]	1[1, 1]	1[1, 1]

Open in a new tab

It is evident that methods that do not perform variable selection such as mclust and SUGS perform poorly when there are many irrelevant variables. The performance of clustvarsel here seems volatile and performs poorly at correctly selecting relevant features. VarSelLCM and SUGSVarSel are competitive in terms variable selection and clustering. However, VarSelLCM requires an exhaustive search over the number of clusters, which makes this method computationally costly to apply when the number of clusters is not known. SUGSVarSel outperforms all variable selection and clustering methods in terms of speed, while also automatically inferring the number of clusters in the data. We proceed to evaluate the performance of SUGSVarSel on large simulated datasets.

3.2.1. Increasing the number of observations

We simulate the same distribution as before, but instead sample 1000 observations and only 100 variables and the irrelevant variable are simulated from a standard Gaussian distribution. All priors are the same as in the previous analysis and we sub-sample 10% of the variables 10 times to produce an initial variable selection set. We repeat SUGS and SUGSVarSel for 10 random orderings of the data. We compare the scalable methods mclust, DP-means, SUGS, SUGSVarSel and VarSelLCM, where 25%, 10%, 5% of the variable are relevant. For SUGS we choose the partition with maximal PML, while for SUGSVarSel we select the result with maximal ML. For VarSelLCM we run the algorithm for possible number of clusters 1 through 4 and select an appropriate model using the MICL, as previously. Results are presented in Table 5–Table 7.

Table 5. Simulation example where 1000 observations are simulated from a Gaussian mixture distribution with 3 components and 100 variables, in which 25% of variables are relevant.

Method	Time, secs	Correct relevant variables	Correct irrelevant variables	ARI
mclust	11.2 [10.9, 11.6]	1	0	0 [0, 0]
DP-means	22.3 [21.7, 23.1]	1	0	0.30 [0.25, 0.36]
SUGS	3.4 [3.1, 3.6]	1	0	0.98 [0.97, 0.98]
SUGSVarSel	31.2 [30.7, 31.8]	1[1, 1]	1[1, 1]	1[1, 1]
VarSelLCM	3596.8 [2639.5, 7537.7]	1[1, 1]	1[1, 1]	1[1, 1]

Open in a new tab

Table 7. Simulation example where 1000 observations are simulated from a Gaussian mixture distribution with 3 components and 100 variables, in which 5% of variables are relevant.

Method	Time, secs	Correct relevant variables	Correct irrelevant variables	ARI
mclust	11.4 [11.2,15.7]	1	0	0 [0, 0]
DP-means	22.0 [21.1, 22.7]	1	0	0 [0, 0]
SUGS	6.3 [5.6,11.1]	1	0	0 [0, 0]
SUGSVarSel	60.8 [59.8, 64.2]	1[1, 1]	1 [0.99, 1]	0.78 [0.54, 0.92]
VarSelLCM	2688.8 [2588.9, 2878.6]	1[1, 1]	1[1, 1]	0.943 [0.931, 0.945]

Open in a new tab

Mclust, SUGS and DP-means produce poor quality clusterings, because irrelevant variables present in the data render finding the true underlying clustering structure challenging. SUGSVarSel and VarSelLCM produce high quality answers in all situations but SUGSVarSel is 2 orders of magnitude faster. However, to alleviate the computational burden we searched up to a maximum of 4 clusters in VarSelLCM, providing it with an easier opportunity to produce high quality clusterings. In applications to real data this would have to be much larger, adding considerably to computational time, whereas the inference of the number of clusters is automatic in SUGSVarSel.

3.3. Advantages of Bayesian model averaging

As an example, we simulate a dataset with 30 observations from a mixture of 3 Gaussians, where two of the Gaussians are isotropic and centred (2, 2) and (−3, −3), respectively, each with mixing weights 0.4. The third component has mixture weight 0.2 and is centered at (−3, 4) but the covariance matrix is 2 on the diagonals and 1 on the off diagonals, violating our independence assumption. We additionally include 2 components of irrelevant variables generated from standard Gaussians. Our prior specifications are set as in the previous section. Simply using the ML to pick a partition results in an ARI of 0.635 between the clustering produced and the truth. However, we can also perform BMA and then summarise our co-clustering. We applied hierarchical clustering with average linkage to compute a clustering, which has previously be applied to posterior similarity matrices (Medvedovic, Yeung & Bumgarner, 2004; Fritsch & Ickstadt, 2009; Liverani et al., 2015) (see Supplementary Material for complete details). This clustering then produces an ARI of 0.875. The heatmap of the co-clustering matrix is plotted in Figure 1, allowing us to visualise the uncertainty in the clustering.

A heatmap of the BMA co-clustering matrix, where dark blue indicates the probability of being in the same cluster is 1 and white indicates a probability of 0 of belonging to the same cluster. The component annotation bar indicates the true component labels and the cluster annotation bar indicates the clustering obtained from summarising the BMA co-clustering matrix.

4. Applications to cancer subtyping

4.1. Application to leukaemia dataset

In this section, we apply SUGSVarSel to real biological datasets. The first is a well-studied genomic clustering problem: the separation of acute myeloid leukaemia (AML) and the B/T-cell subtypes of acute lymphoblastic leukemia (ALL) samples on the basis of microarray transcriptomic data. We use the dataset described by Golub et al. (1999), which comprises 38 samples, 27 of which are ALL (8 T-cell and 19 B-cell related), and 11 of which are AML cases. Initial preprocessing is performed as in Dudoit, Fridlyand, and Speed (2002), which reduces the dimension of the dataset from 6817 to 3051 genes. In Dudoit, Fridlyand, and Speed (2002), a further dimension reduction step is performed that makes use of the AML and ALL class labels, so that only those genes that have a high ratio of their between-class to within-class sums of squares are retained. Here we instead wish to adopt a completely unsupervised approach, so that we may use the known ALL-AML class label in order to validate our results.

We select the 200 most variable genes and then normalise, so the expression values for each gene are mean-centred at 0 with variance 1. 200 genes were chosen because this led to good predictive performance in previous analysis of these data (Golub et al., 1999; Dudoit, Fridlyand & Speed, 2002). We then apply SUGSVarSel to the resultant dataset. We sub-sample 10% of the variables 20 times to produce an initial variable selection set, and run the algorithm for 100 random orderings. We adopt our default priors and summarise the output using BMA. A final summary clustering is obtained by performing hierarchical clustering with average linkage (Fritsch & Ickstadt, 2009). We use the ARI to compare our results to the truth (of 3 classes) and repeat the process 10 times and report the average results.

Results are illustrated in Figure 2. The final clustering result provides an ARI of 0.831, which is in line with previous analyses preformed on this dataset (Golub et al., 1999; Dudoit, Fridlyand & Speed, 2002). The algorithm selects a total of 92 genes, including TCL1, TCRB, IL8, EPB72, IL7R, TCRG, NFIL6, which are all known to be associated with leukaemia (Natsuka et al., 1992; Pekarsky, Hallas & Croce, 2001; Van der Velden et al., 2004; Kuett et al., 2015; Chen, Tsau & Lin, 2010; Shochat et al., 2011). A full list of the selected genes (including their descriptions) can be found in the Supplementary Material. The advantage of our analysis over other methods is that we did not need to specify the number clusters – the algorithm automatically inferred 3 clusters in the data, which have excellent correspondence to the known classes of AML and ALL, as well as the 2 ALL subgroups.

A PCA plot of the microarray expression data of 38 patients from the Golub et al. (1999) dataset, using the 200 most variable genes. The different symbols indicate the clustering produced by the SUGSVarSel algorithm after summarising the BMA co-clustering matrix using hierarchical clustering with average linkage. The colours indicate the annotated sub-types.

To assess the importance of variable selection, we also apply mclust and the original SUGS algorithm to the data. We run the mclust algorithm performing a systematic search to select the number of clusters, up to a maximum of 9, and select the number of cluster which maximises the BIC. This criterion selects 3 clusters and clustering produced gives an adjusted Rand index of 0.627 – the inclusion of irrelevant variables has led to reduced cluster quality. We run SUGS using our default prior choices and using the PML criterion to select a clustering. The algorithm was run for 100 random ordering and we repeated the process 10 times, reporting an average ARI of 0. The lack of variable selection renders SUGS unable to produce a meaningful clustering. In Figure 3, we visualise the BMA co-clustering matrix for these data when applying the SUGSVarSel algorithm.

A heatmap of the BMA co-clustering matrix for the 38 patients, when applying SUGSVarSel, demonstrating the added benefit of visualising uncertainty. The annotation bars of the left indicate the correspondence between the clusters and the subtypes.

4.2. Application to TCGA breast cancer dataset

We demonstrate SUGSVarSel on a further genomics dataset. We analyse an expression dataset for breast cancer tumour data from The Cancer Genome Atlas (TCGA) (Network, 2012), which we pre-process in the same way as in Lock and Dunson (2013). The processed expression dataset comprises 348 tumours with 645 genes, of which 14 belong to the PAM50 (Prediction Analysis of Microarray) group of genes (Parker et al., 2009).

Analysis was performed in the following way. We first standardise our data so that each column is mean-centred with variance 1. We then subsample 10% of the variables 64 times to produce an initial variable set. We then apply the SUGSVarSel algorithm with default settings. We summarise our output by performing BMA and then hierarchical clustering with average linkage.

SUGSVarSel reveals two clusters in the dataset, the second of which is significantly associated with Basal-like tumours (Fisher test, p < 0.0001). The algorithm selects 245 variables to discriminate between the groups. We perform PCA before and after variable selection to demonstrate that the reduced variable set produces more separable and therefore more interpretable clusters. Furthermore, the algorithm selected 13 out of a total of 14 of the PAM50 genes, which is significantly better than random (Fisher Test, p < 0.0001).

There is perhaps concern that variable selection could remove relevant genes for clustering, in the situation where we have a highly informative set of variables. We consider the following task to cluster the breast cancer genes using the PAM50 genes from the total unprocessed dataset (that is without the filtering of Lock and Dunson (2013)), of which there are 48. We apply the SUGSVarSel in identical fashion to before, sub-sampling 10% of the variables 4 time to produce an initial variable set. We obtain 5 clusters which correspond well to the different breast cancer subgroups.

Cluster 1 is associated with Luminal A cancers, cluster B is associated with Luminal cancers, cluster 3 with basal-like tumours, cluster 4 contains mostly HER2 type breast cancers (chi-squared p < 0.0001). Thus, hardly surprisingly, the cluster produce on the PAM50 data coincide well with the PAM50 subgroups. Furthermore, 87.5% of the genes were selected which is more than we expect given our prior, telling us this was a highly informative set of genes.

The clusterings shown in Figure 4 and Figure 5 demonstrate that the variables we use for clustering are critically important. The two different pre-filtering choices led to results of varying quality and biological meaning. This is strong evidence in support of model-based variable selection rather than ad-hoc preprocessing.

PCA plot on the TCGA breast cancer data, where clusters produced by SUGSVarSel are indicated by shape and subtypes by colour. The left PCA plot demonstrates smaller and tighter clusters using only the variables that remained after variable selection. In the right hand plot all variable were used to produce the plot.

5. Pan-cancer proteomic characterisation

In this section we apply our method to The Cancer Proteome Atlas (TCPA) datasets (Li et al., 2013; Akbani et al., 2014; Städler et al., 2017). The dataset contains a large number of tumours and cell line samples with protein expression levels generated using reverse-phase protein arrays (RPPAs). Our method allows us to perform a number of tasks on this data; in particular, for each cancer we can detect possible subgroups and the relevant proteins which discriminate these subgroups. We can also perform a pan-cancer analysis to explore the differences and similarities between cancers. Pan-cancer studies can unravel inter-cancer relationships which are important for developing new clinical targets (Weinstein et al., 2013; Uhlen et al., 2017; Berger et al., 2018; Hoadley et al., 2018). Recent pan-cancer analyses have suggested that cancers should be classified based on their molecular signatures rather than tissue of origin (Berger et al., 2018; Hoadley et al., 2018) and this motivates our analysis.

As is usual with this data there are irrelevant variables so methods that do not perform variable selection such as mclust and SUGS are ill-suited. Furthermore, there is little a priori knowledge about the number of clusters and so methods such as VarSelLCM and clustvarsel which require an exhaustive search of the number of clusters are inappropriate. To perform the analysis on all cancer sets would be prohibitively slow for the slowest of analysis methods.

The TCPA datasets contain data on 19 cancer types and the description of these cancers can be found in Supplementary Material. The total dataset consists of over 5000 tumour samples with only a few samples for some cancers and hundreds of samples for others and several hundreds of proteins. The merged PAN-Can 19 level 4 dataset is used in the following analysis, since it is appropriate for multiple disease analysis. More information about the data can be found here http://tcpaportal.org/tcpa/, where the data itself can also be downloaded. In addition, we standardise the expression levels for each protein so that they are zero-centred with unit variance.

The following table demonstrates the number of cases for each cancer type (Table 8).

Table 8. A table indicating the different cancer types and the number of observations from each of those cancers.

ACC	BLCA	BRCA	COAD	GBM	HNSC	KIRC	KIRP	LGG	LUAD
46	127	820	327	205	203	445	208	257	234
LUSC	OV	PAAD	PRAD	READ	SKCM	STAD	THCA	UCEC
192	411	105	164	129	207	299	374	404

Open in a new tab

We only keep proteins which have been measured on all cancers, which total 217 and so our dataset has a total of 5157 tumour samples with 217 variables. We apply SUGSVarSel to this data by first sub-sampling 10% of the variables 43 (a fifth of the total number of variables) times. Using the same priors as in previous analysis we analyses this data using the SUGSVarSel algorithm, running the algorithm for 50 random orderings, thus exploring a total of 2150 models. We summarise the BMA clustering using hierarchical clustering with average linkage. The summarised clustering contains 60 clusters, however many of these clusters contain only a few observations. Reassuringly there are 18 clusters with more than 20 observations and we focus on these for our analysis. A table summarising the clusters, along with results from hierarchical clustering, can be found in the Supplementary material, in Figure 6 is a heatmap of the clusterings:

In addition, we plot a heatmap of the data with the clustering produce by SUGSVarSel using only the proteins selected by the algorithm (Figure 7).

A heatmap of the expression data using the clustering produced by the SUGSVarSel algorithm applied to the pan-cancer TCPA dataset. The annotation bars on the top of plot indicate the different cancers and clusters.

It is rare that a cancer associates with a single cluster, however there are evident relationships between cancers and clusters. Cluster A contains predominately womens’ cancers (OV, UCEC, BRCA), while cluster B contains a large spread of cancers. Clusters C, E and F contain the cancers of the digestive tract (STAD, COAD and READ). Cluster D contains a subgroups of breast cancers (BRCA), while cluster G contains solely kidney cancer (KIRC). Clusters H and I contain cancers of the brain (LGG, GBM). Cluster J and P contain aero-digestive cancers (HNSC, LUAD LUSC). Thyroid cancer (THCA) is spread across clusters K, L and B, whilst KIRP is predominately found in cluster M. Pancreatic cancer (PAAD) is split across clusters N and B. Cluster O contains the majority of breast cancer patients. Prostate cancer (PRAD) is dominantly found in Q, while R forms a small cluster of stomach cancers. This is in line with other analyses performed on these data (Akbani et al., 2014; Hoadley et al., 2014; Şenbabaoğlu et al., 2016). A total of 147 proteins were selected as relevant for clustering.

We now consider an illustrative example. Figure 6 shows us that clusters K and L contain only thyroid cancers. It is of biological interest to see what drives the differences between these clusters as they could define clinically relevant thyroid subgroups. Considering only the 147 selected proteins, we plot the expression profile for the 20 proteins (Figure 8), with smallest p-value, which are significantly different between clusters K and L (T-test (Welch, 1947), p < 0.00001, using Benjamini-Hochberg correction (Benjamini & Hochberg, 1995)).

A heatmap of the expression TCPA data for the thyroid subgroups. We have plotted the expression for only the top 20 proteins which are significantly different between clusters K and L.

We do not observe an over representation of any of the thyroid cancers subtypes within each of these clusters (see Table 9).

Table 9. A table showing the distribution of 3 different THCA subtypes across the clusters K and L produce from the SUGSVarSel algorithm.

	K	L
Thyroid papillary carcinoma – classical/usual	31	72
Thyroid papillary carcinoma – follicular (>= 99% follicular patterned)	17	25
Thyroid papillary carcinoma – tall cell (>= 50% tall cell features)	2	6

Open in a new tab

Note that this information was not available for all patients.

6. Conclusion

In this article we presented SUGSVarSel, an extension to the SUGS algorithm of Wang and Dunson (2011) to allow variable selection. We demonstrated that when irrelevant variables are present the quality of the clustering can be degraded and clusters become more challenging to interpret. SUGSVarSel allows the flexibility of a Bayesian nonparametric approach but inference is considerably faster than using MCMC. Indeed, the SUGSVarSel algorithm infers the number of clusters automatically and performs inference for the Dirichlet process hyperparameter. This is in contrast to most clustering with variable selection methods which require a systematic search over the number of clusters.

Whilst our method is approximate it performs competitively with other commonly used approaches. Furthermore, we take advantage of exploring many models by performing Bayesian model averaging, which is important for exploring uncertainty in our clustering. We remark that model uncertainty and the application of BMA is rarely explored in clustering tasks. We have provided an R package to facilitate dissemination of our method utilising C++ to accelerate intensive computations and parallel processing features to make further computational gains

Application to two cancer transcriptomic datasets show the clear benefit of simultaneously performing variable selection and clustering. We demonstrate that variable selection improves interpretation of these datasets, providing the genes that drive the clustering structure of the data, as well as identifying those that are irrelevant for clustering. We further applied our method to a pan-cancer proteomic dataset for which none of the current model-based clustering and variable selection methods are suitable. SUGSVarSel is able to provide a characterisation of 5157 tumour samples, demonstrating clustering relationships across cancer types based on their molecular signature rather the tissue of origin.

There are a number of ways in which our proposed method could be extended. Firstly, our assumption that variables are conditionally independent given the cluster allocations might be unrealistic for some datasets. In such cases, more elaborate variable selection methods might be desirable, although this is likely to come at increased computational cost. Furthermore, we have assumed conjugacy throughout, so that the marginal likelihood in Equation (5) may be evaluated analytically. As noted in the original SUGS paper of Wang and Dunson (2011), one possible way to extend to non-conjugate cases would be to approximate this marginal likelihood, e.g. using a Laplace approximation.

Supplementary Material

Supplementary File

EMS158447-supplement-Supplementary_File.zip^{(27.3MB, zip)}

Supplementary File 1

EMS158447-supplement-Supplementary_File_1.zip^{(27.3MB, zip)}

Table 2. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 25% of variables are relevant.

Method	Time, secs	Correct relevant variables	Correct irrelevant variables	ARI
mclust	<1	1	0	1[1, 1]
DP-means	<1	1	0	0.74 [0.70, 0.79]
clustvarsel	1852.3 [1185.2, 5880.8]	0.02 [0.02, 0.02]	0.847 [0.812, 0.945]	0.01 [0.00, 0.04]
SUGS	2.07 [1.89, 2.16]	1	0	0.78 [0.72, 0.84]
SUGSVarSel	21.9 [21.9, 22.1]	1[1, 1]	1[1, 1]	1[1, 1]
VarSelLCM	487.7 [481.3, 494.1]	1[1, 1]	1[1, 1]	1[1, 1]

Open in a new tab

Table 3. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 10% of variables are relevant.

Method	Time, secs	Correct relevant variables	Correct irrelevant variables	ARI
mclust	<1	1	0	0 [0, 0]
DP-means	<1	1	0	0 [0, 0]
clustvarsel	3095.8 [2377.3, 3302.7]	0.05 [0.05, 0.10]	0.803 [0.778, 0.854]	0 [0, 0]
SUGS	5.02 [4.76, 5.23]	1	0	0.18 [0.13, 0.21]
SUGSVarSel	19.7 [19.5, 19.9]	1[1, 1]	1[1, 1]	1[1, 1]
VarSelLCM	523.5 [521.6, 532.0]	1[1, 1]	1[1, 1]	1[1, 1]

Open in a new tab

Table 6. Simulation example where 1000 observations are simulated from a Gaussian mixture distribution with 3 components and 100 variables, in which 10% of variables are relevant.

Method	Time, secs	Correct relevant variables	Correct irrelevant variables	ARI
mclust	11.0 [10.7, 11.4]	1	0	0 [0, 0]
DP-means	21.4 [21.0, 21.8]	1	0	0.11 [0.02, 0.22]
SUGS	5.1 [4.9, 5.3]	1	0	0.01 [0.01, 0.04]
SUGSVarSel	33.3 [33.0, 33.8]	1[1, 1]	1[1, 1]	0.90 [0.80, 0.97]
VarSelLCM	1938.5 [1852.3, 1973.9]	1[1, 1]	1[1, 1]	0.997 [0.994, 0.997]

Open in a new tab

Funding

Medical Research Council, Funder Id: http://dx.doi.org/10.13039/501100000265, Wellcome Trust Mathematical Genomics and Medicine student supported financially by the School of Clinical Medicine, University of Cambridge. Grant Number: MC_UU_00002/10, MC_UU_00002/13.

References

Akbani R, Ng PKS, Werner HMJ, Shahmoradgoli M, Zhang F, Ju Z, Liu W, Yang J-Y, Yoshihara K, Li J, Ling S, et al. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nat Commun. 2014;5:3887. doi: 10.1038/ncomms4887. [DOI] [PMC free article] [PubMed] [Google Scholar]
Antoniak CE. Mixtures of dirichlet processes with applications to Bayesian nonparametric problems. Ann Statist. 1974;2:1152–1174. [Google Scholar]
Attias H. Inferring parameters and structure of latent variable models by variational bayes; Proc 15th Conf on Uncertainty in Artificial Intelligence; San Francisco, CA, USA. 1999. pp. 21–30. [Google Scholar]
Attias H. In: Advances in Neural Information Processing Systems12. Solla SA, Leen TK, Müller K, editors. MIT Press; Denver, USA: 2000. A variational Bayesian framework for graphical models; pp. 209–215. [Google Scholar]
Benjamini Y, Hochberg Y. Controlling the false discovery rate: apractical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995;57:289–300. [Google Scholar]
Berger ACA, Korkut RS, Kanchi AM, Hegde W, Lenoir W, Liu Y, Liu H, Fan H, Shen V, Ravikumar A, Rao A, Schultz X, et al. A comprehensive pan-cancer molecular study of gynecologic and breast cancers.”. Cancer Cell. 2018;33:690–705.:e9. doi: 10.1016/j.ccell.2018.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Blackwell D, MacQueen JB. Ferguson distributions via polya urn schemes. Ann Statist. 1973;1:353–355. [Google Scholar]
Blei DM, Jordan MI. Variational inference for Dirichlet process mixtures. Bayesian Anal. 2006;1:121–143. [Google Scholar]
Blei DMA, Kucukelbirand J, McAuliffe D. Variational inference: a review for statisticians. J Am Stat Assoc. 2016;112:859–877. [Google Scholar]
Chen AH, Tsauand Y-W, Lin C-H. Novel methods to identify biologically relevant genes for leukemia and prostate cancer from gene expression profiles. BMC Genomics. 2010;11:274. doi: 10.1186/1471-2164-11-274. [DOI] [PMC free article] [PubMed] [Google Scholar]
Constantinopoulos C, Titsias MK, Likas A. Bayesian feature and model selection for Gaussian mixture models. IEEE Trans Pattern Anal Mach Intell. 2006;28:1013–1018. doi: 10.1109/TPAMI.2006.111. [DOI] [PubMed] [Google Scholar]
Cooke EJ, Savage RS, Kirk PDW, Darkinsand R, Wild DL. Bayesian hierarchical clusteringfor microarray time series data with replicates and outlier measurements. BMC Bioinformatics. 2011;12:399. doi: 10.1186/1471-2105-12-399. [DOI] [PMC free article] [PubMed] [Google Scholar]
Darkins R, Cooke EJ, Ghahramani Z, Kirk PDW, Wild DL, Savage RS. Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm. PLoS One. 2013;8:e59795. doi: 10.1371/journal.pone.0059795. [DOI] [PMC free article] [PubMed] [Google Scholar]
Daumé H., III . In: AISTATS. Meila M, Shen X, editors. Puerto Rico; San Juan: 2007. Fast search for Dirichlet process mixture models; p. 8390. [Google Scholar]
Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002;97:77–87. [Google Scholar]
Escobar MD. Estimating normal means with a dirichlet process prior. J Am Stat Assoc. 1994;89:268–277. [Google Scholar]
Escobar MD, West M. Bayesian density estimation and inference using mixtures. J Am Stat Assoc. 1995;90:577–588. [Google Scholar]
Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann Statist. 1973;1:209–230. [Google Scholar]
Ferguson TS. Priordistributionson spaces of probability measures. Ann Statist. 1974;2:615–629. [Google Scholar]
Fop M, Murphy TB. Variable selection methods for model-based clustering. Stat Surv. 2018;12:1–48. [Google Scholar]
Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc. 2002;97:611–631. [Google Scholar]
Fraley C, Raftery AE, Murphy TB, Scrucca L. mclust Version 4 for R: normal mixture modelingfor model-based clustering. classification, and density estimation. 2012 [Google Scholar]
Fritsch A, Ickstadt K. Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal. 2009;4:367–391. [Google Scholar]
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
Heller K, Ghahramani Z. Bayesian hierarchical clustering; Proceedings of the 22nd International Conference on Machine Learning; Bonn, Germany. 2005. [Google Scholar]
Hoadley KAC, Yau DM, Wolf AD, Cherniack D, Tamborero S, Ng MD, Leiserson B, Niu MD, McLellan V, Uzunangelov J, Zhang C, et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158:929–944. doi: 10.1016/j.cell.2014.06.049. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, Akbani R, Bowlby R, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018;173:291–304. doi: 10.1016/j.cell.2018.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statist Sci. 1999;14:382–417. [Google Scholar]
Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2:193–218. [Google Scholar]
Jain S, Neal RM. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. J Comput Graph Stat. 2004;13:158–182. [Google Scholar]
Jiang K, Kulis B, Jordan MI. Advances in Neural Information Processing Systems 25. Lake Tahoe, Nevada: 2012. Small-variance asymptotics for exponential family dirichlet process mixture models. [Google Scholar]
Jiang L, Dong Y, Chen N, Chen T. DACE: a scalable DP-means algorithm for clustering extremely large sequence data. Bioinformatics. 2016;33:834–842. doi: 10.1093/bioinformatics/btw722. [DOI] [PubMed] [Google Scholar]
Kim S, Tadesse MG, Vannucci M. Variable selection inclustering via dirichlet process mixture models. Biometrika. 2006;93:877–893. [Google Scholar]
Kuett A, Rieger C, Perathoner D, Herold T, Wagner M, Sironi S, Sotlar K, Horny H-P, Deniffel C, Drolle H, Fiegl M. Il-8as mediator in the microenvironment-leukaemia network in acute myeloid leukaemia. Sci Rep. 2015;5:18411. doi: 10.1038/srep18411. [DOI] [PMC free article] [PubMed] [Google Scholar]
Kulis B, Jordan MI. Revisiting k-means: new algorithms via Bayesian nonparametrics; International Conference on Machine Learning; 2012. [Google Scholar]
Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell. 2004;26:1154–1166. doi: 10.1109/TPAMI.2004.71. [DOI] [PubMed] [Google Scholar]
Li J, Lu Y, Akbani R, Ju Z, Roebuck PL, Liu W, Yang J-Y, Broom BM, Verhaak RG, Kane DW, Wakefield C, Weinstein JN, Mills GB, Liang H. TCPA: a resource for cancer functional proteomics data. Nat Methods. 2013;10:1046–1047. doi: 10.1038/nmeth.2650. [DOI] [PMC free article] [PubMed] [Google Scholar]
Liverani S, Hastie DI, Azizi L, Papathomas M, Richardson S. PReMiuM: An R package for profile regression mixture models using Dirichlet processes. J Stat Softw. 2015;64(1) doi: 10.18637/jss.v064.i07. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lo AY. On a class of Bayesian nonparametric estimates: i. density estimates. Ann Statist. 1984;12:351–357. [Google Scholar]
Lock EF, Dunson DB. Bayesian consensus clustering. Bioinformatics. 2013;29:2610–2616. doi: 10.1093/bioinformatics/btt425. [DOI] [PMC free article] [PubMed] [Google Scholar]
Madigan D, Raftery AE. Model selection and accounting for model uncertainty in graphical models using Occam’s window. J Am Stat Assoc. 1994;89:1535–1546. [Google Scholar]
Marbac M, Sedki M. Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput. 2017;27:1049–1063. [Google Scholar]
Marbac M, Sedki M. VarSelLCM: an R/C++ package for variable selection in model-basedclusteringof mixed-data with missing values. Bioinformatics. 2018;35:1255–1257. doi: 10.1093/bioinformatics/bty786. [DOI] [PubMed] [Google Scholar]
Maugis C, Celeuxand G, Martin-Magniette M-L. Variable selection forclusteringwith gaussian mixture models. Biometrics. 2009;65:701–709. doi: 10.1111/j.1541-0420.2008.01160.x. [DOI] [PubMed] [Google Scholar]
Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;20:1222–1232. doi: 10.1093/bioinformatics/bth068. [DOI] [PubMed] [Google Scholar]
Natsuka S, Akira S, Nishio Y, Hashimoto S, Sugita T, Isshiki H, Kishimoto T. Macrophage differentiation-specific expression of NF-IL6,atranscription factorfor interleukin-6. Blood. 1992;79:460–466. [PubMed] [Google Scholar]
Neal RM. Markovchain sampling methods for dirichlet process mixture models. J Comput Graph Stat. 2000;9:249–265. [Google Scholar]
Network CGA. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. doi: 10.1038/nature11412. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, Quackenbush JF, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27:1160–1167. doi: 10.1200/JCO.2008.18.1370. [DOI] [PMC free article] [PubMed] [Google Scholar]
Pekarsky Y, Hallasand C, Croce CM. The role of TCL1 in human T-cell leukemia. Oncogene. 2001;20:5638. doi: 10.1038/sj.onc.1204596. [DOI] [PubMed] [Google Scholar]
Raftery AE, Dean N. Variable selection for model-based clustering. J Am Stat Assoc. 2006;101:168–178. [Google Scholar]
Rand WM. Objective criteria for the evaluation ofclustering methods. J Am Stat Assoc. 1971;66:846–850. [Google Scholar]
Rasmussen CE. The infinite gaussian mixture model. Advances in Neural Information Processing Systems 12, Denver, USA. 2000;12:554–560. volume12. [Google Scholar]
Raykov YP, Boukouvalas A, Little MA. Simple approximate MAP inference for Dirichlet processes mixtures. Electron J Statist. 2016a;10:3548–3578. [Google Scholar]
Raykov Y, Boukouvalas A, Baig F, Little MA. What to do when k-means clustering fails: a simpleyet principled alternative algorithm. PLoSOne. 2016b;11:e0162259. doi: 10.1371/journal.pone.0162259. [DOI] [PMC free article] [PubMed] [Google Scholar]
Russell N, Murphy TB, Raftery AE. Bayesian model averaging in model-based clustering and density estimation. arXiv preprint arXiv. 2015:1506.09035 [Google Scholar]
Savage RS, Heller K, Xu Y, Ghahramani Z, Truman WM, Grant M, Denby KJ, Wild DL. R/BHC:fast Bayesian hierarchical clusteringfor microarray data. BMC Bioinformatics. 2009;10:242. doi: 10.1186/1471-2105-10-242. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–464. [Google Scholar]
Scrucca L, Raftery AE. clustvarsel: a package implementing variable selection for model-based clustering in R. J Stat Softw. 2014;84:1–28. doi: 10.18637/jss.v084.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]
Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016;8:205–233. [PMC free article] [PubMed] [Google Scholar]
Şenbabaoğlu Y, Sümer SO, Sánchez-Vega F, Bemis D, Ciriello G, Schultz N, Sander C. A multi-method approach for pro- teomic network inference in 11 human cancers. PLoSComput Biol. 2016;12:e1004765. doi: 10.1371/journal.pcbi.1004765. [DOI] [PMC free article] [PubMed] [Google Scholar]
Shochat C, Tal N, Bandapalli OR, Palmi C, Ganmore I, Te Kronnie G, Cario G, Cazzaniga G, Kulozik AE, Stanulla M, Schrappe M, Biondi A, Basso G, et al. Gain-of-function mutations in interleukin-7 receptor-α (IL7R) in childhood acute lymphoblastic leukemias. J Exp Med. 2011;208:901–908. doi: 10.1084/jem.20110580. [DOI] [PMC free article] [PubMed] [Google Scholar]
Städler N, Dondelinger F, Hill SM, Akbani R, Lu YGB, Mills GB, Mukherjee S. Molecular heterogeneity at the network level: high-dimensional testing, clustering and a TCGA case study. Bioinformatics. 2017;33:2890–2896. doi: 10.1093/bioinformatics/btx322. [DOI] [PMC free article] [PubMed] [Google Scholar]
Tadesse MG, Shaand N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005;100:602–617. [Google Scholar]
Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical dirichlet processes. J Am Stat Assoc. 2006;101:1566–1581. [Google Scholar]
Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, Benfeitas R, Arif M, Liu Z, Edfors F, Sanli K, von Feilitzen K, et al. A pathology atlas of the human cancer transcriptome. Science. 2017;357:eaan2507. doi: 10.1126/science.aan2507. [DOI] [PubMed] [Google Scholar]
Van der Velden V, Brüggemann M, Hoogeveen P, de Bie M, Hart P, Raff T, Pfeifer H, Lüschen S, Szczepański TE, Van Wering E, Kneba M, van Dongen JJ. TCRB gene rearrangements in childhood and adult precursor-B-ALL: frequency, applicability as MRD-PCR target, and stability between diagnosis and relapse. Leukemia. 2004;18:1971. doi: 10.1038/sj.leu.2403505. [DOI] [PubMed] [Google Scholar]
Wang L, Dunson DB. Fast Bayesian inference in dirichlet process mixture models. J Comput Graph Stat. 2011;20:196–216. doi: 10.1198/jcgs.2010.07081. [DOI] [PMC free article] [PubMed] [Google Scholar]
Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander c, Stuart JM Cancer Genome Atlas Research Network. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]
Welch BL. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika. 1947;34:28–35. doi: 10.1093/biomet/34.1-2.28. [DOI] [PubMed] [Google Scholar]
Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010;105:713–726. doi: 10.1198/jasa.2010.tm09415. [DOI] [PMC free article] [PubMed] [Google Scholar]
Zhang X, Nott DJ, Yau C, Jasra A. A sequential algorithm for fast fitting of dirichlet process mixture models. J Comput Graph Stat. 2014;23:1143–1162. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File

EMS158447-supplement-Supplementary_File.zip^{(27.3MB, zip)}

Supplementary File 1

EMS158447-supplement-Supplementary_File_1.zip^{(27.3MB, zip)}

[R1] Akbani R, Ng PKS, Werner HMJ, Shahmoradgoli M, Zhang F, Ju Z, Liu W, Yang J-Y, Yoshihara K, Li J, Ling S, et al. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nat Commun. 2014;5:3887. doi: 10.1038/ncomms4887. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] Antoniak CE. Mixtures of dirichlet processes with applications to Bayesian nonparametric problems. Ann Statist. 1974;2:1152–1174. [Google Scholar]

[R3] Attias H. Inferring parameters and structure of latent variable models by variational bayes; Proc 15th Conf on Uncertainty in Artificial Intelligence; San Francisco, CA, USA. 1999. pp. 21–30. [Google Scholar]

[R4] Attias H. In: Advances in Neural Information Processing Systems12. Solla SA, Leen TK, Müller K, editors. MIT Press; Denver, USA: 2000. A variational Bayesian framework for graphical models; pp. 209–215. [Google Scholar]

[R5] Benjamini Y, Hochberg Y. Controlling the false discovery rate: apractical and powerful approach to multiple testing. J Roy Stat Soc B Met. 1995;57:289–300. [Google Scholar]

[R6] Berger ACA, Korkut RS, Kanchi AM, Hegde W, Lenoir W, Liu Y, Liu H, Fan H, Shen V, Ravikumar A, Rao A, Schultz X, et al. A comprehensive pan-cancer molecular study of gynecologic and breast cancers.”. Cancer Cell. 2018;33:690–705.:e9. doi: 10.1016/j.ccell.2018.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Blackwell D, MacQueen JB. Ferguson distributions via polya urn schemes. Ann Statist. 1973;1:353–355. [Google Scholar]

[R8] Blei DM, Jordan MI. Variational inference for Dirichlet process mixtures. Bayesian Anal. 2006;1:121–143. [Google Scholar]

[R9] Blei DMA, Kucukelbirand J, McAuliffe D. Variational inference: a review for statisticians. J Am Stat Assoc. 2016;112:859–877. [Google Scholar]

[R10] Chen AH, Tsauand Y-W, Lin C-H. Novel methods to identify biologically relevant genes for leukemia and prostate cancer from gene expression profiles. BMC Genomics. 2010;11:274. doi: 10.1186/1471-2164-11-274. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Constantinopoulos C, Titsias MK, Likas A. Bayesian feature and model selection for Gaussian mixture models. IEEE Trans Pattern Anal Mach Intell. 2006;28:1013–1018. doi: 10.1109/TPAMI.2006.111. [DOI] [PubMed] [Google Scholar]

[R12] Cooke EJ, Savage RS, Kirk PDW, Darkinsand R, Wild DL. Bayesian hierarchical clusteringfor microarray time series data with replicates and outlier measurements. BMC Bioinformatics. 2011;12:399. doi: 10.1186/1471-2105-12-399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R13] Darkins R, Cooke EJ, Ghahramani Z, Kirk PDW, Wild DL, Savage RS. Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm. PLoS One. 2013;8:e59795. doi: 10.1371/journal.pone.0059795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Daumé H., III . In: AISTATS. Meila M, Shen X, editors. Puerto Rico; San Juan: 2007. Fast search for Dirichlet process mixture models; p. 8390. [Google Scholar]

[R15] Dudoit S, Fridlyand J, Speed TP. Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002;97:77–87. [Google Scholar]

[R16] Escobar MD. Estimating normal means with a dirichlet process prior. J Am Stat Assoc. 1994;89:268–277. [Google Scholar]

[R17] Escobar MD, West M. Bayesian density estimation and inference using mixtures. J Am Stat Assoc. 1995;90:577–588. [Google Scholar]

[R18] Ferguson TS. A Bayesian analysis of some nonparametric problems. Ann Statist. 1973;1:209–230. [Google Scholar]

[R19] Ferguson TS. Priordistributionson spaces of probability measures. Ann Statist. 1974;2:615–629. [Google Scholar]

[R20] Fop M, Murphy TB. Variable selection methods for model-based clustering. Stat Surv. 2018;12:1–48. [Google Scholar]

[R21] Fraley C, Raftery AE. Model-based clustering, discriminant analysis and density estimation. J Am Stat Assoc. 2002;97:611–631. [Google Scholar]

[R22] Fraley C, Raftery AE, Murphy TB, Scrucca L. mclust Version 4 for R: normal mixture modelingfor model-based clustering. classification, and density estimation. 2012 [Google Scholar]

[R23] Fritsch A, Ickstadt K. Improved criteria for clustering based on the posterior similarity matrix. Bayesian Anal. 2009;4:367–391. [Google Scholar]

[R24] Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999;286:531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]

[R25] Heller K, Ghahramani Z. Bayesian hierarchical clustering; Proceedings of the 22nd International Conference on Machine Learning; Bonn, Germany. 2005. [Google Scholar]

[R26] Hoadley KAC, Yau DM, Wolf AD, Cherniack D, Tamborero S, Ng MD, Leiserson B, Niu MD, McLellan V, Uzunangelov J, Zhang C, et al. Multiplatform analysis of 12 cancer types reveals molecular classification within and across tissues of origin. Cell. 2014;158:929–944. doi: 10.1016/j.cell.2014.06.049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R27] Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, Akbani R, Bowlby R, et al. Cell-of-origin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018;173:291–304. doi: 10.1016/j.cell.2018.03.022. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Statist Sci. 1999;14:382–417. [Google Scholar]

[R29] Hubert L, Arabie P. Comparing partitions. Journal of Classification. 1985;2:193–218. [Google Scholar]

[R30] Jain S, Neal RM. A split-merge markov chain monte carlo procedure for the dirichlet process mixture model. J Comput Graph Stat. 2004;13:158–182. [Google Scholar]

[R31] Jiang K, Kulis B, Jordan MI. Advances in Neural Information Processing Systems 25. Lake Tahoe, Nevada: 2012. Small-variance asymptotics for exponential family dirichlet process mixture models. [Google Scholar]

[R32] Jiang L, Dong Y, Chen N, Chen T. DACE: a scalable DP-means algorithm for clustering extremely large sequence data. Bioinformatics. 2016;33:834–842. doi: 10.1093/bioinformatics/btw722. [DOI] [PubMed] [Google Scholar]

[R33] Kim S, Tadesse MG, Vannucci M. Variable selection inclustering via dirichlet process mixture models. Biometrika. 2006;93:877–893. [Google Scholar]

[R34] Kuett A, Rieger C, Perathoner D, Herold T, Wagner M, Sironi S, Sotlar K, Horny H-P, Deniffel C, Drolle H, Fiegl M. Il-8as mediator in the microenvironment-leukaemia network in acute myeloid leukaemia. Sci Rep. 2015;5:18411. doi: 10.1038/srep18411. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Kulis B, Jordan MI. Revisiting k-means: new algorithms via Bayesian nonparametrics; International Conference on Machine Learning; 2012. [Google Scholar]

[R36] Law MHC, Figueiredo MAT, Jain AK. Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell. 2004;26:1154–1166. doi: 10.1109/TPAMI.2004.71. [DOI] [PubMed] [Google Scholar]

[R37] Li J, Lu Y, Akbani R, Ju Z, Roebuck PL, Liu W, Yang J-Y, Broom BM, Verhaak RG, Kane DW, Wakefield C, Weinstein JN, Mills GB, Liang H. TCPA: a resource for cancer functional proteomics data. Nat Methods. 2013;10:1046–1047. doi: 10.1038/nmeth.2650. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R38] Liverani S, Hastie DI, Azizi L, Papathomas M, Richardson S. PReMiuM: An R package for profile regression mixture models using Dirichlet processes. J Stat Softw. 2015;64(1) doi: 10.18637/jss.v064.i07. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R39] Lo AY. On a class of Bayesian nonparametric estimates: i. density estimates. Ann Statist. 1984;12:351–357. [Google Scholar]

[R40] Lock EF, Dunson DB. Bayesian consensus clustering. Bioinformatics. 2013;29:2610–2616. doi: 10.1093/bioinformatics/btt425. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R41] Madigan D, Raftery AE. Model selection and accounting for model uncertainty in graphical models using Occam’s window. J Am Stat Assoc. 1994;89:1535–1546. [Google Scholar]

[R42] Marbac M, Sedki M. Variable selection for model-based clustering using the integrated complete-data likelihood. Stat Comput. 2017;27:1049–1063. [Google Scholar]

[R43] Marbac M, Sedki M. VarSelLCM: an R/C++ package for variable selection in model-basedclusteringof mixed-data with missing values. Bioinformatics. 2018;35:1255–1257. doi: 10.1093/bioinformatics/bty786. [DOI] [PubMed] [Google Scholar]

[R44] Maugis C, Celeuxand G, Martin-Magniette M-L. Variable selection forclusteringwith gaussian mixture models. Biometrics. 2009;65:701–709. doi: 10.1111/j.1541-0420.2008.01160.x. [DOI] [PubMed] [Google Scholar]

[R45] Medvedovic M, Yeung KY, Bumgarner RE. Bayesian mixture model based clustering of replicated microarray data. Bioinformatics. 2004;20:1222–1232. doi: 10.1093/bioinformatics/bth068. [DOI] [PubMed] [Google Scholar]

[R46] Natsuka S, Akira S, Nishio Y, Hashimoto S, Sugita T, Isshiki H, Kishimoto T. Macrophage differentiation-specific expression of NF-IL6,atranscription factorfor interleukin-6. Blood. 1992;79:460–466. [PubMed] [Google Scholar]

[R47] Neal RM. Markovchain sampling methods for dirichlet process mixture models. J Comput Graph Stat. 2000;9:249–265. [Google Scholar]

[R48] Network CGA. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490:61–70. doi: 10.1038/nature11412. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R49] Parker JS, Mullins M, Cheang MC, Leung S, Voduc D, Vickery T, Davies S, Fauron C, He X, Hu Z, Quackenbush JF, et al. Supervised risk predictor of breast cancer based on intrinsic subtypes. J Clin Oncol. 2009;27:1160–1167. doi: 10.1200/JCO.2008.18.1370. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R50] Pekarsky Y, Hallasand C, Croce CM. The role of TCL1 in human T-cell leukemia. Oncogene. 2001;20:5638. doi: 10.1038/sj.onc.1204596. [DOI] [PubMed] [Google Scholar]

[R51] Raftery AE, Dean N. Variable selection for model-based clustering. J Am Stat Assoc. 2006;101:168–178. [Google Scholar]

[R52] Rand WM. Objective criteria for the evaluation ofclustering methods. J Am Stat Assoc. 1971;66:846–850. [Google Scholar]

[R53] Rasmussen CE. The infinite gaussian mixture model. Advances in Neural Information Processing Systems 12, Denver, USA. 2000;12:554–560. volume12. [Google Scholar]

[R54] Raykov YP, Boukouvalas A, Little MA. Simple approximate MAP inference for Dirichlet processes mixtures. Electron J Statist. 2016a;10:3548–3578. [Google Scholar]

[R55] Raykov Y, Boukouvalas A, Baig F, Little MA. What to do when k-means clustering fails: a simpleyet principled alternative algorithm. PLoSOne. 2016b;11:e0162259. doi: 10.1371/journal.pone.0162259. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R56] Russell N, Murphy TB, Raftery AE. Bayesian model averaging in model-based clustering and density estimation. arXiv preprint arXiv. 2015:1506.09035 [Google Scholar]

[R57] Savage RS, Heller K, Xu Y, Ghahramani Z, Truman WM, Grant M, Denby KJ, Wild DL. R/BHC:fast Bayesian hierarchical clusteringfor microarray data. BMC Bioinformatics. 2009;10:242. doi: 10.1186/1471-2105-10-242. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] Schwarz G. Estimating the dimension of a model. Ann Statist. 1978;6:461–464. [Google Scholar]

[R59] Scrucca L, Raftery AE. clustvarsel: a package implementing variable selection for model-based clustering in R. J Stat Softw. 2014;84:1–28. doi: 10.18637/jss.v084.i01. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] Scrucca L, Fop M, Murphy TB, Raftery AE. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J. 2016;8:205–233. [PMC free article] [PubMed] [Google Scholar]

[R61] Şenbabaoğlu Y, Sümer SO, Sánchez-Vega F, Bemis D, Ciriello G, Schultz N, Sander C. A multi-method approach for pro- teomic network inference in 11 human cancers. PLoSComput Biol. 2016;12:e1004765. doi: 10.1371/journal.pcbi.1004765. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] Shochat C, Tal N, Bandapalli OR, Palmi C, Ganmore I, Te Kronnie G, Cario G, Cazzaniga G, Kulozik AE, Stanulla M, Schrappe M, Biondi A, Basso G, et al. Gain-of-function mutations in interleukin-7 receptor-α (IL7R) in childhood acute lymphoblastic leukemias. J Exp Med. 2011;208:901–908. doi: 10.1084/jem.20110580. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R63] Städler N, Dondelinger F, Hill SM, Akbani R, Lu YGB, Mills GB, Mukherjee S. Molecular heterogeneity at the network level: high-dimensional testing, clustering and a TCGA case study. Bioinformatics. 2017;33:2890–2896. doi: 10.1093/bioinformatics/btx322. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R64] Tadesse MG, Shaand N, Vannucci M. Bayesian variable selection in clustering high-dimensional data. J Am Stat Assoc. 2005;100:602–617. [Google Scholar]

[R65] Teh YW, Jordan MI, Beal MJ, Blei DM. Hierarchical dirichlet processes. J Am Stat Assoc. 2006;101:1566–1581. [Google Scholar]

[R66] Uhlen M, Zhang C, Lee S, Sjöstedt E, Fagerberg L, Bidkhori G, Benfeitas R, Arif M, Liu Z, Edfors F, Sanli K, von Feilitzen K, et al. A pathology atlas of the human cancer transcriptome. Science. 2017;357:eaan2507. doi: 10.1126/science.aan2507. [DOI] [PubMed] [Google Scholar]

[R67] Van der Velden V, Brüggemann M, Hoogeveen P, de Bie M, Hart P, Raff T, Pfeifer H, Lüschen S, Szczepański TE, Van Wering E, Kneba M, van Dongen JJ. TCRB gene rearrangements in childhood and adult precursor-B-ALL: frequency, applicability as MRD-PCR target, and stability between diagnosis and relapse. Leukemia. 2004;18:1971. doi: 10.1038/sj.leu.2403505. [DOI] [PubMed] [Google Scholar]

[R68] Wang L, Dunson DB. Fast Bayesian inference in dirichlet process mixture models. J Comput Graph Stat. 2011;20:196–216. doi: 10.1198/jcgs.2010.07081. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R69] Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander c, Stuart JM Cancer Genome Atlas Research Network. The cancer genome atlas pan-cancer analysis project. Nat Genet. 2013;45:1113–1120. doi: 10.1038/ng.2764. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R70] Welch BL. The generalization of ‘student’s’ problem when several different population variances are involved. Biometrika. 1947;34:28–35. doi: 10.1093/biomet/34.1-2.28. [DOI] [PubMed] [Google Scholar]

[R71] Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010;105:713–726. doi: 10.1198/jasa.2010.tm09415. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R72] Zhang X, Nott DJ, Yau C, Jasra A. A sequential algorithm for fast fitting of dirichlet process mixture models. J Comput Graph Stat. 2014;23:1143–1162. [Google Scholar]

PERMALINK

Fast approximate inference for variable selection in Dirichlet process mixtures, with an application to pan-cancer proteomics

Oliver M Crook

Laurent Gatto

Paul DW Kirk

Abstract

1. Introduction

2. Methods

2.1. Dirichlet process mixtures

2.2. Sequential updating and greedy search (SUGS)

2.2.1. Dealing with unknown β

Algorithm 1: The SUGS algorithm, when the DP precision parameter β is allowed to be unknown.

2.2.2. Formulation of Bayesian model selection problem

2.3. SUGS for variable selection

2.3.1. The SUGSVarSel algorithm

Algorithm 2: The SugsVarSel algorithm.

2.3.2. Initialisation strategies for SUGSVarSel

2.4. Bayesian model-averaged co-clustering matrices

2.4.1. Bayesian model averaging

2.4.2. Averaging the co-clustering matrices

3. Comparisons with the state-of-the-art

3.1. Alternative methods for clustering and variable selection

3.2. High-dimensional example

Table 1. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 50% of variables are relevant.

Table 4. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 5% of variables are relevant.

3.2.1. Increasing the number of observations

Table 5. Simulation example where 1000 observations are simulated from a Gaussian mixture distribution with 3 components and 100 variables, in which 25% of variables are relevant.

Table 7. Simulation example where 1000 observations are simulated from a Gaussian mixture distribution with 3 components and 100 variables, in which 5% of variables are relevant.

3.3. Advantages of Bayesian model averaging

Figure 1.

4. Applications to cancer subtyping

4.1. Application to leukaemia dataset

Figure 2.

Figure 3.

4.2. Application to TCGA breast cancer dataset

Figure 4.

Figure 5. PCA plot on the TCGA breast cancer data using 48 of the PAM50 genes, where clusters produced by SUGSVarSel are indicated by shape and subtypes by colour.

5. Pan-cancer proteomic characterisation

Table 8. A table indicating the different cancer types and the number of observations from each of those cancers.

Figure 6. A heatmap indicating the correspondence between clusters produced by the SUGSVarSel algorithm and the different cancer types.

Figure 7.

Figure 8.

Table 9. A table showing the distribution of 3 different THCA subtypes across the clusters K and L produce from the SUGSVarSel algorithm.

6. Conclusion

Supplementary Material

Table 2. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 25% of variables are relevant.

Table 3. High-dimensional simulation example where 100 observations are simulated from a Gaussian mixture distribution with 3 components and 200 variables, in which 10% of variables are relevant.

Table 6. Simulation example where 1000 observations are simulated from a Gaussian mixture distribution with 3 components and 100 variables, in which 10% of variables are relevant.

Funding

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases