clustvarsel: A Package Implementing Variable Selection for Gaussian Model-Based Clustering in R

Luca Scrucca; Adrian E Raftery

doi:10.18637/jss.v084.i01

. Author manuscript; available in PMC: 2018 Nov 16.

Published in final edited form as: J Stat Softw. 2018 Apr 17;84:1. doi: 10.18637/jss.v084.i01

clustvarsel: A Package Implementing Variable Selection for Gaussian Model-Based Clustering in R

Luca Scrucca ¹, Adrian E Raftery ²

PMCID: PMC6238955 NIHMSID: NIHMS967394 PMID: 30450020

Abstract

Finite mixture modeling provides a framework for cluster analysis based on parsimonious Gaussian mixture models. Variable or feature selection is of particular importance in situations where only a subset of the available variables provide clustering information. This enables the selection of a more parsimonious model, yielding more efficient estimates, a clearer interpretation and, often, improved clustering partitions. This paper describes the R package clustvarsel which performs subset selection for model-based clustering. An improved version of the Raftery and Dean (2006) methodology is implemented in the new release of the package to find the (locally) optimal subset of variables with group/cluster information in a dataset. Search over the solution space is performed using either a step-wise greedy search or a headlong algorithm. Adjustments for speeding up these algorithms are discussed, as well as a parallel implementation of the stepwise search. Usage of the package is presented through the discussion of several data examples.

Keywords: BIC, model-based clustering, R, subset selection

1. Introduction

Cluster analysis is the search for a priori unknown group structure in data. Model-based clustering is increasingly becoming one of the most popular cluster analysis methods. Model-based clustering is based on finite mixture models (McLachlan and Peel 2000), with each component density usually representing a cluster. For continuous data, Gaussian components are usually used to model clusters. Model-based clustering as implemented in the R package mclust (Fraley, Raftery, Murphy, and Scrucca 2012; Scrucca, Fop, Murphy, and Raftery 2016) allows for automatic selection of the number of components, and selection of parsimonious covariance structures.

In cluster analysis, as in classification or other supervised learning tasks, the inclusion of noise variables, i.e., features without useful group information, can severely degrade the final results. In fact, the presence of noise variables can negatively impact both the estimation of the number of clusters in the data and the recovery of those groups. The new release of R package clustvarsel (version ≥ 2.0; Dean, Raftery, and Scrucca 2017) implements a wrapper method for automatic variable selection in model-based clustering (as implemented in the mclust package). Thus, the addition of the clustvarsel package allows for automatic variable selection to be included in the estimation process.

Raftery and Dean (2006) introduced a stepwise variable selection methodology tailored to model-based clustering. Variables designated as noise variables in this process were not required to be independent of the clustering variables. However, noise variables could be conditionally independent of the clustering, but still linearly dependent on the clustering variables. This linear dependency was modeled using linear regression. An earlier version of clustvarsel (version 1) implemented this methodology. Dean (2006) is a vignette describing use of this earlier version.

Maugis, Celeux, and Martin-Magniette (2009a, b) extended the framework of Raftery and Dean (2006) by allowing the noise variables to depend on a (possibly null) subset of the clustering variables via stepwise variable selection in the linear regression. This allows for a more parsimonious modeling of the relationship between the noise variables and the clustering variables. For more details on the variable selection framework see Section 2.

Software packages related to subset selection in clustering are SelvarClust (Dia, Martin-Magniette, and Maugis 2009a) and SelvarClustIndep (Dia, Martin-Magniette, and Maugis 2009b), which implement in C++ the above mentioned approaches. The R package SelvarMix (Sedki, Celeux, and Maugis-Rabusseau 2017) provides a method based on the Maugis et al. (2009b) approach preceded by a step in which the variables are ranked using a lasso-like procedure. The R package vscc (Andrews and McNicholas 2013) implements the methodology proposed by Andrews and McNicholas (2014) which aims at finding the variables that simultaneously minimize the within-group variance and maximize the between-group variance. Finally, sparse hierarchical clustering and sparse k-means clustering are included in the R package sparcl (Witten and Tibshirani 2013) according to the proposal of Witten and Tibshirani (2010).

The paper is organized as follows: Section 3 introduces the main function in the clustvarsel package, and discusses the options for the available arguments. In Section 4, several examples are presented by applying the methodology to both synthetic and real world datasets. Algorithmic speedups are discussed in Section 5, including a description of a parallel implementation of the stepwise greedy search. The paper concludes with some discussion and final remarks in Section 6.

2. Methodology

Model-based clustering assumes that the observed data are generated from a mixture of G components, each representing the probability distribution for a different group or cluster (McLachlan and Peel 2000; Fraley and Raftery 2002). For continuous data, the density of each mixture component is often described by the multivariate Gaussian distribution. Thus, the general form of a Gaussian finite mixture model is

f (x) = \sum_{g = 1}^{G} π_{g} ϕ (x | μ_{g}, Σ_{g}),

where π_g represents the mixing probabilities, so that π_g > 0 and $\sum_{g = 1}^{G} π_{g} = 1$ , ϕ(·) is the multivariate Gaussian density with parameters (μ_g, Σ_g) (g = 1, …, G). Clusters are ellipsoidal, centered at the mean vector μ_g, with other geometric features, such as volume, shape and orientation, determined by σ_g. Parsimonious parameterization of covariance matrices is available through the eigenvalue decomposition $Σ_{g} = λ_{g} D_{g} A_{g} D_{g}^{┬}$ , where λ_g is a scalar controlling the volume of the ellipsoid, A_g is a diagonal matrix specifying the shape of the density contours, and D_g is an orthogonal matrix which determines the orientation of the corresponding ellipsoid (Banfield and Raftery 1993; Celeux and Govaert 1995). Fraley et al. (2012, Table 1) report some parameterization of within-group covariance matrices available in the R package mclust, and the corresponding geometric characteristics.

Table 1.

Parameter settings for the scenarios used to generated synthetic data: β defines the correlation of irrelevant variables on clustering variables, whereas Ω is the covariance structure of the noise component. 0_p indicates the (2 × p) matrix of zeroes, and I_p the (p × p) identity matrix.

Scenario

Parameters

Scenario

Parameters

Model 1

β = 0₈

Model 5

β = (\begin{matrix} 0.5 & 0 & 2 & 0 \\ 0 & 1 & 0 & 3 \end{matrix} 0_{4})

Ω = I₈

Ω = diag(I₂, 0.5I₂, I₄)

Model 4

β = (\begin{matrix} 0.5 & 0 \\ 0 & 1 \end{matrix} 0_{6})

Model 7

β = (\begin{matrix} 0.5 & 0 & 2 & 0 & 2 & 0.5 & 2 & 0 \\ 0 & 1 & 0 & 3 & 0.5 & 1 & 0 & 3 \end{matrix})

Ω = I₈

Ω = diag(I₂, 0.5I₄, I₂)

Model	Subset size	VSER	ARI	Subset size	VSER	ARI
Scenario 1	n = 200			n = 1000

`MCLUST,G=4`	10.00	0.800	0.845	10.00	0.800	0.869
`SPARSEKMEANS,G=4`	9.92	0.792	0.881	9.27	0.727	0.884
`CLUSTVARSEL[fwd],G=4`	2.01	0.001	0.882	2.02	0.002	0.887
`CLUSTVARSEL[bkw],G=4`	2.10	0.010	0.882	2.25	0.025	0.887
`MCLUST`	10.00	0.800	0.780	10.00	0.800	0.883
`CLUSTVARSEL[fwd]`	2.01	0.001	0.882	2.01	0.002	0.887
`CLUSTVARSEL[bkw]`	4.03	0.311	0.658	2.25	0.025	0.887

Scenario 4	n = 200			n = 1000

`MCLUST,G=4`	10.00	0.800	0.799	10.00	0.800	0.813
`SPARSEKMEANS,G=4`	9.93	0.793	0.829	9.28	0.728	0.843
`CLUSTVARSEL[fwd],G=4`	2.03	0.005	0.872	2.00	0.000	0.888
`CLUSTVARSEL[bkw],G=4`	2.18	0.018	0.877	2.22	0.022	0.888
`MCLUST`	10.00	0.800	0.640	10.00	0.800	0.864
`CLUSTVARSEL[fwd]`	2.03	0.005	0.872	2.00	0.000	0.888
`CLUSTVARSEL[bkw]`	3.13	0.185	0.732	2.25	0.025	0.888

Scenario 5	n = 200			n = 1000

`MCLUST,G=4`	10.00	0.800	0.811	10.00	0.809	0.879
`SPARSEKMEANS,G=4`	9.35	0.735	0.815	7.00	0.500	0.855
`CLUSTVARSEL[fwd],G=4`	2.00	0.010	0.888	2.00	0.000	0.884
`CLUSTVARSEL[bkw],G=4`	2.00	0.014	0.887	2.00	0.000	0.884
`MCLUST`	10.00	0.800	0.455	10.00	0.800	0.879
`CLUSTVARSEL[fwd]`	1.99	0.013	0.884	2.00	0.000	0.884
`CLUSTVARSEL[bkw]`	2.04	0.014	0.888	2.04	0.004	0.884

Scenario 7	n = 200			n = 1000

`MCLUST,G=4`	10.00	0.800	0.848	10.00	0.800	0.878
`SPARSEKMEANS,G=4`	9.00	0.704	0.857	9.00	0.700	0.865
`CLUSTVARSEL[fwd],G=4`	2.00	0.032	0.878	2.00	0.000	0.883
`CLUSTVARSEL[bkw],G=4`	2.00	0.032	0.878	2.00	0.000	0.883
`MCLUST`	10.00	0.800	0.469	10.00	0.800	0.878
`CLUSTVARSEL[fwd]`	2.00	0.032	0.878	2.00	0.000	0.883
`CLUSTVARSEL[bkw]`	2.00	0.014	0.879	2.00	0.000	0.883

`log.likelihood`	`n`	`df`	`BIC`	`ICL`
`-1241.006`	`200`	`68`	`-2842.298`	`-2854.29`

`Class`	`1`	`2`	`3`	`4`
`B\|F`	`49`	`0`	`0`	`1`
`B\|M`	`11`	`0`	`39`	`0`
`O\|F`	`0`	`5`	`0`	`45`
`O\|M`	`0`	`50`	`0`	`0`

`Variable proposed`	`Type of step`	`BICclust`	`Model`	`G`	`BICdiff`	`Decision`
`CW`	`Add`	`-1408.710`	`E`	`2`	`-6.21775`	`Accepted`
`RW`	`Add`	`-1908.964`	`EEV`	`2`	`127.38583`	`Accepted`
`FL`	`Add`	`-2357.252`	`EEV`	`4`	`81.24626`	`Accepted`
`FL`	`Remove`	`-1908.964`	`EEV`	`2`	`81.24626`	`Rejected`
`BD`	`Add`	`-2609.777`	`EEV`	`4`	`56.08094`	`Accepted`
`BD`	`Remove`	`-2357.252`	`EEV`	`4`	`56.08094`	`Rejected`
`CL`	`Add`	`-2842.298`	`EEV`	`4`	`-31.07119`	`Rejected`
`BD`	`Remove`	`-2357.252`	`EEV`	`4`	`56.08094`	`Rejected`

	`test`	`Elapsed`
`1`	`clustvarsel(X, G = 1:5, verbose = FALSE)`	`2.998`
`2`	`clustvarsel(X, G = 1:5, direction = “backward”, verbose = FALSE)`	`1.551`

	`Cluster`
`Class`	`1`	`2`	`3`
`Arabica`	`22`	`14`	`0`
`Robusta`	`0`	`0`	`7`

	`test`	`elapsed`
`1`	`clustvarsel(X, verbose = FALSE)`	`8.19`
`2`	`clustvarsel(X, direction = “backward”, verbose = FALSE)`	`7.74`

	VSER			CER
Model	n_g = 10	n_g = 20	n_g = 50	n_g = 10	n_g = 20	n_g = 50
`K-MEANS`				0.262	0.242	0.218
`MCLUST[`X₁, …, X₅ `]`				0.073	0.055	0.053
`MCLUST[`X₁, …, X₂₅ `]`				0.218	0.128	0.063

`SPARSEKMEANS`	0.138	0.331	0.799	0.127	0.063	0.054
`CLUSTVARSEL[fwd]`	0.264	0.274	0.114	0.343	0.347	0.151
`CLUSTVARSEL[bkw]`	0.222	0.086	0.033	0.181	0.086	0.054

	`test`	`elapsed`	`relative`
`1`	`out1 <- clustvarsel(X, G = 1:5)`	`5.560`	`1.798`
`2`	`out2 <- clustvarsel(X, G = 1:5, samp = TRUE, sampsize = 200)`	`3.093`	`1.000`

	`test`	`elapsed`
`1`	`clustvarsel(X, G = 1:9, direction = “backward”)`	`52.12`
`2`	`clustvarsel(X, G = 1:9, direction = “backward”, parallel = TRUE)`	`17.46`
`3`	`clustvarsel(X, G = 1:9, direction = “backward”, parallel = 2)`	`28.64`

`1`	`2`	`3`	`4`
`53`	`60`	`40`	`47`

`log.likelihood`	`n`	`df`	`BIC`	`ICL`
`-392.9397`	`43`	`52`	`-981.4619`	`-981.6379`

`Variable proposed`	`Type of step`	`BICclust`	`Model`	`G`	`BICdiff`	`Decision`
`Extract Yield`	`Remove`	`-788.3021`	`VEI`	`3`	`-10.930431`	`Accepted`
`Neochlorogenic Acid`	`Remove`	`-852.5413`	`VEI`	`3`	`-9.982637`	`Accepted`
`Chlorogenic Acid`	`Remove`	`-805.6227`	`VEI`	`3`	`-11.065351`	`Accepted`
`Extract Yield`	`Add`	`-999.1101`	`VEI`	`3`	`-9.315106`	`Rejected`
`Isochlorogenic Acid`	`Remove`	`-816.8139`	`VEV`	`8`	`-13.958685`	`Accepted`
`Extract Yield`	`Add`	`-936.2985`	`VEV`	`6`	`66.325955`	`Accepted`
`Extract Yield`	`Remove`	`-816.8139`	`VEV`	`8`	`66.325955`	`Rejected`
`Isochlorogenic Acid`	`Add`	`-999.1101`	`VEI`	`3`	`-90.028182`	`Rejected`

`log.likelihood`	`n`	`df`	`BIC`	`ICL`
`-443.2269`	`43`	`38`	`-1029.379`	`-1030.937`

PERMALINK

clustvarsel: A Package Implementing Variable Selection for Gaussian Model-Based Clustering in R

Luca Scrucca

Adrian E Raftery

Abstract

1. Introduction

2. Methodology

Table 1.

3. The R package clustvarsel

4. Examples

4.1. Simulated data

Table 2.

4.2. Crabs data

4.3. Coffee data

Figure 1.

4.4. Simulated high-dimensional data

Table 3.

5. Adjustments for speeding up the algorithm

5.1. Sub-sampling at hierarchical initialization step

Figure 3.

5.2. Headlong search

Figure 4.

5.3. Parallel computing

Figure 5.

6. Conclusions and future work

Figure 2.

Acknowledgments

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases