Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2015 Jan 1.
Published in final edited form as: Stat. 2014 Apr 24;3(1):109–125. doi: 10.1002/sta4.49

Bayesian sparse graphical models and their mixtures

Rajesh Talluri a,*, Veerabhadran Baladandayuthapani a, Bani K Mallick b
PMCID: PMC4059614  NIHMSID: NIHMS581456  PMID: 24948842

Abstract

We propose Bayesian methods for Gaussian graphical models that lead to sparse and adaptively shrunk estimators of the precision (inverse covariance) matrix. Our methods are based on lasso-type regularization priors leading to parsimonious parameterization of the precision matrix, which is essential in several applications involving learning relationships among the variables. In this context, we introduce a novel type of selection prior that develops a sparse structure on the precision matrix by making most of the elements exactly zero, in addition to ensuring positive definiteness – thus conducting model selection and estimation simultaneously. More importantly, we extend these methods to analyze clustered data using finite mixtures of Gaussian graphical model and infinite mixtures of Gaussian graphical models. We discuss appropriate posterior simulation schemes to implement posterior inference in the proposed models, including the evaluation of normalizing constants that are functions of parameters of interest, which result from the restriction of positive definiteness on the correlation matrix. We evaluate the operating characteristics of our method via several simulations and demonstrate the application to real data examples in genomics.

Keywords: bayesian, covariance selection, finite mixtures, gaussian graphical models, infinite mixtures, sparse modeling

1. Introduction

Consider the p dimensional random vector Y = (Y(1), ···, Y(p)), which follows a multivariate normal distribution Np(μ, Σ) where both the mean μ and the variance-covariance matrix Σ are unknown. Flexible modeling of the covariance matrix, Σ, or equivalently the precision matrix, Ω = Σ−1, is one of the most important tasks in analyzing Gaussian multivariate data. Furthermore, it has a direct relationship to constructing Gaussian graphical models (GGMs) by identifying the significant edges. Of particular interest in this structure is the identification of zero entries in the precision matrix Ω. An off-diagonal zero entry Ωij = 0 indicates conditional independence between the two random variables Y(i) and Y(j), given all other variables. This is the covariance selection problem or the model selection problem in the Gaussian graphical models (Dempster, 1972; Speed & Kiiveri, 1986; Wong et al., 2003), which provides a framework for the exploration of multivariate dependence patterns. GGMs are tools for modeling conditional independence relationships. Among the practical advantages of using GGMs in high-dimensional problems are their ability to (i) make computations more efficient by alleviating the need to handle large matrices, (ii) yield better predictions by fitting sparser models, and (iii) aid scientific understanding by breaking down a global model into a collection of local models that are easier to search.

There have been many approaches to Gaussian graphical modeling. In a Bayesian setting, modeling is based on hierarchical specifications for the covariance matrix (or precision matrix) using global conjugate priors on the space of positive-definite matrices, such as inverse Wishart priors or its equivalents. Dawid & Lauritzen (1993) introduced an equivalent form as the hyper-inverse Wishart (HIW) distribution. Although that construction enjoys many advantages, such as computational efficiency due to its conjugate formulation and exact calculation of marginal likelihoods (Scott & Carvalho, 2008), it is sometimes inflexible due to its restrictive form. Unrestricted graphical model determination is challenging unless the search space is restricted to decomposable graphs, where the marginal likelihoods are available up to the overall normalizing constants (Giudici, 1996; Roverato, 2000). The marginal likelihoods are used to calculate the posterior probability of each graph, which gives an exact solution for small examples, but a prohibitively large number of graphs for a moderately large p. Moreover, extension to a nondecomposable graph is nontrivial and computationally expensive using reversible-jump algorithms (Giudici & Green, 1999; Brooks et al., 2003). G-Wishart prior distributions has been proposed as an generalization of HIW priors for nondecomposable graphs (Roverato, 2002; Atay-Kayis & Massam, 2005). Although it is convenient to use HIW prior due to its conjugate nature, it has several limitations. First, due to global nature of the hyper-prior specifications, HIW priors become more restrictive. Second, sampling from the non-decomposable HIW distribution is challenging due to the presence of an unknown normalizing constant. Computationally challenging Monte Carlo based techniques have been proposed (Roverato, 2002; Dellaportas et al., 2003; Jones et al., 2004; Carvalho & Scott, 2009; Lenkoski & Dobra, 2011; Wang et al., 2010; Mitsakakis et al., 2011; Dobra et al., 2011; Wang et al., 2012). Due to this computational burden, extension of these models in more complex frameworks (like mixtures of graphical models) makes it a daunting task.

Alternate approaches for more adaptive estimation and/or selection in graphical models are based on methods that enforce sparsity either via variable (edge) selection or regularization/penalization approaches. In a regression context for variable selection problems, such priors have been proposed by (George & McCulloch, 1993, 1997; Kuo & Mallick, 1998; Dellaportas et al., 2000, 2002; Fan & Li, 2001; Hans, 2010, 2009; Tibshirani et al., 2005; Griffin et al., 2013). However the context of covariance selection in graphical models is inherently a different problem with additional complexity arising due to the additional constraints of positive definiteness and the number of parameters to estimate being in the the order of p2 instead of p. Regularization based covariance estimation has been done in a frequentist framework (Rothman et al., 2009; Peng et al., 2009; Levina et al., 2008). An alternate class of penalties that have received considerable attention in recent times have been lasso-type penalties Tibshirani (1996) that have the ability to promote sparseness, and have been used for variable selection in regression problems. In a graphical model context, in a frequentist setting Meinshausen & Bühlmann (2006), Yuan & Lin (2007), Friedman et al. (2008) and Anjum et al. (2009) proposed methods to estimate the precision or covariance matrix based on lasso-type penalties that yield only point estimates of the precision matrix. Lasso-based penalties are equivalent to Laplace priors in a Bayesian setting (Figueiredo, 2003; Bae & Mallick, 2004; Park & Casella, 2008). However, in a Bayesian setting, lasso penalties do not produce absolute zeros as the estimates of the precision matrix, and thus cannot be used to conduct model selection simultaneously in such settings (Wang, 2012).

There have been several attempts to shrink the covariance/precision matrix via matrix factorizations for unrestricted search over the space of both decomposable and nondecomposable graphs. Barnard et al. (2000) factorized the covariance matrix in terms of standard deviations and correlations, proposed several shrinkage estimators and discussed suitable priors. Wong et al. (2003) expressed the inverse covariance matrix as a product of the inverse partial variances and the matrix of partial correlations, then used reversible-jump-based Markov chain Monte Carlo (MCMC) algorithms to identify the zeros among the diagonal elements. Liechty et al. (2004) proposed flexible modeling schemes using decompositions of the correlation matrix. We adopt this strategy in this paper and propose Bayesian methods for GGMs that allow for simultaneous model selection and parameter estimation. We introduce a novel type of prior in Section 2 that can be decomposed into selection and shrinkage components in which lasso-type priors are used to accomplish shrinkage and variable selection priors are used for selection. We allow for local exploration of graphical dependencies that leads to a sparse structure of the precision matrix by enforcing most of the non-required elements to be exactly zero with positive probability while ensuring the estimate of the precision matrix is positive definite.

More importantly, we extend these methods to mixtures of GGMs for clustered data, with each mixture component assumed to be Gaussian with an adaptive covariance structure. For certain types of correlated data, it is reasonable to assume that the variables can be clustered or grouped based on sharing similar connectivity or graphs. Our motivation for this model arises from a high-throughput gene expression data set, for which it is of interest not only to cluster the patients (samples) into the correct subtype of cancer but also to learn about the underlying characteristics of the cancer subtypes. The main scientific question of interest is differentiating the structure of the gene networks in the cancer subtypes as a means of identifying biologically significant differences that explain the variations between the subtypes. The modeling and inferential challenges are related to determining the number of components, as well as estimating the underlying graph for each component. In Section 3 we extend the Bayesian lasso selection model to model finite mixtures of Gaussian graphical models. In Section 4 we further generalize the finite mixture model to infinite mixtures using Dirichlet process priors, where the number of mixture components are unknown. In Section 5 the practical applicability of the proposed model is demonstrated through analyzing a gene expression dataset and the models validated using appropriate simulations. We provide a discussion and conclusion in Section 6. A detailed supplementary material is provided which includes posterior inference and sampling schemes for all the models along with additional validation of the models through a variety of simulation scenarios.

2. Bayesian graphical lasso selection model

To formalize notations, let Yp × n = (Y1, …, Yn) be a p × n matrix with n independent samples and p variates, where each sample Yi=(Yi(1),,Yi(p)) is a p dimensional vector corresponding to the p variates; each sample Yi comes from a multivariate normal distribution with mean μ; and the covariance matrix between the p variates is Σ. We have n samples with covariance matrix σ2In, which implies n independent samples with variance σ2. This can be formalized as follows: Y follows a matrix normal distribution N(μ, Σ, σ2In with mean μ and nonsingular covariance matrix Σ between the p variates (Y(1), …, Y(p)) and σ2 is the variance of the samples. This is identical to the multivariate normal distribution MVN(μv, Σσ2In) where μv is the vectorized form of matrix μ and ⊗ is the Kronecker product. Given a random sample Y1, …, Yn, we wish to estimate the precision matrix Ω = Σ−1. The maximum likelihood estimator of (μ, Σ) is (Ȳ, ) where V¯=1ni=1n(Yi-Y¯)(Yi-Y¯)T. The commonly used sample covariance matrix is Ŝ = n/(n − 1). The precision matrix Ω can be estimated by −1 or Ŝ−1. However, if the dimension is p, we need to estimate p(p + 1)/2 numbers of unknown parameters, which even for a moderate size p, might lead to unstable estimates of Ω. In addition, given that our main aim is to explore the conditional relationships among the variables, our main interest is the identification of zero entries in the precision matrix because a zero entry Ωij = 0 indicates conditional independence between the two covariates Y(i) and Y(j), given all other covariates. We propose different priors over Ω to explore these zero entries. Here and throughout the paper we follow the notation that θ1|θ2 represents the conditional distribution of the random variable θ1 given θ2 so that the likelihood of the Gaussian graphical model is written as YG~(2πσ2)-np2Ωn2exp{-12σ2tr{ΩYYT}}. Instead of modeling the entire p × p precision matrix, Ω, we explore local dependencies by breaking the model down into components. In our modeling framework, we work directly with standard deviations and a correlation matrix following the separation strategy of Barnard et al. (2000) that do not correspond to any particular type of parameterization (e.g., the Cholesky decomposition). Specification of a reasonable prior for the entire precision matrix is complicated because Wishart priors (or equivalents) are not fully general, but restrict the degrees of freedom of the partial standard deviations. Using this decomposition, prior beliefs on the partial standard deviations and correlations can be easily accommodated.

To this end, we can parameterize the precision matrix as Ω = SCS, where S is the diagonal matrix of standard deviations and C is the correlation matrix. The partial correlation coefficients ρij are related to Cij as ρij=-Ωij(ΩiiΩjj)12=-Cij. In this setting, our primary construct of interest is C, which we wish to model in an adaptive manner. One could model the elements of C directly using shrinkage priors such as Laplace priors, which gives us the formulation of the graphical lasso models explored by Meinshausen & Bühlmann (2006), Yuan & Lin (2007), Friedman et al. (2008) and Wang (2012). However, in the Bayesian framework the Laplace priors leads to shrinkage of the partial correlation but does not set them exactly to zero, i.e., does not explicitly carry out selection, which is of essence in graphical models. To this end, we decompose the correlation matrix C as C = AR, where ⊙ is the Hadamard operator that conducts the element-wise multiplication. This parameterization helps us to divide the problem into two parts: shrinkage and selection, using the respective elements of R and A, with the following properties: (a) both A and R are symmetric matrices; (b) the diagonal elements of both matrices are ones and, (c) the off-diagonal elements of the selection matrix: A consist of binary random variables (0 or 1), whereas the off-diagonal elements of the shrinkage matrix: R model the partial correlations between the variables with elements that lie between [−1, 1]. Moreover, R can thought of as a pre-correlation matrix which may not be a positive definite matrix but we constrain the convoluted correlation matrix C = AR to be positive definite. We do so by jointly modeling A and R as detailed below.

2.1. Joint prior specification of shrinkage and selection matrices

We propose joint prior specifications on the shrinkage matrix R and selection matrix A as follows. We use a Laplace prior on the elements of the shrinkage matrix defined as f(Rijτij)12τijexp(-Rijτij), with each individual element ij having its own scale parameter, τij, that controls the level of sparsity. As discussed previously, Laplace priors have been widely used for shrinkage applications. Since A is the selection matrix that performs the variable selection on the elements of the matrix R, it thus consists of only binary variables with the off-diagonal elements being either zeros or ones. The most general prior is an exchangeable Bernoulli prior on the off-diagonal elements of A, given as Aij\qij ~ Bernoulli(qij), i < j, where qij is the probability that the ijth element will be selected as 1.

To specify a joint prior for A and R, we have to satisfy the constraint that C = AR is positive definite. Hence, the joint prior can be expressed as

Rij,Aijτij,qij~Laplace(0,τij)Bernoulli(qij)I(Cp),

where −1 ≤ Rij ≤ 1, 0 ≤ qij ≤ 1 and I(CInline graphicp) = 1 if C is a correlation matrix and is 0 otherwise. Therefore, we ensure the positive definiteness of C using plausible values of A and R. That way, Rij and Aij are dependent through the indicator function. The joint prior of A and R can be specified as

A,Rτ,q~i<jLaplace(0,τij)Bernoulli(qij)I(Cp)

where τ and q are the vectors containing τij and qij values respectively. The full specification of the constraints on the Cij’s to ensure the positive definiteness are discussed in Section S1 (Supplementary Material).

RijS are not independent under this prior specification due to the constraint of positive definiteness of C through the indicator function. To observe the marginal prior distribution of Rij, we simulate from this joint prior distribution by fixing Aij = 1 for all i,j so that C is identical to R. Figure 1 shows marginal distributions of some of the rij’s with different values of τij’s when p =3, 10 and 20. In addition, we have also plotted the marginal prior distribution arising from a constrained uniform distribution as suggested by Barnard et al. (2000). It is clear from the figures that the marginal prior on individual correlations under the joint Laplace prior shrinks more tightly towards zero compared to the joint uniform prior. Furthermore, the effect of dimension p is negligible on Laplace prior distribution for small τij values. In this setting, the shrinkage parameter τij controls the degree of sparsity, i.e., determines the degree to which the ijth element of R will be shrunk towards zero. Accordingly, we treat this τij as an unknown parameter and estimate them adaptively using the data. We assign an exchangeable inverse gamma prior as τij ~ IG(e, f), i <j, where (e, f) are the shape and scale parameters, respectively. Note that if we set τij = τi,j along with A = 1n (i.e., a matrix of all 1’s), this gives rise to the special case of the Bayesian version of the graphical lasso of Friedman et al. (2008) and Yuan & Lin (2007), where the single penalty parameter (τ) controls the sparsity of the graph and is estimated via cross-validation or by using a criterion similar to the Bayesian information criterion (BIC). By allowing the penalty parameter to vary locally for each node, we allow for additional flexibility, which has been shown to result in better properties than those of the lasso prior and which also satisfies the oracle property (consistent model selection), as shown by Griffin & Brown (2011) in the variable selection context.

Figure 1.

Figure 1

Shown here are the marginal priors on Rij for different values of τ= 1, .5, .1 and compared with the uniform prior.

Furthermore, qij is assigned a beta prior as qij ~ Beta(a, b), i <j. In this construction the hyper-parameters qij control the probability that the ijth element will be selected as a non-zero element. To evaluate a highly sparse model the hyper-parameters should be specified such that the beta distribution is skewed towards zero, and for a dense model the hyper-parameters should be specified such that the beta distribution is skewed towards one. Furthermore, prior beliefs about the existence of edges can be incorporated at this stage of the hierarchy by giving greater weights to important edges while down-weighting redundant edges.

In conclusion, the joint specification of A and R above gives us the graphical lasso selection model that performs simultaneous shrinkage and selection. To complete the hierarchical specification of the graphical lasso selection, we use an inverse gamma prior on the inverse of the partial standard deviations Si: Si ~ IG(g, h), i = 1,2, …,.p. We use MCMC techniques for estimation and inference for the above specified models that includes hybrid Gibbs and Metropolis-Hastings strategies to explore the graph space. A critical step in our moves is the evaluation of normalizing constants which result from the positive-definite constraint on the correlation matrices. Detailed posterior sampling schemes are detailed in Section S1 of supplementary material.

3. Mixtures of Gaussian graphical models

3.1. Motivating example

A strength of the proposed method is that it can be employed in a more complex modeling framework in a hierarchical manner. In this section we use the proposed method to develop finite mixture graphical models, where each mixture component is assumed to follow a GGM with an adaptive covariance structure. Our motivation for this model arises from the analysis of a high-throughput genomics data set. Suppose we have a gene expression data set with n samples and g genes, and are interested in detecting K subtypes of cancer among the n samples. We assume a different network (graph) structure of these g genes for each cancer subtype and have a primary goal of using this information efficiently to cluster the samples into the correct subtype of cancer. In addition, we wish to identify biologically significant differences among the networks to explain the variations between the cancer subtypes. To this end, we extended the Bayesian lasso selection models outlined in the previous sections in a finite mixture set-up. This framework allows for additional flexibility over that of regular GGMs by allowing a heterogeneous population to be divided into groups that are more homogeneous and which share similar connectivity or graphs.

3.2. The hierarchical finite mixture GGM model

Let Yp×n = (Y1, …, Yn) be a p × n matrix with n samples and p covariates. Here each of the n samples belongs to one of the K hidden groups or strata. Each sample Yi follows a multivariate normal distribution N(θj, Σj) if it belongs to the jth group. Given a random sample Y1, …, Yn, we wish to estimate the number of mixtures K as well as the precision matrices Ωj=j-1, j = 1, …, K. We introduce the latent indicator variable Li ∈ 1, 2, …, K, which corresponds to every observation Yi that indicates which component of the mixture is associated with Yi, i.e., Li =j if Yi) belongs to the jth group. A priori, we assume P(Li = j) = pj such that p1 + p2 + … + pK = 1. We can then write the likelihood of the data conditional on the latent variables as

YiLi=j,θ,Ω~N(θj,Ωj-1).

The latent indicator variables are allowed a priori to follow a multinomial distribution with probabilities p1, …, pK as

Li~Multinomial(n,[p1,p2,,pK]),

and the associated class probabilities follow a Dirichlet distribution as

p1,p2,,pKα~Dirichlet(α1,α2,,αK).

We allow the individual means of each group to follow a Normal distribution as

θjB~N(0,B),

We assign a common inverse Wishart prior for covariance matrix B across groups as B ~ IW(ν0, B0), where ν0 is the shape parameter and B0 is the scale matrix. The hierarchical specification of the GGM structure for each group Ωj parallels the development of the previous section, with each GGM indexed by its own mixture-specific parameters to allow the sparsity to vary within each cluster component. The hierarchical model for finite mixture GGMs can be summarized as follows:

Yi|Li = j, θ, Ω ~
N(θj,Ωj-1)
Li ~ Multinomial(n, [p1, p2, …, pK])
p1, p2, …, pK|α ~ Dirichlet(α1, α2, …, αK)
θj|B ~ N(0, B)
B ~ IW(ν0, B0)
Ωj = Sj(AjRj)Sj
Aj(lm)|qj(lm) ~ Bernoulli(qj(lm)), i < j
Rj|Aj ~
l<mLaplace(0,τj(l,m))/(Cjp)
τj(lm) ~ IG(e, f)
Sj(l) ~ IG(g, f),

where i denotes the sample, j denotes the mixture component, i = 1, 2, …, n and j = 1, 2, …, K. In addition, Aj(lm) and τj(lm) denote the lmth component of the Aj and the τj, I = 1,2, …, p, m = I, …, p. Posterior inference and conditional distributions are detailed in section S2 of the supplementary material. Conditional on the number of mixtures (K) we fit a finite mixture model, then vary the number of mixtures and select the optimal number of mixtures using BIC, as explained in Section S2 (Supplementary Material). Alternative ways of model determination are to find the Bayes factor using the MCMC samples (Chib & Jeliazkov, 2001).

4. Infinite Mixtures of Gaussian Graphical Models

We have extended our modeling framework to infinite mixtures of Gaussian graphical models in cases where the number of mixture components are unknown. The added advantage of this procedure is the number of mixtures is treated as a model unknown and will be determined adaptively by data rather than using BIC. Infinite mixtures of Gaussian graphical models using hyper inverse wishart priors were previously proposed by Rodriguez et al. (2011). The proposed model improves upon it by incorporating shrinkage and selection priors to estimate mixtures of sparse graphical models.

Following previous section, Yp×n = (Y1, …, Yn) is a p × n matrix with n samples and p covariates and the likelihood function is

Yiθi,Ωi~N(θi,Ωi-1).

Let γi=(θi,Ωi-1). We propose the Dirichlet process prior (Ferguson, 1973; Dey et al., 1998; Müller & Quintana, 2004) on γ = (γ1, γ2, …, γn) which can be written as

γ=(γ1,γ2,,γn)~DP(α,Hϕ),

where α is the precision parameter Hϕ is the base distribution of the Dirichlet process (DP) prior. In a sequence of draws γ1, γ2, … from the Polya urn representation of the Dirichlet process (Blackwell & MacQueen, 1973), the nth sample is either distinct with a small probability α/(α + n − 1) or is tied to previous sample with positive probability to form a cluster. Let fln = {γ1, … γn} − {γn} and dn−1 = number of preexisting clusters of tied samples in γn at the nth draw, then we have

f(γnfl-n,α,ϕ)=αα+n-1Hϕ+j=1dn-1njα+n-1δγ¯j, (1)

where Hϕ is the base prior, and the jth cluster has nj tied samples that are commonly expressed by γ̄j subject to j=1dn-1nj=n-1. After n sequential draws from the Polya urn, there are several ties in the sampled values and we denote the set of distinct samples by {γ̄1, …, γ̄dn, where dn is essentially the number of clusters. We induce sparsity into the model using the base distribution Hϕ which defines the cluster configuration. We allow the individual means of each group to follow a Normal distribution as

θjΩj~N(0,Ωj-1).

The hierarchical specification of the GGM structure for each group Ωj parallels the development of the previous subsection, with each GGM indexed by its own mixture-specific parameters to allow the sparsity to vary within each cluster component. Hence, Hϕ = Nθ|Ω(.)FΩ where FΩ is the baseline prior for the precision matrix. The hierarchical model for the base prior can be summarized as follows:

HϕNθ(.)FΩΩ=S(AR)SFΩFA(.)FR(.)FS(.)A,Rτ,Q~i<jLaplace(0,τij)Bernoulli(qij)I(Cp)τ~IG(νe,νf)Q~Beta(νc,νd)S~IG(να,νβ).

The parametrization is similar to the models detailed in the previous sections. The base prior is not in conjugate form so the base prior is not integrable with the likelihood to draw from the posterior using Gibbs sampling framework (Escobar & West, 1995). We need to use Metropolis Hastings algorithm to handle the non-conjugacy (Neal, 2000). The posterior inference and conditional distributions for this model are detailed in section S3 of the supplementary material.

5. Simulations and Real Data Analysis

5.1. Application to a genomics dataset

For the real data analysis, we used the leukemia data from Golub et al. (1999) as a case study to illustrate our mixtures of graphical models. In this study, the authors measured the human gene expression signatures of acute leukaemia. They used supervised learning to predict the type of leukaemia and used unsupervised learning to discover new classes of leukaemia. The motivation for this work was to improve cancer treatment by distinguishing between subclasses of cancers or tumors. The data are available from http://www.genome.wi.mit.edu/MPR. The data set includes 6817 genes and 72 patient samples. We selected the 50 most relevant genes, identified using a Bayesian gene selection algorithm (Lee et al., 2003). The heat map of the top 50 genes in the data set is shown in Figure 2 which shows that the expression profiles of these genes form distinct groups of genes that behave concordantly – hence warranting a more through investigation to explicitly explore the dependence patterns that vary by group.

Figure 2.

Figure 2

Heat map of top 50 genes in leukaemia data set.

We fit the finite mixtures of Gaussian graphical models to this data set to find the top graphs for the data. We ran the Markov Chain Monte Carlo simulation for 100000 samples and removed the first 20000 samples as burn-in. We obtained two clusters as corresponding best to two subtypes of leukaemia: (1) acute lymphoblastic leukaemia (ALL) and (2) acute myelogenous leukaemia (AML). The respective networks corresponding to the two clusters are shown in Figure 3 and Figure 4. As shown in the figures, the networks for these two clusters are quite different, which suggests possible interactions between genes that differ depending on the subtype of leukaemia.

Figure 3.

Figure 3

Significant edges for the genes in the ALL cluster. The red (green) lines between the proteins indicate a negative (positive) correlation between the proteins.

Figure 4.

Figure 4

Significant edges for the genes in the AML cluster. The red (green) lines between the proteins indicate a negative (positive) correlation between the proteins.

We further explored the biological ramifications of our findings using the gene annotations also used by Golub et al. (1999). Most of the genes active in the ALL network are inactive in the AML network and vice versa. It is known that ITGAX and CD33 play a role in encoding cell surface proteins which are useful in distinguishing lymphoid from myeloid lineage cells. We can see in the networks of the clusters that CD33 and ITGAX are active in the AML network but inactive in the ALL network. The zyxin gene plays a role in in producing an important protein important for cell adhesion. Zyxin is also active in the AML network but not in the ALL network. In general, the genes most useful in distinguishing AML vs. ALL class prediction are markers of haematopoietic lineage, which are not necessarily related to cancer pathogenesis. However, many of these genes encode proteins critical for S-phase cell cycle progression (CCND3, STMN1, and MCM3), chromatin remodeling (RBBP4 and SMARC4), transcription (GTF2E2), and cell adhesion (zyxin and ITGAX), or are known oncogenes (MYB, TCF3 and HOXA9) Golub et al. (1999). The genes encoding proteins for S-phase cell cycle progression (CCND3, STMN1, and MCM3) were all found to be active in the ALL network but inactive in the AML network. This suggests a connection of ALL with the S-phase cell cycle. Genes responsible for chromatin remodeling and transcriptional factors were present in both networks, indicating they are common to both types of cancer. This information can be used to discover a common drug for both types of leukaemia. Among the oncogenes, MYB was related to the ALL network, whereas TCF3 and HOXA9 were related to the AML network. HOXA9 overexpression is responsible transformation in myeloid cells and causes leukaemia in animal models. A general role for the HOXA9 expression in predicting AML outcomes has been suggested by Golub et al. (1999). We also confirmed that HOXA9 is active in the AML network, but not in the ALL network.

5.2. Simulations

For validation of the Bayesian lasso selection model we compared our method with the glasso (Friedman et al., 2008) and MB (Meinshausen & Bühlmann, 2006) methods because both methods use the L1-regularization and are closest to our approach using Laplace priors. We assessed the performance of these methods in terms of the Kullback-Leibler (K-L) loss. The Kullback -Leibler loss is defined as ΔKL (Ω̂, Ω) = trace (ΩΩ̂−1) − log|ΩΩ̂−1| − p. The simulations were designed using sparse, identity, banded, block and dense structures for the precision matrix. In terms of the K-L loss, the Bayesian method outperforms the glasso and MB methods for the sparse unstructured and banded diagonal matrices. All the methods performed equivalently for the identity matrix, the block diagonal, and the dense structure. The complete simulation details are presented in section S4 of the supplementary material.

For validation of the mixture model we performed a posterior predictive simulation study to evaluate the operating characteristics of our methodology for mixtures of graphical models. We simulated data from our fitted model of the leukaemia data set using the estimated precision matrices for the two groups, ALL and AML. The simulation was conducted as follows. Let (μ̂j, Ω^j-1) denote the estimates of the mean and precision matrices corresponding to the ALL (j = 1) and AML (j = 2) groups, respectively. We generated data under the convolution of the following multivariate normal likelihood, Yj~N(μ^j,Ω^j-1), with 100 samples and 50 covariates.

We (re-)fitted our models to the simulated data and compared the estimates of the covariance matrices obtained from a non-adaptive finite mixture model (MCLUST) of Fraley & Raftery (2007). We used the “VVV” setting, which implies the use of an unconstrained covariance estimation method in their procedure. We completed 100000 runs of the Markov Chain Monte Carlo simulation and removed the first 10000 runs as burn-in. The true and corresponding estimates of the precision matrices using the two methods are shown in Figure 5, where the absolute values of the precision matrix excluding the diagonal are plotted.

Figure 5.

Figure 5

Simulation study (p=50). The true and estimated precision matrices for two subtypes of leukaemia: (a) ALL and (b) AML. The top row of images shows the true data generating precision matrix; the middle row shows the estimated precision matrix using our adaptive Bayesian model; and the bottom row shows the estimated precision matrix using a non-adaptive fit. Note that the absolute values of the partial correlations are plotted in the above figures without the diagonal. The colorbars are shown to the right of each image.

As shown in the figure, fitting our adaptive model to the data (middle row of images) yields estimates that are similar to the true data generating precision matrices, whereas fitting the non-adaptive model to the data (bottom row of images) yields noisier estimates, with less local shrinkage of the off-diagonal elements. In addition to a visual inspection, we compared the performance of both methods using the K-L distance. The corresponding estimates of the K-L distances were 3.5592 and 7.2210 for the adaptive and non-adaptive model fits, respectively. For the AML cluster we obtained respective K-L distances of 4.0836 and 7.7881 for the two methods. In addition, we also compared the false positive and false negative rates for finding true edges using each method. It should be noted that the purpose of the MCLUST approach is not covariance selection, hence we imposed selection on the elements of the estimated precision matrix by thresholding the coefficients to zero if they were less than a defined constant. We chose a fairly generous thresholding constant so that the false negatives and false positives were minimized. We applied the thresholding constant of 0.15 to the coefficients of the precision matrices that were estimated for the two clusters. For the AML cluster, we found false positive rates of (0.0049, 0.0645) and false negative rates of (0.0106, 0) for our adaptive model and the MCLUST approach, respectively. For the ALL cluster, we found false positive rates of (0.0041, 0.0661) and false negative rates of (0.0131, 0) for the adaptive and non-adaptive model fits, respectively. In summary, our adaptive method performs better in recovering the true sparse precision matrix compared to the simple (non-adaptive) clustering approaches.

To explore how the method scales with the number of covariates, we ran another simulation with 100 covariates and 200 samples. The results are plotted in Figure 6. We find a similar pattern of performance from fitting our adaptive model to the data (middle row of images), which yields estimates that are closer to the true data generating precision matrices. By contrast, fitting the non-adaptive model to the data (bottom row of images) yields noisier estimates, with less local shrinkage of the off-diagonal elements. Again we compared the performance of both methods using the K-L distance and determined that the corresponding estimates were 10.1241 and 25.3378 for the adaptive and non-adaptive model fits, respectively. For the AML cluster, we obtained K-L distances of 12.1244 and 27.4851 for the respective methods. We chose a thresholding constant of 0.15 and applied that to the coefficients of the precision matrices that were estimated for the two clusters using MCLUST. For the AML cluster we found false positive rates of (0.0063, 0.2822) and false negative rates of (0.0222, 0.0081) for our adaptive model and the MCLUST approach, respectively. For the ALL cluster, we found false positive rates of (0.0044, 0.2497) and false negative rates of (0.101, 0.0372) for the adaptive and non-adaptive fits, respectively. Thus, compared to the non-adaptive approaches, our adaptive method performed better in recovering the true sparse precision matrix. We found that our methods scale reasonably until we reach around 300 covariates but above that level the high computational complexity did not allow for a reasonable computation time. The computational time for the above simulation is approximately 20 hours for 100000 MCMC runs. Parallel computation in cluster machines can be used to speed up the process when the number of covariates extremely high. Alternatively, we plan to explore faster deployments of our algorithm through variational approach or other approximations.

Figure 6.

Figure 6

Simulation study (p=100). True and estimated precision matrices for two subtypes of leukaemia: (a) ALL and (b) AML. The top row of images shows the true data generating precision matrix; the middle row shows the estimated precision matrix using our adaptive Bayesian model; and the bottom row shows the estimated precision matrix using a non-adaptive fit. Note that the absolute values of the partial correlations are plotted in the above figures without the diagonal. The colorbars are shown to the right of each image.

6. Discussion and conclusions

In this article we develop a Bayesian framework for adaptive estimation of precision matrices in Gaussian graphical models. We propose sparse estimators using L1-regularization and use lasso-based selection priors to obtain sparse and adaptively shrunk estimators of the precision matrix that conduct simultaneous model selection and estimation. We extend these methods to mixtures of Gaussian graphical models for clustered data, with each mixture component assumed to be Gaussian with an adaptive covariance structure. We discuss appropriate posterior simulation schemes for implementing posterior inference in the proposed models, including the evaluation of normalizing constants that are functions of the parameters of interest which result from constraints on the correlation matrix. We compare our methods with several existing methods from the literature using both real and simulated examples. We found our methods to be very competitive and in some cases outperform the existing methods.

Our simulations and analysis suggest that it is feasible to implement adaptive GGMs and mixtures of GGMs using Markov Chain Monte Carlo for a reasonable number of variables (p in a few hundreds). Applications to more high-dimensional settings may require more refined sampling algorithms and/or parallelized computations for our method to run in a reasonable time. One nice feature of our modeling framework is that it can be generalized to other contexts in a straightforward manner. As opposed to the unsupervised setting we considered, another context would be that of supervised learning or classification using GGMs. Another interesting setting would be to extend our methods for situations in which the variables are observed over time and our models are used to develop time-dependent sparse dynamic graphs. We leave these tasks for future consideration.

Supplementary Material

Supp Material

Acknowledgments

RT research was supported in part by a cancer prevention fellowship supported by a grant from the National Institute of Drug Abuse (NIH R25 DA026120). VB research is partially supported by NIH grant R01 CA160736 and the Cancer Center Support Grant (CCSG) (P30 CA016672). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Cancer Institute or the National Institutes of Health. We also thank Lee Ann Chastain for editing the manuscript.

Footnotes

Supporting Information

Additional information for this article is available:

Section S1: Posterior sampling schemes for the Bayesian lasso selection model

Section S2: Posterior inference and conditional distributions for finite mixture of Gaussian graphical models

Section S3: Posterior inference and conditional distributions for infinite mixture of Gaussian graphical models

Section S4: Simulations for Bayesian lasso selection model

References

  1. Anjum S, Doucet A, Holmes CC. A boosting approach to structure learning of graphs with and without prior knowledge. Bioinformatics. 2009;25(22):2929–2936. doi: 10.1093/bioinformatics/btp485. [DOI] [PubMed] [Google Scholar]
  2. Atay-Kayis A, Massam H. A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrika. 2005;92(2):317. [Google Scholar]
  3. Bae K, Mallick B. Gene selection using a two-level hierarchical Bayesian model. Bioinformatics. 2004;20(18):3423–3430. doi: 10.1093/bioinformatics/bth419. [DOI] [PubMed] [Google Scholar]
  4. Barnard J, McCulloch R, Meng X. Modeling covariance matrices in terms of standard deviations and correlations, with application to shrinkage. Statistica Sinica. 2000;10(4):1281–1312. [Google Scholar]
  5. Blackwell D, MacQueen JB. Ferguson distributions via pólya urn schemes. The Annals of Statistics. 1973;1(2):353–355. [Google Scholar]
  6. Brooks S, Giudici P, Roberts G. Efficient construction of reversible jump Markov chain Monte Carlo proposal distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2003;65(1):3–39. [Google Scholar]
  7. Carvalho CM, Scott JG. Objective Bayesian model selection in Gaussian graphical models. Biometrika. 2009;96(3):497–512. [Google Scholar]
  8. Chib S, Jeliazkov I. Marginal likelihood from the Metropolis-Hastings output. Journal of the American Statistical Association. 2001;96(453):270–281. [Google Scholar]
  9. Dawid AP, Lauritzen SL. Hyper Markov laws in the statistical analysis of decomposable graphical models. The Annals of Statistics. 1993;21(3):1272–1317. [Google Scholar]
  10. Dellaportas P, Forster J, Ntzoufras I. In: Bayesian variable selection using the Gibbs sampler. Generalized Linear Models: A Bayesian Perspective. Dey D, Ghosh S, Mallick B, editors. Chemical Rubber Company Press; New York: 2000. [Google Scholar]
  11. Dellaportas P, Forster JJ, Ntzoufras I. On Bayesian model and variable selection using MCMC. Statistics and Computing. 2002;12:27–36. [Google Scholar]
  12. Dellaportas P, Giudici P, Roberts G. Bayesian inference for nondecomposable graphical Gaussian models.’. Sankhya, Series A. 2003;65(1):43–55. [Google Scholar]
  13. Dempster AP. Covariance selection. Biometrics. 1972;28(1):157–175. [Google Scholar]
  14. Dey D, Müller P, Sinha D. Practical nonparametric and semiparametric Bayesian statistics. Springer; New York: 1998. [Google Scholar]
  15. Dobra A, Lenkoski A, Rodriguez A. Bayesian inference for general Gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association. 2011;106(496) doi: 10.1198/jasa.2011.tm10465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Escobar MD, West M. Bayesian density estimation and inference using mixtures. Journal of the American Statistical Association. 1995;90(430):577–588. [Google Scholar]
  17. Fan J, Li R. Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American Statistical Association. 2001;96(456):1348–1360. [Google Scholar]
  18. Ferguson TS. A Bayesian analysis of some nonparametric problems. The Annals of Statistics. 1973;1(2):209–230. [Google Scholar]
  19. Figueiredo MAT. Adaptive sparseness for supervised learning. IEEE Transactions on Pattern Analysis Machine Intelligence. 2003;25(9):1150–1159. [Google Scholar]
  20. Fraley C, Raftery AE. Bayesian regularization for normal mixture estimation and model-based clustering. Journal of Classification. 2007;24(2):155–181. [Google Scholar]
  21. Friedman J, Hastie T, Tibshirani R. Sparse inverse covariance estimation with the graphical lasso. Biostatistics. 2008;9(3):432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. George El, McCulloch RE. Variable selection via Gibbs sampling. Journal of the American Statistical Association. 1993;88(423):881–889. [Google Scholar]
  23. George El, McCulloch RE. Approaches for Bayesian variable selection. Statistica Sinica. 1997;7:339–373. [Google Scholar]
  24. Giudici P. Learning in graphical Gaussian models. Oxford University Press; London: 1996. [Google Scholar]
  25. Giudici P, Green PJ. Decomposable graphical Gaussian model determination. Biometrika. 1999;86(4):785–801. [Google Scholar]
  26. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science. 1999;286(5439):531–537. doi: 10.1126/science.286.5439.531. [DOI] [PubMed] [Google Scholar]
  27. Griffin JE, Brown PJ. Bayesian hyper-lassos with non-convex penalization. Australian & New Zealand Journal of Statistics. 2011;53(4):423–442. [Google Scholar]
  28. Griffin JE, Brown PJ, et al. Some priors for sparse regression modelling. Bayesian Analysis. 2013;8(3):691–702. [Google Scholar]
  29. Hans C. Bayesian lasso regression. Biometrika. 2009;96(4):835–845. [Google Scholar]
  30. Hans C. Model uncertainty and variable selection in Bayesian lasso regression. Statistics and Computing. 2010;20(2):221–229. [Google Scholar]
  31. Jones B, Carvalho C, Dobra A, Hans C, Carter C, West M. Experiments in stochastic computation for high-dimensional graphical models. Statistical Science. 2004;20:388–400. [Google Scholar]
  32. Kuo L, Mallick B. Variable selection for regression models. Sankhya, Series B. 1998;60(1):65–81. [Google Scholar]
  33. Lee KE, Sha N, Dougherty ER, Vannucci M, Mallick BK. Gene selection: A Bayesian variable selection approach. Bioinformatics. 2003;19(1):90–97. doi: 10.1093/bioinformatics/19.1.90. [DOI] [PubMed] [Google Scholar]
  34. Lenkoski A, Dobra A. Computational aspects related to inference in Gaussian graphical models with the G-Wishart prior. Journal of Computational and Graphical Statistics. 2011;20(1):140–157. [Google Scholar]
  35. Levina E, Rothman A, Zhu J. Sparse estimation of large covariance matrices via a nested lasso penalty. The Annals of Applied Statistics. 2008;2(1):245–263. [Google Scholar]
  36. Liechty JC, Liechty MW, Muller P. Bayesian correlation estimation. Biometrika. 2004;91(1):1–14. [Google Scholar]
  37. Meinshausen N, Bühlmann P. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics. 2006;34(3):1436–1462. [Google Scholar]
  38. Mitsakakis N, Massam H, Escobar MD, et al. A Metropolis-Hastings based method for sampling from the G-Wishart distribution in Gaussian graphical models. Electronic Journal of Statistics. 2011;5:18–30. [Google Scholar]
  39. Müller P, Quintana FA. Nonparametric Bayesian data analysis. Statistical Science. 2004;19(1):95–110. [Google Scholar]
  40. Neal RM. Markov chain sampling methods for Dirichlet process mixture models. Journal of Computational and Graphical Statistics. 2000;9(2):249–265. [Google Scholar]
  41. Park T, Casella G. The Bayesian lasso. Journal of the American Statistical Association. 2008;103(482):681–686. [Google Scholar]
  42. Peng J, Wang P, Zhou N, Zhu J. Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association. 2009;104(486):735–746. doi: 10.1198/jasa.2009.0126. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Rodriguez A, Lenkoski A, Dobra A, et al. Sparse covariance estimation in heterogeneous samples. Electronic Journal of Statistics. 2011;5:981–1014. doi: 10.1214/11-EJS634. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Rothman A, Levina E, Zhu J. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association. 2009;104(485):177–186. [Google Scholar]
  45. Roverato A. Cholesky decomposition of a hyper inverse Wishart matrix. Biometrika. 2000;87(1):99–112. [Google Scholar]
  46. Roverato A. Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics. 2002;29(3):391–411. [Google Scholar]
  47. Scott JG, Carvalho CM. Feature-inclusion stochastic search for Gaussian graphical models. Journal of Computational and Graphical Statistics. 2008;17(4):790–808. [Google Scholar]
  48. Speed TP, Kiiveri HT. Gaussian Markov distributions over finite graphs. The Annals of Statistics. 1986;14(1):138–150. [Google Scholar]
  49. Tibshirani R. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B (Methodological) 1996;58(1):267–288. [Google Scholar]
  50. Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2005;67(1):91–108. [Google Scholar]
  51. Wang H. Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis. 2012;7(4):867–886. [Google Scholar]
  52. Wang H, Carvalho CM, et al. Simulation of hyper-inverse Wishart distributions for non-decomposable graphs. Electronic Journal of Statistics. 2010;4:1470–1475. [Google Scholar]
  53. Wang H, Li SZ, et al. Efficient Gaussian graphical model determination under G-Wishart prior distributions. Electronic Journal of Statistics. 2012;6:168–198. [Google Scholar]
  54. Wong F, Carter CK, Kohn R. Efficient estimation of covariance selection models. Biometrika. 2003;90(4):809–830. [Google Scholar]
  55. Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):19–35. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp Material

RESOURCES