Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2019 May 6;75(4):1288–1298. doi: 10.1111/biom.13064

Fast Bayesian inference in large Gaussian graphical models

Gwenaël G R Leday 1,, Sylvia Richardson 1
PMCID: PMC6916355  PMID: 31009060

Abstract

Despite major methodological developments, Bayesian inference in Gaussian graphical models remains challenging in high dimension due to the tremendous size of the model space. This article proposes a method to infer the marginal and conditional independence structures between variables by multiple testing, which bypasses the exploration of the model space. Specifically, we introduce closed‐form Bayes factors under the Gaussian conjugate model to evaluate the null hypotheses of marginal and conditional independence between variables. Their computation for all pairs of variables is shown to be extremely efficient, thereby allowing us to address large problems with thousands of nodes as required by modern applications. Moreover, we derive exact tail probabilities from the null distributions of the Bayes factors. These allow the use of any multiplicity correction procedure to control error rates for incorrect edge inclusion. We demonstrate the proposed approach on various simulated examples as well as on a large gene expression data set from The Cancer Genome Atlas.

Keywords: Bayes factor, correlation, Gaussian graphical model, high‐dimensional data, inverse‐Wishart distribution

1. INTRODUCTION

Identifying the complex relationships between molecular entities is central to the understanding of disease biology. The advent of high‐throughput biotechnologies has provided opportunity to study this interplay and considerably stimulated research in this direction. Many studies now exploit high‐throughput molecular data to describe the functional relationships between molecular entities such as genes, proteins, or metabolites.

Graphical models provide a natural basis for the statistical description and analysis of relationships between variables. In applications, interest often lies in the undirected graph that describes the conditional dependence structure among variables. When the joint distribution of the variables is assumed to be Gaussian, this is known to be fully coded in the inverse‐covariance matrix Ω={ωij} (Dempster, 1972). Precisely, a pair (i,j) of variables with 1i<jp, will be conditionally independent (given all the remaining variables) when ωij=0. The present article treats inference of the undirected graph in context of the Gaussian model when the number of variables p is potentially larger than the sample size.

Despite major methodological developments, Bayesian inference for Gaussian graphical models remains challenging. The standard approach casts the problem as a model selection problem, and first requires specification of prior distributions over all possible graphical models and their parameter spaces. Such specification is not straightforward as it is desirable to favor parsimonious models and address the compatibility of priors across models (Carvalho and Scott, 2009; Consonni and La Rocca, 2012). Next, the inference procedure is hindered by the search over a very high‐dimensional model space where the number of possible graphical models grows superexponentially with the number of variables. Full exploration of the model space is, therefore, only possible when the number of variables is very small (say p10). In moderate‐dimensional and high‐dimensional settings where p is in the tens, hundreds, or thousands, the model space must generally be searched stochastically (Wang and Li, 2012; Mohammadi and Wit, 2015). However, due to the tremendous size of the model space in such settings, it may be difficult (or impossible) to identify with confidence the graphical model that is best supported by the data. Indeed, many models may almost equally be supported by the data. Accordingly, it is preferable to account for model uncertainty by performing Bayesian model averaging and to infer the graphical structure by selecting edges with the highest marginal posterior probabilities, for example, by exploiting their connection to a Bayesian version of the false discovery rate (Mitra et al., 2013; Baladandayuthapani et al., 2014; Peterson et al., 2015).

To bypass the difficulties associated with the standard approach, this article proposes to use an alternative framework based on directly selecting edges by multiple testing of hypotheses about pairwise conditional independence using closed‐form Bayes factors. These are obtained using the conditional approach of Dickey (1971), in which the prior under the null hypothesis is derived from that of the alternative by conditioning on the null hypothesis. This approach was also adopted by Giudici (1995) to derive a closed‐form Bayes factor for conditional independence. However, the latter relies on elements of the inverse of the sample covariance matrix which is singular when the number of variables is large relative to the sample size. We circumvent this issue and introduce new closed‐form Bayes factors for marginal and conditional independence that are suitable in such settings. Moreover, we show the consistency of the Bayes factors and derive exact tail probabilities from their null distributions to help address the multiplicity problem and control error rates for incorrect edge inclusion. The proposed procedure, available via the R package beam on the CRAN website, is shown to be computationally very efficient, addressing problems with thousands of nodes in just a few seconds.

The next section introduces notations and the Gaussian conjugate (GC) model. Section 3 presents a closed‐form Bayes factor to evaluate the null hypothesis of conditional independence between any two variables and studies its consistency (all results about marginal independence are provided in Appendix S2). Section 4 details graph inference and discusses the multiple testing problem and error control. The performance of the proposed approach is compared to Bayesian and non‐Bayesian methods on simulated data in Section 5. Section 6 illustrates our method on a large gene expression data set from The Cancer Genome Atlas.

2. BACKGROUND

2.1. Notation

We write xμ,Σ~Np(μ,Σ) to indicate that xRp has a multivariate normal distribution with mean μ and positive‐definite covariance matrix Σ,ΩA,α~IWd(A,α) to indicate that Ω has an Inverse‐Wishart distribution with scale matrix A and degree of freedom α>d+1, and β~β(b1,b2) to indicate that β has a β distribution with shape parameters b1 and b2. Γd(x) is the d‐dimensional gamma function, the operator vec denotes the linear transformation that stacks the columns of a matrix into a vector and denotes the Kronecker product. We use the subscripts aa,bb,ab, and ba to refer to the submatrices Σaa,Σbb,Σab, and Σba of a p×p symmetric matrix Σ whose block‐wise decomposition is implied by a partition of its rows and columns into two disjoint subsets indexed by a{1,,p} and b={1,,p}a.

2.2. The GC model

Given an n×p observation matrix Y=(Y1,,Yp), the GC model is defined by

vec(Y)Σ~Nnp(0,ΣIn),ΣD,δ~IWp((δp1)D,δ), (1)

with D positive definite, In the n‐dimensional identity matrix, and δ>p+1. Here, the covariance matrix with Kronecker product structure makes explicit the assumption of independence for the rows of Y and the dependence of its columns via the covariance Σ.

Due to conjugacy, model (1) offers closed‐form Bayesian estimators of the covariance matrix Σ and its inverse Ω=Σ1. The posterior expectation of Σ is

E(ΣY)={(δp1)D+S}(δ+np1), (2)

where S=YTY, and that of its inverse is

E(ΩY)=(δ+n){(δp1)D+S}1. (3)

It is important to note that estimator (2) is a linear shrinkage estimator that is a convex linear combination of the maximum likelihood estimator Σ^mle=n1S of Σ and E(Σ)=D, with weight α=(δp1)(δ+np1)(0,1) (Chen, 1979; Hannart and Naveau, 2014). Likewise, estimator (3) is recognized as a ridge‐type estimator of the precision matrix (Kubokawa and Srivastava, 2008; Van Wieringen and Peeters, 2016). The next proposition presents some properties of these two estimators. All proofs are presented in Appendix S4.

Proposition 1

Let estimators (2) and (3) depend on δ with D,n, and p fixed, and denote them by Σ^δ and Ω^δ, respectively. Then the following properties hold:

  • (1)

    limδΣ^δ=D;

  • (2)

    limδΩ^δ=D1;

  • (3)

    limδp+1Σ^δ=Σ^mle;

  • (4)

    limδp+1Ω^δ={(n+p+1)n}Σ^mle1, if n>p;

  • (5)

    Σ^δ and Ω^δ are positive definite.

Additionally, the asymptotic properties of estimators (2) and (3) when n are the same as those of the maximum likelihood estimators Σ^mle and Σ^mle1 of Σ and Ω. Proposition 2 summarizes.

Proposition 2

Let estimator (2) and (3) depend on n with D,δ, and p be fixed, and denote them by Σ^n and Ω^n, respectively. Then the following properties hold:

  • (1)

    limnΣ^n=Σ^mle;

  • (2)

    limnΩ^n=Σ^mle1.

2.3. Choice of hyperparameters

In model (1), the prior matrix D represents the prior expectation of Σ. It may also be interpreted as the shrinkage target toward which the maximum likelihood estimator of the covariance matrix is shrunk, since the posterior expectation of Σ is a linear shrinkage estimator. For these reasons, D can be chosen to encourage estimator (2) to have specific structures (eg, autoregressives or low ranks). Ideally, in such cases the matrix D should be parameterized by a low‐dimensional vector of hyperparameters that are interpretable and for which prior knowledge exists. As often this knowledge is absent, it is common to choose D=Ip. Throughout this paper, we use D=Ip and standardize the n×p observation matrix Y so that for 1jp,YjT1n=0 and YjTYjn=1, where 1n is an n×1 vector whose elements are all equal to 1.

The other hyperparameter δ clearly acts as a regularization parameter (see Equations (2) and (3)) and its value must therefore be chosen carefully. Following Chen (1979) and Hannart and Naveau (2014), we use empirical Bayes and estimate δ by the value δ^ maximizing the marginal (or integrated) likelihood of the model (see Appendix S2). We are referring the reader to Hannart and Naveau (2014, Section 2.3) for the proof that the asymptotic properties of estimator (2) and (3) (Proposition 1) hold when δ=δ^.

3. BAYES FACTORS

3.1. Bayes factor for conditional independence

In this section we derive an analytic expression for the Bayes factor evaluating the null hypothesis of conditional independence between two variables in context of model (1). For ease of notation, we define F=(δp1)D and T=F+S. We wish to evaluate the null hypothesis of conditional independence, denoted H0,ij, between two coordinates i and j,1i<jp. We test H0,ij:ωij=0 against the alternative hypothesis H1,ij:ωij0, where ωij is the (i,j)th element of Ω. The Bayes factor evaluating evidence in favor of H1,ij is

BFij=p1(YΣ)p1(Σ)dΣp0(YΣ0)p0(Σ0)dΣ0, (4)

where, by definition, Σ0 is such that ωij=0.

Giudici (1995) showed that (4) could be obtained in closed form by reparameterizing the GC model and defining a compatible prior under the null hypothesis using the approach of Dickey (1971). However, the proposed Bayes factor does not exist in high‐dimensional settings where p>n because it depends on elements of S1. This problem is here circumvented by factorizing the joint likelihood of the observed data as p(YΣ)=p(YbΣbb)p(YaYb,Bab,Σaa.b), the product of a marginal and conditional likelihood. This factorization arises from the partition of Y=[Ya,Yb] into two disjoint subsets indexed by a={i,j} and b=Va. The quantity Bab=Σbb1Σba represents the matrix of regression coefficients obtained when regressing the variables indexed by a onto the variables indexed by b, whereas Σaa.b=ΣaaΣabΣbb1Σba denotes the residual covariance matrix.

The factorization of the likelihood allows conveniently to simplify (4). Using the change of variable from (Σaa,Σab,Σbb) to (Σaa.b,Bab,Σbb) together with the fact that Σbb is independent of (Bab,Σaa.b), most nuisance parameters are integrated out and (4) becomes

BFij=p1(YaYb,Bab,Σaa.b)p1(Bab,Σaa.b)dBabdΣaa.bp0(YaYb,Bab,Σaa.b0)p0(Bab,Σaa.b0)dBabdΣaa.b0. (5)

Note that by the standard properties of the multivariate normal and inverse‐Wishart distributions (Gupta and Nagar, 2000, Theorems 2.3.12 and 3.3.9) the densities under the alternative model are

vec(Ya)Yb,Bab,Σaa.b~Nn×2(vec(YbBab),Σaa.bIn),vec(Bab)Σaa.b~N(p2)×2(vec(Fab),Σaa.bFbb1),Σaa.b~IW2(Faa.b,δ), (6)

where Fab=Fbb1Fba and Faa.b=FaaFabFbb1Fba. Therefore, the simplification of Bayes factor (4) intuitively tells us that evaluating the conditional independence between any two coordinates within the p‐dimensional GC model (1) is equivalent to evaluating the diagonality of the residual covariance matrix in a bivariate regression model.

To obtain (5) in closed form we, similar to Giudici (1995), define a compatible prior for (Bab,Σaa.b) under the null hypothesis H0,ij using the conditional approach of Dickey (1971). Precisely, the prior density under H0,ij is derived from that under H1,ij by conditioning on H0,ij. The densities under the null model are therefore

vec(Ya)Yb,Bab,Σaa.b0~Nn×2(vec(YbBab),Σaa.b0In),vec(Bab)Σaa.b0~N(p2)×2(vec(Fab),Σaa.b0Fbb1),p0(Bab,Σaa.b0)=p1(Bab,Σaa.bH0,ij)=p1(Bab,Σaa.b,H0,ij)p1(Bab,Σaa.b,H0,ij)dBabdΣaa.b, (7)

where Σaa.b0 is such that ωij=0.

We now state the main result of this section.

Lemma 1

Assume (5) holds with densities defined by (6) and (7). Then the Bayes factor in favor of H1,ij is

BFij=k1(δ,n)(1rgij2)δ/2(1rqij2)((δ+n)/2)(giigjjqiiqjj)1/2,

with

k1(δ,n)=Γ((δ+n)/2)Γ((δ+n1)/2)Γ2((δ+1)/2)Γ(δ/2)Γ((δ1)/2)Γ2((δ+n+1)/2),
rgij=gijgiigjj,rqij=qijqiiqjj,Faa.b=[giigijgijgjj],Taa.b=[qiiqijqijqjj].

Remark 1

In Lemma 1, the quantities gii and qii (resp., gjj and qjj) can be thought of representing prior and posterior partial variances for coordinate i (resp., j), whereas rgij and rqij can be thought of representing prior and posterior partial correlations.

Remark 2

The Bayes factor proposed by Giudici (1995, Lemma 3), in contrast to Lemma 1, defines the quantities gij and qij such that the matrices Faa.b={gij} and Faa.b+Saa.b={qij}, with Saa.b=SaaSabSbb1Sba. Note that here Saa.b only exists when Sbb is invertible (ie, when n is large relatively to p) whereas Taa.b=TaaTabTbb1Tba defined in Lemma 1 exists even when p>n because T is always positive definite (a consequence of Proposition 1).

Remark 3

Standard matrix algebra (Gupta and Nagar, 2000, Theorem 1.2.3.(v)) tells us that Faa.b={(F1)aa}1 and Taa.b={(T1)aa}1. This means that the elements of the 2×2 matrices Faa.b and Taa.b can, respectively, be obtained from the elements of F1 and T1. The computation of the Bayes factor in Lemma 1 for all pairs of variables (i,j) hence boils down to computing F1 and T1.

3.2. Consistency

In this section, we consider the selection consistency of the Bayes factor defined in Lemma 1. A Bayes factor is said to be consistent when limnBFij=0 if H0,ij is true and limnBFij= if H1,ij is true (Wang and Maruyama, 2016). In other words, the consistency property means that the true hypothesis will be selected when enough data are provided. We now state the following result.

Lemma 2

If the sample correlation matrix has a limit as n that is positive definite, then the Bayes factor BFij is consistent in selection.

4. GRAPH STRUCTURE RECOVERY

4.1. Inference by multiple testing

We propose to infer the conditional independence graph by multiple testing of hypotheses using the Bayes factor introduced in the previous section. Precisely, we propose to infer the edge set EU={(i,j)ωij0} of the undirected graph U=(V,EU) with vertex set V by evaluating H0,ij:ωij=0 (absence of an edge) vs. H1,ij:ωij0 (presence of an edge) separately for each pair (i,j) of variables.

On the whole, the multiple testing approach consists in translating the pattern of rejected hypotheses into a graph. The approach is justified by the fact that, for the undirected graph, the conditioning sets in the pairwise independence statements do not depend on the structure of the graph (Drton and Perlman, 2007). This means that these statements can be evaluated individually by hypothesis testing. Here, these tests are carried out separately using model (1) that encodes the complete undirected graph where no independence structure is imposed.

4.2. Scaled Bayes factors

To infer the graph structure it is necessary to compare Bayes factors between all p(p1)2 pairs of variables. However, the Bayes factor defined in Lemma 1 is not scale‐invariant (due to its last term) and, hence, not comparable between different pairs of variables. In light of this, we define a scaled version of this Bayes factor that can more appropriately rank edges of graph U. Corollary 1 summarizes.

Corollary 1

The scaled Bayes factor in favor of H1,ij is

sBFij=k1(δ,n)(1rgij2)δ/2(1rqij2)((δ+n)/2),

with quantities defined as in Lemma 1.

Remark 4

When the prior matrix D=Ip (absence of prior knowledge), then rgij=0 and the ordering provided by the scaled Bayes factor in Corollary 1 for all pairs (i,j) is identical to the ordering provided by the square of the posterior partial correlation rqij. This means that the graph selected when using a thresholding rule on the Bayes factors is the same as that obtained using the equivalent thresholding rule on the posterior correlations.

4.3. Multiplicity adjustment and error control

To address the multiplicity problem, we propose to use the tail or error probability associated with the null distribution of each scaled Bayes factor. The tail probability is closely related to the notion of a P‐value: the Bayes factor is treated as a random variable and its distribution, which follows that of the random data, is used to make a probability statement about its observed value. Then, to recover the structure of a graph, the tail probabilities obtained from all p(p1)2 comparisons are adjusted using standard multiplicity correction procedures to control, say, the family‐wise error or false discovery rates (Goeman and Solari, 2014).

In the following, we study the conditional null distribution of the Bayes factor defined in Corollary 1. The conditional null distribution here refers to the distribution that would be obtained by shuffling or permuting labels of the observations (Jiang et al., 2017). Under this scheme, we shall define Pr(sBFij>b) the probability of observing a value for the scaled Bayes factor that is larger than b. Next, we show that this tail probability can be obtained analytically without the need of a permutation algorithm, thus providing a computational advantage. Before, we state three results which will be used in our argumentation.

Proposition 3

Suppose Φ~W2(Σ,d), where

Φ=(ϕ12ϕ1ϕ2φϕ1ϕ2φϕ22)andΣ=(σ12σ1σ2ρσ1σ2ρσ22)

are parametrized in terms of their correlations 1φ1 and 1ρ1. Then,

(φ2ρ=0)~β(12,(d1)2).

Proposition 4

The following equality holds:

YaTYaB¯abT(YbTYb+Fbb)B¯ab+FabFbb1Fba=(YaYbFab)T(In+YbFbb1YbT)1(YaYbFab),

where B¯ab=(YbTYb+Fbb)1(YbTYa+Fba).

Proposition 5

Let Σaa.b be fixed. Then, according to model (6), we have

(YaYbFab)T(In+YbFbb1YbT)1(YaYbFab)~W2(Σaa.b,n).

The only term of the Bayes factor that depends on the data is rqij=qij(qiiqjj)12, where, we recall, qij is such that Taa.b={qij}. Proposition 4 suggests that Taa.b=Faa.b+Z, with Z=(YaYbFab)T(In+YbFbb1YbT)1(YaYbFab). Hence,

rqij=(giigjj)12rgij+(ziizjj)12rzij(gii+zii)12(gjj+zjj)12,

where Z={zij} and rzij=zij(ziizjj)12. This means that Pr{sBFij>b}=Pr{rzij2>c}, where c is a quantity that depends on {δ,n,gii,gjj,rgij,zii,zjj}. Propositions 3 and 5 imply that Z~W2(Σaa.b,n) and rzij2H0,ij~β(12,(n1)2). Therefore, the tail probability of the Bayes factor can be computed exactly using β(12,(n1)2). We remark that the definition of the type 1 error is conditioning on {δ,n,gii,gjj,rgij,zii,zjj}.

5. NUMERICAL EXPERIMENTS

5.1. Comparison to Bayesian methods

In this section, we compare the performance of our approach with other Bayesian methods. For computational reasons, we consider a moderate‐dimensional problem. We generate 50 datasets of size n{25,50,100} from a multivariate Gaussian distribution with mean vector 0 and 50×50 inverse‐covariance matrix Φ. The matrix Φ is a sparse matrix which we generate from a G‐Wishart distribution with scale matrix equal to the identity and b=4 degrees of freedom (using the function bdgraph.sim of R package BDgraph). Four different graph structures are considered, namely the band, cluster, hub, and random structures, which we illustrate in Figure S1.

We compare our method to two sampling‐based approaches based on the birth‐death and reversible jump Markov chain Monte Carlo (MCMC) algorithms, developed by Mohammadi and Wit (2015; 2017), using 100 000 sweeps and a burn‐in period of 50 000 updates. We also consider the method of Schwaller et al. (2017) that offers closed‐form inference within the class of tree‐structured graphical models. For each method we obtain the marginal posterior probabilities of edge inclusion, either via the sampling algorithm or exactly.

To evaluate performance we report the area under the receiver operating characteristic (ROC) curve, which depicts the true positive rate TP/(TP + FN) as a function of the false positive rate FP/(FP  + TP), overall possible thresholds on the marginal posterior probabilities of edge inclusion (or tail probabilities in case of our method). Here, TP,FP, and FN denote the number of true positives, false positives, and false negatives, respectively. We also report the area under the precision recall (PR) curve, which depict the precision TP/(TP + FP) as a function of the true positive rate (also named recall).

Table 1 summarizes simulation results. It shows that our method performs well compared to other Bayesian methods in recovering the different graph structures. For instance, our method often achieves the largest areas under the ROC and PR curves for different graph structures and sample sizes. Moreover, a marked improvement is observed in cases where the sample size is small (n=25) with respect to p. The results also show nonnegligible differences in performance between the birth‐death and reversible jump MCMC algorithms, which suggests that performance can be affected by the choice of sampling algorithm.

Table 1.

Average and SD (in parenthesis) of areas under the ROC and PR curves over the simulated datasets, as a function of the true graph structure and sample size n

Band structure Cluster structure
n Methods
AUCROC
AUCPR
AUCROC
AUCPR
100 beam 0.89 (0.02) 0.65 (0.03) 0.80 (0.02) 0.54 (0.03)
100 bdmcmc 0.89 (0.03) 0.67 (0.03) 0.79 (0.02) 0.51 (0.04)
100 rjmcmc 0.88 (0.03) 0.63 (0.05) 0.78 (0.03) 0.50 (0.04)
100 saturnin 0.89 (0.02) 0.61 (0.04) 0.77 (0.02) 0.53 (0.04)
50 beam 0.84 (0.03) 0.53 (0.04) 0.73 (0.02) 0.39 (0.04)
50 bdmcmc 0.82 (0.03) 0.51 (0.06) 0.72 (0.03) 0.37 (0.04)
50 rjmcmc 0.81 (0.03) 0.47 (0.05) 0.72 (0.02) 0.35 (0.04)
50 saturnin 0.82 (0.02) 0.44 (0.04) 0.68 (0.02) 0.33 (0.04)
25 beam 0.78 (0.04) 0.39 (0.05) 0.66 (0.03) 0.24 (0.04)
25 bdmcmc 0.75 (0.04) 0.32 (0.05) 0.65 (0.03) 0.23 (0.03)
25 rjmcmc 0.75 (0.04) 0.27 (0.05) 0.64 (0.03) 0.22 (0.03)
25 saturnin 0.73 (0.03) 0.28 (0.05) 0.58 (0.02) 0.15 (0.02)
Hub structure Random structure
100 beam 0.88 (0.03) 0.62 (0.03) 0.87 (0.03) 0.65 (0.03)
100 bdmcmc 0.89 (0.02) 0.67 (0.04) 0.86 (0.03) 0.66 (0.03)
100 rjmcmc 0.89 (0.02) 0.65 (0.05) 0.85 (0.03) 0.65 (0.04)
100 saturnin 0.92 (0.01) 0.63 (0.02) 0.86 (0.02) 0.59 (0.02)
50 beam 0.84 (0.03) 0.53 (0.03) 0.83 (0.03) 0.56 (0.04)
50 bdmcmc 0.84 (0.03) 0.52 (0.05) 0.81 (0.03) 0.53 (0.05)
50 rjmcmc 0.84 (0.03) 0.48 (0.06) 0.80 (0.03) 0.49 (0.06)
50 saturnin 0.86 (0.02) 0.48 (0.03) 0.83 (0.02) 0.47 (0.03)
25 beam 0.80 (0.03) 0.42 (0.04) 0.79 (0.03) 0.43 (0.05)
25 bdmcmc 0.79 (0.04) 0.32 (0.05) 0.75 (0.02) 0.33 (0.05)
25 rjmcmc 0.77 (0.04) 0.27 (0.04) 0.74 (0.03) 0.30 (0.05)
25 saturnin 0.80 (0.03) 0.35 (0.04) 0.77 (0.02) 0.35 (0.04)

Abbreviation: AUC, area under curve; PR, precision recall; ROC, receiver operating characteristic.

beam, our method; bdmcmc and rjmcmc, methods of Mohammadi and Wit (2015); saturnin, method of Schwaller et al. (2017); AUCROC, area under the ROC curve; AUCPR area under the PR curve. Best performances are boldfaced.

Overall, the simulation results demonstrate that our method can recover various graphical structures at least as accurately as other Bayesian approaches at a very low computation cost (see Figure S2). Our method achieves generally a greater area under the PR curve than others. The present results also confirm that obtained by Schwaller et al. (2017), namely, the relative good performance of tree‐structured graphical models compared to sampling‐based approaches despite stronger restrictions on the class of graphical models. However, the performance of the approach can degrade in some cases (eg, cluster structures).

5.2. Comparison to non‐Bayesian methods

The performance of the proposed method is compared in higher dimensional settings to non‐Bayesian approaches that carry out graphical model selection via multiple testing. We generate 50 datasets of size n=100 from a p‐dimensional Gaussian distribution mean vector 0 and inverse‐covariance matrix Ψ. Throughout the simulation, we fix the sample size n=100 and vary of the dimensionality p{200,500,1000}. We consider four different sparse precision matrices corresponding to different graph structures (similar to those illustrated in Figure S1): (a) Ψpband is a tridiagonal matrix; (b) Ψpcluster is a block diagonal matrix whose blocks are sparse matrices of size 20 where off‐diagonal entries are nonzero with probability 0.1; (c) Ψphub is a block diagonal matrix whose blocks are sparse matrices of size 20 where only the off‐diagonal entries in the first row and column are nonzero; and (d) Ψprandom is obtained by randomly permuting the rows and columns of Ψpband. For all matrices nonzero entries are generated independently from a uniform distribution on [1,1] and positive definiteness is ensured by adding a constant to the diagonal so the minimum eigenvalue is equal to 0.1.

We compare our method to that of Schäfer and Strimmer (2005) that is based on a linear shrinkage estimator of the covariance matrix (Ledoit and Wolf, 2004) and a mixture model for false discovery rate estimation (Strimmer, 2008). We also consider the asymptotic normal thresholding method of Ren et al. (2015). For both methods we obtain P values associated with the estimated partial correlations, whereas for our method we use the tail probabilities associated with the Bayes factor defined in Corollary 1 for all pairs of variables.

As in Section 5.1, we evaluate performance using the areas under the ROC and PR curves.

Table 2 shows that the proposed method performs well in recovering large graphical structures compared to non‐Bayesian methods. It achieves comparable areas under the ROC and PR curves as other methods for different problem sizes. However, in the case of hub structures the proposed method performs better.

Table 2.

Average and SD (in parenthesis) areas under the ROC and PR curves over the simulated datasets, and as a function of the true graph structure and sample size n

Band structure Cluster structure
p Methods
AUCROC
AUCPR
AUCROC
AUCPR
200 beam 0.88 (0.01) 0.55 (0.02) 0.91 (0.01) 0.58 (0.01)
200 genenet 0.89 (0.01) 0.57 (0.02) 0.91 (0.01) 0.59 (0.01)
200 fastggm 0.87 (0.01) 0.57 (0.02) 0.89 (0.01) 0.60 (0.02)
500 beam 0.91 (0.01) 0.58 (0.01) 0.89 (0.01) 0.50 (0.01)
500 genenet 0.91 (0.01) 0.60 (0.01) 0.89 (0.01) 0.52 (0.01)
500 fastggm 0.90 (0.01) 0.61 (0.01) 0.85 (0.01) 0.49 (0.01)
1000 beam 0.88 (0.01) 0.49 (0.01) 0.90 (0.00) 0.48 (0.01)
1000 genenet 0.88 (0.01) 0.49 (0.01) 0.90 (0.00) 0.49 (0.01)
1000 fastggm 0.87 (0.01) 0.51 (0.01) 0.87 (0.00) 0.48 (0.01)
Hub structure Random structure
200 beam 0.90 (0.01) 0.56 (0.01) 0.86 (0.01) 0.43 (0.02)
200 genenet 0.85 (0.01) 0.21 (0.03) 0.86 (0.01) 0.47 (0.02)
200 fastggm 0.87 (0.01) 0.46 (0.02) 0.85 (0.01) 0.47 (0.02)
500 beam 0.92 (0.01) 0.54 (0.01) 0.82 (0.01) 0.35 (0.01)
500 genenet 0.90 (0.00) 0.43 (0.01) 0.82 (0.01) 0.34 (0.01)
500 fastggm 0.88 (0.01) 0.44 (0.01) 0.81 (0.00) 0.34 (0.01)
1000 beam 0.93 (0.00) 0.54 (0.01) 0.77 (0.00) 0.22 (0.01)
1000 genenet 0.92 (0.00) 0.49 (0.01) 0.77 (0.00) 0.21 (0.01)
1000 fastggm 0.89 (0.00) 0.44 (0.01) 0.77 (0.00) 0.22 (0.01)

Abbreviation: AUC, area under curve; PR, precision recall; ROC, receiver operating characteristic.

beam, our method; saturnin, method of Schwaller et al. (2017); genenet, method of Schäfer and Strimmer (2005); fastggm, method of Ren et al. (2015); AUCROC, area under the ROC curve; AUCPR area under the PR curve. Best performances are boldfaced.

Besides recovering accurately the different graphical structures, Figure 1 shows that the proposed method is the fastest. When p=1000, the average computational time is less than a second whereas contenders are 5 to 20 times slower.

Figure 1.

Figure 1

Running time in seconds (assessed on 3.40 GHz Intel Core i7‐3770 CPU) for each method when p=1000

5.3. Robustness

We here carry out simulations to assess the robustness of the proposed method to model misspecification as compared to the Bayesian and non‐Bayesian contenders of Sections 5.1 and 5.2. We explore three scenarios where the data are (a) multivariate‐t distributed, (b) Gaussian contaminated, and (c) log‐Gaussian distributed. Scenarios 1 and 2 are as in Lin et al. (2016), whereas scenario 3 introduces more skewness. For each scenario, we fix p=50 and generate 50 datasets of size n{25,50,100} using the same four graphical structures (and inverse‐covariance matrices) considered in Section 5.1.

Results are provided in Appendices S6 to S8. ROC and PR curves show that the proposed method is fairly robust to model misspecification. All methods under consideration logically suffer from model misspecification, however, the proposed method keeps an edge over contenders. Results also suggest that the performance of sampling‐based Bayesian methods, which explore the model space, is most affected by model misspecification.

6. GENE NETWORK IN GLIOBLASTOMA MULTIFORME

We illustrate our method on a large gene expression data set on glioblastoma multiforme from The Cancer Genome Atlas. Glioblastoma multiforme is an aggressive form of brain tumor in adults associated with poor prognosis. The data comprise measurements (level 3 normalized; Agilent 244K platform) of 14 827 genes on 532 patients. A small subset of the data were analyzed in Leday et al. (2017). Instead, we here characterize globally the conditional independence structure between all 14 827 genes.

Figure 2A displays the log‐marginal likelihood of model (1) as a function of the prior parameter α when D=Ip. Using the empirical Bayes estimate of α we computed the Bayes factors and their associated tail probabilities for all pair of variables. These computations took 90 seconds overall on 3.40 GHz Intel Core i7‐3770 CPU without parallel schemes, which is remarkable for a graph with a total number of 109 912 551 possible edges.

Figure 2.

Figure 2

A, Log‐marginal likelihood of the GC model as a function of α=(δp1)(δ+np1). The vertical and horizontal dotted lines indicates the location of the optimum. B, Degree distribution of the conditional independence graph. GC, Gaussian conjugate

The conditional independence graph identified by controlling the family‐wise error rate at 10% using the conservative Bonferroni procedure consists of 46,071 edges (0.042% of the total number of edges). Edge degree varies from 0 to 127 with 9675 genes having nonzero degrees. The degree distribution seems to follow an exponential distribution (see Figure 2A), thereby indicating that a relative small number of genes have a large number of links.

Because it is difficult to visualize the graph in its entirety, we identify groups of densely connected nodes using the algorithm of Blondel et al. (2008) implemented in the R package igraph (Csardi and Nepusz, 2006). The algorithm identifies a partition that yields an overall modularity score equal to 0.91. The modularity score measures the quality of a division of a graph into subgraphs. Its maximal value being 1, the identified partition presents a high modularity and suggests the presence of densely interconnected groups of nodes in the conditional independence graph. To illustrate this, we report a subgraph in Figure 3 that has been identified by the clustering algorithm and corresponds to the HOXA gene family. The HOX gene family is known to be involved in the development of human cancers (Bhatlekar et al., 2014), including glioblastoma. The HOXA13 gene has for instance been advanced as potential diagnostic marker for glioblastoma (Duan et al., 2015) and the role of HOXA9 gene in cell proliferation, apoptosis, and drug resistance are under active research (Costa et al., 2010; Gonçalves et al., 2016).

Figure 3.

Figure 3

Example of a densely connected gene subgraph identified by the clustering algorithm of Blondel et al. (2008)

7. DISCUSSION

This article introduced a Bayesian method to infer the conditional (and marginal) independence structure between variables by multiple testing, which bypasses the exploration of the model space and can easily tackle very large problems with thousands of variables. In extensive simulations, the proposed method was shown to perform at least as good as Bayesian and non‐Bayesian contenders while being orders of magnitude faster. The method was illustrated on a large gene expression data set comprising 14 827 genes.

The proposed method has the advantage of being extremely fast and providing explicit control of the type I error. Moreover, it facilitates the incorporation of (different types of) prior information, which is more difficult in a non‐Bayesian setting. For example, the proposed method can incorporate prior marginal and partial correlations via the hyperparameter D, prior probabilities or odds ratios via the Bayes factors, as well as prior group information (eg, pathways) via the multiple testing procedure (Ramdas et al., 2018).

The main limitation of the proposed method relates to estimation. The proposed approach is based on a simple linear shrinkage estimator that does not perform as well as sparse estimators in sparse settings, unless prior knowledge is used (see Appendix S9). Moreover, the multiple testing procedure identifies the most important edges but does not necessarily yield a graphical model that fits well the data (Drton and Perlman, 2007) because the emphasis is on type I error control rather than goodness‐of‐fit.

We foresee several promising extensions of the proposed approach. The Bayes factors proposed in this paper can be used for differential network analysis in which the goal is to identify edges that are in common or specific to predefined groups of samples. Provided that samples between groups are independent, the Bayes factors can simply be multiplied across groups so as to obtain new Bayes factors that provide evidence toward the presence or absence of a common edge. Being symmetric, the Bayes factors can also be inverted before being multiplied so as to evaluate more complex hypotheses, for example, edge losses or gains in a two‐group comparison. Last, it would be interesting to derive the Bayes factor in a regression framework so as to compare them with that of Zhou and Guan (2018).

ACKNOWLEDGMENTS

This research was supported by the Medical Research Council grant number MR/M004421 and core funding number MRC_MC_UP_0801/1. The authors wish to thank Ilaria Speranza for helpful comments on the manuscript and improving largely the software. The authors also wish to thank Catalina Vallejos and Leonardo Bottolo for helpful discussions.

Supporting information

Supplementary Information

Supplementary Information

Leday GGR, Richardson S. Fast Bayesian inference in large Gaussian graphical models. Biometrics. 2019;75:1288–1298. 10.1111/biom.13064

REFERENCES

  1. Baladandayuthapani, V. , Talluri, R. , Ji, Y. , Coombes, K.R. , Lu, Y. , Hennessy, B.T. , Davies, M.A. and Mallick, B.K. (2014). Bayesian sparse graphical models for classification with application to protein expression data. The Annals of Applied Statistics, 8, 1443–1468. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bhatlekar, S. , Fields, J.Z. and Boman, B.M. (2014). Hox genes and their role in the development of human cancers. Journal of Molecular Medicine, 92, 811–823. [DOI] [PubMed] [Google Scholar]
  3. Blondel, V.D. , Guillaume, J.‐L. , Lambiotte, R. and Lefebvre, E. (2008). Fast unfolding of communities in large networks. Journal of Statistical Mechanics: Theory and Experiment, 2008, P10008. [Google Scholar]
  4. Carvalho, C.M. and Scott, J.G. (2009). Objective Bayesian model selection in Gaussian graphical models. Biometrika, 96, 497–512. [Google Scholar]
  5. Chen, C.F. (1979). Bayesian inference for a normal dispersion matrix and its application to stochastic multiple regression analysis. Journal of the Royal Statistical Society: Series B, 41, 235–248. [Google Scholar]
  6. Consonni, G. and LaRocca, L. (2012). Objective Bayes factors for Gaussian directed acyclic graphical models. Scandinavian Journal of Statistics, 39, 743–756. [Google Scholar]
  7. Costa, B.M. , Smith, J.S. , Chen, Y. , Chen, J. , Phillips, H.S. , Aldape, K.D. , Zardo, G. , Nigro, J. , James, C.D. , Fridlyand, J. and Reis, R.M. (2010). Reversing HOXA9 oncogene activation by PI3K inhibition: epigenetic mechanism and prognostic significance in human glioblastoma. Cancer Research, 70, 453–462. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Csardi, G. and Nepusz, T. (2006). The igraph software package forcomplex network research. InterJournal, Complex Systems, 1695. [Google Scholar]
  9. Dempster, A.P. (1972). Covariance selection. Biometrics, 157–175. [Google Scholar]
  10. Dickey, J.M. (1971). The weighted likelihood ratio, linear hypotheses on normal location parameters. The Annals of Mathematical Statistics, 42, 204–223. [Google Scholar]
  11. Drton, M. and Perlman, M.D. (2007). Multiple testing and error control in Gaussian graphical model selection. Statistical Science., 22, 430–449. [Google Scholar]
  12. Duan, R. , Han, L. , Wang, Q. , Wei, J. , Chen, L. , Zhang, J. , Kang, C. and Wang, L. (2015). HOXA13 is a potential GBM diagnostic marker and promotes glioma invasion by activating the wnt and TGF‐β pathways. Oncotarget, 6, 27778. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Giudici, P. (1995). Bayes factors for zero partial covariances. Journal of Statistical Planning and Inference, 46, 161–174. [Google Scholar]
  14. Goeman, J.J. and Solari, A. (2014). Multiple hypothesis testing in genomics. Statistics in Medicine, 33, 1946–1978. [DOI] [PubMed] [Google Scholar]
  15. Gonçalves, C. , Pojo, M. , Xavier‐Magalhães, A. , de Castro, J.V. , Pinto, A. , Taipa, R. , Pardal, F. , Reis, R.M. , Sousa, N. and Costa, B.M. (2016). Regulation of WNT6 by HOXA9 in glioblastoma: Functional and clinical relevance. European Journal of Cancer, 61, S45–S46. [Google Scholar]
  16. Gupta, A.K. , and Nagar, D.K. (2000). Matrix Variate Distributions, Vol. 104 of Chapman & Hall/CRC Monographs and Surveys in Pure and Applied Mathematics. Boca Raton, FL: Chapman & Hall/CRC.
  17. Hannart, A. and Naveau, P. (2014). Estimating high dimensional covariance matrices: a new look at the Gaussian conjugate framework. Journal of Multivariate Analysis, 131, 149–162. [Google Scholar]
  18. Jiang, B. , Ye, C. and Liu, J.S. (2017). Bayesian nonparametric tests via sliced inverse modeling. Bayesian Analysis, 12, 89–112. [Google Scholar]
  19. Kubokawa, T. and Srivastava, M.S. (2008). Estimation of the precision matrix of a singular Wishart distribution and its application in high‐dimensional data. Journal of Multivariate Analysis, 99, 1906–1928. [Google Scholar]
  20. Leday, G.G.R. , de Gunst, M.C.M. , Kpogbezan, G.B. , van der Vaart, A.W. , van Wieringen, W.N. and van de Wiel, M.A. (2017). Gene network reconstruction using global‐local shrinkage priors. The Annals of Applied Statistics, 11, 41–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Ledoit, O. and Wolf, M. (2004). A well‐conditioned estimator for large‐dimensional covariance matrices. Journal of Multivariate Analysis, 88, 365–411. [Google Scholar]
  22. Lin, L. , Drton, M. and Shojaie, A. (2016). Estimation of high‐dimensional graphical models using regularized score matching. Electronic Journal of Statistics, 10, 806–854. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Mitra, R. , Müller, P. , Liang, S. , Yue, L. and Ji, Y. (2013). A Bayesian graphical model for ChIP‐Seq data on histone modifications. Journal of the American Statistical Association, 108, 69–80. [Google Scholar]
  24. Mohammadi, A. and Wit, E.C. (2015). Bayesian structure learning in sparse Gaussian graphical models. Bayesian Analysis, 10, 109–138. [Google Scholar]
  25. Mohammadi, A. , and Wit, E.C. (2017). BDgraph: an R package for Bayesian structure learning in graphical models. [Preprint] ArXiv e‐prints. https://arxiv.org/abs/1501.05108. Accessed July 1, 2018.
  26. Peterson, C. , Stingo, F.C. and Vannucci, M. (2015). Bayesian inference of multiple Gaussian graphical models. Journal of the American Statistical Association, 110, 159–174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Ramdas, A. , Foygel Barber, R. , Wainwright, M.J. and Jordan, M.I. (2018). A unified treatment of multiple testing with prior knowledge using the p‐filter. [Preprint] ArXiv e‐prints. https://arxiv.org/abs/1703.06222. Accessed February 1, 2018.
  28. Ren, Z. , Sun, T. , Zhang, C.‐H. and Zhou, H.H. (2015). Asymptotic normality and optimalities in estimation of large Gaussian graphical models. Annals of Statistics, 43, 991–1026. [Google Scholar]
  29. Schäfer, J. and Strimmer, K. (2005). A shrinkage approach to large‐scale covariance matrix estimation and implications for functional genomics. Statistical Applications in Genetics and Molecular Biology, 32, 28. [DOI] [PubMed] [Google Scholar]
  30. Schwaller, L. , Robin, S. , and Stumpf, M. (2017). A closed‐form approach to Bayesian inference in tree‐structured graphical models. [Preprint] ArXiv e‐prints. https://arxiv.org/abs/1504.02723. Accessed July 1, 2018.
  31. Strimmer, K. (2008). A unified approach to false discovery rate estimation. BMC Bioinformatics, 9, 303. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. VanWieringen, W.N. and Peeters, C.F.W. (2016). Ridge estimation of inverse covariance matrices from high‐dimensional data. Computational Statistics and Data Analysis, 103, 284–303. [Google Scholar]
  33. Wang, H. and Li, S.Z. (2012). Efficient Gaussian graphical model determination under G‐Wishart prior distributions. Electronic Journal of Statistics, 6, 168–198. [Google Scholar]
  34. Wang, M. and Maruyama, Y. (2016). Consistency of Bayes factor for nonnested model selection when the model dimension grows. Bernoulli, 22, 2080–2100. [Google Scholar]
  35. Zhou, Q. and Guan, Y. (2018). On the null distribution of bayes factors in linear regression. Journal of the American Statistical Association, 113, 1362–1371. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

Supplementary Information


Articles from Biometrics are provided here courtesy of Wiley

RESOURCES