Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Oct 5.
Published in final edited form as: J Mach Learn Res. 2022;23(242):https://www.jmlr.org/papers/v23/21-0102.html.

Bayesian Covariate-Dependent Gaussian Graphical Models with Varying Structure

Yang Ni 1, Francesco C Stingo 2, Veerabhadran Baladandayuthapani 3
PMCID: PMC10552903  NIHMSID: NIHMS1857185  PMID: 37799290

Abstract

We introduce Bayesian Gaussian graphical models with covariates (GGMx), a class of multivariate Gaussian distributions with covariate-dependent sparse precision matrix. We propose a general construction of a functional mapping from the covariate space to the cone of sparse positive definite matrices, which encompasses many existing graphical models for heterogeneous settings. Our methodology is based on a novel mixture prior for precision matrices with a non-local component that admits attractive theoretical and empirical properties. The flexible formulation of GGMx allows both the strength and the sparsity pattern of the precision matrix (hence the graph structure) change with the covariates. Posterior inference is carried out with a carefully designed Markov chain Monte Carlo algorithm, which ensures the positive definiteness of sparse precision matrices at any given covariates’ values. Extensive simulations and a case study in cancer genomics demonstrate the utility of the proposed model.

Keywords: Covariate-dependent graphs, Markov random fields, Random thresholding, Subject-level inference, Undirected graphs

1. Introduction

Undirected Gaussian graphical models (GGMs), also known as Gaussian Markov random fields, are one of the common tools to analyze multivariate data with complex structure and find many useful applications across biomedicine, finance, and public health. A GGM can simply be expressed as a multivariate Gaussian distribution with a sparse precision (inverse-covariance) matrix. The zero entries of the precision matrix have probabilistic interpretation of conditional independence between the Gaussian random variables (nodes of a graph). Moreover, all the conditional independence relationships can be directly estimated from the accompanying undirected graph for which a zero entry in the precision matrix corresponds to a missing edge in the graph. This equivalency, essentially reduces the problem of graph structure learning in GGMs to finding zeros in the precision matrix.

Many existing GGM approaches (Dobra et al., 2004; Sudderth et al., 2004; Meinshausen and Bühlmann, 2006; Yuan and Lin, 2007; Friedman et al., 2008; Scott and Carvalho, 2008; Dobra et al., 2011; Green and Thomas, 2013; Drton and Maathuis, 2017; Khare et al., 2018; Massam, 2018; Gan et al., 2019) assume an independent and identically distributed (i.i.d.) sampling scheme yi = (yi1, … , yip) ~ N (0, Ω−1) for i = 1, … , n where Ω is the precision matrix. However, the independence assumption does not hold in many applications. For example, observations in multivariate time series data are not independent and exhibit temporal correlations; similarly for spatial data with spatial correlation. In addition, the assumption of identical distribution implies homogeneity across observations and is often violated as well. For instance, tumor heterogeneity is a well-known characteristic in cancer: patients with the same cancer-type can be rather different in their genetic/genomic architecture. Forcing the same GGM (i.e., the same precision matrix Ω) onto every patient is a restrictive modeling assertion, when modeling cancer genomic networks.

Attempts have been made to extend GGMs or other types of graphical models beyond i.i.d. data. If there is a natural grouping of the observations, multiple graphical models (Guo et al., 2011; Danaher et al., 2014; Oates et al., 2014; Peterson et al., 2015; Yajima et al., 2015; Xie et al., 2016; Ni et al., 2018; Shaddox et al., 2018) can be applied to learn group-specific graphs assuming observations within each group are i.i.d.. Another line of work incorporates additional covariates xi in estimating graphs. Conditional Gaussian graphical models (Rothman et al., 2010; Yin and Li, 2011; Bhadra and Mallick, 2013) are multivariate linear regression models with the error terms following an i.i.d. GGM (can be viewed as chain graphical models). While graph estimation is conditional on the covariates, they only enter the model via the mean structure. As a consequence, the graph topology and the precision matrix stay the same across observations. In this paper, we are taking a more direct approach, in the sense that the latent graph and hence the sparse precision matrix are explicit functions of covariates.

There are a few recent work in this direction. Liu et al. (2010a) proposed a tree-based method that partitions the covariate space into a finite number of subspaces by classification and regression trees and fits GGMs separately to subsets of data. However, the estimated graphs may be unstable and lack similarity for similar covariates due to the separate graph estimation, as reported by Cheng et al. (2014). Kolar et al. (2010) proposed a penalized kernel smoothing approach that allows the precision matrix to vary with covariates. Cheng et al. (2014) developed a conditional Ising model for binary data where the dependencies are linear functions of covariates. Although the methods of Kolar et al. (2010) and Cheng et al. (2014) allow edge strength to vary with covariates, graph structure is assumed to be constant across all observations. Recently, Ni et al. (2019) proposed a graphical regression framework that allows both edge strength and graph structure to vary with covariates in (directed) Bayesian networks. They assumed there exists a natural ordering of the nodes. Given this assumption, Bayesian networks can be written as systems of recursive linear regressions. A conditional independence function was then introduced to connect regression coefficients with covariates.

In this paper, we consider a general problem of estimating undirected GGMs conditional on covariates (GGMx). GGMx allows not only the edge strength (i.e., off-diagonal elements of precision matrix) but also the graph structure (i.e., sparsity pattern of precision matrix) to vary as functions of covariates, which is illustrated in Figure 1 with graphs with four nodes and two covariates. Figure 1 also illustrates the generative mechanism underlying GGMx: covariates xi generate sparse precision matrices Ωi (hence the graphs Gi), which in turn generate responses yi. The major challenge in this context is the positive definiteness constraint of precision matrices – a sine qua non for GGMs – in the presence of covariates. We propose a simple strategy by specifying a matrix-valued function f(·), such that Ωi = f(xi) is a positive definite matrix for any xi almost surely; along with the function f(·) consisting of a random thresholding component that encourages sparse precision matrix estimation, specifically enforcing the required zero-pattern that corresponds to missing edges. The sparse functional relationship between Ωi and xi allows for novel graph interpolation for an unseen observation at covariates x*. We show that the random thresholding gives rise to a discrete mixture of non-local priors (Johnson and Rossell, 2010) for precision matrices. We also carefully design a Markov chain Monte Carlo (MCMC) algorithm for posterior inference, which guarantees to propose positive definite precision matrices for any xi. GGMx allows for subject-level inference on unknown graphs. Moreover, GGMx is a general class of graphical models, which subsumes at least five special cases including standard GGMs, group-specific GGMs (Guo et al., 2011; Danaher et al., 2014; Peterson et al., 2015), time-varying GGMs (Zhou et al., 2010), covariate-dependent GGMs (Kolar et al., 2010), and context-specific GGMs (Nyman et al., 2017). Extensive simulation studies show strong and robust performance of GGMx compared with competing methods. Using a cancer genomics case study, we demonstrate how GGMx can be used to infer subject-specific gene networks, which can facilitate deeper investigations in the genomic foundation of precision medicine.

Figure 1:

Figure 1:

Illustration of GGMx. Subject-level sparse precision matrices Ωi and graphs Gi of Y vary with covariates X. The edge thickness is proportional to its strength of association ωijk. Both edge strength and graph structure change with X. GGMx can also be viewed as a generative model: X generate graphs which in turn generate Y.

The rest of this article is organized as follows. We introduce the background and notations in Section 2. We present the proposed GGMx in Section 3 and discuss the link between the random thresholding prior and non-local priors in Section 4. We summarize the posterior inference and graph interpolation in Section 5. We demonstrate the utility and robustness of GGMx with extensive simulation studies in Section 6. GGMx is illustrated by a real data application in Section 7. Section 8 provides our closing discussion.

2. Background and notation

A GGM is a multivariate Gaussian distribution with a sparse precision matrix. Let Y=(Y1,,Yp)N(0,Ω1) be multivariate Gaussian random variables with mean zero and precision matrix Ω=[ωjk]. Since the off-diagonal elements in Ω are proportional to partial correlations, a zero entry ωjk = 0 indicates that Yj and Yk are conditionally independent given all other variables. A GGM graphically represents the zero patterns of Ω by an undirected graph. An undirected graph G = (V, E) consists of a set of nodes V = {1, …, p} and a set of undirected edges E{{j,k}j,kV}. The nodes V represent the variables Y and an edge {j, k} is present in the graph if and only if ωjk0. This is not an arbitrary way of drawing a graph. In fact, the conditional independence relationships that are encoded in the multivariate Gaussian distribution can be directly read off from G using the notion of graph separation. Importantly, learning the graph structure is equivalent to finding the zero patterns of Ω.

Under the Bayesian paradigm, several prior distributions (Roverato, 2002; Wang et al., 2012; Wang, 2015) for sparse precision matrices have been developed, which all take the same general form,

π(Ω)=π~(Ω)I(ΩM+)π~(Ω)I(ΩM+)dΩπ~(Ω)I(ΩM+), (1)

with M+ being the collection of positive definite matrices (PDMs). For example, G-Wishart prior (Roverato, 2002) assumes π~(Ω) to be a Wishart distribution Wishart(b,Ω0) and M+MG+ to be PDMs consistent with a graph G, which leads to π(ΩG,b,Ω0)Wishart(Ωb,Ω0)I(ΩMG+). Bayesian graphical lasso (Wang et al., 2012) assumes π~(Ω) to be a product of independent exponential priors Exp(·|λ) and double-exponential priors DE(·|λ) on diagonal and off-diagonal elements of Ω, Ω,π(Ωλ)j<kDE(ωjkλ)jExp(ωiiλ/2)I(ΩM+). Graphical spike-and-slab prior (Wang, 2015) replaces the double-exponential priors in Bayesian graphical lasso by spike-and-slab priors, π(ΩG,v1,v0,λ){j,k}EN(ωjk0,v1){j,k}EN(ωjk0,v0)jExp(ωiiλ/2)I(ΩM+) where v1v0. Priors on Ω can be defined either conditionally on the graph G or marginally; in what follows we do not use a model indicator parameter G but will infer the graph structures directly from the zero patterns in the precision matrices.

3. Gaussian graphical models with covariates

Let y1, … , yn be n realizations of a random vector Y = (Y1, … , Yp). We assume an independent multivariate Gaussian distribution for each observation yip(yiΩi)=N(0,Ωi1) with the precision matrix Ωi=[ωijk], importantly, indexed by i = 1, … , n. A subject-level graph Gi = (V, Ei) is embedded in the subject-level precision matrix Ωi:{j,k}Ei if and only if ωijk0.

Without further modeling assumptions, Ωi cannot be estimated with a single observation i. Let x1, … , xn be n realizations of covariates X = (1, X1, … , Xq). Note that when X = 1 (i.e., there is no covaraites), the proposed GGMx is reduced to standard GGMs; more discussion of special cases of GGMx will be given later. We model Ωif(xi) through a symmetric matrix-valued function f(·), which is estimable as a population-level parameter shared across all observations.

General construction of covariate-dependent priors.

The key is the construction of the function f()=[fjk()] such that Ωi=f(xi) is a PDM for any xi, i = 1, … , n. Let + denote the collection of all such functions. This can be achieved by specifying a prior f ~ II that assigns positive mass only on functions that satisfy such requirement, Π(+)=1. We consider the following generalization of the prior density in (1) as,

π(f)=π~(f)I(f+)π~(f)I(f+)df, (2)

where π~ is a distribution on matrix-valued functions. Note that the support of π~ is not limited to +, offering great flexibility in the choice of π~. For example, we can start from independent distributions a priori such that π~(f)=jkπ~(fjk); using this construction, the marginal distribution π~(fjk) need not be defined with a constrained range. Because of the deterministic relationship Ωi = f (xi), prior π(f) induces a conditional prior on Ωi given xi.

Two additional critical properties are desired for f(·). (i) Smoothness — similar inputs should give rise to similar PDMs. Without smoothness, similar subjects may have vastly different networks, which is difficult to interpret in many applications including ours. (ii) Sparsity — π(f) should have positive probability on sparse PDMs. Sparsity is a common assumption in high-dimensional models including GGMs, which improves statistical efficiency and interpretability compared to dense models. In order to encourage sparsity of Ωi, a positive mass has to be placed on sparse PDMs a priori because otherwise there will be zero mass on sparse PDMs a posteriori even if data strongly favor sparse PDMs. To equip f(·) with these two properties, we decompose each off-diagonal element fjk(·) of f (·) into two components,

fjk(xi)=gjk(xi)I(|gjk(xi)|>tjk),forj<k, (3)

where gjk() is some smooth function, the hard thresholding I(|gjk(xi)|>tjk) promotes sparsity in fjk(), and tjk is a random threshold, which can be interpreted as a minimum effect size of ωijk. Specifically, whenever gjk(xi) is less than tjk in magnitude, the hard thresholding truncates fjk(xi) to zero and hence induces a missing edge between nodes j and k for subject i. Our use of a thresholding function to induce sparsity on precision matrices Ωi= f(xi) is novel and crucially different from conventional GGM priors including the G-Wishart prior and the graphical spike-and-slab prior (Wang, 2015): in order to construct observation-specific graphs, conventional priors would require a latent indicator for each potential edge and each observation, which would greatly increase the model complexity. For example, in our application with multiple myeloma dataset, conventional priors would need n·p·(p—1)/2 = 79,728 latent indicators whereas the proposed GGMx needs much fewer p · (p — 1)/2 = 528 thresholding parameters. Moreover, as will be introduced later, GGMx enables undirected graph interpolation for unseen covariates, a new feature that is difficult to obtain with conventional priors. Other choices of thresholding functions are possible such as soft thresholding and nonnegative garrote thresholding. The main motivation of choosing hard thresholding over the alternatives is its theoretical connection with mixture of non-local priors; see Section 4.

For the diagonal elements (inverse-partial-variance, Whittaker 2009) fjj(·) of f(·), we assume the following model to ensure its nonnegativity,

fjj(xi)=exp{gjj(xi)}. (4)

Note that unlike off-diagonal elements in (3), the diagonal element fjj(·) is not subject to thresholding.

Remark 1 Our formulation encompasses covariate-dependent priors on both the off-diagonal (inverse-covariance) and diagonal (inverse-partial-variance) elements, thus conducting both graphical and inverse-partial-variance regression, simultaneously.

Remark 2 The proposed prior has two advantages over the more commonly used G-Wishart prior: (i) the induced prior on Ωi from (2) explicitly incorporates covariates xi and (ii) the normalizing constant of G-Wishart is not a constant with respect to graph G and therefore comparing two graphs requires explicit evaluation of the intractable normalizing constant whereas π(f), due to the thresholding function, does not have such complication.

Given f(·) and X, the proposed GGMx satisfies functional Markov properties, e.g., the pairwise functional Markov property, which is stated formally in the following lemma.

Lemma 1 If fjk(X) = 0, then YjYkYrest, X where Yrest is the subvector of Y without Yj and Yk.

The proof of Lemma 1 directly follows from the fact that fjk(X) = 0 implies there is a missing edge between nodes j and k given covariates X, which in turn implies that YjYkYrest, X from standard GGM theory.

A natural choice of gjk(·) is a linear function gjk(xi)=βjkTxi although, in general, gjk(·) can be any smooth function. Given the limited sample size of the case study, we consider gjk(·) to be linear for parsimony (see Section 8 for a brief discussion on modeling a nonlinear gjk) and interpretability (βjk are the rates of changes of ωijk in xi). If the focus is on learning the graph structure and strength, i.e., the off-diagonal elements of Ωi, one can further simplify the model by reducing diagonal elements gjj(xi) to be constant with respect to the covariates.

GGMx is a fairly flexible class of models and has at least five special cases (see Table 1). (i) If X only contains the intercept, then GGMx reduces to the case of the standard GGM because the graph is a function of a constant and hence is constant. (ii) If X is categorical, then GGMx is a multiple graphical model (also known as group-specific GGM) as the categorical covariate defines the groups. (iii) If X is univariate time points, then GGMx can be used for modeling time-varying GGMs (Zhou et al., 2010) by treating time as a covariate1. (iv) If the thresholds tjk’s are fixed to 0, then GGMx is a covariate-dependent GGM in which the strength of the graph varies continuously with the covariates but the structure is constant because a non-zero linear function is non-zero almost everywhere. (v) If X is a subset of Y, then GGMx can be interpreted as a context-specific GGM (Nyman et al., 2017) where the graph structure varies with (discretized) X.

Table 1:

Five special cases of GGMx.

Special cases of GGMx Conditions Mapping X ↦ Ω
Standard GGM xi = 1 gjk(xi)=βjk ωijk=βjkI(|βjk|>tjk)
Group-specific GGM xi = c, c ∈ {1, … , C} gjk(xi)=βjkc ωijk=βjkcI(|βjkc|>tjk)
Time-varying GGM xi = x, time x gjk(x)=xβjk ωijk=xβjkI(|xβjk|>tjk)
Covariate-dependent GGM tjk = 0 gjk(xi)=βjkTxi ωijk=βjkTxi
Context-specific GGM xi = yi1 gjk(xi)=yi1βjk ωijk=yi1βjkI(|yi1βjk|>tjk)

Priors. We assign priors to βjk and tjk, which in turn define π~(f). We assume an independent multivariate Gaussian prior βjkπ(βjk)=N(βjk0,τjkIq). The thresholding parameter tjk can be interpreted as the minimum size of off-diagonal elements of Ωi. Since its value is usually unknown in practice, we assign a truncated normal prior tjkπ(tjk)=N(μt,σt2)I(tjk>0) to reflect the uncertainty. As we will show in the next section, the priors of βjk and tjk induce a mixture of non-local priors on Ωi.

To complete the prior formulation, for the hyperparamter τjk, we assign a hyperprior

τ={τjk}jkπ(τ)=CτjkIG(τjkaτ,bτ)CτjkIG(τjkaτ,bτ)dτ

where IG(a, b) denotes an inverse-gamma density with shape a and scale b, and Cτ is the normalizing constant in (2),

Cτ=π~(f)I(f+)df.

Including Cτ in the prior of π(τ) serves to cancel out Cτ1 in (2) so that the full conditional of τjk is inverse-gamma. Similar cancellation trick has been used and thoroughly investigated in Bayesian graphical lasso (Wang et al., 2012).

A schematic representation of the proposed GGMx is provided in Figure 2.

Figure 2:

Figure 2:

A schematic representation of GGMx.

4. Theoretical Properties

We establish a general result of the connection between the proposed prior of precision matrices induced by (2) and (3) and non-local alternative priors in GGM. A non-local prior assigns a vanishing density (under the alternative hypothesis) to the neighborhood of the null hypothesis. In variable selection contexts, this density vanishes around 0 and therefore shrinks small effect to zero, which is appealing because we are interested in a parsimonious estimation of the graph (i.e., a sparse network). Non-local priors have been shown, both theoretically and empirically, to have superior performance over local priors in various applications including hypothesis testing, high-dimensional sparse regression, and Bayesian networks (Johnson and Rossell, 2010, 2012; Altomare et al., 2013; Rossell and Telesca, 2017; Shin et al., 2018; Ni et al., 2019). However, to the best of our knowledge, all existing priors of sparse precision matrices in GGM (G-Wishart, Bayesian graphical lasso, and stochastic search structure learning prior) are local, i.e., π(Ω) does not approach 0 as ωjk → 0 for (j, k) ∈ E. Conceptually, local priors have a seemingly “contradictory” representation of one’s prior belief. On the one hand, (j, k) ∈ E suggests ωjk is non-zero. But on the other hand, local priors fail to assign zero mass at ωjk = 0; in fact, local priors often assign the maximum mass at zero. The practical implication of such “contradiction” is that local priors tend to favor denser models and be more susceptible to false discoveries compared to non-local priors especially for high-dimensional models like GGMx.

Let πθ and πt generically denote the priors for θjk and tjk,θjkπθ(θjk) and tjkπt(tjk). Let T=[tjk]. We now show the connection between non-local priors and the proposed prior of the following general form,

π(ΩT)=π~(ΩT)I(ΩM+)π~(ΩT)I(ΩM+)dΩ,

and

π~(ΩT)=j=1pπd(ωjj)j<kπωt(ωjktjk),

where

ωjk=θjkI(|θjk|>tjk),forj<k.

Note that the equations above have no reference to covariates. We deliberately do so for clarity and generality; all the following theoretical results apply to the marginal distribution π(Ωi) in GGMx by letting θjk=gjk(xi)=βjkTxi and ωjj=exp{gjj(xi)}. Conditional on tjk, the prior πθ induces a spike-and-slab mixture distribution,

πωt(ωjktjk)=ρδ0(ωjk)+(1ρ)π~ωt(ωjktjk)

where the mixture weight ρ=Pr(|ωjk|<tjktjk) is computed under the conditional distribution of ωjk induced by πθ(·) and hence is a function of tjk (not ωjk), and the slab is a truncated distribution,

π~ωt(ωjktjk)=πθ(ωjk)I(|ωjk|>tjk)Pr(|ωjk|>tjktjk).

Slightly abusing the notations, let ω=(ω1,,ωM)=(ω12,,ω1p,ω23,,ω2p,,ωp1,p) be an M-dimensional vector containing upper-triangular elements of Ω with M=(p2). Let S{1,,M} denote the indices of non-zeros elements in Ω (or equivalently in ω), i.e. ωm = 0 if and only if mSc. Then the conditional prior of Ω given T can be written as a mixture over all possible subsets S,

π(ΩT)=1g(T)I(ΩM+)j=1pπd(ωjj) (5)
×S2{1,,M}mSπθ(ωm)I(|ωm|>tm)mScPr(|ωm|<tmtm)δ0(ωm), (6)

where g(T)=π~(ΩT)I(ΩM+)dΩ is the normalizing constant and 2{1,…,M} is the power set of {1, … , M}. Our main theorem shows that under very mild conditions, the marginal prior π(Ω) is a discrete mixture of non-local priors. Before we present the main theorem, we first state a lemma that is useful in proving the theorem.

Lemma 2 E[1/g(T)] < ∞ if the distribution πθ(·) of θjk has positive mass around zero, i.e., there exists δ > 0 such that for any 0<δ<δ,δδπθ(θ)dθ>0, and the distribution πd(·) of ωjj is not a point mass at zero, i.e., πd()δ0().

Proof Consider

g(T)=π~(ΩT)I(ΩM+)dΩ=Pr(ΩM+T)>Pr({ωjj>(p1)λ}j=1p,{|ωjk|λ}j<kT),λ0>Pr({ωjj>(p1)λ}j=1p,{|θjk|λ}j<k)=j=1pPr(ωjj>(p1)λ)j<kPr(|θjk|λ)=defL(λ).

The first inequality holds because diagonally dominant symmetric matrix is positive definite and the second inequality is true because |θjk|λ implies |ωjk|λ by design and T is independent of ωjj and θjk. If πθ has positive mass around zero and πd is not a point mass at zero, we can pick a sufficiently small (but positive) λ* > 0 such that the lower bound L(λ*) of g(T) is positive. Then it follows that E[1/g(T)] < ∞.

Theorem 1 The marginal prior π(Ω) is given by

π(Ω)=S2{1,,M}ρSπS(Ω),

where πS(Ω) is the prior under the hypothesis HS:ωm0,mS and ωm = 0, m ∈ Sc. Moreover, πS(Ω) is a non-local prior for any S2{1,,M}\, that is, πS(Ω)0 as ωm0 for m ∈ S, provided (i) Pr(t = 0) = 0, (ii) πθ(·) is bounded and has positive mass near 0, and (iii) πd()δ0().

Proof The marginal distribution of Ω is given by

π(Ω)=π(ΩT)πt(T)dT=I(ΩM+)j=1pπd(ωjj)1g(T)j<kπωt(ωjktjk)πt(T)dT.

Let m(Ω)=I(ΩM+)j=1pπd(ωjj), then

π(Ω)=m(Ω)1g(T)j<kπωt(ωjktjk)πt(T)dT=m(Ω)1g(T)j<k{Pr(|ωjk|<tjktjk)δ0(ωjk)+πθ(ωjk)I(|ωjk|>tjk)}πt(tjk)dT=m(Ω)1g(T)S2{1,,M}}mSπθ(ωm)I(|ωm|>tm)πt(tm)mScPr(|ωm|<tmtm)δ0(ωm)πt(tm)dT=S2{1,,M}m(Ω)ET[1g(T)mSI(|ωm|>tm)mScPr(|ωm|<tmtm)]mSπθ(ωm)mScδ0(ωm)=defS2{1,,M}hS(Ω)=S2{1,,M}hS(Ω)dΩ×hS(Ω)hS(Ω)dΩ=defS2{1,,M}ρS×πS(Ω).

We will show that for any mS and any sequence ωm(n)0 as n,πS(Ω(n))0 as n → ∞ where Ω(n) contains ωm(n) as an element. Note that

1g(T)mSI(|ωm|>tm)mScPr(|ωm|<tmtm)1g(T).

Since E[1/g(T)] < ∞ due to conditions (ii) - (iii) and Lemma 2, and limnI(|ωm(n)|>tm)=0 almost surely due to condition (i), then by dominated convergence theorem, we have

limnET[1g(T)I(|ωm(n)|>tm)mS,mmI(|ωm|>tm)mScPr(|ωm|<tmtm)]=ET[1g(T){limnI(|ωm(n)|>tm)}mS,mmI(|ωm|>tm)mScPr(|ωm|<tmtm)]=0

Finally, condition (ii) renders πS(Ω(n))0.

Conditions (i) - (iii) in Theorem 1 are very mild and satisfied by a wide range of πt, πθ, and πd. Condition (i) is trivially satisfied if πt is continuous (e.g., gamma, inverse-gamma, log-normal, and truncated normal distributions). Condition (ii) holds for Cauchy, normal, and most of the scale mixtures of normal distributions such as Laplace, normal-gamma, and t distributions. Condition (iii) only excludes point mass at zero δ0(·) from all the possible choices of πd(·).

A simple illustrative example

As a concrete example, πS(Ω) is non-local under the prior distributions specified in Section 3, namely, πθ(θjk)=N(0,τ),πd(ωjj)=log-normal(0,τ), and πt(tjk)=N(μt,σt2)I(tjk>0). To visualize the proposed non-local prior, we consider a small precision matrix with p = 3 and perform a prior simulation to generate Ω from π(Ω), the procedure of which is a special case of the posterior simulation procedure (ignoring the likelihood) to be described in Section 5. We visualize πS(Ω) for S = {1, … , M}, i.e., a complete graph. The marginal densities of pairs of off-diagonal elements of Ω (normalized to partial correlations) are depicted in the top panel of Figure 3 which show vanishing density as ωjk approaches 0. By contrast, a local prior on Ω (simulated by fixing tjk = 0) has an increasing density as ωjk approaches 0 as shown in the bottom panel of Figure 3.

Figure 3:

Figure 3:

Non-local (top) and local (bottom) prior distributions of Ω.

Remark The connection between non-local priors and random thresholding has been investigated in the regression context (Rossell and Telesca, 2017; Ni et al., 2019). We make a nontrivial extension to precision matrix estimation for undirected GGMs. One major difference between our theory and those in Rossell and Telesca (2017) and Ni et al. (2019) is the complexity of the intractable prior normalizing constant g(T) in (5). Intractable prior normalizing constant is a common challenge in standard Bayesian GGMs (Dobra et al., 2011; Wang et al., 2012; Wang, 2015), both theoretically and computationally. In order to show the equivalence between non-local priors and random thresholding for GGMs, we make extra assumptions, i.e., πθ(·) has positive mass around zero and πd()δ0(), in order to bound E[1/g(T)]. These mild assumptions are not required in previous works. Also note that Rossell and Telesca (2017) truncates probability density whereas we threshold the random variables. Consequently, the resulting marginal prior of Rossell and Telesca (2017) is a non-local prior while ours is a discrete mixture of the non-local prior and point mass at 0. Computationally, the issue of intractable normalizing constant is resolved by a carefully designed MCMC algorithm, which will be discussed in the next section.

5. Posterior Inference

The proposed GGMx is parameterized by three sets of parameters {βjk}jk,{tjk}j<k, and {τjk}jk. The joint posterior distribution of these parameters is given by,

p({βjk}jk,{tjk}j<k,{τjk}jk{yi,xi}i=1n)i=1nN(yi0,Ωi)j<kN(tjkμt,σt2)I(tjk>0)jkN(βjk0,τjkIq)IG(τjkaτ,bτ),

where the right-hand side of this equation depends on xi through Ωi = f (xi) and f(·) is defined by {βjk}jk and {tjk}j<k. The posterior inference of the model parameters is carried out by MCMC. We need to carefully choose a proposal distribution that can propose f+ efficiently. This is not a trivial task because the probability that we generate f+ is practically zero if we propose βjk and tjk from naive proposals such as standard random walks. Here, we introduce a proposal that always proposes f+.

For illustration, suppose we are currently updating the (j, k)th element of Ωi. Let ωi,−k,k denote the kth column of Ωi without the kth row and let Ωi,−k,−k denote the submatrix of Ωi without the kth row and column. Let ϕik=ωikkuik with uik=ωi,k,kTΩi,k,k1ωi,k,k. We first propose new βjk and tjk from some proposal densities qβ(βjkβjk) and qt(tjktjk) such as random walks for ℓ = 1, … , q+1. The resulting new values of ωi,−k,k and uik are denoted by ωi,k,k and uik. Notice that Ωi is positive definite if and only if ϕik > 0 for k = 1, … , p. This is due to the Sylvester’s criterion that a symmetric matrix is positive definite if and only if all of the leading principal minors are positive. Without loss of generality, assuming k is the last column and all previous principal minors are positive, and assuming covariates xi are positive. Then the last leading principal minor det(Ωi)=(ωikkuik)det(Ωi,k,k) is positive if and only if ωikkuik>0. Therefore, in order to ensure positive definiteness of Ωi, ∀i when updating its (j, k)th element, we will additionally propose a new βkk such that ωikk=exp{gkk(xi)}>uik where gkk(xi)=xiβkk+xiβkk. The solution to this inequality for all i is the constraint that the proposal of βkk needs to respect. Specifically, we will propose βkkqβ(βkkβkk)I(βkkSk) where

Sk={ββ>maxi(log(uik)xiβkkxi)}.

We summarize the property of the proposal density in the following proposition, of which the proof is given by the proceeding paragraph.

Proposition 1 The proposal density q(βjk,tjk,βkkβjk,tjk,βkk)=qβ(βjkβjk)qt(tjktjk)qβ(βkkβkk)I(βkkSk) and the full conditional density p(βjk,tjk,βkk) have the same support.

We now provide the MCMC (Metropolis-within-Gibbs) algorithm below; its validity is guaranteed by Proposition 1 and standard MCMC theory.

The MCMC Algorithm.

Initialize model parameters. Repeat the following steps until practical convergence.

(I) Update precision matrices Ωi. Scanning through each column k = 1, … , p, each row jk, and each covariate = 1, … , q + 1, we propose βjk,tjk, and βkk from qβ(βjkβjk),qt(logtjklogtjk), and qβ(βkkβkk)I(βkkSk) where qt(logtjklogtjk)=N(logtjklogtjk,ηt2),qβ(βjkβjk)=N(βjkβjk,ηβ2) and qβ(βkkβkk)=N(βkkβkk,ηβ2). We accept the proposal with probability min(1, α) where

α=i=1np(yiΩi)π(βjk)π(tjk)π(βkk)qβ(βjkβjk)qt(tjktjk)qβ(βjkβjk)I(βkkSk)i=1np(yiΩi)π(βjk)π(tjk)π(βkk)qβ(βjkβjk)qt(tjktjk)qβ(βkkβkk)I(βkkSk).

The proposal standard deviations ηt2 and ηβ2 can be set to achieve desired acceptance rate (say, 20%-40%).

(II) Update the hypervariances τjk from the inverse-gamma full conditional, τjkIG(aτ+1/2,bτ+βjk2/2).

Graph estimation.

A point estimate of Gi can be obtained by thresholding the posterior probability of inclusion. Specifically, we select {j,k}Ei if Pr({j,k}Eiyi,xi)>c where c ∈ [0, 1] is the probability cutoff2 . The posterior probability of inclusion can be approximated by the MCMC samples,

Pr({j,k}Eiyi,xi)=Pr(ωijk0yi,xi)1Rr=1RI{ωijk(r)0},

where the superscript (r) indexes the posterior samples.

Graph interpolation.

Since the precision matrix Ωi = f (xi) is modeled as a function of xi, we can interpolate a graph G* = (V, E*) for an unseen observation at covariates x*. It is achieved through the posterior predictive distribution of f(·), which can be approximated by the MCMC samples,

Pr({j,k}Ey,x,x)=Pr{fjk(x)0y,x,x}1Rr=1RI{fjk(r)(x)0}.

Graph interpolation requires covariates x* only, since the right-hand side of the equation above does not depend on y*. In practice, this is a desirable property. For example, one can predict the gene network for new patients without sequencing the whole genome; the measurement of covariates (e.g., blood biomarkers) will suffice.

6. Simulations

6.1. Simulation Setup

We assessed the utility and operating characteristics of GGMx in seven simulation scenarios with different levels of sparsity and types of covariates. The same size of the dataset in application was used: n = 151, p = 33, and q = 2 (q was set to 1 for the last scenario). Note that even with a moderate dataset, the number of parameters (βjk,tjk,τjk) that need to be estimated is p(p+1)(q+2)2+p(p1)2=2,772, which is substantially larger than the sample size. We focused on graph structure learning in the first five scenarios by assuming constant diagonal elements gjj(·) for simplicity; non-constant case (i.e., simultaneous inverse-partial-variance and graphical regression) will be considered in the last two scenarios. We fixed the probability cutoff c to be 0.5 in all scenarios.

Scenario I.

We generated the simulated data from our model. We randomly set 2% of βjkℓ for j < k to be ±1 with equal probability. We set tjk = 0.5 and all the diagonal elements of Ωi to be 1. The covariate xij was generated from an uniform distribution xijiidU(1,1). The resulting precision matrix Ωi might not be positive definite for all observations i = 1, … , n. We repeated the process until Ωi > 0, ∀i. Then the observation yi was drawn from normal yiindN(0,Ωi1). Using the same procedure, we generated a similar independent dataset with sample size 50 for testing graph interpolation of GGMx.

Scenario II.

The procedure in Scenario I was inefficient to generate a denser network. In addition, it may not mimic well the data in application. In this scenario, we used one posterior draw from GGMx applied to the multiple myeloma data as simulation truth. The true βjkl’s are shown as heatmaps in Figure 4a where = 1 corresponds to the intercept and = 2, 3 correspond to the two covariates. Since the heatmap of βjk1 is denser than those of βjk2 and βjk3, there were more nearly constant edges than highly varying edges. The true tjk’s are shown in Figure 4b. The covariates xi of the multiple myeloma dataset was used. And yi was drawn from the model yiindN(0,Ωi1) with Ωi = f(xi).

Figure 4:

Figure 4:

Simulation truths for Scenario II. Heatmaps of (a) true βjkl’s and (b) true tjk. They are one posterior draw from GGMx applied to the multiple myeloma data.

Scenario III.

This scenarios considered a simulation truth from an ordinary GGM, i.e. Ωi = Ω, ∀i. We generated a true Ω as follows.

  1. Generate an Erdös-Rényi graph G with connecting probability 5%.

  2. Set the diagonal entries of Ω to 1. For each edge {j, k} in G, draw corresponding off-diagonal entrie ωjk uniformly in [−1, −0.5] ∪ [0.5, 1].

  3. Since Ω might not be positive definite, we kept adding 0.1I to Ω until Ω became positive definite. The resulting partial correlations were less than 0.4 in magnitude. Then we simulated yiiidN(0,Ω1) and xijiidU(1,1). GGMx took the independently generated xij as covariates, which were pure “noises” for constructing the graph of yi.

Scenario IV.

We extended Scenario III to multiple graphs with C = 3 groups. The sample size of each group was n1 = 50, n2 = 50, and n3 = 51. Graph G1 was generated as an Erdös-Rényi graph with connecting probability 10%, which led to 63 edges. We randomly turned 3 edges on and 3 edges off from G1 to obtain G2 and similarly constructed G3 from G2. As a result, each pair of (G1, G2) and (G2, G3) shares about 90% edges whereas (G1, G3) shares about 80% edges. Then given graphs, the precision matrices and observations yi were generated in the same way as Scenario III. To apply GGMx in this setting, we let xij be a binary indicator such that xij = 1 if observation i belongs to group j for j = 1, 2 and xij = 0 for j = 1, 2 if observation i belongs to group 3.

Scenario V.

We have considered continuous covariates (Scenarios I-II), a discrete covariate (Scenarios IV), or no relevant covariates (Scenario III). Here, we included a scenario with one continuous covariate and one discrete covariate. We generated the data by following Scenario I with one covariate replaced by a Bernoulli(0.5) variable and the corresponding coefficients βjkℓ’s set to ±0.5 with equal probability.

Scenario VI.

We considered a scenario without assuming ωijj to be a constant; instead we set gjj(xi) = 0.1 + 0.2xi1 + 0.2xi2 and ωijj=exp{gjj(xi)}. For off-diagonal elements, we randomly included 2% of the edges and the corresponding βjkl for j < k was set to be 0.7. The covariate xiℓ was generated from xiiid2Beta(2,1). The resulting precision matrix Ωi might not be positive definite for all observations i = 1, … , n. We repeated the process until Ωi > 0, ∀i. Then the observation yi was drawn from normal yiindN(0,Ωi1).

Scenario VII.

To illustrate GGMx can be used to recover time-varying GGM, we reduced the number of covariate to q = 1 from Scenario VI.

6.2. Methods under Consideration

We compared the proposed GGMx with six competing methods: Bayesian Gaussian graphical models (Mohammadi et al., 2015), graphical lasso (Friedman et al., 2008), kernel graphical lasso (Liu et al., 2010a), fused graphical lasso, group graphical lasso (Danaher et al., 2014), and Bayesian multiple Gaussian graphical model (Shaddox et al., 2018).

Bayesian Gaussian graphical models (BGGMs) assume i.i.d. multivariate Gaussian likelihood and the G-Wishart prior on the precision Ω ~ WG(b, D) and a uniform prior on the graph G. G-Wishart prior is conjugate to the multivariate Gaussian likelihood. However, due to intractable prior normalizing constant of G-Wishart prior, non-trivial MCMC algorithm is required for posterior inference. We use an efficient trans-dimensional MCMC algorithm proposed by Mohammadi et al. (2015) based on a continuous-time birth-death process.

Graphical lasso (glasso) is a penalized likelihood approach that maximizes the objective function log|Ω|tr(SΩ)λΩ1 where S is the sample covariance matrix. The first two terms are the Gaussian log-likelihood and the last term is an 1 penalty, which induces sparsity in Ω. The optimization is solved using a coordinate descent algorithm.

Both BGGM and glasso assume i.i.d. sampling and are designed to infer networks that do not change with covariates. For a more fair comparison, we implemented the kernel graphical lasso (k-glasso) approach outlined in Liu et al. (2010a). K-glasso is a modification of glasso with the sample covariance matrix S replaced by a covariate-dependent covariance matrix via kernel smoothing. Specifically, let

S(x)=i=1nK(xxih)(yiμ(x))(yiμ(x))T/i=1nK(xxih),

with

μ(x)=i=1nK(xxih)yi/i=1nK(xxih),

where || · || is the Euclidean norm, h > 0 is the bandwidth, and K(·) is a Gaussian kernel. Then a sparse estimate of Ωi is obtained by applying glasso with S=S(xi),Ω^i=arg minΩ{log|Ω|tr(S(xi)Ω)λiΩ1}.

As pointed out in Section 3, the proposed GGMx is a multiple graphical model when the covariates are categorical. Multiple graphical models assume that observations are divided into C groups. The goal is to jointly estimate group-specific sparse precision matrices Ω(c), c = 1, … , C. Since the grouping of observations can be represented by a categorical variable, GGMx is able to learn group-specific graphs. For comparison, we consider three alternative multiple graphical model approaches, the two penalized approaches proposed in Danaher et al. (2014), fused graphical lasso (FGL) and group graphical lasso (GGL), and the Bayesian multiple Gaussian graphical model (MGGM) proposed by Shaddox et al. 2018. Both penalized algorithms maximize the following objective with respect to positive definite matrices {Ω(c)}c=1C,

c=1Cnc{log|Ω(c)|tr(S(c)Ω(c))}P({Ω(c)}c=1C),

where nc is the sample size of group c, S(c) is the sample covariance matrix of group c, and P(·) is a penalty that encourages sparsity and similarity of {Ω(c)}c=1C. The penalty is chosen to be λ1c=1Cjk|ωjk(c)|+λ2c<cj,k|ωjk(c)ωjk(c)| for FGL and λ1c=1Cjk|ωjk(c)|+λ2jkc=1Cωjk(c)2 for GGL.

Finally, MGGM uses local priors on sparse precision matrices (Wang, 2015) and can be thought as the local prior counterpart of the proposed method for the multiple graphs setting; comparisons with this method only pertain to Scenario IV.

For GGMx, we set the hyperparameters, aτ = bτ = 10−1, μt = 1, and σt = 0.2; these choices will be tested in sensitivity analyses at the end of this section. Both GGMx and BGGM were run for 10,000 iterations with 5,000 burn-in. The regularization parameter of glasso was selected by the stability approach (Liu et al., 2010b) implemented in the R package huge. The tuning parameters λ1 and λ2 of FGL and GGL were selected based on the approximated Akaike Information Criterion (AIC) as suggested by Danaher et al. (2014). A 20 × 20 grid evenly spaced between 0.05 and 0.5 for λ1, and between 0.001 and 0.01 for λ2, was used. Likewise, the tuning parameters λi and hi of k-glasso were also selected based on AIC on a 20 × 20 grid [0.1, 1] × [0.1, 1] for each observation i = 1, … , n. All results were based on 50 repeat simulations.

6.3. Simulation Results

To assess the graph recovery performance, we computed true positive rate (TPR), false discovery rate (FDR), and Matthews correlation coefficient (MCC),

TPR=TPTP+FN,FDR=FPTP+FP,MCC=TP×TNFP×FN(TP+FP)(TP+FN)(TN+FP)(TN+FN),

where TP, FP, TN, and FN stand for true positives, false positives, true negatives, and false negatives. MCC takes value between −1 and 1 with 1 being perfect graph recovery and 0 being random guess. In addition, we scrutinized the edges with inclusion probability that is considerably affected by the covariates’ value. Hence, we introduced another three measures: partial TPR (pTPR), partial FDR (pFDR), and partial MCC (pMCC) which are simply TPR, FDR, and MCC restricted to the edges with true frequency of inclusion across observations between 0.1 and 0.9. We report all the metrics in Figure 5. Overall, GGMx had robust, superior performance (with high true positive and low false discovery rates) across all scenarios.

Figure 5:

Figure 5:

Simulations. Operating characteristics averaged over 50 repeat simulations under seven scenarios. Graph interpolation is not shown as it is similar to graph estimation. pMCC is 0 when no missing edge is detected. pTPR, pFDR, and pMCC are not available for Scenarios III and IV.

In Scenario I, GGMx clearly outperformed BGGM and glasso in all six measures. This was expected because the data were generated from the proposed model and all edges were associated with covariates. BGGM and glasso assume i.i.d. sampling and therefore did not perform well. Although k-glasso was much better than BGGM and glasso, it is clear that GGMx performed significantly better than k-glasso in all metrics. In addition, GGMx can interpolate graph structure given new covariates. The results of graph interpolation (not shown) were very similar to those of graph estimation.

In Scenario II, it appeared that BGGM was comparable to GGMx in TPR, FDR and MCC. This is because in the simulation truth, there were much more nearly constant edges than highly varying edges. In many cases including the application, it is interesting to focus on highly varying edges as they are most differential across observations. Not surprisingly, GGMx had favorable performance compared to BGGM and glasso in terms of pTPR, pFDR, and pMCC. K-glasso was not able to pick up the signals in this scenario, which mimicked the real data.

In Scenario III where there was no relationship between graph and covariates, BGGM outperformed GGMx, glasso, and k-glasso. But GGMx still had a reasonably good performance with the lowest FDR and substantially better overall performance than glasso and k-glasso.

In Scenario IV (multiple graphical models), GGL, FGL, and MGGM had higher TPR compared to GGMx, however, at the price of higher FDR. Consequently, GGMx outperformed GGL, FGL, and MGGM in terms of FDR and the overall measure MCC.

In Scenario V, GGL and FGL were applied ignoring the continuous covariate whereas k-glasso was applied ignoring the discrete covariate. GGMx was able to simultaneously incorporate both continuous and discrete covariates in estimating graphs and therefore as expected it had the best performance compared to BGGM, glasso, k-glasso, GGL, and FGL in practically all measures.

In Scenario VI where the diagonal elements of Ωi were not constrained to be constants, the results were consistent with those in Scenarios I-V. BGGM and glasso outperformed k-glasso overall but k-glasso was much better with respect to the selection of the edges that have substantial variability (measured by pTPR, pFDR, and pMCC). The proposed GGMx was clearly the best in both overall measures and partial measures. For example, GGMx had considerably higher MCC as well as pMCC than all the competing methods. In addition, we also evaluated the estimation accuracy of Ωi by computing the mean squared error (MSE). We again focused on edges with true frequency of inclusion across observations between 0.1 and 0.9. The resulting MSE was 0.10, 1.24, 0.44, and 0.82 for GGMx, BGGM, glasso, and k-glasso, which demonstrated the capability of the proposed GGMx in capturing the heterogeneity in Ωi.

In Scenario VII, the main conclusion stays the same as in Scenario VI although k-glasso had significantly reduced FDR, however, at the price of significantly reduced TPR. GGMx, on the other hand, demonstrated its stable performance across all scenarios and all measures.

Lastly, we assessed the sensitivity of GGMx to the choice of all the hyperparameters (aτ, bτ) and (μt, σt). We picked Scenario VII and varied the hyperparameters in the following range, (aτ, bτ) ∈ {(10−2, 10−2), (10−3, 10−3), (10−4, 10−4)} and (μt, σt) ∈ {(1.0, 0.5), (1.0, 1.0), (1.5, 1.0)}3. The performance of GGMx with different hyperparameters is reported in Table 2, which shows GGMx is robust within the considered range.

Table 2:

Sensitivity Analysis. Operating characteristics for simulations under six alternative hyperparameter settings. The numbers are calculated on the basis of 50 repetitions; standard deviations are within parentheses. The first row shows the performance of GGMx in Scenario VII with default hyperparameter setting (aτ, bτ) = (10−1, 10−1) and (μt, σt) = (1, 0.2).

TPR FDR MCC pTPR pFDR pMCC
Default Parameter Setting 0.94 (0.04) 0.20 (0.12) 0.86 (0.07) 0.94 (0.04) 0.01 (0.01) 0.81 (0.09)
(aτ, bτ) (10−2, 10−2) 0.93 (0.04) 0.15 (0.10) 0.89 (0.06) 0.93 (0.04) 0.01 (0.01) 0.81 (0.09)
(10−3, 10−3) 0.93 (0.04) 0.10 (0.08) 0.91 (0.05) 0.93 (0.04) 0.01 (0.01) 0.80 (0.08)
(10−4, 10−4) 0.93 (0.04) 0.07 (0.07) 0.93 (0.04) 0.93 (0.04) 0.01 (0.01) 0.80 (0.08)
(μt, σt)
(1.0, 0.5) 0.97 (0.03) 0.33 (0.12) 0.80 (0.08) 0.97 (0.03) 0.02 (0.02) 0.82 (0.07)
(1.0, 1.0) 0.98 (0.02) 0.30 (0.13) 0.82 (0.08) 0.98 (0.02) 0.03 (0.02) 0.81 (0.08)
(1.5, 1.0) 0.97 (0.03) 0.19 (0.11) 0.88 (0.07) 0.97 (0.03) 0.03 (0.02) 0.81 (0.08)

7. Application in Multiple Myeloma

We present an application of GGMx in modeling transcriptomic regulation in multiple myeloma (MM) which is a late-stage malignancy of plasma cells. Recent research has shifted the focus from traditional “one size fits all” therapies to precision medicine strategies because MM is a highly heterogeneous genetic disease at an individual level (Hervé et al., 2011). To find better personalized treatment and more accurate prescriptive recommendations to MM patients, there needs to be a better understanding of the heterogeneity based on genomically defined pathways (Lohr et al., 2014). We use data generated by the Multiple Myeloma Research Consortium, a multi-institutional collaborative research effort collected data (among others) on gene expressions and clinical parameters from MM patients (Chapman et al., 2011).

We focus our analyses on the genes mapped to one of the most important pathways in MM, NF-κB signaling pathway. Activation of the NF-κB pathway has been implicated in MM, but the genomic foundation of such activation is only partially understood (Demchenko et al., 2010; Roy et al., 2018). Clinical information includes measurements of two important prognostic factors, serum beta-2 microglobulin (Sβ2M) and serum albumin. The International Staging System (Greipp et al. 2005) uses these two prognostic factors to stage MM: stage I, Sβ2M < 3.5 mg/L and serum albumin ≥ 3.5 g/dL; stage II, neither stage I nor III; and stage III, Sβ2M ≥ 5.5 mg/L. The observed values of Sβ2M and serum albumin, and the staging partition are depicted in Figure 6. We use these two prognostic factors as covariates (q = 2).

Figure 6:

Figure 6:

Observed prognostic factors are shown as crosses and dots. Dots are chosen as representative cases for network visualization in Figure 8. Triangles will be used to interpolate networks for unseen patients shown in Figure 9. The prognostic covariates space are partitioned into Stages I, II, and III, according to the International Staging System for multiple myeloma.

The goal of this study was to infer subject-level gene expression networks whose structures are modified by the prognostic factors. After removing outliers and samples with missing gene expression or clinical information, we had n = 151 samples and p = 33 genes. We ran two separate MCMCs, each with 50,000 iterations, discarded the first 50% as burn-in and saved every 50th sample after burn-in. To check MCMC convergence, we calculated the potential scale reduction factor (PSRF, Gelman et al. 1992) for each entry in Ωi, i = 1, … , n. The median PSRF was 1.00 with interquartile range 0.01, which showed no lack of convergence. We then concatenated the two chains and all subsequent inference was based on the combined Monte Carlo samples. The probability cutoff c was chosen to control the posterior expected FDR at 1%.

Population-level inference

The estimated graphs had 30 edges per subject on average with minimum 20 edges (from a stage III patient) and maximum 37 edges (from a stage I patient). We summarized a population-level gene expression network G = (V, E) as the union of all networks across subjects E=i=1nEi. There were |E| = 42 edges in G. To visualize the graph variability, we computed the variance of edge inclusion. Specifically, let ejk = (e1jk , … , enjk) be a binary vector such that eijk = 1 if {j, k} ∈ Ei. Then for edge {j, k}, the variance of edge inclusion was defined as the sample variance of ejk. The population-level network was reported in Figure 7, with the edge width proportional to edge inclusion variability. We found 14 out of 42 edges with variance greater than 0.2 (note the maximum variance is 0.25 for Bernoulli random variable). These 14 edges appeared in about 30%-70% of the patients. In line with our simulation studies, traditional GGMs are unlikely to accurately capture these differential edges.

Figure 7:

Figure 7:

Population-level summary of gene expression network. The network is a union of all networks across subjects. The edge width is proportional to edge inclusion variability.

Subject-level inference

Next, we focus on the subject-level inference. We chose 6 representative patients, 2 from each stage, to show their respective networks in Figure 8. The values of their prognostics factors are represented by the dots in Figure 6. We set the edge width proportional to the absolute value of partial correlation ρijk=ωijkωijjωikk, and use solid lines to represent positive partial correlations and dashed lines negative partial correlations.

Figure 8:

Figure 8:

Subject-level networks for six representative patients, represented as dots in Figure 6. The edge width is proportional to the absolute value of partial correlation. The sign of partial correlation is represented by line type: + solid line and - dashed line.

We highlight several interesting biological findings. RELB was found to be a highly connected gene across all patients (Figures 7 and 8). RELB is a core member of NF-κB family. Hence it is not surprising that RELB played an import role in NF-κB pathway. In fact, many MM patients have abnormal NF-κB target gene expression, associated with genetic aberration of NFKB1 and NFKB2 (Annunziata et al., 2007). This further confirms our finding that RELB was consistently positively associated with NFKB1 and NFKB2. In addition, NFKBIA is an inhibitor of NF-κB, which is consistent with our findings that NFKBIA was negatively associated with RELB across patients. It is also known that genes in the same family tend to be positively associated with each other. Our study found positive links, for example, BIRC2—BIRC3 and NFKBIA—NFKBIZ. As disease progresses, some paths get blocked and some new connections get acquired. Among others, the link between LTB and TNFRSF13B was found in stage III patients but not in stage I patients whereas the link between NFKBIL2 and MAP3K7IP2 was lost in stage III patients. While some of those links are well documented in the biological literature (Liu et al., 2017), their gain and loss mechanisms need further validation and investigation.

Finally, as new patients come into the clinic, GGMx can be used to quickly predict the individualized gene network only based on the blood test results of Sβ2M and serum albumin without the costly and time-consuming whole genome sequencing. For illustration, we picked two sets of covariates that were unobserved in our collected data; they are represented by triangles in Figure 6. The estimated gene expression network of the two hypothetical patients are shown in Figure 9, which was enabled by the unique feature of graph interpolation of the proposed GGMx.

Figure 9:

Figure 9:

Network interpolation for two sets of unseen prognostic factors, represented as triangles in Figure 6.

8. Discussion

In this article, we introduce a general regression framework for (undirected) Gaussian graphical models with covariates (GGMx). This generalization of regular GGM beyond i.i.d. data allows the graph structure and strength to change with covariates and is particularly challenging especially in the undirected graph context due to the positive definiteness constraint of a precision matrix. We have addressed this challenge through a novel prior that is theoretically connected to non-local priors for precision matrices, paired with a carefully designed MCMC algorithm for efficient posterior inference. GGMx includes at least five special cases including standard GGMs, group-specific GGMs, time-varying GGMs, covariate-dependent GGMs, and context-specific GGMs. We demonstrated the utility and robustness of GGMx through extensive simulations and an application in precision oncology. Our GGMx framework is broadly applicable to many other scientific domains of interest. For example, in brain functional magnetic resonance imaging data, GGMx can be used to study how brain connectivity networks change with covariates such as time and stimuli.

We remark that covariance regression (Hoff and Niu, 2012; Fox and Dunson, 2015) is a closely related model. It is, however, fundamentally different from the proposed GGMx in at least two ways. First, covariance regression assumes the covariance matrix rather than the precision matrix to be a function of xi which takes a specific form, Σi=Ωi1=Ψ+Λ(xi)+Λ(xi)T for some PDM Ψ and matrix-value function Λ(xi). Second, covariance regression assumes a dense Σi whereas GGMx allows Ωi to be sparse and moreover, the sparsity pattern can change with covariates. Note that zeros in Λ(xi) generally do not translate to zeros in Σi or Ωi. Therefore, it is not straightforward to extend the covariance regression framework to allow sparsity.

While our work is a useful first step for undirected graphical regression, there are several extensions and refinements possible. We have chosen the smooth covariate-dependent functions, gjk(·), to be linear for simplicity and parsimony. Same choice has been made by similar papers (Cheng et al., 2014). However, in general, it can be replaced by a nonlinear function. For example, letting x~ denote some basis expansion of x such as splines and wavelets, we can model gjk(x)=βjkTx~ and the same inference procedure with linear functions applies. We plan to incorporate nonlinearity in our future work. Furthermore, we have worked with a moderate number of variables due to several reasons. First, the number of parameters that need to be estimated in GGMx is on the order of p(p+1)(q+2)2+p(p1)2, which can be large even for a moderate number of variables and covariates. Second, from an application perspective, we focus on a specific signalling pathway in multiple myeloma, NF-κB for deeper scientific interpretations. The small sample size (relative to the number of parameters) does not allow for reliable inferences for a much larger number of variables (e.g., the entire transcriptomic profile). Finally, the scalability of the proposed GGMx also limits the number of variables under consideration. The scalability can be potentially improved by adopting more efficient MCMC algorithms such as Metropolis-adjusted Langevin algorithm (Roberts et al., 1996) or Hamiltonian Monte Carlo (Duane et al., 1987). Both algorithms take advantage of gradient information of the target distribution. However, the hard thresholding function in (3) is discontinuous. This difficulty can be potentially overcome by considering a continuous relaxation of the hard thresholding function (Cai et al., 2018). Another potential solution is resorting to variational Bayes algorithms, which approximate the posterior distributions by simpler variational distributions through minimizing the Kullback–Leibler divergence between them. We hope to address the scalability issue in our future work.

Acknowledgement

YN was partially supported by NSF DMS-2112943. VB was partially supported by NIH grants R01CA244845-01A1 and P30 CA-046592 and start-up funds from the U-M Rogel Cancer Center and School of Public Health.

Footnotes

1.

This is a conceptual statement. Note that existing time-varying GGM methods typically assume graphs to vary non-linearly with time whereas this paper considers linearly-varying graphs.

2.

Note that the probability cutoff c is different from the random threshold tjk. The random threshold is a model parameter used to induce sparsity whereas the probability cutoff is introduced to obtain posterior point estimate of graphs.

3.

The resulting prior means (variances) of the minimum effect size tjk are 1.0 (0.22), 1.3 (0.63), and 1.6 (0.77).

Contributor Information

Yang Ni, Department of Statistics, Texas A&M University, College Station, TX 77843, USA.

Francesco C. Stingo, Department of Statistics, Computer Science, Applications “G. Parenti”, The University of Florence Florence, Italy

Veerabhadran Baladandayuthapani, Department of Biostatistics, University of Michigan, Ann Arbor, MI 48109, USA.

References

  1. Altomare Davide, Consonni Guido, and Rocca Luca La. Objective Bayesian search of Gaussian directed ccyclic graphical models for ordered variables with non-local priors. Biometrics, 69(2):478–487, 2013. [DOI] [PubMed] [Google Scholar]
  2. Annunziata Christina M, Davis R Eric, Demchenko Yulia, Bellamy William, Gabrea Ana, Zhan Fenghuang, Lenz Georg, Hanamura Ichiro, Wright George, Xiao Wenming, et al. Frequent engagement of the classical and alternative NF-κB pathways by diverse genetic abnormalities in multiple myeloma. Cancer Cell, 12(2):115–130, 2007. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bhadra Anindya and Mallick Bani K. Joint high-dimensional Bayesian variable and covariance selection with an application to eQTL analysis. Biometrics, 69(2):447–457, 2013. [DOI] [PubMed] [Google Scholar]
  4. Cai Qingpo, Kang Jian, Yu Tianwei, et al. Bayesian network marker selection via the thresholded graph Laplacian Gaussian prior. Bayesian Analysis, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Chapman Michael A, Lawrence Michael S, Keats Jonathan J, Cibulskis Kristian, Sougnez Carrie, Schinzel Anna C, Harview Christina L, Brunet Jean-Philippe, Ahmann Gregory J, Adli Mazhar, et al. Initial genome sequencing and analysis of multiple myeloma. Nature, 471(7339):467–472, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Cheng Jie, Levina Elizaveta, Wang Pei, and Zhu Ji. A sparse Ising model with covariates. Biometrics, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Danaher Patrick, Wang Pei, and Witten Daniela M. The joint graphical lasso for inverse covariance estimation across multiple classes. J R Stat Soc B, 76(2):373–397, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Demchenko Yulia N, Glebov Oleg K, Zingone Adriana, Keats Jonathan J, Bergsagel P Leif, and Kuehl W Michael. Classical and/or alternative NF-κB pathway activation in multiple myeloma. Blood, 115(17):3541–3552, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Dobra Adrian, Hans Chris, Jones Beatrix, Nevins Joseph R, Yao Guang, and West Mike. Sparse graphical models for exploring gene expression data. Journal of Multivariate Analysis, 90(1):196–212, 2004. [Google Scholar]
  10. Dobra Adrian, Lenkoski Alex, and Rodriguez Abel. Bayesian inference for general Gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496):1418–1433, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Drton Mathias and Maathuis Marloes H. Structure learning in graphical modeling. Annual Review of Statistics and Its Application, 4:365–393, 2017. [Google Scholar]
  12. Duane Simon, Kennedy Anthony D, Pendleton Brian J, and Roweth Duncan. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987. [Google Scholar]
  13. Fox Emily B and Dunson David B. Bayesian nonparametric covariance regression. The Journal of Machine Learning Research, 16(1):2501–2542, 2015. [Google Scholar]
  14. Friedman Jerome, Hastie Trevor, and Tibshirani Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Gan Lingrui, Narisetty Naveen N, and Liang Feng. Bayesian regularization for graphical models with unequal shrinkage. Journal of the American Statistical Association, 114 (527):1218–1231, 2019. [Google Scholar]
  16. Gelman Andrew, Rubin Donald B, et al. Inference from iterative simulation using multiple sequences. Statistical Science, 7(4):457–472, 1992. [Google Scholar]
  17. Green Peter J. and Thomas Alun. Sampling decomposable graphs using a Markov chain on junction trees. Biometrika, 100(1):91, 2013. [Google Scholar]
  18. Greipp Philip R, Miguel Jesus San, Durie Brian GM, et al. International staging system for multiple myeloma. Journal of Clinical Oncology, 23(15):3412–3420, 2005. [DOI] [PubMed] [Google Scholar]
  19. Guo Jian, Levina Elizaveta, Michailidis George, and Zhu Ji. Joint estimation of multiple graphical models. Biometrika, page asq060, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Hervé Avet-Loiseau, Florence Magrangeas, Philippe Moreau, Michel Attal, Thierry Facon, Kenneth Anderson, Harousseau Jean-Luc Munshi Nikhil, and Stéphane Minvielle. Molecular heterogeneity of multiple myeloma: pathogenesis, prognosis, and therapeutic implications. Journal of Clinical Oncology, 29(14):1893–1897, 2011. [DOI] [PubMed] [Google Scholar]
  21. Hoff Peter D and Niu Xiaoyue. A covariance regression model. Statistica Sinica, pages 729–753, 2012. [Google Scholar]
  22. Johnson Valen E and Rossell David. On the use of non-local prior densities in Bayesian hypothesis tests. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(2):143–170, 2010. [Google Scholar]
  23. Johnson Valen E and Rossell David. Bayesian model selection in high-dimensional settings. Journal of the American Statistical Association, 107(498):649–660, 2012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Khare Kshitij, Rajaratnam Bala, and Saha Abhishek. Bayesian inference for Gaussian graphical models beyond decomposable graphs. Journal of the Royal Statistical Society: Series B, 80(4):727–747, 2018. [Google Scholar]
  25. Kolar Mladen, Parikh Ankur P, and Xing Eric P. On sparse nonparametric conditional covariance selection. In ICML-10, pages 559–566, 2010.
  26. Liu Han, Chen Xi, Wasserman Larry, and Lafferty John D. Graph-valued regression. In Advances in Neural Information Processing Systems, pages 1423–1431, 2010a. [Google Scholar]
  27. Liu Han, Roeder Kathryn, and Wasserman Larry. Stability approach to regularization selection (stars) for high dimensional graphical models. In Advances in Neural Information Processing Systems, pages 1432–1440, 2010b. [PMC free article] [PubMed] [Google Scholar]
  28. Liu Ting, Zhang Lingyun, Joo Donghyun, and Sun Shao-Cong. NF-κkB signaling in inflammation. Signal Transduction and Targeted Therapy, 2:17023, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Lohr Jens G, Stojanov Petar, Carter Scott L, et al. Widespread genetic heterogeneity in multiple myeloma: implications for targeted therapy. Cancer Cell, 25(1):91–101, 2014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Massam Hélène. Bayesian inference in graphical Gaussian models. Handbook of Graphical Models, pages 257–282, 2018. [Google Scholar]
  31. Meinshausen Nicolai and Bühlmann Peter. High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462, 2006. [Google Scholar]
  32. Mohammadi Abdolreza, Wit Ernst C, et al. Bayesian structure learning in sparse Gaussian graphical models. Bayesian Analysis, 10(1):109–138, 2015. [Google Scholar]
  33. Ni Yang, Müller Peter, Zhu Yitan, and Ji Yuan. Heterogeneous reciprocal graphical models. Biometrics, 74(2):606–615, 2018. [DOI] [PubMed] [Google Scholar]
  34. Ni Yang, Stingo Francesco C, and Baladandayuthapani Veerabhadran. Bayesian graphical regression. Journal of the American Statistical Association, 114(525):184–197, 2019. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Nyman Henrik, Pensar Johan, and Corander Jukka. Stratified gaussian graphical models. Communications in Statistics-Theory and Methods, 46(11):5556–5578, 2017. [Google Scholar]
  36. Oates Chris J, Korkola Jim, Gray Joe W, Mukherjee Sach, et al. Joint estimation of multiple related biological networks. The Annals of Applied Statistics, 8(3):1892–1919, 2014. [Google Scholar]
  37. Peterson Christine, Stingo Francesco C, and Vannucci Marina. Bayesian inference of multiple Gaussian graphical models. Journal of the American Statistical Association, 110 (509):159–174, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Roberts Gareth O, Tweedie Richard L, et al. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996. [Google Scholar]
  39. Rossell David and Telesca Donatello. Nonlocal priors for high-dimensional estimation. Journal of the American Statistical Association, 112(517):254–265, 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Rothman Adam J, Levina Elizaveta, and Zhu Ji. Sparse multivariate regression with covariance estimation. Journal of Computational and Graphical Statistics, 19(4):947–962, 2010. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Roverato Alberto. Hyper inverse Wishart distribution for non-decomposable graphs and its application to Bayesian inference for Gaussian graphical models. Scandinavian Journal of Statistics, 29(3):391–411, 2002. [Google Scholar]
  42. Roy Payel, Sarkar Uday Aditya, and Basak Soumen. The NF-κB activating pathways in multiple myeloma. Biomedicines, 6(2):59, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Scott James G and Carvalho Carlos M. Feature-inclusion stochastic search for Gaussian graphical models. Journal of Computational and Graphical Statistics, 17(4):790–808, 2008. [Google Scholar]
  44. Shaddox Elin, Stingo Francesco C, Peterson Christine B, Jacobson Sean, Cruickshank-Quinn Charmion, Kechris Katerina, Bowler Russell, and Vannucci Marina. A Bayesian approach for learning gene networks underlying disease severity in COPD. Statistics in Biosciences, 10(1):59–85, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  45. Shin Minsuk, Bhattacharya Anirban, and Johnson Valen E. Scalable Bayesian variable selection using nonlocal prior densities in ultrahigh-dimensional settings. Statistica Sinica, 28(2):1053, 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  46. Sudderth Erik B, Wainwright Martin J, and Willsky Alan S. Embedded trees: Estimation of gaussian processes on graphs with cycles. IEEE Transactions on Signal Processing, 52 (11):3136–3150, 2004. [Google Scholar]
  47. Wang Hao. Scaling it up: Stochastic search structure learning in graphical models. Bayesian Analysis, 10(2):351–377, 2015. [Google Scholar]
  48. Wang Hao et al. Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis, 7(4):867–886, 2012. [Google Scholar]
  49. Whittaker Joe. Graphical models in applied multivariate statistics. Wiley Publishing, 2009. [Google Scholar]
  50. Xie Yuying, Liu Yufeng, and Valdar William. Joint estimation of multiple dependent gaussian graphical models with applications to mouse genomics. Biometrika, 103(3):493–511, 2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Yajima Masanao, Telesca Donatello, Ji Yuan, and Müller Peter. Detecting differential patterns of interaction in molecular pathways. Biostatistics, 16(2):240–251, 2015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Yin Jianxin and Li Hongzhe. A sparse conditional Gaussian graphical model for analysis of genetical genomics data. The Annals of Applied Statistics, 5(4):2630, 2011. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Yuan Ming and Lin Yi. Model selection and estimation in the Gaussian graphical model. Biometrika, 94(1):19–35, 2007. [Google Scholar]
  54. Zhou Shuheng, Lafferty John, and Wasserman Larry. Time varying undirected graphs. Machine Learning, 80(2-3):295–319, 2010. [Google Scholar]

RESOURCES