Bayesian Inference for General Gaussian Graphical Models With Application to Multivariate Lattice Data

Adrian Dobra; Alex Lenkoski; Abel Rodriguez

doi:10.1198/jasa.2011.tm10465

. Author manuscript; available in PMC: 2016 Feb 25.

Published in final edited form as: J Am Stat Assoc. 2012 Dec 24;106(496):1418–1433. doi: 10.1198/jasa.2011.tm10465

Bayesian Inference for General Gaussian Graphical Models With Application to Multivariate Lattice Data

Adrian Dobra ¹, Alex Lenkoski ², Abel Rodriguez ³

PMCID: PMC4767185 NIHMSID: NIHMS745474 PMID: 26924867

Abstract

We introduce efficient Markov chain Monte Carlo methods for inference and model determination in multivariate and matrix-variate Gaussian graphical models. Our framework is based on the G-Wishart prior for the precision matrix associated with graphs that can be decomposable or non-decomposable. We extend our sampling algorithms to a novel class of conditionally autoregressive models for sparse estimation in multivariate lattice data, with a special emphasis on the analysis of spatial data. These models embed a great deal of flexibility in estimating both the correlation structure across outcomes and the spatial correlation structure, thereby allowing for adaptive smoothing and spatial autocorrelation parameters. Our methods are illustrated using a simulated example and a real-world application which concerns cancer mortality surveillance. Supplementary materials with computer code and the datasets needed to replicate our numerical results together with additional tables of results are available online.

Keywords: CAR model, G-Wishart distribution, Markov chain Monte Carlo (MCMC) simulation, Spatial statistics

1. INTRODUCTION

Graphical models (Lauritzen 1996), which encode the conditional independence among variables using a graph, have become a popular tool for sparse estimation in both the statistics and machine learning literatures (Dobra et al. 2004; Meinshausen and Bühlmann 2006; Yuan and Lin 2007; Banerjee, El Ghaoui, and D’Aspremont 2008; Drton and Perlman 2008; Friedman, Hastie, and Tibshirani 2008; Ravikumar, Wainwright, and Lafferty 2010). Implementing model selection approaches in the context of graphical models typically allows a dramatic reduction in the number of parameters under consideration, preventing overfitting and improving predictive capability. In particular, Bayesian approaches to inference in graphical models generate regularized estimators that incorporate model structure uncertainty.

The focus of this article is Bayesian inference in Gaussian graphical models (Dempster 1972) using the G-Wishart prior (Roverato 2002; Atay-Kayis and Massam 2005; Letac and Massam 2007). This class of distributions is extremely attractive since it represents the conjugate family for the precision matrix whose elements associated with edges not in the underlying graph are constrained to be equal to zero. Many recent articles have described various stochastic search methods for Gaussian graphical models (GGMs) with the G-Wishart prior based on marginal likelihoods which, in this case, are given by the ratio of the normalizing constants of the posterior and prior G-Wishart distributions—see the works of Atay-Kayis and Massam (2005), Jones et al. (2005), Carvalho and Scott (2009), Armstrong et al. (2009), Lenkoski and Dobra (2011) and the references therein. Wang and West (2009) proposed a MCMC algorithm for model determination and estimation in matrix-variate GGMs that also involves marginal likelihoods. Although the computation of marginal likelihoods for decomposable graphs is straightforward, similar computations for non-decomposable graphs or matrix-variate GGMs raise significant numerical challenges. This leads to the idea of devising Bayesian model determination methods that avoid the computation of marginal likelihoods.

The contributions of this article are threefold. First, we develop a new Metropolis–Hastings method for sampling from the G-Wishart distribution associated with an arbitrary graph. We discuss our algorithm in the context of the related sampling methods of Wang and Carvalho (2010) and Mitsakakis, Massam, and Escobar (2011), and empirically show that it scales better to graphs with many vertices. Second, we propose novel reversible jump MCMC samplers (Green 1995) for model determination and estimation in multivariate and matrix-variate GGMs. We contrast our approaches with the algorithms of Wong, Carter, and Kohn (2003) and Wang and West (2009) which focus on arbitrary GGMs and matrix-variate GGMs, respectively. Third, we devise a new flexible class of conditionally autoregressive models (CAR) (Besag 1974) for lattice data that rely on our novel sampling algorithms. The link between conditionally autoregressive models and GGMs was originally pointed out by Besag and Kooperberg (1995). However, since typical neighborhood graphs are non-decomposable, fully exploiting this connection within a Bayesian framework requires that we are able to estimate Gaussian graphical models based on general graphs. Our main focus is on applications to multivariate lattice data, where our approach based on matrix-variate GGMs provides a natural approach to create sparse multivariate CAR models.

The organization of the article is as follows. Section 2 formally introduces GGMs and the G-Wishart distribution, along with a description of our novel sampling algorithm for the G-Wishart distribution. Section 3 describes our reversible jump MCMC algorithm for GGMs. This algorithm represents a generalization of the work of Giudici and Green (1999) and is applicable beyond decomposable GGMs. Section 4 discusses inference and model determination in matrix-variate GGMs. Unlike the related framework developed by Wang and West (2009) which involves exclusively decomposable graphs, our sampler operates on the joint distribution of the row and column precision matrices, the row and column conditional independence graphs, and the auxiliary variable that needs to be introduced to solve the underlying non-identifiability problem associated with matrix-variate normal distributions. Section 5 reviews conditional autoregressive priors for lattice data and their connection to GGMs. This section discusses both univariate and multivariate models for continuous and discrete data based on generalized linear models. Section 6 presents two illustrations of our methodology: a simulation study and a cancer mortality mapping model. Finally, Section 7 concludes the article by discussing future directions for our work.

2. GAUSSIAN GRAPHICAL MODELS AND THE G–WISHART DISTRIBUTION

We let X = X_{V_p}, V_p = {1, 2, …, p}, be a random vector with a p-dimensional multivariate normal distribution N_p(0, K⁻¹). We consider a graph G = (V_p, E), where each vertex i ∈ V corresponds with a random variable X_i and E ⊂ V_p × V_p are undirected edges. Here “undirected” means that (i, j) ∈ E if and only if (j, i) ∈ E. We denote by 𝒢_p the set of all 2^p⁽^p^−1)/2 undirected graphs with p vertices. A Gaussian graphical model with conditional independence graph G is constructed by constraining to zero the off-diagonal elements of K that do not correspond with edges in G (Dempster 1972). If (i, j) ∉ E, X_i and X_j are conditionally independent given the remaining variables. The precision matrix K = (K_ij)_1≤_i_,_j_≤_p is constrained to the cone P_G of symmetric positive definite matrices with off-diagonal entries K_ij = 0 for all (i, j) ∉ E.

We consider the G-Wishart distribution Wis_G(δ, D) with density

p (K ∣ G, δ, D) = \frac{1}{I_{G} (δ, D)} {(det K)}^{(δ - 2) / 2} exp {- \frac{1}{2} (K, D)},

(1)

with respect to the Lebesgue measure on P_G (Roverato 2002; Atay-Kayis and Massam 2005; Letac and Massam 2007). Here 〈A, B〉 = tr(A^TB) denotes the trace inner product. The normalizing constant I_G(δ, D) is finite if δ > 2 and D positive definite (Diaconnis and Ylvisaker 1979). If G is the full graph (E = V_p × V_p), Wis_G(δ, D) is the Wishart distribution Wis_p(δ, D) (Muirhead 2005).

Since our sampling methods rely on perturbing the Cholesky decompositions of the matrix K, we review some key results. We write K ∈ P_G as K = Q^T (Ψ^TΨ)Q where Q = (Q_ij)_1≤_i_≤_j_≤_p and Ψ = (Ψ_ij)_1≤_i_≤_j_≤_p are upper triangular, while D⁻¹ = Q^TQ is the Cholesky decomposition of D⁻¹. We see that K = Φ^TΦ where Φ = ΨQ is the Cholesky decomposition of K. The zero constraints on the off-diagonal elements of K associated with G induce well-defined sets of free elements Φ^ν⁽^G⁾ = {Φ_ij : (i, j) ∈ ν(G)} and Ψ^ν⁽^G⁾ = {Ψ_ij : (i, j) ∈ ν(G)} of the matrices Φ and Ψ—see proposition 2, page 320, and lemma 2, page 326, of the article by Atay-Kayis and Massam (2005). Here ν(G) = ν₌(G) ∪ ν_<(G), ν₌(G) = {(i, i) : i ∈ V_p} and ν_<(G) = {(i, j) : i < j and (i, j) ∈ E}.

We denote by M^ν⁽^G⁾ the set of incomplete triangular matrices whose elements are indexed by ν(G) and whose diagonal elements are strictly positive. Both Φ^ν⁽^G⁾ and Ψ^ν⁽^G⁾ must belong to M^ν⁽^G⁾. The non-free elements of Ψ are determined through the completion operation (Atay-Kayis and Massam 2005, lemma 2) as a function of the free elements Ψ^ν⁽^G⁾. Each element Ψ_ij with i < j and (i, j) ∉ E is a function of the other elements Ψ_i_′_j_′ that precede it in lexicographical order. Roverato (2002) proved that the Jacobian of the transformation that maps K ∈ P_G to Φ^ν⁽^G⁾ ∈ M^ν⁽^G⁾ is $J (K \to Φ^{ν (G)}) = 2^{p} \prod_{i = 1}^{p} Φ_{i i}^{v_{i}^{G} + 1}$ , where $v_{j}^{G} = ∣ {j : j < i and (i, j) \in E} ∣$ . Here |A| denotes the number of elements of the set A. Atay-Kayis and Massam (2005) showed that the Jacobian of the transformation that maps Φ^ν⁽^G⁾ to Ψ^ν⁽^G⁾ is given by $J (Φ^{ν (G)} \to Ψ^{ν (G)}) = \prod_{i = 1}^{p} Q_{i i}^{d_{i}^{G} + 1}$ , where $d_{i}^{G} = ∣ {j : j > i and (i, j) \in E} ∣$ . We have det $K = \prod_{i = 1}^{p} Φ_{i i}^{2}$ and Φ_ii = Ψ_iiQ_ii. It follows that the density of Ψ^ν⁽^G⁾ with respect to the Lebesgue measure on M^ν⁽^G⁾ is

p (Ψ^{ν (G)} ∣ δ, D) = \frac{2^{p}}{I_{G} (δ, D)} \prod_{i = 1}^{p} Q_{i i}^{v_{i}^{G} + d_{i}^{G} + δ} \prod_{i = 1}^{p} Ψ_{i i}^{v_{i}^{G} + δ - 1} \times exp (- \frac{1}{2} \sum_{i = 1}^{p} \sum_{j = i}^{p} Ψ_{i j}^{2}) .

(2)

We note that $v_{i}^{G} + d_{i}^{G}$ represents the number of neighbors of vertex i in the graph G.

2.1 Existing Methods for Sampling From the G-Wishart Distribution

The problem of sampling from the G-Wishart distribution has received considerable attention in the recent literature. Piccioni (2000) followed by Asci and Piccioni (2007) exploited the theory of regular exponential families with cuts and proposed the block Gibbs sampler algorithm. Their iterative method generates one sample from Wis_G(δ, D) by performing sequential adjustments with respect to submatrices associated with each clique of G. The adjustment with respect to a clique C involves the inversion of a (p − |C|) × (p − |C|) matrix. If G has several cliques involving a small number of vertices but has a large number of vertices p, repeated inversions of high-dimensional matrices must be performed at each iteration. This makes the block Gibbs sampler algorithm impractical for being used in the context of large graphs.

Carvalho, Massam, and West (2007) gave a direct sampling method for Wis_G(δ, D) that works well for decomposable graphs. This algorithm was extended by Wang and Carvalho (2010) to non-decomposable graphs by introducing an accept– reject step as follows. Based on theorem 1, page 328, of the article of Atay-Kayis and Massam (2005), Wang and Carvalho (2010) and Mitsakakis, Massam, and Escobar (2011) wrote the density of Ψ^ν⁽^G⁾ given in (2) as

p (Ψ^{ν (G)} ∣ δ, D) = \frac{1}{E_{h} [f (Ψ^{ν (G)})]} f (Ψ^{ν (G)}) h (Ψ^{ν (G)}),

(3)

where

log f (Ψ^{v (G)}) = - \frac{1}{2} \sum_{(i, j) \notin v (G), i < j} Ψ_{i j}^{2}

is a function of the non-free elements of Ψ which, in turn, are uniquely determined from the free elements in Ψ^ν⁽^G⁾. Furthermore, h(Ψ^ν⁽^G⁾) is the product of mutually independent chi-squared and standard normal distributions, that is, $Ψ_{i i}^{2} - χ_{δ + v_{i}^{G}}^{2}$ for i ∈ V_p and Ψ_ij ~ N(0, 1) for (i, j) ∈ ν_<(G). The expectation E_h[f (Ψ^ν⁽^G⁾)] is calculated with respect to h(·). Since f(Ψ^ν⁽^G⁾) ≤ 1 for any Ψ^ν⁽^G⁾ ∈ M^ν⁽^G⁾, we have

p (Ψ^{v (G)} ∣ δ, D) \leq M h (Ψ^{v (G)}),

(4)

where $M = E_{h}^{- 1} [f (Ψ^{v (G)})] \geq 1$ . From algorithm A.4 of the book by Robert and Casella (2004), an accept–reject method for sampling from Wis_G(δ, D) proceeds by sampling Ψ^ν⁽^G⁾ ~ h(·) and U ~ Uni(0, 1) until U ≤ f(Ψ^ν⁽^G⁾). The probability of acceptance of a random sample from h(·) as a random sample from Wis_G(δ, D) is

Pr (U < f (Ψ^{v (G)})) = E_{h} [f (Ψ^{v (G)})] .

(5)

Wang and Carvalho (2010) further decomposed G in its maximal prime subgraphs (Tarjan 1985), applied the accept–reject method for marginal or conditional distributions of Wis_G(δ, D) corresponding with each maximal prime subgraph, and generated a sample from Wis_G(δ, D) by putting together the resulting lower-dimensional sampled matrices.

Mitsakakis, Massam, and Escobar (2011) proposed a Metropolis–Hastings method for sampling from the G-Wishart Wis_G(δ, D) distribution. We denote by K^[^s^] = Q^T(Ψ^[^s^])^TΨ^[^s^]Q the current state of their Markov chain, where (Ψ^[^s^])^ν⁽^G⁾ ∈ M^ν⁽^G⁾. Mitsakakis, Massam, and Escobar (2011) generated a candidate K′ = Q^T (Ψ′)^TΨ′Q for the next state by sampling (Ψ′)^ν⁽^G⁾ ~ h(·) and determining the non-free elements of Ψ′ from the free sampled elements (Ψ′)^ν⁽^G⁾. The chain moves to K′ with probability

min {f ({(Ψ^{'})}^{v (G)}) / f ({(Ψ^{[s]})}^{v (G)}), 1} .

(6)

The inequality (4) implies that this independent chain is uniformly ergodic (Mengersen and Tweedie 1996) and its expected acceptance probability is greater than or equal to E_h[f(Ψ^ν⁽^G⁾)]—see lemma 7.9 in the book by Robert and Casella (2004). Hence, the Markov chain of Mitsakakis, Massam, and Escobar (2011) is more efficient than the method of Wang and Carvalho (2010) when the latter algorithm is employed without graph decompositions or when the graph G has only one maximal prime subgraph. On the other hand, the method of Mitsakakis, Massam, and Escobar (2011) involves changing the values of all free elements Ψ^ν⁽^G⁾ in a single step. As a result, if their Markov chain is currently in a region of high probability, it could potentially have to generate many candidates from h(·) before moving to a new state with comparable or larger probability.

2.2 Our Algorithm for Sampling From the G-Wishart Distribution

We introduce a new Metropolis–Hastings algorithm for sampling from the G-Wishart Wis_G(δ, D) distribution. In contrast to the work of Mitsakakis, Massam, and Escobar (2011), our approach makes use of a proposal distribution that depends on the current state of the chain and leaves all but one of the free elements in Ψ^ν⁽^G⁾ unchanged. The distance between the current and the proposed state is controlled through a Gaussian kernel with a precision parameter σ_m.

We denote by K^[^s^] = Q^T (Ψ^[^s^])^TΨ^[^s^]Q the current state of the chain with (Ψ^[^s^])^ν⁽^G⁾ ∈ M^ν⁽^G⁾. The next state K^[^s^+1] = Q^T(Ψ^[^s^+1])^TΨ^[^s^+1]Q is obtained by sequentially perturbing the free elements (Ψ^[^s^])^ν⁽^G⁾. A diagonal element $Ψ_{i_{0} i_{0}}^{[s]} > 0$ is updated by sampling a value γ from a $N (Ψ_{i_{0} i_{0}}^{[s]}, σ_{m}^{2})$ distribution truncated below at zero. We define the upper triangular matrix Ψ′ such that $Ψ_{i j}^{'} = Ψ_{i j}^{[s]}$ for (i, j) ∈ ν(G) \ {(i₀, i₀)} and $Ψ_{i_{0} i_{0}}^{'} = γ$ . The non-free elements of Ψ′ are obtained through the completion operation (Atay-Kayis and Massam 2005, lemma 2) from (Ψ′)^ν⁽^G⁾. The Markov chain moves to K′ = Q^T(Ψ′)^TΨ′Q with probability min{R_m, 1}, where

\begin{array}{l} R_{m} = \frac{p ({(Ψ^{'})}^{v (G)} ∣ δ, D)}{p ({(Ψ^{[s]})}^{v (G)} ∣ δ, D)} \frac{p (Ψ_{i_{0} i_{0}}^{[s]} ∣ Ψ_{i_{0} i_{0}}^{'})}{p (Ψ_{i_{0} i_{0}}^{'} ∣ Ψ_{i_{0} i_{0}}^{[s]})} \\ = \frac{φ (Ψ_{i_{0} i_{0}}^{[s]} / σ_{m})}{φ (Ψ_{i_{0} i_{0}}^{'} / σ_{m})} {(\frac{Ψ_{i_{0} i_{0}}^{'}}{Ψ_{i_{0} i_{0}}^{[s]}})}^{v_{i_{0}}^{G} + δ - 1} R_{m}^{'} . \end{array}

(7)

Here φ(·) represents the CDF of the standard normal distribution and

R_{m}^{'} = exp {- \frac{1}{2} \sum_{i = 1}^{p} \sum_{j = i}^{p} [{(Ψ_{i j}^{'})}^{2} - {(Ψ_{i j}^{[s]})}^{2}]} .

(8)

A free off-diagonal element $Ψ_{i_{0} j_{0}}^{[s]}$ is updated by sampling a value $γ^{'} ~ N (Ψ_{i_{0} j_{0}}^{[s]}, σ_{m}^{2})$ . We define the upper triangular matrix Ψ′ such that $Ψ_{i j}^{'} = Ψ_{i j}^{[s]}$ for (i, j) ∈ ν(G) \ {(i₀, j₀)} and $Ψ_{i_{0} j_{0}}^{'} = γ^{'}$ . The remaining elements of Ψ′ are determined by the completion operation from lemma 2 in the article of Atay-Kayis and Massam (2005) from (Ψ′)^ν⁽^G⁾. The proposal distribution is symmetric $p (Ψ_{i_{0} j_{0}}^{'} ∣ Ψ_{i_{0} j_{0}}^{[s]} = p (Ψ_{i_{0} j_{0}}^{[s]} ∣ Ψ_{i_{0} j_{0}}^{'})$ , thus we accept the transition of the chain from K^[^s^] to K′ = Q^T(Ψ′)^TΨ′Q with probability $min {R_{m}^{'}, 1}$ , where $R_{m}^{'}$ is given in Equation (7). Since (Ψ^[^s^])^ν⁽^G⁾ ∈ M^ν⁽^G⁾, we have (Ψ′)^ν⁽^G⁾ ∈ M^ν⁽^G⁾ which implies K′ ∈ P_G. We denote by K^[^s^+1] the precision matrix obtained after completing all the updates associated with the free elements indexed by ν(G).

A key computational aspect is related to the dependence of Cholesky decompositions on a particular ordering of the variables involved. Empirically we noticed that the mixing times of Markov chains that make use of our sampling approach can be improved by changing the ordering of the variables in V_p at each iteration. More specifically, a permutation υ is uniformly drawn from the set of all possible permutations ϒ_p of V_p. The row and columns of D are reordered according to υ and a new Cholesky decomposition of D⁻¹ is determined. The set ν(G) and { $d_{i}^{G}$ : i ∈ V} are recalculated given the ordering of the vertices induced by υ. Although the random shuffling of the ordering of the indices worked well for the particular applications we have considered in this article, we do not have any theoretical justification that explains why it improves the computational efficiency of our sampling approaches.

Our later developments from Section 4 involve sampling K ~ Wis_G(δ, D) subject to the constraint K₁₁ = 1. We have $K_{11}^{[s]} = 1 \Leftrightarrow {(Ψ^{[s]} Q)}_{11} = 1 \Leftrightarrow Ψ_{11}^{[s]} = 1 / Q_{11}$ . We subsequently obtain the next state K^[^s^+1] of the Markov chain by perturbing the free elements (Ψ^[^s^])^ν⁽^G^)\{(1,1)}. When defining the triangular matrix Ψ′ we set $Ψ_{11}^{'} = 1 / Q_{11}$ which implies that the corresponding candidate matrix K′ has $K_{11}^{'} = 1$ . Thus K^[^s^+1] also obeys the constraint $K_{11}^{[s + 1]} = 1$ . The random orderings of the variables need to be drawn from the set $ϒ_{p}^{(1, 1)}$ of permutations υ ∈ ϒ_p such that υ(1) = 1. This way the (1, 1) element of K always occupies the same position.

2.3 The Scalability of Sampling Methods From the G-Wishart Distribution

Practical applications of our novel framework for analyzing multivariate lattice data from Section 5 involve sampling from G-Wishart distributions associated with arbitrary graphs with tens or possibly hundreds of vertices—see Section 6.2 as well as relevant examples from the works of Elliott et al. (2001), Banerjee, Carlin, and Gelfand (2004), Rue and Held (2005), Lawson (2009), Gelfand et al. (2010). We perform a simulation study to empirically compare the scalability of our approach (DLR) for simulating from G-Wishart distributions with respect to the algorithms of Wang and Carvalho (2010) (WC) and Mitsakakis, Massam, and Escobar (2011) (MME).We consider the cycle graph C_p ∈ 𝒢_p with edges {(i, i + 1) : 1 ≤ i ≤ p − 1} ∪ {(p, 1)} and the matrix A_p ∈ P_{C_p} such that (A_p)_ii = 1 (1 ≤ i ≤ p), (A_p)_i_,_i₋₁ = (A_p)_i_−1,_i = 0.5 (2 ≤ i ≤ p), and (A_p)_1,_p = (A_p)_p_,1 = 0.4. We chose C_p because it is the sparsest graph with p vertices and only one maximal prime subgraph, hence no graph decompositions can be performed in the context of the WC algorithm.

We employ the three sampling methods to sample from the G-Wishart Wis_{C_p} (103, D_p) distribution, where $D_{p} = I_{p} + 100 A_{p}^{- 1}$ and I_p is the p-dimensional identity matrix. This is representative of a G-Wishart posterior distribution corresponding with 100 samples from a $N_{p} (0, A_{p}^{- 1})$ and a G-Wishart prior Wis_{C_p} (3, I_p). The acceptance probability for generating one sample with the DLR algorithm is defined as the average acceptance probabilities of the updates corresponding with diagonal and off-diagonal free elements in Ψ^ν⁽^G⁾. We calculate Monte Carlo estimates and their standard errors of the acceptance probabilities for our method by running 100 independent chains of length 2500 for each combination (σ_m, p) ∈ {0.1, 0.5, 1, 2} × {4, 6, …, 20}. The same number of chains of the same length have been run with the MME algorithm for p = 4, 6, …, 20. We calculate Monte Carlo estimates of the acceptance probabilities (5) associated with the WC algorithm, that is,

{\hat{E}}_{h} [f (Ψ^{v (C_{p})})] = \sum_{s = 1}^{2500} f ({(Ψ^{[s]})}^{v (C_{p})}), with {(Ψ^{[s]})}^{v (C_{p})} ~ h (\cdot), s = 1, \dots, 2500.

The corresponding standard errors were determined by calculating 100 such estimates for each p = 4, 6, …, 20. The results are summarized in Table 1. As we would expect, the acceptance rates for the DLR algorithm decrease as σ_m increases since the proposed jumps in P_{C_p} become larger. More importantly, for each value of σ_m, the acceptance rates for the DLR algorithm decrease very slowly as the number of vertices p grows. This shows that our Metropolis–Hastings method is likely to retain its efficiency for graphs that involve a large number of vertices. On the other hand, the acceptance rates for the WC algorithm are extremely small even for p = 6 vertices. This implies that a large number of samples from the instrumental distribution h(·) would need to be generated before one sample from Wis_{C_p} (103, D_p) is obtained. The MME algorithm gives slightly better acceptance rates, but they are still small and are indicative of a simulation method that constantly attempts to make large jumps in areas of low probability of Wis_{C_p} (103,D_p). Therefore the WC and MME methods might not scale well despite being perfectly valid in theory, hence they cannot be used in the context of the multivariate lattice data models from Section 5. The acceptance rates of the DLR algorithm are a function of the precision parameter σ_m which can be easily adjusted to yield jumps that are not too short but also not too long in the target cone of matrices. This flexibility is key for successfully employing our algorithm for graphs with many vertices.

Table 1.

Monte Carlo estimates and their standard errors (in parentheses) of the acceptance probabilities for our proposed algorithm (DLR) for sampling from the G-Wishart distribution, the Wang and Carvalho (2010) algorithm (WC), and the Mitsakakis, Massam, and Escobar (2011) algorithm (MME)

p	DLR				WC	MME
p	σ_m = 0.1	σ_m = 0.5	σ_m = 1	σ_m =2	WC	MME
4	0.953 (2.0e–3)	0.776 (3.0e–3)	0.600 (3.2e–3)	0.389 (3.8e–3)	0.340 (6.4e–3)	0.473 (1.0e–2)
6	0.947 (2.0e–3)	0.751 (2.4e–3)	0.565 (2.9e–3)	0.356 (2.3e–3)	5.08e–2 (2.2e–3)	0.185 (1.0e–2)
8	0.944 (1.8e–3)	0.740 (2.1e–3)	0.551 (2.3e–3)	0.343 (2.6e–3)	7.94e–3 (7.1e–4)	0.078 (0.9e–2)
10	0.943 (1.7e–3)	0.734 (2.0e–3)	0.543 (2.1e–3)	0.336 (2.1e–3)	1.18e–3 (1.7e–4)	0.035 (8.0e–3)
12	0.942 (1.4e–3)	0.729 (1.8e–3)	0.537 (2.1e–3)	0.331 (2.2e–3)	1.56e–4 (5.1e–5)	0.013 (6.0e–3)
14	0.940 (1.5e–3)	0.725 (1.6e–3)	0.532 (1.8e–3)	0.327 (1.9e–3)	2.27e–5 (1.4e–5)	0.005 (4.0e–3)
16	0.940 (1.3e–3)	0.721 (1.6e–3)	0.527 (1.7e–3)	0.324 (1.7e–3)	2.19e–6 (1.8e–6)	0.002 (3.0e–3)
18	0.939 (1.2e–3)	0.717 (1.4e–3)	0.523 (1.5e–3)	0.321 (1.5e–3)	3.87e–7 (8.0e–7)	0.001 (2.0e–3)
20	0.938 (1.0e–3)	0.714 (1.4e–3)	0.520 (1.6e–3)	0.318 (1.5e–3)	2.08e–8 (3.5e–8)	5.4e–4 (9.0e–4)

Open in a new tab

It may initially appear that failure to handle graphs with a maximal prime subgraph with 20 vertices is not a major shortcoming of an algorithm that relies on graph decompositions. However, the example we consider in Section 6.2 involves a fixed graphical model whose underlying graph has a maximal prime subgraph with 36 vertices (see the Supplementary Materials). Since this graph is one that is constructed from the neighborhood structure of the United States, we can see the importance of the ability to scale when applying the GGM framework to spatial statistical problems.

3. REVERSIBLE JUMP MCMC SAMPLERS FOR GGMS

The previous section was concerned with sampling from the G-Wishart distribution when the underlying conditional independence graph is fixed. In contrast, this section is concerned with performing Bayesian inference in GGMs, which involves determination of the graph G. Giudici and Green (1999) proposed a reversible jump Markov chain algorithm that is restricted to decomposable graphs and employs a hyper inverse Wishart prior (Dawid and Lauritzen 1993) for the covariance matrix Σ = K⁻¹ = (Σ_ij)_1≤_i_,_j_≤_p. The efficiency of their approach comes from representing decomposable graphs through their junction trees and from a specification of Σ as an incomplete matrix Γ = (Γ_ij)_1≤_i_,_j_≤_p such that Σ_ij = Γ_ij if i = j or if (i, j) is an edge in the current graph (i.e., K_ij is not constrained to 0); the remaining elements of Γ are left unspecified but can be uniquely determined using the iterative proportional scaling algorithm (Dempster 1972; Speed and Kiiveri 1986). Brooks, Giudici, and Roberts (2003) presented techniques for calibrating the precision parameter of the normal kernel used by Giudici and Green (1999) to sequentially update the elements of Γ and the edges of the underlying decomposable graph which lead to improved mixing times of the resulting Markov chain. Related work (Scott and Carvalho 2008; Carvalho and Scott 2009; Armstrong et al. 2009) has also focused on decomposable graphs. Graphs of this type have received attention mainly because their special structure is convenient from a computational standpoint. However, decomposability is a serious constraint as it drastically reduces 𝒢_p to a much smaller subset of graphs: the ratio between the number of decomposable graphs and the total number of graphs decreases from 0.95 for p = 4 to 0.12 for p = 8 (Armstrong 2005).

Roverato (2002), Atay-Kayis and Massam (2005), Jones et al. (2005), Dellaportas, Giudici, and Roberts (2003), Moghaddam et al. (2009), Lenkoski and Dobra (2011) explored various methods for numerically calculating marginal likelihoods for non-decomposable graphs. Besides being computationally expensive, stochastic search algorithms that traverse 𝒢_p based on marginal likelihoods output a set of graphs that have the highest posterior probabilities from all the graphs that have been visited, but do not produce estimates of the precision matrix K unless a direct sampling algorithm from the posterior distribution of K given a graph is available. As we have seen in Section 2.3, current direct sampling algorithms (e.g., Wang and Carvalho 2010) might not scale well to graphs with many vertices. Since our key motivation comes from modeling multivariate lattice data (see Section 5), we want to develop a Bayesian method that samples from the joint posterior distribution of precision matrices K ∈ P_G and graphs G ∈ 𝒢_p, thereby performing inference for both K and G. Wong, Carter, and Kohn (2003) developed such a reversible jump Metropolis–Hastings algorithm by decomposing K = TΔT, where $T = diag {K_{11}^{1 / 2}, \dots, K_{p p}^{1 / 2}}$ and Δ = (Δ_ij)_1≤_i_,_j_≤_p is a correlation matrix with Δ_ii = 1 and Δ_ij = K_ij/(K_iiK_jj)^1/2, for i < j. Their prior specification for K involves independent priors for T_ii and a joint prior for the off-diagonal elements of Δ whose normalizing constant is associated with all the graphs with the same number of edges.

We propose a new reversible jump Markov chain algorithm that is based on a G-Wishart prior for K. While we do not empirically explore the efficiency of our approach with respect to the Wong, Carter, and Kohn (2003) algorithm, we state that the key advantage of our framework lies in its generalization to matrix-variate data—see Section 4.We are unaware of any similar extension of the Wong, Carter, and Kohn (2003) approach. Another benefit of our method with respect to that of Wong, Carter, and Kohn (2003) is related to its flexibility with respect to prior specifications on 𝒢_p—see the discussion below.

We let 𝒟= {x⁽¹⁾, …, x⁽ⁿ⁾} be the observed data of n independent samples from N_p(0, K⁻¹). Given a graph G ∈ 𝒢_p, we assume a G-Wishart prior Wis_G(δ₀, D₀) for the precision matrix K ∈ P_G. We take δ₀ = 3 > 2 and D₀ = I_p. With this choice the prior for K is equivalent with one observed sample, while the observed variables are assumed to be a priori independent of each other. Since the G-Wishart prior for K is conjugate to the likelihood p(𝒟| K), the posterior of K given G is Wis_G(n + δ₀, U + D₀) where $U = \sum_{j = 1}^{n} x^{(j)} {(x^{(j)})}^{T}$ . We also assume a prior Pr(G) on 𝒢_p. We develop a MCMC algorithm for sampling from the joint posterior distribution

p (K, G ∣ D) \propto p (D ∣ K) p (K ∣ G, δ_{0}, D_{0}) Pr (G),

that is well-defined if and only if K ∈ P_G. We sequentially update the precision matrix given the current graph and the edges of the graph given the current precision matrix—see Appendix A. The addition or deletion of an edge involves a change in the dimensionality of the parameter space since the corresponding element of K becomes free or constrained to zero, hence we make use of the reversible jump MCMC methodology of Green (1995). The graph update step also requires the calculation of the normalizing constants of the G-Wishart priors corresponding with the current and the candidate graph. To this end, we make use of the Monte Carlo method of Atay-Kayis and Massam (2005) which converges very fast when computing I_G(δ, D) [see Equation (1)] even for large graphs when δ is small and D is set to the identity matrix (Lenkoski and Dobra 2011).

Our framework accommodates any prior probabilities Pr(G) on the set of graphs 𝒢_p which is a significant advantage with respect to the covariance selection prior from Wong, Carter, and Kohn (2003). Indeed, the prior in the work of Wong, Carter, and Kohn (2003) induces fixed probabilities for each graph and does not allow the possibility of further modifying these probabilities according to prior beliefs. A usual choice is the uniform prior Pr(G) = 2⁻^m with $m = (\begin{matrix} p \\ 2 \end{matrix})$ , but this prior is biased toward middle-size graphs and gives small probabilities to sparse graphs and to graphs that are almost complete. Here the size of a graph G is defined as the number of edges in G and is denoted by size(G) ∈ {0, 1, …, m}. Dobra et al. (2004) and Jones et al. (2005) assumed that the probability of inclusion of any edge in G is constant and equal to ψ ∈ (0, 1), which leads to the prior

Pr (G) \propto ψ^{size (G)} {(1 - ψ)}^{m - size (G)} .

(9)

Sparser graphs can be favored with prior (9) by choosing a small value for ψ. Since ψ could be difficult to elicit in some applications, Carvalho and Scott (2009) integrated out ψ from (9) by assuming a conjugate beta distribution Beta(a, b), which leads to the prior

Pr (G) \propto B (a + size (G), b + m - size (G)) / B (a, b),

(10)

where B(·, ·) is the beta function. From the work of Scott and Berger (2006) it follows that prior (10) has an automatic multiplicity correction for testing the inclusion of spurious edges. Armstrong et al. (2009) suggested a hierarchical prior on 𝒢_p (the size-based prior) that gives equal probability to the size of a graph and equal probability to graphs of each size, that is,

\begin{array}{l} Pr (G) = Pr (size (G)) Pr (G ∣ size (G)) \\ = \frac{1}{m + 1} {(\begin{matrix} m \\ size (G) \end{matrix})}^{- 1}, \end{array}

(11)

which is also obtained by setting a = b = 1 in (10). We note that the expected size of a graph under size-based prior is m/2, which is also the expected size of a graph under the uniform prior on 𝒢_p.

4. REVERSIBLE JUMP MCMC SAMPLER FOR MATRIX–VARIATE GGMS

We extend our framework to the case when the observed data 𝒟 = {x⁽¹⁾, …, x⁽ⁿ⁾} are associated with a p_R × p_C random matrix X = (X_ij) that follows a matrix-variate normal distribution

vec (X) ∣ K_{R}, K_{C} ~ N_{p_{R} p_{C}} (0, {[K_{C} \otimes K_{R}]}^{- 1}),

with probability distribution function (Gupta and Nagar 2000):

p (X ∣ K_{R}, K_{C}) = {(2 π)}^{- p_{R} p_{C} / 2} {(det K_{R})}^{p_{C} / 2} {(det K_{C})}^{p_{R} / 2} \times exp {- \frac{1}{2} tr [K_{R} X K_{C} X^{T}]} .

(12)

Here K_R is a p_R ×p_R row precision matrix and K_C is a p_C × p_C column precision matrix. Furthermore, we assume that K_R ∈ P_{G_R} and K_C ∈ P_{G_C} where G_R = (V_{p_R}, E_R) and G_C = (V_{p_C}, E_C) are two graphs with p_R and p_C vertices, respectively. We consider the rows X_1*, …, X_{p_R*} and the columns X_*1, …, X_{*p_C} of the random matrix X. From theorem 2.3.12 of the book by Gupta and Nagar (2000) we have $X_{i *}^{T} ~ N_{p c} (0, {(K_{R}^{- 1})}_{i i} K_{C}^{- 1})$ and $X_{* j} ~ N_{p_{R}} (0, {(K_{C}^{- 1})}_{j j} K_{R}^{- 1})$ . The graphs G_R and G_C define graphical models for the rows and columns of X (Wang and West 2009):

\begin{array}{l} X_{i_{1} *} ⫫ X_{i_{2} *} ∣ X_{(V_{p_{R}} \ {i_{1}, i_{2}}) *} \\ \Leftrightarrow {(K_{R})}_{i_{1} i_{2}} = {(K_{R})}_{i_{2} i_{1}} = 0 \\ \Leftrightarrow (i_{1}, i_{2}) \notin E_{R} and \\ {X_{*}}_{j_{1}} ⫫ X_{* j_{2}} ∣ X_{(V_{p_{C}} \ {j_{1}, j_{2}}) *} \\ \Leftrightarrow {(K_{C})}_{j_{1} j_{2}} = {(K_{R})}_{j_{2} j_{1}} = 0 \\ \Leftrightarrow (j_{1}, j_{2}) \notin E_{C} . \end{array}

(13)

Any prior specification for K_R and K_C must take into account the fact that the two precision matrices are not uniquely identified from their Kronecker product which means that, for any z > 0, (z⁻¹K_R) ⊗ (zK_C) = K_R ⊗ K_C represents the same precision matrix for vec(X)—see Equation (12). We follow the idea laid out in the article of Wang and West (2009) and impose the constraint (K_C)₁₁ = 1.We define a prior for K_C through parameter expansion by assuming a G-Wishart prior Wis_{G_C} (δ_C, D_C) for the matrix zK_C with z > 0, δ_C > 2, and D_C ∈ P_{G_C}. It is immediate to see that the Jacobian of the transformation from zK_C to (z, K_C) is J((zK_C)→(z, K_C)) = z^{|ν(G_C)|−1}. It follows that our joint prior for (z, K_C) is given by

p (z, K_{C} ∣ G_{C}, δ_{C}, D_{C}) = \frac{1}{I_{G_{C}} (δ_{C}, D_{C})} {(det K_{C})}^{(δ_{C} - 2) / 2} \times exp {- \frac{1}{2} 〈 K_{C}, z D_{C} 〉} z^{p_{C} (δ_{C} - 2) / 2 + ∣ v (G_{C}) ∣ - 1} .

The elements of K_R ∈ P_{G_R} are not subject to any additional constraints, hence we assume a G-Wishart prior Wis_{G_R} (δ_R, D_R) for K_R. We take δ_C = δ_R = 3, D_C = I_{p_C}, and D_R = I_{p_R}. We complete our prior specification by choosing two priors Pr(G_C) and Pr(G_R) for the row and column graphs, where G_R ∈ 𝒢_{p_R} and G_C ∈ 𝒢_{p_C}—see our discussion from Section 3.

We perform Bayesian inference for matrix-variate GGMs by developing a MCMC algorithm for sampling from the joint posterior distribution of the row and column precision matrices, the row and column graphs, and the auxiliary variable z:

p (K_{R}, z, K_{C}, G_{R}, G_{C} ∣ D) \propto p (D ∣ K_{R}, K_{C}) p (K_{R} ∣ G_{R}, δ_{R}, D_{R}) \times p (z, K_{C} ∣ G_{C}, δ_{C}, D_{C}) Pr (G_{C}) Pr (G_{R}),

(14)

that is defined for K_R ∈ P_{G_R}, K_C ∈ P_{G_C} with (K_C)₁₁ = 1 and z > 0. Equation (14) is written as

\begin{array}{l} p (K_{R}, z, K_{C}, G_{R}, G_{C} ∣ D) \\ \propto z^{p_{C} (δ_{C} - 2) / 2 + ∣ v (G_{C}) ∣ - 1} \\ \times {(det K_{R})}^{({n p}_{C} + δ_{R} - 2) / 2} {(det K_{C})}^{({n p}_{R} + δ_{C} - 2) / 2} \\ \times exp {- \frac{1}{2} tr [\sum_{j = 1}^{n} K_{R} x^{(j)} K_{C} {(x^{(j)})}^{T} + K_{R} D_{R} + K_{C} (z D_{C})]} \\ \times Pr (G_{C}) Pr (G_{R}) . \end{array}

(15)

The details of our sampling scheme are presented in Appendix B. Updating the row and column precision matrices involves sampling from their corresponding G-Wishart conditional distributions using the Metropolis–Hastings approach from Section 2.2, while updating z involves sampling from its gamma full conditional. The updates of the row and column graphs involve changes in the dimension of the parameter space and require a reversible jump step (Green 1995).

It is relevant to discuss how our MCMC sampler from Appendix B is different from the methodology proposed by Wang and West (2009). First of all, Wang and West (2009) allowed only decomposable row and column graphs which makes their framework inappropriate for modeling multivariate lattice data where the graph representing the neighborhood structure of the regions involved is typically not decomposable. Second of all, they proposed a Markov chain sampler for the marginal distribution associated with (G_R, G_C) of the joint posterior distribution (15):

Pr (G_{R}, G_{C} ∣ D) \propto p (D ∣ G_{R}, G_{C}) Pr (G_{C}) Pr (G_{R}),

(16)

which involves the marginal likelihood

p (D ∣ G_{R}, G_{C}) = \int p (D ∣ K_{R}, K_{C}) p (K_{R} ∣ G_{R}, δ_{R}, D_{R}) \times p (z, K_{C} ∣ G_{C}, δ_{C}, D_{C}) d K_{R} d K_{C} d z .

(17)

Wang and West (2009) employed the candidate’s formula (Besag 1989; Chib 1990) to approximate the marginal likelihood (17). They sampled from (16) by sequentially updating the row and column graphs. The update of the row graph proceeds by sampling a candidate $G_{R}^{'}$ from an instrumental distribution $π (G_{R}^{'} ∣ G_{R})$ and accepting it with Metropolis–Hastings probability

min {\frac{p (D ∣ G_{R}^{'}, G_{C}) Pr (G_{R}^{'}) π (G_{R} ∣ G_{R}^{'})}{p (D ∣ G_{R}, G_{C}) Pr (G_{R}) π (G_{R}^{'} ∣ G_{R})}, 1} .

(18)

The update of the column graph is done in a similar manner. Therefore resampling G_R or G_C requires the computation of a new marginal likelihood (17) which entails a considerable computational effort even for graphs with a small number of vertices. We avoid the computation of marginal likelihoods by sampling from the joint distribution (15) in which the row and column precision matrices have not been integrated out. Sampling from the joint marginal distribution (16) seems appealing because it involves the reduced space of row and column graphs. However, the numerical difficulties associated with repeated calculations of marginal likelihoods (17) outweigh the benefits of working in a smaller space. We empirically compare the efficiency of our inference approach for matrix-variate GGMs with the work of Wang and West (2009) in Section 6.1.

5. BAYESIAN HIERARCHICAL MODELS FOR MULTIVARIATE LATTICE DATA

Conditional autoregressive (CAR) models (Besag 1974; Mardia 1988) are routinely used in spatial statistics to model lattice data. In the case of a single observed outcome in each region, the data are associated with a vector X = (X₁, …, X_{p_R})^T where X_i corresponds to region i. The zero-centered CAR model is implicitly defined by the set of full conditional distributions

X_{i} ∣ {X_{j} : j \neq i} ~ N (\sum_{j \neq i} b_{i j} X_{j}, λ_{i}^{2}), i = 1, \dots, p_{R} .

(19)

Therefore, CAR models are just two-dimensional Gaussian Markov random fields. According to Brook’s (1964) theorem, this set of full-conditional distributions implies that the joint distribution for X satisfies

p (X ∣ λ) \propto exp {- \frac{1}{2} X^{T} Λ^{- 1} (I - B) X},

where $Λ = diag {λ_{1}^{2}, \dots, λ_{p_{R}}^{2}}$ and B is a p_R × p_R matrix such that B = (b_ij) and b_ii = 0. In order for Λ⁻¹(I−B) to be a symmetric matrix we require that $b_{i j} λ_{j}^{2} = b_{j i} λ_{i}^{2}$ for i ≠ j; therefore the matrix B and vector λ must be carefully chosen. A popular approach is to begin by constructing a symmetric proximity matrix W= (w_ij), and then set b_ij = w_ij/w_i₊ and $λ_{i}^{2} = τ^{2} / w_{i +}$ where w_i₊ = Σ_j w_ij and τ² > 0. In that case, Λ⁻¹(I − B) = τ⁻²(E_W −W), where E_W = diag{w₁₊, …, w_{p_R+}}. The proximity matrix W is often constructed by first specifying a neighborhood structure for the geographical areas under study; for example, when modeling state or county level data it is often assumed that two geographical units are neighbors if they share a common border. This neighborhood structure can be summarized in a graph G_R ∈ 𝒢_{p_R} whose vertices correspond to geographical areas, while its edges are associated with areas that are considered neighbors of each other. The proximity matrix W is subsequently specified as

w_{i j} = {\begin{cases} 1, & if j \in \partial_{G_{R}} (i) \\ 0, & otherwise, \end{cases}

(20)

where ∂_{G_R}(i) denotes the set of vertices that are linked by an edge in the graph G_R with the vertex i.

Specifying the joint precision matrix for X using the proximity matrix derived from the neighborhood structure is very natural; essentially, it implies that observations collected on regions that are not neighbors are conditionally independent from each other given the rest. However, note that the specification in (20) implies that (E_W − W)1_{p_R} = 0 and therefore the joint distribution on X is improper. Here 1_l is the column vector of length l with all elements equal to 1. Proper CAR models (Cressie 1973; Sun et al. 2000; Gelfand and Vounatsou 2003) can be obtained by including a spatial autocorrelation parameter ρ, so that

X_{i} ∣ {X_{j} : j \neq i} ~ N (ρ \sum_{j \neq i} \frac{w_{i j}}{w_{i +}} X_{j}, \frac{τ^{2}}{w_{i +}}) .

(21)

The joint distribution on X is then multivariate normal $N_{p_{R}} (0, V_{W}^{- 1} (τ^{- 2}, ρ))$ where V_W(τ⁻², ρ) = τ⁻²(E_W − ρW) ∈ P_{G_R}. This distribution is proper as long as ρ is between the reciprocals of the minimum and maximum eigenvalues for W. In particular, note that taking ρ = 0 leads to independent random effects.

In the spirit of Besag and Kooperberg (1995), an alternative but related approach to the construction of models for lattice data is to let X ~ N_{p_R} (0, K⁻¹) and assign K a G-Wishart prior Wis_{G_R}(δ_R, (δ_R − 2)D_R), where $D_{R} = V_{W}^{- 1} (τ^{- 2}, ρ)$ . The mode of Wis_{G_R} (δ_R, (δ_R − 2)D_R) is the unique matrix K = (K_ij) ∈ P_{G_R} that satisfies the relations

\begin{array}{l} {(K^{- 1})}_{i j} = {(D_{R})}_{i j}, if j \in \partial_{G_{R}} (i), and \\ K_{i j} = 0, if j \notin \partial_{G_{R}} (i) . \end{array}

(22)

The matrix $D_{R}^{- 1} = V_{W} (τ^{- 2}, ρ)$ verifies the system (22), hence it is the mode of Wis_{G_R}(δ_R, (δ_R − 2)D_R). As such, the mode of the prior for K_R induces the same prior specification for X as (21). It is easy to see that, conditional on K_R ∈ P_{G_R}, we have

X_{i} ∣ {X_{j} : j \neq i} ~ N (- \sum_{j \in \partial_{G_{R}} (i)} \frac{K_{i j}}{K_{i i}} X_{j}, \frac{1}{K_{i i}}) .

(23)

Hence, by modeling X using a Gaussian graphical model and restricting the precision matrix K to belong to the cone P_{G_R}, we are inducing a mixture of CAR priors on X where the priors on

b_{i j} = {\begin{cases} - K_{i j} / K_{i i}, & if j \in \partial_{G_{R}} (i) \\ 0, & otherwise, \end{cases}

and $λ_{i}^{2} = 1 / K_{i i}$ [see Equation (19)] are induced by the G-Wishart prior Wis_{G_R} (δ_R, (δ_R − 2)D_R).

The specification of CAR models through G-Wishart priors solves the impropriety problem of intrinsic CAR models and preserves the computational advantages derived from standard CAR specifications while providing greater flexibility. Indeed, the prior is trivially proper because the matrix K ∈ P_{G_R} is invertible by construction. The computational advantages are preserved because the full conditional distributions for each X_i can be easily computed for any matrix K without the need to perform matrix inversion, and they depend only on a small subset of neighbors {X_j:j ∈ ∂_{G_R}(i)}. Additional flexibility is provided because the weights b_ij for j ∈ ∂_{G_R} (i) and smoothing parameters $λ_{i}^{2}$ are being estimated from the data rather than being assumed fixed, allowing for adaptive spatial smoothing. Our approach provides what can be considered as a nonparametric alternative to the parametric estimates of the proximity matrix of Cressie and Chan (1989).

A similar approach can be used to construct proper multivariate conditional autoregressive (MCAR) models (Mardia 1988; Gelfand and Vounatsou 2003). In this case, we are interested in modeling a p_R × p_C matrix X = (X_ij) where X_ij denotes the value of the jth outcome in region i. We let X follow a matrix-variate normal distribution with row precision matrix K_R capturing the spatial structure in the data (which, as in univariate CAR models, is restricted to the cone P_{G_R} defined by the neighborhood graph G_R), and column precision matrix K_C, which controls the multivariate dependencies across outcomes. It can be easily shown that the row vector X_i_* of X depends only on the row vectors associated with those regions that are neighbors with i—see also Equation (13):

X_{i *}^{T} ∣ {X_{j *}^{T} : j \neq i} ~ N_{p_{C}} (- \sum_{j \in \partial_{G_{R}} (i)} \frac{{(K_{R})}_{i j}}{{(K_{R})}_{i i}} X_{j *}^{T}, \frac{1}{{(K_{R})}_{i i}} K_{C}^{- 1}) .

Given the matrix-variate GGMs framework from Section 4, we can model the conditional independence relationships across outcomes through a column graph G_C ∈ 𝒢_{p_C} and require K_C ∈ P_{G_C} As opposed to the neighborhood graph G_R which is known and considered fixed, the graph G_C is typically unknown and needs to be inferred from the data.

The matrix-variate GGM formulation for spatial models can also be used as part of more complex hierarchical models. Indeed, CAR and MCAR models are most often used as a prior for the random effects of a generalized linear model (GLM) to account for residual spatial structure not accounted for by covariates. When no covariates are available, the model can be interpreted as a spatial smoother where the spatial covariance matrix controls the level of spatial smoothing in the underlying (latent) surface. Similarly, MCAR models can be used to construct multivariate spatial GLMs. More specifically, consider the p_R × p_C matrix Y = (Y_ij) of discrete or continuous outcomes, and let Y_ij ~ h_j(·|η_ij) where h_j is a probability mass or probability density function that belongs to the exponential family with location parameter η_ij. The spatial GLM is then defined through the linear predictor

g^{- 1} (η_{i j}) = μ_{j} + X_{i j} + Z_{i j} β_{j},

(24)

where g(·) is the link function, μ_j is an outcome-specific intercept, X_ij is a zero-centered spatial random effect associated with location i, Z_ij is a matrix of observed covariates for outcome j at location i, and β_j is the vector of fixed effects associated with outcome j. As an example, by choosing y_ij ~ Poi(η_ij), g⁻¹(·) = log(·) we obtain a multivariate spatial log-linear model for count data, which is often used for disease mapping in epidemiology (see Section 6.2). We further assign a matrix-variate normal distribution for X with independent G-Wishart priors for K_R and zK_C, where z > 0 is an auxiliary variable needed to impose the identifiability constraint (K_C)₁₁ = 1:

\begin{array}{l} K_{R} ∣ δ_{R}, D_{R} & ~ {Wis}_{G_{R}} (δ_{R}, (δ_{R} - 2) D_{R}), \\ (z K_{C}) ∣ δ_{C}, D_{C} & ~ {Wis}_{G_{C}} (δ_{C}, D_{C}) . \end{array}

(25)

Prior elicitation for this class of spatial models is relatively straightforward. Indeed, elicitation of the matrix D_R = (E_W − ρW)⁻¹ requires only the elicitation of the neighborhood matrix W which also defines the neighborhood graph G_R, along with reasonable values for the spatial autocorrelation parameter ρ. In particular, in the application we discuss in Section 6.2 we assume that, a priori, there is a strong degree of positive spatial association, and choose a prior for ρ that gives higher probabilities to values close to 1 (Gelfand and Vounatsou 2003):

ρ ~ Uni ({0, 0.05, 0.1, \dots, 0.8, 0.82, \dots, 0.90, 0.91, \dots, 0.99}) .

(26)

In the case of MCAR models, it is common to assume that the prior value for the conditional covariance between variables is zero, which leads to choosing D_C from the G-Wishart prior for zK_C to be a diagonal matrix. We note that the scale parameter τ² is no longer identifiable in the context of the joint prior (25), hence it is then sensible to pick D_C = I_{p_C} .

At this point, a word of caution about the interpretation of the models seems appropriate. The graphs G_R = (V_{p_R}, E_R) and G_C = (V_{p_C}, E_C) induce conditional independence relationships associated with the rows and columns of the random-effects matrix X—see Equation (13). Similar conditional independence relationships hold for the rows and columns of the matrix of location parameters η = (η_ij), but these extend to the observed outcomes Y only if its entries are continuous (for a more thorough discussion, see the Supplementary Materials). Hence, the reader must be careful when interpreting the results of these models when binary or count data are involved; in these cases, any statement about conditional independence needs to be circumscribed to the location parameters η.

6. ILLUSTRATIONS

6.1 Simulation Study

We empirically compare the reversible jump MCMC sampler for matrix-variate GGMs proposed in Section 4 (RJ-DLR) with the methodology of Wang and West (2009) (WW). We consider the row graph Ḡ_R = (V_{p_R}, Ē_R) with p_R = 5 vertices and with edges Ē_R = {(1, i) : 1 < i ≤ 5} ∪ {(2, 3), (3, 4), (4, 5), (2, 5)}. We take the column graph Ḡ_C to be the cycle graph with p_C = 10 vertices (see Section 2.3). We note that both the row and column graphs are non-decomposable, with Ḡ_R being relatively dense and with Ḡ_C being relatively sparse. We define the row and column precision matrices K̄_R ∈ P_{Ḡ_R} and K̄_C ∈ P_{Ḡ_C} to have diagonal elements equal to 1 and nonzero off-diagonal elements equal with 0.4. We generate 100 datasets each comprising n = 100 observations sampled from the 5 × 10 matrix-variate normal distribution p(·|K̄_R, K̄_C)—see Equation (12).

For each dataset we attempted to recover the edges of Ḡ_R and Ḡ_C with the RJ-DLR and the WW algorithms. The two MCMC samplers were run for 10,000 iterations with a burn-in of 1000 iterations. We implemented the RJ-DLR algorithm in R and C++ (code available as Supplemental Material). We used the code developed by Wang and West (2009) for their method. In the RJ-DLR algorithm we set σ_m_,_R = σ_m_,_C = σ_g_,_R = σ_g_,_C = 0.5, which yields average rejection rates on Metropolis–Hastings updates of about 0.3 for both K̄_C and K̄_R. We assumed independent uniform priors for the row and column graphs.

Tables 2 and 3 show the average posterior edge inclusion probabilities for Ḡ_R and Ḡ_C. Our RJ-DLR method recovers the structure of both graphs very well: the edges that belong to each graph receive posterior inclusion probabilities of 1, while the edges that are absent from each graph receive posterior inclusion probabilities below 0.1. The ability of the RJ-DLR algorithm to recover the structures of a dense row graph and of a sparse column graph is quite encouraging since the uniform priors for the row and column graphs favored middle-size graphs.

Table 2.

Average estimates (left panel) and their standard errors (right panel) of posterior edge inclusion probabilities for the row graph Ḡ_R obtained using the RJ-DLR algorithm (below diagonal) and the WW algorithm (above diagonal). The boxes identify the edges that are in Ḡ_R

graphic file with name nihms745474f4.jpg

Open in a new tab

Table 3.

Average estimates (upper panel) and their standard errors (lower panel) of posterior edge inclusion probabilities for the column graph Ḡ_C obtained using the RJ-DLR algorithm (below diagonal) and the WW algorithm (above diagonal). The boxes identify the edges that are in Ḡ_C

graphic file with name nihms745474f5.jpg

Open in a new tab

The WW method shows diminished performance in recovering graphical structures. Edges (1, 4), (1, 5), and (2, 5), which are actually in Ḡ_R, are given average inclusion probabilities of only 0.7. Edge (1, 3), which is in Ḡ_R, receives a very low inclusion probability, while edge (2, 4), which is not in Ḡ_R, is given a high inclusion probability. The WW algorithm includes edges that are in the column graph with relatively high probabilities (typically above 0.9). However, edges that do not belong to Ḡ_C also appear to be included quite often, with an average inclusion probability between 0.3 and 0.4. These tables also show a large standard deviation of edge inclusion probabilities across the 100 datasets using the WW method. The decreased performance of the WW algorithm versus our own RJ-DLR algorithm is likely attributable to the fact that Wang and West (2009) accounted only for decomposable graphs in their framework, which may be poor at approximating the non-decomposable graphs considered in this example.

6.2 Mapping Cancer Mortality in the United States

Accurate and timely counts of cancer mortality are very useful in the cancer surveillance community for purposes of efficient resource allocation and planning. Estimation of current and future cancer mortality broken down by geographic area (state) and tumor has been discussed in a number of recent articles, including those by Tiwari et al. (2004), Ghosh and Tiwari (2007), Ghosh et al. (2007), and Ghosh, Ghosh, and Tiwari (2008). This section considers a multivariate spatial model on state-level cancer mortality rates in the United States for 2000. These mortality data are based on death certificates that are filed by certifying physicians. They are collected and maintained by the National Center for Health Statistics (http://www.cdc.gov/nchs) as part of the National Vital Statistics System. The data are available from the Surveillance, Epidemiology, and End Results (SEER) program of the National Cancer Institute (http://seer.cancer.gov/seerstat).

The data we analyze consist of mortality counts for 11 types of tumors recorded in the 48 continental states plus the District of Columbia. Since mortality counts below 25 are often deemed unreliable in the cancer surveillance community, we treated them as missing. Along with mortality counts, we obtained population counts in order to model death risk. Figure 1 shows raw mortality rates for four of the most common types of tumors (colon, lung, breast, and prostate). Although the pattern is slightly different for each of these cancers, a number of striking similarities are present; for example, Colorado appears to be a state with low mortality for all of these common cancers, while West Virginia, Pennsylvania, and Florida present relatively high mortality rates.

Mortality rates (per 10,000 habitants) in the 48 continental states and the D.C. area corresponding to four common cancers during 2000.

We consider modeling these data using Poisson multivariate log-linear models with spatial random effects. More concretely, we let Y_ij be the number of deaths in state i = 1, …, p_R = 49 for tumor type j = 1, …, p_C = 11. We set

Y_{i j} ∣ η_{i j} ~ Poi (η_{i j}), log (η_{i j}) = log (m_{i}) + μ_{j} + X_{i j} .

(27)

Here m_i is the population of state i, μ_j is the intercept for tumor type j, and X_ij is a zero-mean spatial random effect associated with location i and tumor j. We denote X̃_ij = μ_j + X_ij. We further model X̃ = (X̃_ij) with a matrix-variate normal prior:

vec (\tilde{X}) ∣ μ, K_{R}, K_{C} ~ N_{p_{R} p_{C}} (vec {{(1_{p_{R}} μ^{T})}^{T}}, {[K_{C} \otimes K_{R}]}^{- 1}),

(28)

where μ = (μ₁, …, μ_{p_C})^T.

We consider four models in total. The first two models we consider (GGM-U and GGM-S) are based on the sparse models for multivariate aerial data described in Section 5, and differ only in terms of the prior assigned on the space of graphs [GGM-U uses a uniform prior, while GGM-S uses the size-based prior from Equation (11)]. Hence, the joint prior for the row and column precision matrices is given in (25) for both models. The column graph G_C is unknown and allowed to vary in 𝒢₁₁. The row graph G_R is fixed and derived from the incidence matrix W corresponding with the neighborhood structure of the U.S. states. More explicitly, each state is associated with a vertex in G_R. Two states are linked by an edge in G_R if they share a common border. The Supplementary Materials describe the neighborhood structure of G_R and provide its decomposition into maximal prime subgraphs. What is interesting about the graph G_R, and relevant in light of the results of Section 2.3, is that its decomposition yields 13 maximal prime subgraphs. Most of these prime components are complete and only contain two or three vertices; however, one maximal prime subgraph has 36 vertices (states), making G_R non-decomposable with a sizable maximal prime subgraph. The degrees of freedom for the G-Wishart priors are set as δ_R = δ_C = 3, while the centering matrices are chosen as D_C = I_{p_C} and D_R = (E_W − ρW)⁻¹.

The third model, which we call model FULL, is obtained by keeping the column graph G_C fixed to the full graph. This is equivalent to replacing the G-Wishart prior for (zK_C) from Equation (25) with a Wishart prior Wis_{p_C} (δ_C, D_C). Finally, model MCAR is obtained from model FULL by substituting the G-Wishart prior for K_R from Equation (25) with K_R = E_W − ρW. The resulting prior distribution for the spatial random effects X̃ in model MCAR is precisely the MCAR(ρ, Σ) prior of Gelfand and Vounatsou (2003). In all cases, we complete the model specification by choosing the prior from Equation (26) for the spatial autocorrelation parameter ρ and a multivariate normal prior for the mean rates vector μ ~ N_{p_C} (μ₀, Ω⁻¹) where μ₀ = μ₀1_{p_C} and Ω = ω⁻²I_{p_C}. We set μ₀ to be the median log incidence rate across all cancers and regions, and ω to be twice the interquartile range in raw log incidence rates.

Posterior inferences for models GGM-U and GGM-S are obtained by extending the sampling algorithm from Appendix B; details are presented in the Supplemental Materials. MCMC samplers for models FULL and MCAR are similarly derived in a straightforward manner. Missing counts were sampled from their corresponding predictive distributions. We monitored the chains to ensure convergence for each model. We set σ_m_,_R = σ_g_,_R = 0.2 and σ_m_,_C = σ_g_,_C = 0.5, which achieved rejection rates of roughly 0.3 when updating elements of both K_R or K_C. Furthermore, the average acceptance rate of a reversible jump move for the column graph G_C was around 0.22 for both GGM-U and GGM-S.

To assess the out-of-sample predictive performance of the four models, we performed a 10-fold cross-validation exercise. The exercise was run by randomly dividing the non-missing counts (those above 25) in ten bins. For each bin j, we used the samples from the other nine bins as data and treated the samples from bin j as missing. In this comparison, the MCMC sampler for each of the four models was run ten times for 160,000 iterations and the first half of the run was discarded as burn-in. In the sequel, we denote the predicted counts for model ℳ ∈ {GGM-U, GGM-S, FULL, MCAR} by $Y_{i j}^{c v} (M)$ . We employ the goodness-of-fit (mean squared error of the posterior predictive mean or MSE) and the variability/penalty (mean predictive error or VAR) terms of Gelfand and Ghosh (1998), that is,

\begin{array}{l} MSE (M) = \frac{1}{∣ {(i, j) : Y_{i j} \geq 25} ∣} \times \sum_{{(i, j) : Y_{i j} \geq 25}} {(E_{M} [Y_{i j}^{c v} (M)] - Y_{i j})}^{2}, \\ VAR (M) = \frac{1}{∣ {(i, j) : Y_{i j} \geq 25} ∣} \sum_{{(i, j) : Y_{i j} \geq 25}} {Var}_{M} [Y_{i j}^{c v} (M)], \end{array}

to compare the predictive mean and variance of the four models. We also calculated the ranked probability score (RPS) (discussed by Czado, Gneiting, and Held 2009 in the context of count data), which measures the accuracy of the entire predictive distribution. The results, which are summarized in Table 4, reveal an interesting progression in the out-of-sample predictive performance of the four methods. Under either the MSE, VAR, or the RPS criteria, model MCAR performs considerably worse than model FULL, which in turn is outperformed by both models GGM-U and GGM-S. Also, the effect of the choice of prior on the column graph space appears negligible as GGM-U and GGM-S have roughly the same predictive performance. It is noteworthy to mention that the improvement in the RPS in the sequence of models MCAR, FULL, GGM-U, and GGM-S comes both because of a better prediction of the means as well as because the predictive distributions become sharper—see the article by Gneiting and Raftery (2007) for a discussion of the trade-off between sharpness and calibration in the formation of predictive distributions. The dramatic improvement in predictive performance that results from moving from model MCAR to model FULL is the result of using the G-Wishart distribution to allow greater flexibility in the spatial interactions over the CAR specification suggested by Gelfand and Vounatsou (2003). On the other hand, the additional gain yielded by the use of models GGM-U or GGM-S versus model FULL would seem to come from the increased parsimony of the GGM models.

Table 4.

Ten-fold cross-validation predictive scores in the U.S. cancer mortality example. Model GGM-S is best with respect to fit (MSE) and ranked probability score (RPS), while model GGM-U is best with respect to variability (VAR) of the out-of-sample predicted counts

Model	MSE	VAR	RPS
GGM-U	17,379.9	23,685.2	62.1
GGM-S	16,979.8	24,361.1	61.8
FULL	18,959.6	24,530.4	63.2
MCAR	19,211.1	47,568.7	76.7

Open in a new tab

We also conducted an in-sample comparison of the four models. For each model, we ran ten Markov chains for 160,000 iterations and discarded the first half as burn-in based; Figure 2 is indicative of the performance of the MCMC sampler we employed. These runs used the entire dataset (with the exception of counts below 25, which are still treated as missing), rather than the cross-validation datasets. The Supplemental Materials give detailed tables of fitted values and 95% credible intervals for all counts in the dataset obtained using each of the four models. Table 5 summarizes these results by presenting the empirical coverage rates and the average lengths of these in-sample 95% credible intervals for each of the four models. We see that the methods based on the G-Wishart priors for the spatial component (GGM-U, GGM-S, FULL) have very similar coverage probabilities that are close to the nominal 95%. These three models also tend to have relatively short credible intervals, with GGM-U and GGM-S presenting the smallest (and almost identical) values. In contrast, model MCAR returns credible intervals that are too wide, containing 99% of all observed values. As with the out-of-sample exercise, these results suggest that both the additional flexibility in modeling the spatial dependence and the sharper estimates of the dependence across cancer types provided by the matrix-variate GGMs are key to prevent overfitting.

Convergence plot of the average size of the column graph *G_C* by log iteration for model GGM-U (left panel) and model GGM-S (right panel).

Table 5.

Nominal coverage rates and mean length of the in-sample 95% credible intervals for the U.S. cancer mortality example obtained using the four Poisson multivariate log-linear models

Model	Coverage rate	Mean length
GGM-U	0.960	65.464
GGM-S	0.964	65.474
FULL	0.964	65.885
MCAR	0.990	69.239

Open in a new tab

Finally, we compare the estimates of the column graph G_C under GGM-U and GGM-S to assess the sensitivity of the model to the prior on graph space. Figure 3 shows a heatmap of the posterior edge inclusion probabilities in the column graph G_C, with the lower triangle corresponding to GGM-U and the upper triangle to GGM-S. Overall there is substantial coherence in terms of the effects of the uniform and size-based priors on 𝒢₁₁ on the frequency on which edges are added or deleted from the column graph. From the left panel of Figure 2 we see that the average size of the column graph G_C in GGM-U is 24.8 edges while the right panel shows that the average graph size in model GGM-S is 22.5 edges. Both of these numbers are quite close to 27.5—the average graph size under the uniform prior and the size-based prior on 𝒢₁₁. This suggests that there is little sensitivity to the prior in terms of graph selection.

Edge inclusion probabilities for model GGM-U (lower triangle) and model GGM-S (upper triangle) in the U.S. cancer mortality example. The acronyms used are explained in the Supplementary Materials.

7. DISCUSSION

In this article we have developed and illustrated a surprisingly powerful computational approach for multivariate and matrix-variate GGMs that scales well to problems with a moderate number of variables. Convergence seems to be achieved quickly (usually within the first 5000 iterations) and acceptance rates are within reasonable ranges and can be easily controlled. In the context of the sparse multivariate models for aerial data discussed in Section 5 the running times, although longer than for conventional MCAR models, are still short enough to make routine implementation feasible (e.g., in the cancer surveillance example from Section 6.2, about 22 hours on a dual-core 2.8 GHz computer running Linux). However, computations can still be quite challenging when the number of units in the lattice is very high. We plan to explore more efficient implementations that exploit the sparse structure of the Cholesky decompositions of precision matrices induced by GGMs.

We believe that our application to the construction of spatial models for lattice data makes for particularly appealing illustrations. On one hand, the U.S. cancer mortality example we have considered suggests that the additional flexibility in the spatial correlation structure provided by our approach is necessary to accurately model some spatial datasets. Indeed, our approach allows for “nonstationary” CAR models, where the spatial autocorrelation and spatial smoothing parameters vary spatially. On the other hand, to the best of our knowledge, we are unaware of any approach in the literature to construct and estimate potentially sparse MCAR, particularly under a Bayesian approach. In addition to providing insights into the mechanisms underlying the data generation process, the model offers drastically improved predictive performance and sharper estimation of model parameters.

SUPPLEMENTARY MATERIALS

Code

Computer code to run all routines discussed in this article, as well as the U.S. cancer mortality data. (DLR-archive.tar.gz.zip, GNU zipped tar file)

Results

Theoretical results related to conditional independence relationships in sparse latent Gaussian processes; the description of the MCMC algorithm used to fit the sparse multivariate spatial Poisson count model from Section 6.2; tables describing the U.S. cancer mortality data, the graphical model constructed from the neighborhood structure of the United States, as well as fitted values and 95% credible intervals for the models considered in the article in Section 6.2. (DLR-supplement.pdf, PDF file)

Acknowledgments

The first author was supported in part by the National Science Foundation (DMS 1120255). The second author gratefully acknowledges support by the joint research project “Spatio/Temporal Graphical Models and Applications in Image Analysis” funded by the German Science Foundation (DFG), grant GRK 1653, as well as the MAThematics Centre Heidelberg (MATCH). The work of the third author was supported in part by the National Science Foundation (DMS 0915272) and the National Institutes of Health (R01GM090201-01). The authors thank the editor, the associate editor, and three reviewers for helpful comments.

APPENDIX A: DETAILS OF THE REVERSIBLE JUMP MCMC SAMPLER FOR GGMS

The MCMC algorithm from Section 3 sequentially updates the elements of the precision matrix and the edges of the underlying graph as follows. We denote the current state of the chain by (K^[^s^], G^[^s^]), K^[^s^] ∈ P_{G^[_s]}. Its next state (K^[^s^+1], G^[s+1]), K^[^s^+1] ∈ P_{G^[_s+1]}, is generated by sequentially performing the following two steps. We make use of two strictly positive precision parameters σ_m and σ_g that remain fixed at some suitable small values. We assume that the ordering of the variables has been changed according to a permutation υ selected at random from the uniform distribution on ϒ_p. We denote by (U + D₀)⁻¹ = (Q^*)^TQ^* the Cholesky decomposition of (U + D₀)⁻¹, where the rows and columns of this matrix have been permuted according to υ.

We denote by ${nbd}_{p}^{+} (G)$ the graphs that can be obtained by adding an edge to a graph G ∈ 𝒢_p and by ${nbd}_{p}^{-} (G)$ the graphs that are obtained by deleting an edge from G. We call the one-edge-way set of graphs ${nbd}_{p} (G) = {nbd}_{p}^{+} (G) \cup {nbd}_{p}^{-} (G)$ the neighborhood of G in 𝒢_p. These neighborhoods connect any two graphs in 𝒢_p through a sequence of graphs such that two consecutive graphs in this sequence are each others’ neighbors.

Step 1: Resample the graph

We sample a candidate graph G′ ∈ nbd_p(G^[^s^]) from the proposal

q (G^{'} ∣ G^{[s]}) = \frac{1}{2} Uni ({nbd}_{p}^{+} (G^{[s]})) + \frac{1}{2} Uni ({nbd}_{p}^{-} (G^{[s]})),

(A.1)

where Uni(A) represents the uniform distribution on the discrete set A. The distribution (A.1) gives an equal probability of proposing to delete an edge from the current graph and of proposing to add an edge to the current graph. We favor (A.1) over the more usual proposal distribution Uni(nbd_p(G^[^s^])) that is employed, for example, by Madigan and York (1995). If G^[^s^] contains a very large or a very small number of edges, the probability of proposing a move that adds or, respectively, deletes an edge from G^[^s^] is extremely small when sampling from Uni(nbd_p(G^[^s^])), which could lead to poor mixing in the resulting Markov chain.

We assume that the candidate G′ sampled from (A.1) is obtained by adding the edge (i₀, j₀), i₀ < j₀, to G^[^s^]. Since $G^{'} \in {nbd}_{p}^{+} (G^{[s]})$ we have $G^{[s]} \in {nbd}_{p}^{-} (G^{'})$ . We consider the decomposition of the current precision matrix K^[s] = (Q^*)^T ((Ψ^[s])^T Ψ^[s])Q^* with (Ψ^[s])^{ν(G^[s])} ∈ M^{ν(G^[s])}. Since the vertex i₀ has one additional neighbor in G′, we have $d_{i_{0}}^{G^{'}} = d_{i_{0}}^{G^{[s]}}, d_{j_{0}}^{G^{'}} = d_{j_{0}}^{G^{[s]}} + 1, v_{i_{0}}^{G^{'}} = v_{i_{0}}^{G^{[s]}} + 1, v_{j_{0}}^{G^{'}} = v_{j_{0}}^{G^{[s]}}$ , and ν(G′) = ν(G^[^s^]) ∪ {(i₀, j₀)}. We define an upper triangular matrix Ψ′ such that $Ψ_{i j}^{'} = Ψ_{i j}^{[s]}$ for (i, j) ∈ ν(G^[^s^]). We sample $γ ~ N (Ψ_{i_{0} j_{0}}^{[s]}, σ_{g}^{2})$ and set $Ψ_{i_{0} j_{0}}^{'} = γ$ . The rest of the elements of Ψ′ are determined from (Ψ′)^ν⁽^G^′) through the completion operation. The value of the free element $Ψ_{i_{0} j_{0}}^{'}$ was set by perturbing the non-free element $Ψ_{i_{0} j_{0}}^{[s]}$ . The other free elements of Ψ′ and Ψ^[s] coincide.

We take K′ = (Q^*)^T ((Ψ′)^T Ψ′)Q^*. Since (Ψ′)^ν⁽^G^′) ∈ M^ν⁽^G^′), we have K′ ∈ P_G_′. The dimensionality of the parameter space increases by 1 as we move from (K^[^s^], G^[^s^]) to (K′, G′). Since (Ψ′)^{ν(G^[s])} = (Ψ^[s])^{ν(G^[s])}, the Jacobian of the transformation from ((Ψ^[s])^{ν(G[^s])}, γ) to (Ψ′)^ν⁽^G^′) is equal to 1. Moreover, Ψ^[^s^] and Ψ′ have the same diagonal elements, hence $det K^{[s]} = \prod_{i = 1}^{p} {(Q_{i i}^{*} Ψ_{i i}^{[s]})}^{2} = det K^{'}$ . The Markov chain moves to (K′, G′) with probability $min {R_{g}^{+}, 1}$ where $R_{g}^{+}$ is given by Green (1995)

R_{g}^{+} = σ_{g} \sqrt{2 π} Q_{i_{0} i_{0}}^{*} Q_{j_{0} j_{0}}^{*} Ψ_{i_{0} i_{0}}^{[s]} \frac{I_{G^{[s]}} (δ_{0}, D_{0})}{I_{G^{'}} (δ_{0}, D_{0})} \frac{Pr (G^{'})}{Pr (G^{[s]})} \frac{∣ {nbd}_{p}^{+} (G^{[s]}) ∣}{∣ {nbd}_{p}^{-} (G^{'}) ∣} \times exp {- \frac{1}{2} [〈 K^{'} - K^{[s]}, U + D_{0} 〉 - \frac{{(Ψ_{i_{0} j_{0}}^{'} - Ψ_{i_{0} j_{0}}^{[s]})}^{2}}{σ_{g}^{2}}]} .

(A.2)

Otherwise the chain stays at (K^[^s^], G^[^s^]).

Next we assume that the candidate G′ is obtained by deleting the edge (i₀, j₀) from G^[^s^]. We have $d_{i_{0}}^{G^{'}} = d_{i_{0}}^{G^{[s]}}, d_{j_{0}}^{G^{'}} = d_{j_{0}}^{G^{[s]}} - 1, v_{i_{0}}^{G^{'}} = v_{i_{0}}^{G^{[s]}} - 1, v_{j_{0}}^{G^{'}} = v_{j_{0}}^{G^{[s]}}$ , and ν(G′) = ν(G^[^s^])\{(i₀, j₀)}. We define an upper triangular matrix Ψ′ such that $Ψ_{i j}^{'} = Ψ_{i j}^{[s]}$ for (i, j) ∈ ν(G′). The rest of the elements of Ψ′ are determined through completion. The free element $Ψ_{i_{0} j_{0}}^{[s]}$ becomes non-free in Ψ′, hence the parameter space decreases by 1 as we move from (Ψ^[s])^{ν(G^[s])} to (Ψ′)^ν⁽^G^′) ∈ M^ν⁽^G^′). As before, we take K′= (Q^*)^T ((Ψ′)^T Ψ′)Q^*. The acceptance probability of the transition from (K^[^s^], G^[^s^]) to (K′, G′) is $min {R_{g}^{-}, 1}$ where

\begin{array}{l} R_{g}^{-} = {(σ_{g} \sqrt{2 π} Q_{i_{0} i_{0}}^{*} Q_{j_{0} j_{0}}^{*} Ψ_{i_{0} i_{0}}^{[s]})}^{- 1} \\ \times \frac{I_{G^{[s]}} (δ_{0}, D_{0})}{I_{G^{'}} (δ_{0}, D_{0})} \frac{Pr (G^{'})}{Pr (G^{[s]})} \frac{∣ {nbd}_{p}^{-} (G^{[s]}) ∣}{∣ {nbd}_{p}^{+} (G^{'}) ∣} \\ \times exp {- \frac{1}{2} [〈 K^{'} - K^{[s]}, U + D_{0} 〉 + \frac{{(Ψ_{i_{0} j_{0}}^{'} - Ψ_{i_{0} j_{0}}^{[s]})}^{2}}{σ_{g}^{2}}]} . \end{array}

(A.3)

We denote by (K^[^s^+1/2], G^[^s^+1]), K^[^s^+1/2] ∈ G^[^s^+1], the state of the chain at the end of this step.

Step 2: Resample the precision matrix

Given the updated graph G^[^s^+1], we update the precision matrix K^[^s^+1/2] = (Q^*)^T × (Ψ^s^+1/2)^T Ψ^s^+1/2Q^* by sequentially perturbing the free elements (Ψ^[s+1/2])^{ν(G^[s+1])}. For each such element, we perform one iteration of the Metropolis–Hastings algorithm from Section 2.2 with δ = n + δ₀, D = U + D₀, and Q = Q^*. The standard deviation of the normal proposals is σ_m. We denote by K^[^s^+1] ∈ P_{G^[_s+1]} the precision matrix obtained after all the updates have been performed.

APPENDIX B: DETAILS OF THE REVERSIBLE JUMP MCMC SAMPLER FOR MATRIX-VARIATE GGMS

Our sampling scheme from the joint posterior distribution (15) is composed of the following five steps that explain the transition of the Markov chain from its current state ( $K_{R}^{[s]}, K_{C}^{[s]}, G_{R}^{[s]}$ , z^[s]) to its next state ( $K_{R}^{[s + 1]}, K_{C}^{[s + 1]}, G_{R}^{[s + 1]}$ , z^[s+1]). We use four strictly positive precision parameters σ_m_,_R, σ_m_,_C, σ_g_,_R, and σ_g_,_C.

Step 1: Resample the row graph

We denote $n_{R}^{*} = {n p}_{C}$ and $U_{R}^{*} = \sum_{j = 1}^{n} x^{(j)} K_{C}^{[s]} {(x^{(j)})}^{T}$ . We generate a random permutation υ_R ∈ ϒ_{p_R} of the row indices V_{p_R} and reorder the row and columns of the matrix $U_{R}^{*} + D_{R}$ according to υ_R. We determine the Cholesky decomposition ${(U_{R}^{*} + D_{R})}^{- 1} = {(Q_{R}^{*})}^{T} Q_{R}^{*}$ . We proceed as described in Step 1 of Appendix A. Given the notations from Appendix A, we take p = p_R, $n = n_{R}^{*}, U = U_{R}^{*}$ , δ₀ = δ_R, D₀ = D_R, and σ_g = σ_g_,_R. We denote the updated row precision matrix and graph by $(K_{R}^{[s + 1 / 2]}, G_{R}^{[s + 1]}), K_{R}^{[s + 1 / 2]} \in P_{G_{R}^{[s + 1]}}$ .

Step 2: Resample the row precision matrix

We denote $n_{R}^{*} = {n p}_{C}$ and $U_{R}^{*} = \sum_{j = 1}^{n} x^{(j)} K_{C}^{[s]} {(x^{(j)})}^{T}$ . We determine the Cholesky decomposition ${(U_{R}^{*} + D_{R})}^{- 1} = {(Q_{R}^{*})}^{T} Q_{R}^{*}$ after permuting the row and columns of $U_{R}^{*} + D_{R}$ according to a random ordering in ϒ_{p_R}.. The conditional distribution of $K_{R} \in P_{G_{R}^{[s + 1]}}$ is G-Wishart ${Wis}_{G_{R}^{[s + 1]}} (n_{R}^{*} + δ_{R}, U_{R}^{*} + D_{R})$ . We make the transition from $K_{R}^{[s + 1 / 2]}$ to $K_{R}^{[s + 1]} \in P_{G_{R}^{[s + 1]}}$ using Metropolis–Hastings updates described in Section 2.2. Given the notations we used in that section, we take p = p_R, $δ = n_{R}^{*} + δ_{R}, D = U_{R}^{*} + D_{R}, Q = Q_{R}^{*}$ , and σ_m = σ_m_,_R.

Step 3: Resample the column graph

We denote $n_{C}^{*} = {n p}_{R}$ and $U_{C}^{*} = \sum_{j = 1}^{n} {(x^{(j)})}^{T} K_{R}^{[s + 1]} x^{(j)}$ . We sample a candidate column graph $G_{C}^{'} \in {nbd}_{p_{C}} (G_{C}^{[s]})$ from the proposal

q (G_{C}^{'} ∣ G_{C}^{[s]}, z^{[s]}) = \frac{1}{2} \frac{{(z^{[s]})}^{∣ v (G_{C}^{'}) ∣}}{\sum_{G_{C}^{″} \in {nbd}_{p_{C}}^{+} (G_{C}^{[s]})} {(z^{[s]})}^{∣ v (G_{C}^{″}) ∣}} 1_{{G_{C}^{'} \in {nbd}_{p_{C}}^{+} (G_{C}^{[s]})}} + \frac{1}{2} \frac{{(z^{[s]})}^{∣ v (G_{C}^{'}) ∣}}{\sum_{G_{C}^{″} \in {nbd}_{p_{C}}^{-} (G_{C}^{[s]})} {(z^{[s]})}^{∣ v (G_{C}^{″}) ∣}} 1_{{G_{C}^{'} \in {nbd}_{p_{C}}^{-} (G_{C}^{[s]})}},

(B.1)

where 1_A is equal to 1 if A is true and is 0 otherwise. The proposal (B.1) gives an equal probability that the candidate graph is obtained by adding or deleting an edge from the current graph.

We assume that $G_{C}^{'}$ is obtained by adding the edge (i₀, j₀) to $G_{C}^{[s]}$ . We generate a random permutation $υ_{C} \in ϒ_{p_{C}}^{(1, 1)}$ of the row indices V_{p_C} and reorder the row and columns of the matrix $U_{C}^{*} + z^{[s]} D_{C}$ according to υ_C. The permutation υ_C is such that υ_C(1) = 1, hence the (1, 1) element of $K_{C}^{[s]}$ remains in the same position. We determine the Cholesky decomposition ${(U_{C}^{*} + z^{[s]} D_{C})}^{- 1} = {(Q_{C}^{*})}^{T} Q_{C}^{*}$ of ${(U_{C}^{*} + z^{[s]} D_{C})}^{- 1}$ . We consider the decomposition of the column precision matrix $K_{C}^{[s]} = {(Q_{C}^{*})}^{T} {(Ψ_{C}^{[s]})}^{T} Ψ_{C}^{[s]} Q_{C}^{*}$ with ${(Ψ_{C}^{[s]})}^{v (G_{C}^{[s]})} \in M^{v (G_{C}^{[s]})}$ . We define an upper triangular matrix $Ψ_{C}^{'}$ such that ${(Ψ_{C}^{'})}_{i j} = {(Ψ_{C}^{[s]})}_{i j}$ for $(i, j) \in v (G_{C}^{[s]})$ . We sample $γ ~ N ({(Ψ_{C}^{[s]})}_{i_{0} j_{0}}, σ_{g, C}^{2})$ and set ${(Ψ_{C}^{'})}_{i_{0} j_{0}} = γ$ . The rest of the elements of $Ψ_{C}^{'}$ are determined from ${(Ψ_{C}^{'})}^{v (G_{C}^{'})}$ through the completion operation (Atay-Kayis and Massam 2005, lemma 2). We consider the candidate column precision matrix

K_{C}^{'} = {(Q_{C}^{*})}^{T} {(Ψ_{C}^{'})}^{T} Ψ_{C}^{'} Q_{C}^{*} .

(B.2)

We know that $K_{C}^{[s]} \in P_{G_{C}^{[s]}}$ must satisfy ${(K_{C}^{[s]})}_{11} = 1$ . The last equality implies ${(Ψ_{C}^{[s]})}_{11} = 1 / {(Q_{C}^{*})}_{11}$ , hence ${(Ψ_{C}^{'})}_{11} = 1 / {(Q_{C}^{*})}_{11}$ . Therefore we have $K_{C}^{'} \in P_{G_{C}^{'}}$ and ${(K_{C}^{'})}_{11} = 1$ .

We make the transition from ( $K_{C}^{[s]}, G_{C}^{[s]}$ ) to ( $K_{C}^{'}, G_{C}^{'}$ ) with probability $min {R_{C}^{+}, 1}$ where

\begin{array}{l} R_{C}^{+} = σ_{g, C} \sqrt{2 π} {(Q_{C}^{*})}_{i_{0} i_{0}} {(Q_{C}^{*})}_{j_{0} j_{0}} {(Ψ_{C}^{[s]})}_{i_{0} i_{0}} \\ \times \frac{I_{G_{C}^{[s]}} [δ_{C}, D_{C}]}{I_{G_{C}^{'}} (δ_{C}, D_{C})} \frac{Pr (G_{C}^{'})}{Pr (G_{C}^{[s]})} \frac{\sum_{G_{C}^{″} \in {nbd}_{p_{C}}^{+} (G_{C}^{[s]})} {(z^{[s]})}^{∣ v (G_{C}^{″}) ∣}}{\sum_{G_{C}^{″} \in {nbd}_{p_{C}}^{-} (G_{C}^{'})} {(z^{[s]})}^{∣ v (G_{C}^{″}) ∣}} \\ \times exp {- \frac{1}{2} [〈 K_{C}^{'} - K_{C}^{[s]}, U_{C}^{*} + z^{[s]} D_{C} 〉 - \frac{{({(Ψ_{C}^{'})}_{i_{0} j_{0}} - {(Ψ_{C}^{[s]})}_{i_{0} j_{0}})}^{2}}{σ_{g, C}^{2}}]} . \end{array}

(B.3)

Next we assume that $G_{C}^{'}$ is obtained by deleting the edge (i₀, j₀) from $G_{C}^{[s]}$ . We define an upper triangular matrix $Ψ_{C}^{'}$ such that ${(Ψ_{C}^{'})}_{i j} = {(Ψ_{C}^{[s]})}_{i j}$ for $(i, j) \in v (G_{C}^{'})$ . The candidate $K_{C}^{'}$ is obtained from $Ψ_{C}^{'}$ as in Equation (B.2). We make the transition from ( $K_{C}^{[s]}, G_{C}^{[s]}$ ) to ( $K_{C}^{'}, G_{C}^{'}$ ) with probability $min {R_{C}^{-}, 1}$ where

\begin{array}{l} R_{C}^{-} = {σ_{g, C} \sqrt{2 π} {(Q_{C}^{*})}_{i_{0} i_{0}} {(Q_{C}^{*})}_{j_{0} j_{0}} {(Ψ_{C}^{[s]})}_{i_{0} i_{0}}}^{- 1} \\ \times \frac{I_{G_{C}^{[s]}} (δ_{C}, D_{C})}{I_{G_{C}^{'}} (δ_{C}, D_{C})} \frac{Pr (G_{C}^{'})}{Pr (G_{C}^{[s]})} \frac{\sum_{G_{C}^{″} \in {nbd}_{p_{C}}^{-} (G_{C}^{[s]})} {(z^{[s]})}^{∣ v (G_{C}^{″}) ∣}}{\sum_{G_{C}^{″} \in {nbd}_{p_{C}}^{+} (G_{C}^{'})} {(z^{[s]})}^{∣ v (G_{C}^{″}) ∣}} \\ \times exp {- \frac{1}{2} [〈 K_{C}^{'} - K_{C}^{[s]}, U_{C}^{*} + z^{[s]} D_{C} 〉 + \frac{{({(Ψ_{C}^{'})}_{i_{0} j_{0}} - {(Ψ_{C}^{[s]})}_{i_{0} j_{0}})}^{2}}{σ_{g, C}^{2}}]} . \end{array}

(B.4)

We denote the updated column precision matrix and graph by ( $K_{C}^{[s + 1 / 2]}, G_{C}^{[s + 1]}$ ).

Step 4: Resample the column precision matrix

We denote $n_{C}^{*} = {n p}_{R}$ and $U_{C}^{*} = \sum_{j = 1}^{n} {(x^{(j)})}^{T} K_{R}^{[s + 1]} x^{(j)}$ . We determine the Cholesky decomposition ${(U_{C}^{*} + z^{[s]} D_{C})}^{- 1} = {(Q_{C}^{*})}^{T} Q_{C}^{*}$ after permuting the row and columns of $U_{C}^{*} + z^{[s]} D_{C}$ according to a random ordering in $ϒ_{p_{C}}^{(1, 1)}$ . The conditional distribution of $K_{C} \in P_{G_{C}^{[s + 1]}}$ with (K_C)₁₁ = 1 is G-Wishart ${Wis}_{G_{C}^{[s + 1]}} (n_{C}^{*} + δ_{C}, U_{C}^{*} + z^{[s]} D_{C})$ . We make the transition from $K_{C}^{[s + 1 / 2]}$ to $K_{C}^{[s + 1]} \in P_{G_{C}^{[s + 1]}}$ using the Metropolis–Hastings updates from Section 2.2. Given the notations we used in that section, we take p = p_C, $δ = n_{C}^{*} + δ_{C}, D = U_{C}^{*} + z^{[s]} D_{C}, Q = Q_{C}^{*}$ , and σ_m = σ_m_,_C. The constraint (K_C)₁₁ = 1 is accommodated as described at the end of Section 2.2.

Step 5: Resample the auxiliary variable

The conditional distribution of z > 0 is

Gamma (\frac{p_{C} (δ_{C} - 2)}{2} + ∣ v (G_{C}^{[s + 1]}) ∣, \frac{1}{2} tr (K_{C}^{[s + 1]} D_{C})) .

(B.5)

Here Gamma(α, β) has density f(x|α, β) ∝ β^αx^α⁻¹ exp(−βx). We sample z^[^s^+1] from (B.5).

Contributor Information

Adrian Dobra, Email: adobra@uw.edu, Assistant Professor, Departments of Statistics, Biobehavioral Nursing, and Health Systems and the Center for Statistics and the Social Sciences, Box 354322, University of Washington, Seattle, WA 98195.

Alex Lenkoski, Email: alex.lenkoski@uni-heidelberg.de, Postdoctoral Research Fellow, Institut für Angewandte Mathematik, Universität Heidelberg, 69115 Heidelberg, Germany.

Abel Rodriguez, Email: abel@soe.ucsc.edu, Assistant Professor, Department of Applied Mathematics and Statistics, University of California, Santa Cruz, CA 95064.

References

Armstrong H. PhD thesis. The University of New South Wales; 2005. Bayesian Estimation of Decomposable Gaussian Graphical Models. [Google Scholar]
Armstrong H, Carter CK, Wong KF, Kohn R. Bayesian Covariance Matrix Estimation Using a Mixture of Decomposable Graphical Models. Statistics and Computing. 2009;19:303–316. [Google Scholar]
Asci C, Piccioni M. Functionally Compatible Local Characteristics for the Local Specification of Priors in Graphical Models. Scandinavian Journal of Statistics. 2007;34:829–840. [Google Scholar]
Atay-Kayis A, Massam H. A Monte Carlo Method for Computing the Marginal Likelihood in Nondecomposable Gaussian Graphical Models. Biometrika. 2005;92:317–335. [Google Scholar]
Banerjee O, El Ghaoui L, D’Aspremont A. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]
Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, FL: Chapman & Hall/CRC; 2004. [Google Scholar]
Besag J. Spatial Interaction and the Statistical Analysis of Lattice Systems” (with discussion) Journal of the Royal Statistical Society, Ser B. 1974;36:192–236. [Google Scholar]
Besag J. A Candidate’s Formula: A Curious Result in Bayesian Prediction. Biometrika. 1989;76:183. [Google Scholar]
Besag J, Kooperberg C. On Conditional and Intrinsic Autoregressions. Biometrika. 1995;82:733–746. [Google Scholar]
Brook D. On the Distinction Between the Conditional Probability and the Joint Probability Approaches in the Specification of the Nearest Neighbour Systems. Biometrika. 1964;51:481–489. [Google Scholar]
Brooks SP, Giudici P, Roberts GO. Efficient Construction of Reversible Jump Markov Chain Monte Carlo Proposals Distributions. Journal of the Royal Statistical Society, Ser B. 2003;65:3–55. [Google Scholar]
Carvalho CM, Scott JG. Objective Bayesian Model Selection in Gaussian Graphical Models. Biometrika. 2009;96:1–16. [Google Scholar]
Carvalho CM, Massam H, West M. Simulation of Hyper-Inverse Wishart Distributions in Graphical Models. Biometrika. 2007;94:647–659. [Google Scholar]
Chib S. Marginal Likelihood From the Gibbs Output. Journal of the American Statistical Association. 1990;90:1313–1321. [Google Scholar]
Cressie NAC. Statistics for Spatial Data. New York: Wiley; 1973. [Google Scholar]
Cressie NAC, Chan NH. Spatial Modeling of Regional Variables. Journal of American Statistical Association. 1989;84:393–401. [Google Scholar]
Czado C, Gneiting T, Held L. Predictive Model Assessment for Count Data. Biometrics. 2009;65:1254–1261. doi: 10.1111/j.1541-0420.2009.01191.x. [DOI] [PubMed] [Google Scholar]
Dawid AP, Lauritzen SL. Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models. The Annals of Statistics. 1993;21:1272–1317. [Google Scholar]
Dellaportas P, Giudici P, Roberts G. Bayesian Inference for Nondecomposable Graphical Gaussian Models. Sankhyā: The Indian Journal of Statistics. 2003;65:43–55. [Google Scholar]
Dempster AP. Covariance Selection. Biometrics. 1972;28:157–175. [Google Scholar]
Diaconnis P, Ylvisaker D. Conjugate Priors for Exponential Families. The Annals of Statistics. 1979;7:269–281. [Google Scholar]
Dobra A, Hans C, Jones B, Nevins JR, Yao G, West M. Sparse Graphical Models for Exploring Gene Expression Data. Journal of Multivariate Analysis. 2004;90:196–212. [Google Scholar]
Drton M, Perlman MD. A SINful Approach to Gaussian Graphical Model Selection. Journal of Statistical Planning and Inference. 2008;138:1179–1200. [Google Scholar]
Elliott P, Wakefield J, Best N, Briggs D. Spatial Epidemiology: Methods and Applications. New York: Oxford University Press; 2001. [Google Scholar]
Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation With the Graphical Lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gelfand AE, Ghosh SK. Model Choice: A Minimum Posterior Predictive Loss Approach. Biometrika. 1998;85:1–11. [Google Scholar]
Gelfand AE, Vounatsou P. Proper Multivariate Conditional Autoregressive Models for Spatial Data Analysis. Biostatistics. 2003;4:11–25. doi: 10.1093/biostatistics/4.1.11. [DOI] [PubMed] [Google Scholar]
Gelfand AE, Diggle PJ, Fuentes M, Guttorp P. Handbook of Spatial Statistics. Boca Raton, FL: Chapman & Hall/CRC; 2010. [Google Scholar]
Ghosh K, Tiwari RC. Prediction of U.S. Cancer Mortality Counts Using Semiparametric Bayesian Techniques. Journal of the American Statistical Association. 2007;102:7–15. [Google Scholar]
Ghosh K, Ghosh P, Tiwari RC. Comment on “The Nested Dirichlet Process,” by A. Rodriguez, D. B. Dunson, and A. E. Gelfand. Journal of the American Statistical Association. 2008;103(483):1147–1149. [Google Scholar]
Ghosh K, Tiwari RC, Feuer EJ, Cronin K, Jemal A. Predicting U.S. Cancer Mortality Counts Using State Space Models. In: Khattree R, Naik DN, editors. Computational Methods in Biomedical Research. Boca Raton, FL: Chapman & Hall/CRC; 2007. pp. 131–151. [Google Scholar]
Giudici P, Green PJ. Decomposable Graphical Gaussian Model Determination. Biometrika. 1999;86:785–801. [Google Scholar]
Gneiting T, Raftery AE. Strictly Proper Scoring Rules, Prediction and Estimation. Journal of the American Statistical Association. 2007;102:359–378. [Google Scholar]
Green PJ. Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika. 1995;82:711–732. [Google Scholar]
Gupta AK, Nagar DK. Matrix Variate Distributions. Monographs and Surveys in Pure and Applied Mathematics. Vol. 104. London: Chapman & Hall/CRC; 2000. [Google Scholar]
Jones B, Carvalho C, Dobra A, Hans C, Carter C, West M. Experiments in Stochastic Computation for High-Dimensional Graphical Models. Statistical Science. 2005;20:388–400. [Google Scholar]
Lauritzen SL. Graphical Models. New York: Oxford University Press; 1996. [Google Scholar]
Lawson AB. Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology. Boca Raton, FL: Chapman & Hall/CRC; 2009. [Google Scholar]
Lenkoski A, Dobra A. Computational Aspects Related to Inference in Gaussian Graphical Models With the G-Wishart Prior. Journal of Computational and Graphical Statistics. 2011;20:140–157. [Google Scholar]
Letac G, Massam H. Wishart Distributions for Decomposable Graphs. The Annals of Statistics. 2007;35:1278–1323. [Google Scholar]
Madigan D, York J. Bayesian Graphical Models for Discrete Data. International Statistical Review. 1995;63:215–232. [Google Scholar]
Mardia KV. Multi-Dimensional Multivariate Gaussian Markov Random Fields With Application to Image Processing. Journal of Multivariate Analysis. 1988;24:265–284. [Google Scholar]
Meinshausen N, Bühlmann P. High Dimensional Graphs and Variable Selection With the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]
Mengersen KL, Tweedie RL. Rates of Convergence of the Hastings and Metropolis Algorithms. The Annals of Statistics. 1996;24:101–121. [Google Scholar]
Mitsakakis N, Massam H, Escobar MD. A Metropolis–Hastings Based Method for Sampling From the G-Wishart Distribution in Gaussian Graphical Models. Electronic Journal of Statistics. 2011;5:18–30. [Google Scholar]
Moghaddam B, Marlin B, Khan E, Murphy K. Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A, editors. Advances in Neural Information Processing Systems. Vol. 22. San Mateo, CA: Morgan Kaufmann; 2009. pp. 1285–1293. [Google Scholar]
Muirhead RJ. Aspects of Multivariate Statistical Theory. New York: Wiley; 2005. [Google Scholar]
Piccioni M. Independence Structure of Natural Conjugate Densities to Exponential Families and the Gibbs Sampler. Scandinavian Journal of Statistics. 2000;27:111–127. [Google Scholar]
Ravikumar P, Wainwright MJ, Lafferty JD. High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression. The Annals of Statistics. 2010;38:1287–1319. [Google Scholar]
Robert C, Casella G. Monte Carlo Statistical Methods. 2. New York: Springer-Verlag; 2004. [Google Scholar]
Roverato A. Hyper Inverse Wishart Distribution for Non-Decomposable Graphs and Its Application to Bayesian Inference for Gaussian Graphical Models. Scandinavian Journal of Statistics. 2002;29:391–411. [Google Scholar]
Rue H, Held L. Gaussian Markov Random Fields: Theory and Applications. Boca Raton, FL: Chapman & Hall/CRC; 2005. [Google Scholar]
Scott JG, Berger JO. An Exploration of Aspects of Bayesian Multiple Testing. Journal of Statistical Planning and Inference. 2006;136:2144–2162. [Google Scholar]
Scott JG, Carvalho CM. Feature-Inclusion Stochastic Search for Gaussian Graphical Models. Journal of Computational and Graphical Statistics. 2008;17:790–808. [Google Scholar]
Speed TP, Kiiveri HT. Gaussian Markov Distributions Over Finite Graphs. The Annals of Statistics. 1986;14:138–150. [Google Scholar]
Sun D, Tsutakawa RK, Kim H, He Z. Bayesian Analysis of Mortality Rates With Disease Maps. Statistics and Medicine. 2000;19:2015–2035. doi: 10.1002/1097-0258(20000815)19:15<2015::aid-sim422>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]
Tarjan RE. Decomposition by Clique Separators. Discrete Mathematics. 1985;55:221–232. [Google Scholar]
Tiwari RC, Ghosh K, Jemal A, Hachey M, Ward E, Thun MJ, Feuer EJ. A New Method for Predicting U.S., and State-level Cancer Mortality Counts for the Current Calendar Year. CA: A Cancer Journal for Clinicians. 2004;54(1):30–40. doi: 10.3322/canjclin.54.1.30. [DOI] [PubMed] [Google Scholar]
Wang H, Carvalho CM. Simulation of Hyper-Inverse Wishart Distributions for Non-Decomposable Graphs. Electronic Journal of Statistics. 2010;4:1470–1475. [Google Scholar]
Wang H, West M. Bayesian Analysis of Matrix Normal Graphical Models. Biometrika. 2009;96:821–834. doi: 10.1093/biomet/asp049. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wong F, Carter CK, Kohn R. Efficient Estimation of Covariance Selection Models. Biometrika. 2003;90:809–830. [Google Scholar]
Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;94:19–35. [Google Scholar]

[R1] Armstrong H. PhD thesis. The University of New South Wales; 2005. Bayesian Estimation of Decomposable Gaussian Graphical Models. [Google Scholar]

[R2] Armstrong H, Carter CK, Wong KF, Kohn R. Bayesian Covariance Matrix Estimation Using a Mixture of Decomposable Graphical Models. Statistics and Computing. 2009;19:303–316. [Google Scholar]

[R3] Asci C, Piccioni M. Functionally Compatible Local Characteristics for the Local Specification of Priors in Graphical Models. Scandinavian Journal of Statistics. 2007;34:829–840. [Google Scholar]

[R4] Atay-Kayis A, Massam H. A Monte Carlo Method for Computing the Marginal Likelihood in Nondecomposable Gaussian Graphical Models. Biometrika. 2005;92:317–335. [Google Scholar]

[R5] Banerjee O, El Ghaoui L, D’Aspremont A. Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data. Journal of Machine Learning Research. 2008;9:485–516. [Google Scholar]

[R6] Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, FL: Chapman & Hall/CRC; 2004. [Google Scholar]

[R7] Besag J. Spatial Interaction and the Statistical Analysis of Lattice Systems” (with discussion) Journal of the Royal Statistical Society, Ser B. 1974;36:192–236. [Google Scholar]

[R8] Besag J. A Candidate’s Formula: A Curious Result in Bayesian Prediction. Biometrika. 1989;76:183. [Google Scholar]

[R9] Besag J, Kooperberg C. On Conditional and Intrinsic Autoregressions. Biometrika. 1995;82:733–746. [Google Scholar]

[R10] Brook D. On the Distinction Between the Conditional Probability and the Joint Probability Approaches in the Specification of the Nearest Neighbour Systems. Biometrika. 1964;51:481–489. [Google Scholar]

[R11] Brooks SP, Giudici P, Roberts GO. Efficient Construction of Reversible Jump Markov Chain Monte Carlo Proposals Distributions. Journal of the Royal Statistical Society, Ser B. 2003;65:3–55. [Google Scholar]

[R12] Carvalho CM, Scott JG. Objective Bayesian Model Selection in Gaussian Graphical Models. Biometrika. 2009;96:1–16. [Google Scholar]

[R13] Carvalho CM, Massam H, West M. Simulation of Hyper-Inverse Wishart Distributions in Graphical Models. Biometrika. 2007;94:647–659. [Google Scholar]

[R14] Chib S. Marginal Likelihood From the Gibbs Output. Journal of the American Statistical Association. 1990;90:1313–1321. [Google Scholar]

[R15] Cressie NAC. Statistics for Spatial Data. New York: Wiley; 1973. [Google Scholar]

[R16] Cressie NAC, Chan NH. Spatial Modeling of Regional Variables. Journal of American Statistical Association. 1989;84:393–401. [Google Scholar]

[R17] Czado C, Gneiting T, Held L. Predictive Model Assessment for Count Data. Biometrics. 2009;65:1254–1261. doi: 10.1111/j.1541-0420.2009.01191.x. [DOI] [PubMed] [Google Scholar]

[R18] Dawid AP, Lauritzen SL. Hyper Markov Laws in the Statistical Analysis of Decomposable Graphical Models. The Annals of Statistics. 1993;21:1272–1317. [Google Scholar]

[R19] Dellaportas P, Giudici P, Roberts G. Bayesian Inference for Nondecomposable Graphical Gaussian Models. Sankhyā: The Indian Journal of Statistics. 2003;65:43–55. [Google Scholar]

[R20] Dempster AP. Covariance Selection. Biometrics. 1972;28:157–175. [Google Scholar]

[R21] Diaconnis P, Ylvisaker D. Conjugate Priors for Exponential Families. The Annals of Statistics. 1979;7:269–281. [Google Scholar]

[R22] Dobra A, Hans C, Jones B, Nevins JR, Yao G, West M. Sparse Graphical Models for Exploring Gene Expression Data. Journal of Multivariate Analysis. 2004;90:196–212. [Google Scholar]

[R23] Drton M, Perlman MD. A SINful Approach to Gaussian Graphical Model Selection. Journal of Statistical Planning and Inference. 2008;138:1179–1200. [Google Scholar]

[R24] Elliott P, Wakefield J, Best N, Briggs D. Spatial Epidemiology: Methods and Applications. New York: Oxford University Press; 2001. [Google Scholar]

[R25] Friedman J, Hastie T, Tibshirani R. Sparse Inverse Covariance Estimation With the Graphical Lasso. Biostatistics. 2008;9:432–441. doi: 10.1093/biostatistics/kxm045. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R26] Gelfand AE, Ghosh SK. Model Choice: A Minimum Posterior Predictive Loss Approach. Biometrika. 1998;85:1–11. [Google Scholar]

[R27] Gelfand AE, Vounatsou P. Proper Multivariate Conditional Autoregressive Models for Spatial Data Analysis. Biostatistics. 2003;4:11–25. doi: 10.1093/biostatistics/4.1.11. [DOI] [PubMed] [Google Scholar]

[R28] Gelfand AE, Diggle PJ, Fuentes M, Guttorp P. Handbook of Spatial Statistics. Boca Raton, FL: Chapman & Hall/CRC; 2010. [Google Scholar]

[R29] Ghosh K, Tiwari RC. Prediction of U.S. Cancer Mortality Counts Using Semiparametric Bayesian Techniques. Journal of the American Statistical Association. 2007;102:7–15. [Google Scholar]

[R30] Ghosh K, Ghosh P, Tiwari RC. Comment on “The Nested Dirichlet Process,” by A. Rodriguez, D. B. Dunson, and A. E. Gelfand. Journal of the American Statistical Association. 2008;103(483):1147–1149. [Google Scholar]

[R31] Ghosh K, Tiwari RC, Feuer EJ, Cronin K, Jemal A. Predicting U.S. Cancer Mortality Counts Using State Space Models. In: Khattree R, Naik DN, editors. Computational Methods in Biomedical Research. Boca Raton, FL: Chapman & Hall/CRC; 2007. pp. 131–151. [Google Scholar]

[R32] Giudici P, Green PJ. Decomposable Graphical Gaussian Model Determination. Biometrika. 1999;86:785–801. [Google Scholar]

[R33] Gneiting T, Raftery AE. Strictly Proper Scoring Rules, Prediction and Estimation. Journal of the American Statistical Association. 2007;102:359–378. [Google Scholar]

[R34] Green PJ. Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination. Biometrika. 1995;82:711–732. [Google Scholar]

[R35] Gupta AK, Nagar DK. Matrix Variate Distributions. Monographs and Surveys in Pure and Applied Mathematics. Vol. 104. London: Chapman & Hall/CRC; 2000. [Google Scholar]

[R36] Jones B, Carvalho C, Dobra A, Hans C, Carter C, West M. Experiments in Stochastic Computation for High-Dimensional Graphical Models. Statistical Science. 2005;20:388–400. [Google Scholar]

[R37] Lauritzen SL. Graphical Models. New York: Oxford University Press; 1996. [Google Scholar]

[R38] Lawson AB. Bayesian Disease Mapping: Hierarchical Modeling in Spatial Epidemiology. Boca Raton, FL: Chapman & Hall/CRC; 2009. [Google Scholar]

[R39] Lenkoski A, Dobra A. Computational Aspects Related to Inference in Gaussian Graphical Models With the G-Wishart Prior. Journal of Computational and Graphical Statistics. 2011;20:140–157. [Google Scholar]

[R40] Letac G, Massam H. Wishart Distributions for Decomposable Graphs. The Annals of Statistics. 2007;35:1278–1323. [Google Scholar]

[R41] Madigan D, York J. Bayesian Graphical Models for Discrete Data. International Statistical Review. 1995;63:215–232. [Google Scholar]

[R42] Mardia KV. Multi-Dimensional Multivariate Gaussian Markov Random Fields With Application to Image Processing. Journal of Multivariate Analysis. 1988;24:265–284. [Google Scholar]

[R43] Meinshausen N, Bühlmann P. High Dimensional Graphs and Variable Selection With the Lasso. The Annals of Statistics. 2006;34:1436–1462. [Google Scholar]

[R44] Mengersen KL, Tweedie RL. Rates of Convergence of the Hastings and Metropolis Algorithms. The Annals of Statistics. 1996;24:101–121. [Google Scholar]

[R45] Mitsakakis N, Massam H, Escobar MD. A Metropolis–Hastings Based Method for Sampling From the G-Wishart Distribution in Gaussian Graphical Models. Electronic Journal of Statistics. 2011;5:18–30. [Google Scholar]

[R46] Moghaddam B, Marlin B, Khan E, Murphy K. Accelerating Bayesian Structural Inference for Non-Decomposable Gaussian Graphical Models. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A, editors. Advances in Neural Information Processing Systems. Vol. 22. San Mateo, CA: Morgan Kaufmann; 2009. pp. 1285–1293. [Google Scholar]

[R47] Muirhead RJ. Aspects of Multivariate Statistical Theory. New York: Wiley; 2005. [Google Scholar]

[R48] Piccioni M. Independence Structure of Natural Conjugate Densities to Exponential Families and the Gibbs Sampler. Scandinavian Journal of Statistics. 2000;27:111–127. [Google Scholar]

[R49] Ravikumar P, Wainwright MJ, Lafferty JD. High-Dimensional Ising Model Selection Using l1-Regularized Logistic Regression. The Annals of Statistics. 2010;38:1287–1319. [Google Scholar]

[R50] Robert C, Casella G. Monte Carlo Statistical Methods. 2. New York: Springer-Verlag; 2004. [Google Scholar]

[R51] Roverato A. Hyper Inverse Wishart Distribution for Non-Decomposable Graphs and Its Application to Bayesian Inference for Gaussian Graphical Models. Scandinavian Journal of Statistics. 2002;29:391–411. [Google Scholar]

[R52] Rue H, Held L. Gaussian Markov Random Fields: Theory and Applications. Boca Raton, FL: Chapman & Hall/CRC; 2005. [Google Scholar]

[R53] Scott JG, Berger JO. An Exploration of Aspects of Bayesian Multiple Testing. Journal of Statistical Planning and Inference. 2006;136:2144–2162. [Google Scholar]

[R54] Scott JG, Carvalho CM. Feature-Inclusion Stochastic Search for Gaussian Graphical Models. Journal of Computational and Graphical Statistics. 2008;17:790–808. [Google Scholar]

[R55] Speed TP, Kiiveri HT. Gaussian Markov Distributions Over Finite Graphs. The Annals of Statistics. 1986;14:138–150. [Google Scholar]

[R56] Sun D, Tsutakawa RK, Kim H, He Z. Bayesian Analysis of Mortality Rates With Disease Maps. Statistics and Medicine. 2000;19:2015–2035. doi: 10.1002/1097-0258(20000815)19:15<2015::aid-sim422>3.0.co;2-e. [DOI] [PubMed] [Google Scholar]

[R57] Tarjan RE. Decomposition by Clique Separators. Discrete Mathematics. 1985;55:221–232. [Google Scholar]

[R58] Tiwari RC, Ghosh K, Jemal A, Hachey M, Ward E, Thun MJ, Feuer EJ. A New Method for Predicting U.S., and State-level Cancer Mortality Counts for the Current Calendar Year. CA: A Cancer Journal for Clinicians. 2004;54(1):30–40. doi: 10.3322/canjclin.54.1.30. [DOI] [PubMed] [Google Scholar]

[R59] Wang H, Carvalho CM. Simulation of Hyper-Inverse Wishart Distributions for Non-Decomposable Graphs. Electronic Journal of Statistics. 2010;4:1470–1475. [Google Scholar]

[R60] Wang H, West M. Bayesian Analysis of Matrix Normal Graphical Models. Biometrika. 2009;96:821–834. doi: 10.1093/biomet/asp049. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] Wong F, Carter CK, Kohn R. Efficient Estimation of Covariance Selection Models. Biometrika. 2003;90:809–830. [Google Scholar]

[R62] Yuan M, Lin Y. Model Selection and Estimation in the Gaussian Graphical Model. Biometrika. 2007;94:19–35. [Google Scholar]

PERMALINK

Bayesian Inference for General Gaussian Graphical Models With Application to Multivariate Lattice Data

Adrian Dobra

Alex Lenkoski

Abel Rodriguez

Abstract

1. INTRODUCTION

2. GAUSSIAN GRAPHICAL MODELS AND THE G–WISHART DISTRIBUTION

2.1 Existing Methods for Sampling From the G-Wishart Distribution

2.2 Our Algorithm for Sampling From the G-Wishart Distribution

2.3 The Scalability of Sampling Methods From the G-Wishart Distribution

Table 1.

3. REVERSIBLE JUMP MCMC SAMPLERS FOR GGMS

4. REVERSIBLE JUMP MCMC SAMPLER FOR MATRIX–VARIATE GGMS

5. BAYESIAN HIERARCHICAL MODELS FOR MULTIVARIATE LATTICE DATA

6. ILLUSTRATIONS

6.1 Simulation Study

Table 2.

Table 3.

6.2 Mapping Cancer Mortality in the United States

Figure 1.

Table 4.

Figure 2.

Table 5.

Figure 3.

7. DISCUSSION

SUPPLEMENTARY MATERIALS

Code

Results

Acknowledgments

APPENDIX A: DETAILS OF THE REVERSIBLE JUMP MCMC SAMPLER FOR GGMS

Step 1: Resample the graph

Step 2: Resample the precision matrix

APPENDIX B: DETAILS OF THE REVERSIBLE JUMP MCMC SAMPLER FOR MATRIX-VARIATE GGMS

Step 1: Resample the row graph

Step 2: Resample the row precision matrix

Step 3: Resample the column graph

Step 4: Resample the column precision matrix

Step 5: Resample the auxiliary variable

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases