Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Dec 14.
Published in final edited form as: Stat Med. 2022 Apr 6;41(16):3057–3075. doi: 10.1002/sim.9404

Hierarchical multivariate directed acyclic graph autoregressive models for spatial diseases mapping

Leiwen Gao 1, Abhirup Datta 2, Sudipto Banerjee 1
PMCID: PMC10719081  NIHMSID: NIHMS1906091  PMID: 35708210

Abstract

Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop multivariate directed acyclic graphical autoregression models to accommodate spatial and inter-disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter-disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results program.

Keywords: areal data analysis, Bayesian hierarchical models, directed acyclic graphical autoregression, multiple disease mapping, multivariate areal data models

1 |. INTRODUCTION

Spatially referenced data comprising regional aggregates of health outcomes over delineated administrative units such as counties or zip codes are widely used by epidemiologists to map mortality or morbidity rates and better understand their geographic variation. Disease mapping, as this exercise is customarily called, employs statistical models to present smoothed maps of rates or counts of a disease. Such maps can assist investigators in identifying lurking risk factors1 and in detecting “hot-spots” or spatial clusters emerging from common environmental and socio-demographic effects shared by neighboring regions. By interpolating estimates of health outcome from areal data onto a continuous surface, disease mapping also generates smoothed maps for the small-area scale, adjusting for the sparsity of data or low population size.2,3

For a single disease, there has been a long tradition of employing Markov random fields (MRFs)4 to introduce conditional dependence for the outcome in a region given its neighbors. Two conspicuous examples are the conditional autoregression (CAR)5,6 and simultaneous autoregression (SAR) models7 that build dependence using undirected graphs to model geographic maps. More recently, a class of directed acyclic graphical autoregressive (DAGAR) models was proposed as a preferred alternative to CAR or SAR models in allowing better identifiability and interpretation of spatial autocorrelation parameters.8

Multivariate disease mapping is concerned with the analysis of multiple diseases that are associated among themselves and across space. It is not uncommon to find substantial associations among different diseases sharing genetic and environmental risk factors. Quantification of genetic correlations among multiple cancers has revealed associations among several cancers including lung, breast, colorectal, ovarian, and pancreatic cancers.9 Disease mapping exercises with lung and esophageal cancers have also evinced associations among them.10 When the diseases are inherently related so that the prevalence of one encourages (or inhibits) occurrence of the other, there can be substantial inferential benefits in jointly modeling the diseases rather than fitting independent univariate models for each disease.1021

The existence of multivariate MRFs can be demonstrated using a multivariate extension of the so called “Brook’s lemma,” which attempts to derive a joint distribution from specified full conditionals.13,22,23 McNab, in a series of papers, has delivered substantial insights into the construction, computation, and properties of different classes of multivariate CAR models.21,2426 Rather than work with full conditionals, an alternate approach builds joint distributions using linear transformations of a set of univariate CAR models.14,16,19,27,28 A different classe merges from hierarchical constructions10,29 where each disease enters the model in a given sequence (or order) of conditional probability models. This produces simple yet flexible and interpretable association structures, but every ordering produces a different model resulting in an explosion of models even for a modest number of cancers (say, more than 2 or 3 diseases). While multivariate MRF models constructed from undirected graphs are invariant to ordering, hence obviate the issue of order dependence, they impose restrictions to ensure positive-definiteness of covariance matrices, can be computationally onerous and render covariance structures that are challenging to interpret.

We introduce a class of multivariate DAGAR (MDAGAR) models for multiple diseases mapping by building the joint distribution hierarchically using univariate DAGAR models. This approach is analogous to generalized MCAR (GMCAR) models.10 The objective here is to retain the interpretation of spatial autocorrelation offered by the DAGAR, which is challenging for the CAR30 and order-free MCAR models. Our methodological innovation is devising a hierarchical MDAGAR model in conjunction with a bridge sampling algorithm31,32 for choosing among differently ordered hierarchical models and, more importantly, offering Bayesian model averaged (BMA) inference to neutralize the effect of order dependent inference. The idea is to begin with a fixed ordered set of cancers, posited to be associated with each other and a cross space, and build a hierarchical model. The DAGAR specification produces a comprehensible association structure, while bridge sampling allows us to rank differently ordered models using their marginal posterior probabilities. Since each model corresponds to an assumed conditional dependence, the marginal posterior probabilities will indicate the tenability of such assumptions given the data. Epidemiologists, then, will be able to use this information to establish relationships among the diseases and spatial autocorrelation for each disease.

The article proceeds as follows. Section 2 develops the hierarchical MDAGAR model and introduces a bridge sampling method to select the MDAGAR with the best hierarchical order. Section 3 presents simulation studies comparing MDAGAR with GMCAR and order-free MCAR models and also illustrates model averaged inference from the bridge sampling. Section 4 applies our MDAGAR to age-adjusted incidence rates of four cancers from the Surveillance, Epidemiology, and End Results (SEER) database and discusses different cases with respect to predictors. Finally, in Section 5, we summarize with some concluding remarks and pointers for future research.

2 |. METHODS

2.1 |. Overview of univariate DAGAR modeling

Let G={V,E} be a graph corresponding to a geographic map, where the vertices V={1,2,,k} represent clearly delineated regions on the map and E={(i,j):i~j} is the collection of edges between the vertices representing neighboring pairs of regions. We denote two neighboring regions i and j by i~j. We assume that the vertices in V are ordered in a fixed sequence according to their number labels. The DAGAR model builds a spatial autocorrelation model for a single outcome on G using the ordered set of vertices in V8. Let N(1) be the empty set and let jV\{1} be the index for any region except 1. We define N(j) to be the set of labels of geographic neighbors of j that precede j in V, that is, N(j)={lV:l<j;l~j}.8 Let {wi:iV} be a collection of k random variables defined over the map. DAGAR specifies the following autoregression,

w1=ϵ1;wj=lN(j)bjlwl+ϵj,j=2,,k, (1)

where ϵj~indN(0,λj) with the precision λj, and bjl=0 if lN(j). This implies that w~N(0,τQ(ρ)) , where Q(ρ) is a spatial precision matrix that depends only upon a spatial autocorrelation parameter ρ and τ is a positive scale parameter. The precision matrix Q(ρ)=(IB)F(IB),B is a k×k strictly lower-triangular matrix and F is a k×k diagonal matrix. The elements of B and F are denoted by bjl and λj, respectively, where

bjl={0,iflN(j)ρ1+(n<j1)ρ2,ifj=2,3,,k,lN(j)andλj=1+(n<j1)ρ21ρ2,j=1,2,k. (2)

n<j is the number of members in N(j) and n<1=0. The above definition of bjl is consistent with the lower-triangular structure of cB because lN(j) for any lj. The derivation of B and F as functions of a spatial correlation parameter ρ is based upon forming local autoregressive models on embedded spanning trees of subgraphs of G.8

DAGAR and CAR are both examples of MRFs.4 They are similar in that both models use a graph to model geographic neighbors, but they are different in how they model spatial dependencies. DAGAR, as the name suggests, builds dependencies using a directed acyclic graph (DAG). This produces a joint likelihood using sequential construction of the partial conditional distributions wiw<i. CAR builds a joint model by specifying Gaussian full conditional distributions wiwi by treating the underlying map as an undirected graph, where absence of an edge between two regions denotes conditional independence of their spatial effects given other geographic neighbors. These two approaches yield different structures for the precision matrix Q(ρ) with different interpretations for the parameter ρ. DAGAR retains the interpretation of ρ as an autocorrelation parameter,8 while the interpreting spatial autocorrelation in CAR is challenging.30

2.2 |. Motivating multivariate disease mapping

There is a substantial literature on joint modeling of multiple spatially oriented outcomes, some of which have been cited in Section 1. While it is possible to model each disease separately using a univariate DAGAR, hence independent of each other, the resulting inference will ignore the association among the diseases. This will be manifested in model assessment because the less dependence among diseases that a model accommodates, the farther away it will be from the joint model in the sense of Kullback-Leibler divergence.

More formally, suppose we have two mutually exclusive sets A and B that contain labels for diseases. Let yA and yB be the vectors of spatial outcomes over all regions corresponding to the diseases in set A and set B, respectively. A full joint model p(y), where y=(yA,yB), can be written as p(y)=p(yA)×p(yByA). Let C1 and C2 be two nested subsets of diseases in A such that C2C1A. Consider two competing models, p1(y)=p(yA)×p(yByC1) and p2(y)=p(yA)×p(yByC2), where p1() and p2() are probability densities constructed from the joint probability measure p() by imposing conditional independence such that p(yByA)=p(yByC1)andp(yByA)=p(yByC2), respectively. Both p1() and p2() suppress dependence by shrinking the conditional set A, but p2() suppresses more than p1(). We show below that p2() is farther away from p() than p1().

A straightforward application of Jensen’s inequality yields EBC1[logp(yByC1)p(yByC2)]0, where EBC1[] denotes the conditional expectation with respect to p(yByC1). Therefore,

KL(pp2)KL(pp2)=EA,B[log(p(y)p2(y))log(p(y)p2(y))]=EA,B[logp1(y)p2(y)]=EA,B[logp(yByC1)p(yByC2)]=EB,C1[logp(yByC1)p(yByC2)]=EC1{EBC1[logp(yByC1)p(yByC2)]}0. (3)

The equality EA,B[]=EB,C1[] in the last row follows from the fact that the argument is a function of diseases in B,C1 and C2 and, hence, in B and C1 because C2C1. The argument given in (3) is free of distributional assumptions and is linked to the submodularity of entropy and the “information never hurts” principle.33,34 Equation (3) shows that models built upon hierarchical dependence structures depend upon the order in which the diseases enter the model. While this is a disadvantage, hierarchical dependencies are easier to interpret, easier to compute using currently available Bayesian modeling software such as BUGS or JAGS and have been shown to be very competitive in inferential performance.35 Hence, we develop and implement Bayesian model averaging over different ordered models in a computationally efficient manner.

2.3 |. Multivariate DAGAR model

Modeling multiple diseases will introduce associations among the diseases and spatial dependence for each disease. Let yij be a disease outcome of interest for disease i in region j. For sake of clarity, we assume that yij is a continuous variable (eg, incidence rates) related to a set of explanatory variables through the regression model,

yij=xijβi+wij+eij, (4)

where xij is a pi×1 vector of explanatory variables specific to disease i within region j,βi are the slopes corresponding to disease i,wij is a random effect for disease i in region j, and eij~indN(0,(σi2)1) is the random noise arising from uncontrolled imperfections in the data.

Part of the residual from the explanatory variables is captured by the spatial-temporal effect wij. Let wi=(wi1,wi2,,wik) for i=1,2,,q. We adopt a hierarchical approach,10 where we specify the joint distribution of w=(w1,w2,,wq) as p(w)=p(w1)i=2qp(wiw<i). We model p(w1) and each of the conditional densities p(wiw<i) with w<i=(w1,,wi1) for i2 as univariate spatial models. The merits of this approach include simplicity and computational efficiency while ensuring that richness in structure is accommodated through the p(wiw<i)'s.

We point out two important distinctions from the GMCAR model10: (i) instead of using CAR for the spatial dependence, we use DAGAR; and (ii) we apply a computationally efficient bridge sampling algorithms32 to compute the marginal posterior probabilities for each ordered model. The first distinction allows better interpretation of spatial autocorrelation than the CAR models. The second distinction is of immense practical value and makes this approach feasible for a much larger number of outcomes. Without this distinction, analysts would be dealing with q! models for q diseases and choose among them based upon a model-selection metric. That would be overly burdensome for more than 2 or 3 diseases.

2.4 |. A conditional multivariate DAGAR model

The multivariate DAGAR (or MDAGAR) model is constructed as

w1=ϵ1;wi=Ai1w1+Ai2w2++Ai,i1wi1+ϵifori=2,3,,q, (5)

where ϵi~N(0,τiQ(ρi)) and τiQ(ρi) are univariate DAGAR precision matrices with B and F as in (2). In (5), we model w1 as a univariate DAGAR and, progressively, the conditional density of each wi given w1,,wi1 is also as a DAGAR for i=2,3,,q.

Each disease has its own distribution with its own spatial autocorrelation parameter. There are q spatial autocorrelation parameters, {ρ1,ρ2,,ρq}, corresponding to the q diseases. Given the differences in the geographic variation of different diseases, this flexibility is desirable. Each matrix Aii in (5) with i=1,,i1 models the association between diseases i and i. We specify Aii=η0iiIk+η1iiM, where M is the binary adjacency matrix for the map, that is, mjj=1 if j~j and 0 otherwise. Coefficients η0ii and η1ii associate wij with wij and wij. In other words, η0ii is the diagonal element in Aii′, while η1ii is the element in the jth row and j′th column if j~j. Therefore, for the joint distribution of w, if A is the kq×kq strictly block-lower triangular matrix with (ii′)th block being Aii=O whenever ii, and ϵ=(ϵ1,,ϵq), then (5) renders w=Aw+ϵ.

Since IA is still lower triangular with 1s on the diagonal, it is nonsingular with det(IA)=1 . Writing w=(IA)1ϵ, where ϵ~N(0,Λ) and the block diagonal matrix Λ has τ1Q(ρ1),,τqQ(ρq) on the diagonal, we obtain w~N(0,Qw)forρ=(ρ1,,ρq) with

Qw=(IA)Λ(IA). (6)

We say that w follows MDAGAR if w~N(0,Qw).

Interpretation of ρ1,,ρq is clear: ρ1 measures the spatial association for the first disease, while ρi,i2, is the residual spatial correlation in the disease i after accounting for the first i1 diseases. Similarly, τ1 is the spatial precision for the first disease, while τi,i2, is the residual spatial precision for disease i after accounting for the first i1 diseases.

2.4.1 |. Model implementation

We extend (4) to the following Bayesian hierarchical framework with the posterior distribution

p(β,w,η,ρ,τ,σy)p(ρ)×p(η)×i=1q{IG(1/τiaτ,bτ)×IG(σi2aσ,bσ)×N(βiμβ,Vβ1)}×N(w0,Qw)×i=1qj=1kN(yijxijβi+wij,1/σi2), (7)

where β=(β1,β2,,βq),τ={τ1,τ2,,τq},σ={σ12,σ22,,σq2} and η={η2,η3,,ηq} with ηi=(ηi1,ηi2,ηi,i1) and ηii=(η0ii,η1ii) for i=2,,q and i=1,,i1. For variance parameters 1/τi and σi2,IG(a,b) is the inverse-gamma distribution with shape and rate parameters a and b, respectively. For each element in ηi we choose a normal prior N(μij,σηij2), while the prior N(w0,Qw) can also be written as

p(wτ,η2,,ηq,ρ)τ1k2|Q(ρ1)|12exp{τ12w1Q(ρ1)w1}×i=2qτik2|Q(ρi)|12exp{τi2(wii=1i1Aiiwi)Q(ρi)(wii=1i1Aiiwi)}, (8)

where det(Q(ρi))=j=1kλij, and wiTQ(ρi)wi=λi1wi12+j=2kλij(wijjN(j)bijjwij)2.

We sample the parameters from the posterior distribution in (7) using Markov chain Monte Carlo (MCMC) with Gibbs sampling and random walk metropolis36 as implemented in the rjags package within the R statistical computing environment. Web Appendix B S.2.1 presents details on the MCMC updating scheme.

2.5 |. Model selection via bridge sampling

It is clear from (5) that each ordering of diseases in MDAGAR will produce a different model. For the bivariate situation, it is convenient to compare only two models (orders) by the significance of parameter estimates as well as model performance. However, when there are more than two diseases involved in the model, at least six models (for three diseases) will be fitted and comparing all models become cumbersome or even impracticable.

Instead, we pursue model averaging of MDAGAR models. Given a set of T=q! candidate models, say M1,,MT, Bayesian model selection and model averaging calculates

p(M=Mty)=p(yM=Mt)p(M=Mt)j=1Tp(yM=Mj)p(M=Mj), (9)

for t=1,,T.37 Computing the marginal likelihood p(yMt) in (9) is challenging. Methods such as importance sampling38 and generalized harmonic mean39 have been proposed as stable estimators with finite variance, but finding the required importance density with strong constraints on the tail behavior relative to the posterior distribution is often challenging. Bridge sampling estimates the marginal likelihood (ie, thenormalizing constant) by combining samples from two distributions: a bridge function h() and a proposal distribution g().40 Let θt={βt,σt,ρt,τt,η2,t,,ηq,t} be the set of parameters in model Mt with prior p(θtMt) as defined in the first row of (7). Based on the identity,

1=p(yθt,Mt)p(θtMt)h(θtMt)g(θtMt)dθtp(yθt,Mt)p(θtMt)h(θtMt)g(θtMt)dθt,

a current version of the bridge sampling estimator is

p(yM=Mt)=Eg(θtMt)[p(yθt,Mt)p(θtMt)h(θtMt)]Ep(θty,Mt)[h(θtMt)g(θtMt)]1N2i=1N2p(yθ˜t,i,Mt)p(θ˜t,iMt)h(θ˜t,iMt)1N1j=1N1h(θt,jMt)g(θt,jMt), (10)

where θt,j~p(θty,Mt),j=1,,N1, are N1 posterior samples and θ˜t,i~g(θtMt),i=1,,N2, are N2 samples drawn from the proposal distribution.32 The likelihood p(yθt,M=Mt) is obtained by integrating out w from (7) as

N(yXβ,[Qw1(ρt,τt,η2,t,,ηq,t)+diag(σt)Ik]1), (11)

given that y=(y1,,yq) with yi=(yi1,yi2,,yik), diag(𝝈) is a diagonal matrix with σi2,i=1,,q, on the diagonal, and X is the design matrix with, Xi as block diagonal where Xi=(xi1,xi2,,xik). The bridge function h(θtMt) is specified by the optimal choice31,

h(θtMt)=C1s1p(yθt,Mt)p(θtMt)+s2p(yMt)g(θtMt), (12)

where C is a constant. Inserting (12) in (10) yields the estimate of p(yM=Mt) after convergence of an iterative scheme31 as

p^(yMt)(t+1)=1N2i=1N2l2,is1l2,i+s2p^(yMt)(t)1N1j=1N11s1l1,j+s2p^(yMt)(t), (13)

where l1,j=p(yθt,j,Mt)p(θt,jMt)g(θt,jMt),l2,i=p(yθ˜t,i,Mt)p(θ˜t,iMt)g(θ˜t,iMt),s1=N1N1+N2, and s2=N2N1+N2.

Given the log marginal likelihood estimates from bridgesampling, the posterior model probability for each model is calculated from (9) by setting prior probability of each model p(M=Mt). For Bayesian model averaging (BMA), the model averaged posterior distribution of a quantity of interest Δ is obtained as p(Δy)=t=1Tp(ΔM=Mt,y)p(M=Mty),37 and the posterior mean is

E(Δy)=t=1TE(ΔM=Mt,y)p(M=Mty). (14)

Setting Δ={β,w} fetches us the model averaged posterior estimates for spatial random effects as well as calculating the posterior mean incidence rates as discussed in Section 4.

3 |. SIMULATION

We simulate three different experiments. The first is designed to evaluate MDAGAR’s inferential performance against GMCAR. The second compares MDAGAR, GMCAR and order-free MCAR for data generated from the latter. The third experiment illustrates the effectiveness of bridge sampling (Section 2.5) in preferring models with a correct “ordering” of the diseases.

3.1 |. Data generation

We compare MDAGAR’s inferential performance with GMCAR10 (Section 3.2) and order-free MCAR16 (Section 3.3). We choose the 48 states of the contiguous United States as our underlying map, where two states are treated as neighbors if they share a common geographic boundary. We generated our outcomes yij using the model in (4) with q=2, that is, two outcomes, and two covariates, x1j and x2j, with p1=2 and p2=3. We fixed the values of the covariates after generating them from N(0,Ipi),i=1,2, independent across regions. The regression slopes were set to β1=(1,5) and β2=(2,4,5).

Turning to the spatial random effects, we generated values of w=(w1,w2) from a N(0,Qw) distribution, where the precision matrix is

Qw=[τ1Q(ρ1)+τ2A21Q(ρ2)A21τ2A21Q(ρ2)τ2Q(ρ2)A21τ2Q(ρ2)]. (15)

We set τ1=τ2=0.25,ρ1=0.2, and ρ2=0.8 in (15) and take Q(ρi)=D(ρi)1, where D(ρi)=exp(ϕid(j,j)),ϕi=log(ρi) is the spatial decay for disease i and d(j,j) refers to the distance between the embedding of the jth and j′th vertex. The vertices are embedded on the Euclidean plane and the centroid of each state is used to create the distance matrix. Using this exponential covariance matrix to generate the data offers a “neutral” ground to compare the performance of MDAGAR with GMCAR. We specified A12 using fixed values of η={η021,η121}. Here, we considered three sets of values for η to correspond to low, medium and high correlation among diseases. We fixed η={0.05,0.1} to ensure an average correlation of 0.15 (range 0.072–0.31); η={0.5,0.3} with an average correlation of 0.55 (range 0.45–0.74); and η={2.5,0.5} with a mean correlation of 0.89 (range 0.84–0.94). We generated wij's for each of the above specifications for η and, with the values of wij generated as above, we generated the outcome yij~N(xijβi+wij,1/σi2) , where σ12=σ22=0.4. We repeat the above procedure to replicate 85 datasets for each of the three specifications of η.

For our third experiment (Sections 3.4 and 3.5), we generate a dataset with q=3 cancers. We extend the above setup to include one more disease. We generate yij's from (4) with the value of x3j fixed after being generated from N(0,I3),β3=(5,3,6), and σ32=0.4. Let [i,j,k] denote the model p(wi)×p(wjwi)×p(wkwj,wi). For three diseases the six resulting models are denoted as M1=[1,2,3],M2=[1,3,2],M3=[2,1,3],M4=[2,3,1],M5=[3,1,2] , and M6=[3,2,1].

Each of the six models imply a corresponding joint distribution w~N(0,Qw) which is used to generate the wij's. Let the parenthesized suffix (i) denote the disease in the ith order. For example, in M2=[1,3,2], we write w in the form of (5) as

w1~ϵ(1);w3=A(21)w1+ϵ(2);w2=A(31)w1+A(32)w3+ϵ(3),

where ϵ(i)~N(0,τ(i)Q(ρ(i))) with Q(ρ(i))=D(ρ(i))1 as in the first experiment, and A(ii)=η0(ii)I+η1(ii)M is the coefficient matrix associating random effects for diseases in the ith and i′th order. We set τ(1)=τ(2)=τ(3)=0.25,ρ(1)=0.2,ρ(2)=0.8,ρ(3)=0.5,η0(21)=0.5,η1(21)=0.3,η0(31)=1,η1(31)=0.6,η0(32)=1.5, and η1(32)=0.9 to completely specify Qw for each of the 6 models. For each Mi, we generate 50 datasets by first generating w~N(0,Qw) and then generating yij's from (4) using the above specifications. Details on the algorithms and the computing environments for each model are provided in Section S.2.1.

3.2 |. Comparisons between MDAGAR and GMCAR

In our first experiment, we analyzed the 85 replicated datasets using (7) with

p(ρ)×p(η)i=1q=2{Unif(ρi0,1)}×N(η210,0.01I2), (16)

where η21=(η021,η121) and Unif is the uniform density. Prior specifications are completed by setting aτ=2,bτ=0.4, aσ=2,bσ=0.4,μβ=0,Vβ=1000I in (7). The same set of priors was used for both MDAGAR and GMCAR as they have the same number of parameters with similar interpretations. Both models are fast to compute; MDAGAR reported an average running time of 3.87 minutes for each dataset in the bivariate disease analysis, while that for GMCAR was 6.25 minutes.

We compare models using the widely applicable information criterion (WAIC)41,42 and a model comparison score D based on a balanced loss function for replicated data.43 Both WAIC and D reward goodness of fit and penalize model complexity. Details on how these metrics are computed are provided in Web Appendix B S.2.2. In addition, we also computed the average mean squared error (AMSE) of the spatial random effects estimated from each of the 85 datasets. We found the mean (standard deviation) of the AMSEs to be 1.69 (0.034) from the 85 low-correlation datasets, 1.47 (0.030) from the 85 medium-correlation datasets, and 2.35 (0.059) from the 85 high-correlation datasets. The corresponding numbers for GMCAR were 1.83 (0.033), 1.59 (0.031), and 2.14 (0.050), respectively. The MDAGAR tends to have smaller AMSE for low and medium correlations, while GMCAR’s AMSE tends to be pronouncedly lower than MDAGAR’s when the correlations are high. We also compute the mean values of WAICs and D scores for each simulated dataset. Figure 1 plots the values of WAICs (A-C) and D scores (D-F) for the 85 datasets corresponding to each of the three correlation settings. Here, MDAGAR outperforms GMCAR in all three correlation settings with respect to both WAICs and D scores. While MDAGAR outperforms GMCAR in overall model fitting scores for most correlation settings, GMCAR can yield better estimates of spatial effects in high correlation settings.

FIGURE 1.

FIGURE 1

Density plots for WAICs and D scores over 85 datasets. (A-C) Density plots of WAIC for MDAGAR (blue) and GMCAR (red) models with low, medium, and high correlation, respectively, (D-F) the corresponding density plots for D scores. The dotted vertical line shows the mean for WAIC and D in each plot

Figure 2 presents scatter plots for the true values (x-axis) of spatial random effects against their posterior estimates (y-axis). To be precise, each panel plots 85×48×2=8160 true values of the elements of the 96×1 vector w for 85 datasets against their corresponding posterior estimates. We see strong agreements between the true values and their estimates for both MDAGAR and GMCAR. The agreement is more pronounced for the datasets corresponding to medium and high correlations. For the low-correlation datasets, MDAGAR still exhibits strong agreement which is better than GMCAR.

FIGURE 2.

FIGURE 2

Scatter plots for estimates of spatial random effects (y-axis) against the true values (x-axis) with 45° lines over 85 datasets: (A-C) Estimates from MDAGAR model with low, medium, and high correlation, (D-F) the corresponding estimates from GMCAR. Pearson’s correlation coefficient for each plot is indicated as “r

We compute DKL(N(0,Qtrue)N(0,Qw))=12[log(det(Qtrue)det(Qw))+tr(QwQtrue1)qk], which is the Kullback-Leibler divergence between the model for w with the true generative precision matrix (Qtrue) and those with MDAGAR and GMCAR precisions (Qw). Using the posterior samples in the precision matrix, we evaluate the posterior probability that DKL(N(0,Qtrue)N(0,QMDAGAR)) is smaller than DKL(N(0,Qtrue)N(0,Qw)). Figure 3 depicts a density plot of these probabilities over the 85 datasets. w and medium, the MDAGAR has a mean probability of around 69% to be closer to the true model than the GMCAR, while for high correlations GMCAR excels with an average probability of 72% to be closer to the true model. These findings are consistent with the AMSEs, where GMCAR tended to perform better when correlations were high. Additional comparative diagnostics from MDAGAR and GMCAR, such as coverage probabilities for parameters and correlations between random effects for two diseases in the same state, are presented in Web Appendix B S.2.2.2.

FIGURE 3.

FIGURE 3

Density plots for probability that the KL-divergence between the MDAGAR and the true model is smaller than that between GMCAR and the true model with three levels of correlation for two diseases: Low (purple), medium (green), and high (red)

3.3 |. Comparisons between MDAGAR and order-free MCAR

We also generated data using an order-free MCAR model16 to evaluate MDAGAR and GMCAR when the underlying structure is different from the proposed conditional scheme. For the MCAR model, we specified the joint covariance matrix of w as

Qw1=(AIk×k)Γ1(AIk×k), (18)

where Σ=AA is a q×q matrix corresponding to disease dependence, A is the upper triangular Cholesky decomposition of Σ and Γ is a kq×kq block diagonal matrix with Γii1=τi2(DρiW) (k×k precision matrix for a proper CAR) for each i = 1, …, q. This corresponds to the MCAR generated from w=(AI)v , where v=(v1,,va) and vi~indN(0,DρiW) for i=1,,q,D, is the diagonal matrix with number of neighbors along the diagonal and W is a binary adjacency matrix. Therefore, w is generated from independent but not identically distributed latent proper CAR distributions (see Reference 16, Section 3.2).

Keeping other model specifications same as in Section 3.1 (so q=2 and 𝜌i’s are as in Section 3.1), we fixed A=[100.71]. Computing (17) with these specifications yields a mean correlation of 0.52 among the entries of the matrix (range: 0.48–0.54). The above procedure is replicated for 50 datasets for each model. We estimated the MDAGAR and GMCAR models in two opposite orders, denoted MDAGAR1, MDAGAR2, GMCAR1, and GMCAR2, and compared with the order-free MCAR. We estimate (7) with the respective specifications for Qw for each model. For the MDAGAR and GMCAR models, we used the priors specified in the previous section using (16). For the MCAR, we assigned logaii,i=1,2, and a21 with normal priors with variances 0.0625 and 100, respectively. The order-free MCAR is also fast to compute and reported an average running time of 5.89 minutes for each dataset in this experiment.

Figure 4 plots values of (A) WAICs, (B) D scores, and (C) the posterior mean of the Kullback-Leibler divergence between a given model and the true density, DKL(p(ytrue)p(y)), for each of the 50 datasets (indexed in the x-axis) computed for each of the five models. For model fitting, GMCAR1 exhibits better performance with smaller values for WAIC and D, while GMCAR2, MDAGAR1, and MDAGAR2 are all comparable with MCAR. GMCAR1 and MDAGAR1 exhibit slightly better performance in D scores compared with GMCAR2 and MDAGAR2, respectively, but the two orders produce similar WAICs. In terms of the posterior means of DKL(p(ytrue)p(y), MCAR is expectedly closer to the true model (having the same data generating structure), but MDAGAR is still very competitive performance in spite of being a misspecified model. The variability in the posterior means of DKL(p(ytrue)p(y) for the different models reveal substantial overlap so conditional models have the ability to compete with order-free MCAR even when data are generated from the latter.

FIGURE 4.

FIGURE 4

Density plots for (A) WAICs, (B) D scores, and (C) the posterior mean of DKL(p(ytrue)p(y)) over 50 datasets, respectively, using MDAGAR1, MDAGAR2, GMCAR1, GMCAR2, and MCAR. The dot vertical line shows the mean for each plot

3.4 |. Analyses using different orderings for spatial units

The MDAGAR model in Section 3.2 is analyzed using an ordering of spatial units (counties) from the southwest to the northeast. Here, we repeat the analysis for the MDAGAR model using three other orderings that start in the southeast, northwest, and northeast, respectively. We present results from these differently ordered DAGAR models using the 85 low-correlation simulated datasets. For the random effects, the mean (standard deviation) of the AMSEs for three different orderings (southeast, northwest, and northeast) are 1.61 (0.029), 1.28 (0.026), and 1.43 (0.027), respectively, without significantly differing from the original ordering in Section 3.2.

Figure 5 plots the densities of mean WAICs, D scores, and DKL(p(ytrue)p(y)) over the 85 datasets for the MDAGAR model using three different orderings and the original ordering in Section 3.2. In computing DKL(p(ytrue)p(y)), we specify p(ytrue)=N(Xβtrue+wtrue,diag(σtrue)Ik), which is the density of the true y and p(y)=N(Xβ+w,diag(σ)Ik) is the density for y from MDAGAR. While the ordering of the diseases does not appear to have a significant impact on model fitting as the density plots for the four orderings almost overlap with each other, (3) suggests that some order dependence may be expected.

FIGURE 5.

FIGURE 5

Density plots for WAICs, D scores, and DKL(p(ytrue)p(y)) over 85 datasets for the MDAGAR model using four different orderings: Northeast (red), northwest (green), southeast (blue), and southwest (purple). The dotted vertical line shows the mean for each plot

3.5 |. Model selection for different disease orders

We now evaluate the effectiveness of the method in Section 2.5 at selecting the model with the correct ordering of diseases. We used the bridgesampling package in R to compute p(Miy(n))=maxt=1,,6p(Mty(n)) for each of n=50×6 datasets generated as described in Section 3.1. Table 1 presents the probability of each model being selected for different true model scenarios. The probability of selecting the true model is shown in bold along the diagonal. Our experiment reveals that bridge sampling is extremely effective at choosing the correct order. It was able to identify the correct order between78% and 90%, which is substantially larger than any of the probability of choosing any of the misspecified models.

TABLE 1.

Proportion of times (π(Mi)) bridge sampling chose the model with the correct order out of the 50 datasets with that order

True model π(M1) π(M2) π(M3) π(M4) π(M5) π(M6)
M 1 0.90 0.00 0.10 0.00 0.00 0.00
M 2 0.00 0.86 0.00 0.00 0.14 0.00
M 3 0.14 0.00 0.86 0.00 0.00 0.00
M 4 0.00 0.00 0.00 0.90 0.00 0.10
M 5 0.00 0.22 0.00 0.00 0.78 0.00
M 6 0.00 0.00 0.00 0.16 0.00 0.84

4 |. MULTIPLE CANCER ANALYSIS FROM SEER

We now turn to analyzing an areal dataset using the MDAGAR model for four different cancers: lung, esophagus, larynx, and colorectal. The incidence of adenocarcinoma of lung and esophageal cancer have been found to share common risk factors44 and metabolic mechanisms.45 Lung cancer appears to be among the most common second primary cancers in patients with colon cancer.46 Meanwhile, patients with laryngeal cancer have also been reported to possess high risks of developing second primary lung cancer.47 The dataset is extracted from the SEERStat database using the SEERStat statistical software.48 The dataset consists of the four cancers: lung, esophagus, larynx, and colorectal, where the outcome is the 5-year average age-adjusted incidence rates (age-adjusted to the 2000 U.S. Standard Population) per 100000 population in the years from 2012 to 2016 across 58 counties in California, USA, as mapped in Figure 6. The maps exhibit preliminary evidence of correlation across space and among cancers. Cutoffs for the different levels of incidence rates are quantiles for each cancer. For all four cancers, incidence rates are relatively higher in counties concentrated in the middle northern areas including Shasta, Tehama, Glenn, Butte, and Yuba than those other areas. In general, northern areas have higher incidence rates than in the south. This is especially pronounced for lung cancer and esophagus cancer. For larynx cancer, while the highest incidence rates are in the northwest (Del Norte and Sisikiyou counties), the incidence rates in the south are also at somewhat higher levels. For colorectal cancer, the edge areas at the bottom also exhibit high incidence rates.

FIGURE 6.

FIGURE 6

Maps of 5-year average age-adjusted incidence rates per 100 000 population for lung, esophagus, larynx, and colorectal cancer in California, 2012 to 2016

As an exploratory tool to assess associations among the cancers, we calculate Pearson’s correlation for each pair of cancers by regarding incidence rates in different counties as independent samples and find Pearson’s correlation coefficient between the incidence of lung cancer and those of esophageal, larynx, and colorectum cancers to be 0.55, 0.46, and 0.46, respectively. Meanwhile, the correlation between esophageal and larynx cancer is 0.27. Next, to explore the spatial association for each disease, we calculate Moran’s I based upon rth order neighbors for each cancer and plot the areal correlogram.49 Defining distance intervals, (0,d1],(d1,d2],(d2,d3],, the rth order neighbors refer to units with distance in (dr1,dr], that is, within distance dr but separated by more than dr1 . The distance is the Euclidean distance from an Albers map projection of California. As shown in Figure 7, lung, esophageal, and colorectum cancers all present spatial patterns that initially diminish with increasing r and eventually flatten close to 0. Overall, counties with similar levels of incidence rates tend to depict some spatial clustering.

FIGURE 7.

FIGURE 7

Moran’s I of rth order neighbors for lung, esophageal, larynx, and colorectum cancer

We turn to model based inference using (7). We return to the MDAGAR, GMCAR, and MCAR, where neighbors are defined using shared borders. We analyze this dataset and separate the spatial correlation for each cancer from association among cancers with the following prior specifications,

p(η,ρ,τ,σ,w)=i=1qUnif(ρi0,1)×i=2qj=1i1N(ηij0,0.01I2)×i=1qN(βi0,0.001I)×i=1qIG(1/τi2,0.1)×i=1qIG(σi22,1)×N(w0,Qw). (19)

We also discuss a case excluding the risk factor (see Web Appendix B Section S.2.2.3).

For covariates, we include county attributes that possibly affect the incidence rates, including percentages of residents younger than 18 years old (youngij), older than 65 years old (oldij), with education level below high school (eduij), percentages of unemployed residents (unempij), black residents (blackij), male residents (maleij), uninsured residents (uninsureij), and percentages of families below the poverty threshold (povertyij). All covariates are common for different cancers and extracted from the SEERStat database48 for the same period, 2012 to 2016.

Since cigarette smoking is a common risk factor for cancers, adult smoking rates (smokeij) for 2014 to 2016 were obtained from the California Tobacco Facts and Figures 2018 database.50 Spatial patterns in the map of adult cigarette smoking rates, shown in Figure 8, are similar to the incidence of cancers, especially lung and esophageal cancers, the highest smoking rates are concentrated in the north. While some central California counties (eg, Stanislaus, Tuolumne, Merced, Mariposa, Fresno, and Tulare) also exhibit high rates, although there is clearly less spatial clustering of the high rates than in the north.

FIGURE 8.

FIGURE 8

Important county-level covariates with significant effects: Adult cigarette smoking rates (left), percentage of black residents (middle), and uninsured residents (right)

Since the order of cancers in the DAG specify the model, we fit all 4!=24 models using (7) and compute the marginal likelihoods using bridge sampling (Section 2.5). By setting the prior model probabilities as p(M=Mt)=124 for t=1,2,,24, we compute the posterior model probabilities using (9). These are presented in Table 2. We obtain BMA estimates using (14) with the weights in Table 2. Among all models, model M10 is selected as the best model with the largest posterior probability 0.577 and the corresponding conditional structure is [esophageal] × [larynx | esophageal] × [colorectal | esophageal, larynx] × [lung | esophageal, larynx, colorectal].

TABLE 2.

The posterior model probabilities for 24 models

p(M1y) p(M2y) p(M3y) p(M4y) p(M5y) p(M6y) p(M7y) p(M8y)
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
p(M9y) p(M10y) p(M11y) p(M12y) p(M13y) p(M14y) p(M15y) p(M16y)
0.000 0.577 0.000 0.000 0.000 0.000 0.342 0.079
p(M17y) p(M18y) p(M19y) p(M20y) p(M21y) p(M22y) p(M23y) p(M24y)
0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.002

Note: Bold values signify that the 95% credible intervals exclude 0.

Table 3 is a summary of the parameter estimates including regression coefficients, spatial auto correlation (ρi), spatial precision (τi),and noise variance (σi2) for each cancer. From M10 and BMA, we find the regression slopes for the percentage of smokers and uninsured residents are significantly positive and negative, respectively, for esophageal cancer. The negative association between percentage of uninsured and esophageal cancer may seem surprising, but is likely a consequence of counties exhibiting low incidence rates for esophageal cancer having a relatively large number of uninsured residents (see top right in Figure 6 and the right most figure in Figure 8). Since esophageal cancer has low incidence rates, this association could well be spurious due to spatial confounding. Percentage of smokers is, unsurprisingly, found to be a significant risk factor for lung cancer, while the percentage of blacks seems to be significantly associated with elevated incidence of larynx cancer. In addition, we tend to see that percentage of population below the poverty level has a pronounced association with higher rates of lung and esophageal cancer.

TABLE 3.

Posterior means (95% credible intervals) for parameters estimated from M10 and BMA estimates for regression coefficients only for the SEER four cancer dataset

Parameters Model Esophageal Larynx Colorectal Lung
Intercept M10 16.76 (4.06, 29.56) 6.37 (−1.16, 13.89) 19.16 (−11.94, 49.72) 28.68 (−18.3, 74.93)
BMA 15.87 (2.92, 28.63) 6.85 (−0.71, 14.38) 18.21 (−14.03, 49.07) 28.25 (−18.12, 74.52)
Smokers(%) M10 0.25 (0.12, 0.37) 0.04 (−0.03, 0.12) 0.23 (−0.12, 0.57) 0.81 (0.08, 1.62)
BMA 0.23 (0.10, 0.36) 0.05 (−0.03, 0.12) 0.22 (−0.13, 0.58) 0.80 (0.08, 1.59)
Young(%) M10 −0.12 (−0.31, 0.07) −0.07 (−0.18, 0.04) 0.27 (−0.2, 0.76) −0.08 (−0.90, 0.74)
BMA −0.11 (−0.3, 0.08) −0.08 (−0.19, 0.03) 0.29 (−0.18, 0.78) −0.01 (−0.86, 0.82)
Old (%) M10 −0.11 (−0.25, 0.04) −0.05 (−0.14, 0.03) 0.10 (−0.28, 0.48) −0.09 (−0.81, 0.67)
BMA −0.10 (−0.25, 0.05) −0.05 (−0.14, 0.03) 0.10 (−0.29, 0.49) −0.08 (−0.79, 0.66)
Edu (%) M10 0.02 (−0.08, 0.12) −0.02 (−0.08, 0.04) 0.16 (−0.12, 0.43) −0.20 (−0.75, 0.31)
BMA 0.02 (−0.09, 0.12) −0.02 (−0.07, 0.04) 0.15 (−0.14, 0.42) −0.24 (−0.79, 0.27)
Unemp (%) M10 −0.13 (−0.29, 0.03) 0.01 (−0.08, 0.10) −0.09 (−0.54, 0.37) 0.60 (−0.47, 1.55)
BMA −0.12 (−0.28, 0.05) 0.01 (−0.08, 0.1) −0.08 (−0.54, 0.38) 0.61 (−0.43, 1.56)
Black (%) M10 0.14 (−0.06, 0.34) 0.14 (0.03, 0.26) −0.16 (−0.73, 0.39) 0.15 (−1.06, 1.29)
BMA 0.13 (−0.07, 0.33) 0.15 (0.03, 0.27) −0.18 (−0.75, 0.39) 0.14 (−1.02, 1.25)
Male (%) M10 −0.04 (−0.17, 0.09) 0.00 (−0.07, 0.08) 0.24 (−0.12, 0.60) 0.14 (−0.57, 0.79)
BMA −0.04 (−0.17, 0.09) 0 (−0.07, 0.08) 0.24 (−0.12, 0.62) 0.14 (−0.55, 0.82)
Uninsured (%) M10 0.24 (0.44, −0.04) −0.08 (−0.20, 0.04) 0.07 (−0.44, 0.58) 0.01 (−0.82, 0.86)
BMA 0.23 (0.42, −0.02) −0.08 (−0.2, 0.04) 0.09 (−0.42, 0.61) 0 (−0.81, 0.82)
Poverty (%) M10 0.30 (−0.24, 0.84) 0.20 (−0.12, 0.51) −0.06 (−1.51, 1.45) 0.85 (−2.15, 3.85)
BMA 0.32 (−0.23, 0.87) 0.2 (−0.12, 0.51) −0.08 (−1.54, 1.42) 0.8 (−2.14, 3.75)
ρcancer M10 0.25 (0.01, 1.00) 0.33 (0.01, 0.96) 0.50 (0.03, 0.97) 0.52 (0.03, 0.99)
τcancer M10 25.27 (5.08, 61.57) 27.60 (8.05, 60.42) 19.97 (3.06, 55.61) 20.31 (1.77, 55.92)
σcancer2 M10 1.67 (1.11, 2.47) 0.49 (0.28, 0.75) 8.22 (1.09, 14.23) 1.19 (0.18, 5.21)

Note: Bold values signify that the 95% credible intervals exclude 0.

Recall from Section 2.4 that ρ1 is the residual spatial autocorrelation for esophageal cancer after accounting for the explanatory variables, while ρi for i=2,3,4 are residual spatial autocorrelations after accounting for the explanatory variables and the preceding cancers in the model M10. From Table 3, we see that esophageal cancer exhibits relatively weaker spatial autocorrelation, while the residual spatial autocorrelations for larynx and colorectal cancers after accounting for preceding cancers are both at moderate levels of around 0.5. Similarly for the spatial precision τi, larynx appears to have the smallest conditional variability while that for colorectal and lung are slightly larger.

For the posterior mean incidence rates and spatial random effects wij, we present estimates from model M10 and BMA. Figure 9A,B is maps of posterior mean spatial random effects and model fitted incidence rates for four cancers obtained from BMA, while Figure 10A,B shows maps of those from model M10. The posterior mean incidence rates from BMA and M10 are in accord with each other, and both present DAGAR-smoothed versions of the original patterns in Figure 6. For posterior means of spatial random effects, in general, the estimates from M10 are similar to model averaged estimates, especially for lung and colorectal cancers, exhibiting relatively large positive values in the northern counties, where the incidence rates are high. However, for esophageal and larynx cancers we see slight discrepancies between M10 and BMA in the north. The BMA estimates produce larger positive random effects, ranging between 0.1and 0.5,in most counties, while M10 produces estimates between 0 and 0.1 for esophageal cancer. More counties with random effects larger than 0.1 are estimated from M10 for larynx cancer. We believe this is attributable, at least in part, to another competitive model, M15=[larynx]×[esophaguslarynx]×[lunglarynx,esophagus]×[colorectallarynx,esophagus,lung] (posterior probability 0.342), which contributes to the BMA. On the other hand, the effects of some important county-level covariates play an essential role in the discrepancy between the estimates of random effects and model fitted incidence rates for each cancer.

FIGURE 9.

FIGURE 9

Maps of posterior results using BMA for lung, esophagus, larynx, and colorectal cancer in California including (A) posterior mean spatial random effects and (B) posterior mean incidence rates

FIGURE 10.

FIGURE 10

Maps of posterior results using the highest probability model M10 for lung, esophagus, larynx, and colorectal cancer in California including (A) posterior mean spatial random effects and (B) posterior mean incidence rates

Recall from Section 2 that η0ii and η1ii reflect the associations among cancers that can be attributed to spatial structure. Specifically, larger values of η0ii will indicate inherent associations unrelated to spatial structure, while the magnitude of η1ii reflects associations due to spatial structure. Figure S.2 presents posterior distributions of η for all pairs of cancers. We see from the distribution of η043 that there is a pronounced nonspatial component in the association between lung and colorectal cancers. Similar, albeit somewhat less pronounced, nonspatial associations are seen between larynx and esophageal cancers and between lung and larynx cancers. Analogously, the posterior distributions for η143 and η132 tend to have substantial positive support suggesting substantial spatial cross-correlations between lung and colorectal cancers and between colorectal and larynx cancers. Interestingly, we find negative support in the posterior distributions for η121 and η142.The negative mass implies that the covariance among cancers with in a region is suppressed by strong dependence with neighboring regions. This seems to be the case for associations between lung and esophageal cancers and between lung and larynx cancers.

Web Appendix B also presents supplementary analysis that excludes adult smoking rates from the covariates, which we refer to as “Case 2.” Figure S.3 shows estimated correlations between pairwise cancers in each of the 58 counties. The top row presents the correlations including smoking rates (“Case 1”) as has been analyzed here. The bottom row presents the corresponding maps for “Case 2.” Interestingly, accounting for smoking rates substantially diminishes the associations among esophageal, colorectal and lung cancers. These are significantly associated in “Case 2” but only lung and colorectal retain their significance after accounting for smoking rates.

We also implemented the order-free MCAR model (as described in Section 3.3) and presented the estimates of posterior mean incidence rates and spatial random effects in Figure 11. Compared with MDAGAR, the MCAR exhibits better fitting for colorectal cancer since the posterior incidence rates in Figure 11B is closer to those in the raw map (Figure 6), while MDAGAR seems to outperform MCAR for larynx cancer. Overall, the model fitting is comparable between MDAGAR and MCAR.

FIGURE 11.

FIGURE 11

Maps of posterior results (Case 1) using MCAR for lung, esophagus, larynx, and colorectal cancer in California including (A) posterior mean spatial random effects and (B) posterior mean incidence rates

5 |. DISCUSSION

We have developed a multivariate “MDAGAR” model in conjunction with a bridge sampling method to estimate spatial correlations for multiple correlated diseases. The MDAGAR is constructed hierarchically over areal units based on univariate DAGAR models. We demonstrate that MDAGAR tends to outperform GMCAR when association between spatial random effects for different diseases is weak or moderate. MDAGAR retains the interpretability of spatial autocorrelations, as in univariate DAGAR, separating the spatial correlation for each disease from any inherent or endemic association among diseases. While MDAGAR, like all DAG based models, is specified according to a fixed order of the diseases, we show that bridge sampling can effectively choose among the different orders and also provide BMA inference in a computationally efficient manner.

Our data analysis elicits how correlations between incidence rates for different cancers are impacted by risk factors. For example, eliminating adult cigarette smoking rates produces similar spatial patterns for the incidence rates of esophageal, lung and colorectal cancer. In addition, the significant correlation between lung and esophageal cancer, even after accounting for smoking rates, implies other inherent or endemic association such as latent risk factors and metabolic mechanisms. We also see that the MDAGAR based posterior estimates of the latent spatial effects in Figures 9A and 10A resemble those from MDAGAR without covariates (Figure 12), while the maps for the estimated incidence rates in Figures 9B and 10B account for the spatial variability of the covariates.

FIGURE 12.

FIGURE 12

Maps of posterior mean spatial random effects (with no covariates) using the same order as M10

Future research will look at different constructions of graphical models for areal data. Examples can include defining rth order neighbors using distance metrics, as in Figure 7, and deriving alternate precision matrices. We also intend to address scalability with very large number of diseases. Here, common spatial factor models for areal data51 can be adapted to model the factors as DAGAR, thereby yielding classes of DAGAR based factor models. A very different approach will be to build scalable graphical models using two different graphs: one for areal units (CAR or DAGAR) and another undirected graph representing conditional independence among cancers. Multidimensional MRFs as well as developments analogous to recently introduced graphical Gaussian processes52 can be pursued for high-dimensional disease mapping. Finally, spatial confounding in multivariate disease mapping5355 will be explored in the context of MDAGAR.

DATA AVAILABILITY STATEMENT

All computer programs implementing the examples in this paper can be found in the public domain and downloaded from https://github.com/LeiwenG/Multivariate_DAGAR.

Supplementary Material

Supplementary Information

ACKNOWLEDGEMENTS

The work of the first and third authors has been supported in part by the Division of Mathematical Sciences (DMS) of the National Science Foundation (NSF) under grant 1916349 and by the National Institute of Environmental Health Sciences (NIEHS) under grants R01ES030210 and 5R01ES027027. The work of the second author was supported by the Division of Mathematical Sciences (DMS) of the National Science Foundation (NSF) under grant 1915803.

Funding information

Division of Mathematical Sciences, Grant/Award Numbers: 1915803, 1916349; National Institute of Environmental Health Sciences, Grant/Award Numbers: 5R01ES027027, R01ES030210

Footnotes

CONFLICT OF INTEREST

The authors declare no potential conflict of interest.

SUPPORTING INFORMATION

Additional supporting information may be found online in the Supporting Information section at the end of this article.

REFERENCES

  • 1.Koch T. Cartographies of Disease: Maps, Mapping, and Medicine. Redlands, CA: Esri Press; 2005. [Google Scholar]
  • 2.Berke O. Exploratory disease mapping: kriging the spatial risk function from regional count data. Int J Health Geogr. 2004;3(1):18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Richardson S, Thomson A, Best N, Elliott P. Interpreting posterior relative risk estimates in disease-mapping studies. Environ Health Perspect. 2004;112(9):1016–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Rue H, Held L. Gaussian Markov Random Fields : Theory and Applications. Monographs on Statistics and Applied Probability. Boca Raton, FL: Chapman and Hall/CRC Press; 2005. [Google Scholar]
  • 5.Besag J. Spatial interaction and the statistical analysis of lattice systems. J Royal Stat Soc Ser B (Methodol). 1974;36(2):192–225. [Google Scholar]
  • 6.Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics. Ann Inst Stat Math. 1991;43(1):1–20. [Google Scholar]
  • 7.Kissling WD, Carl G. Spatial autocorrelation and the selection of simultaneous autoregressive models. Glob Ecol Biogeogr. 2008;17(1): 59–71. [Google Scholar]
  • 8.Datta A, Banerjee S, Hodges JS, Gao L. Spatial disease mapping using directed acyclic graph auto-regressive (DAGAR) models. Bayesian Anal. 2019;14(4):1221–1244. doi: 10.1214/19-BA1177 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Lindström S, Finucane H, Bulik-Sullivan B, et al. Quantifying the genetic correlation between multiple cancer types. Cancer Epidemiol Prev Biomark. 2017;26(9):1427–1435. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Jin X, Carlin BP, Banerjee S. Generalized hierarchical multivariate CAR models for areal data. Biometrics. 2005;61(4):950–961. [DOI] [PubMed] [Google Scholar]
  • 11.Knorr-Held L, Best NG. A shared component model for detecting joint and selective clustering of two diseases. J R Stat Soc A Stat Soc. 2001;164(1):73–85. [Google Scholar]
  • 12.Kim H, Sun D, Tsutakawa RK. A bivariate Bayes method for improving the estimates of mortality rates with a twofold conditional autoregressive model. J Am Stat Assoc. 2001;96(456):1506–1521. [Google Scholar]
  • 13.Gelfand AE, Vounatsou P. Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics. 2003;4(1):11–15. [DOI] [PubMed] [Google Scholar]
  • 14.Carlin BP, Banerjee S. Hierarchical multivariate CAR models for spatio-temporally correlated survival data. Bayesian Stat. 2003;7(7):45–63. [Google Scholar]
  • 15.Held L, Natário I, Fenton SE, Rue H, Becker N. Towards joint disease mapping. Stat Methods Med Res. 2005;14(1):61–82. [DOI] [PubMed] [Google Scholar]
  • 16.Jin X, Banerjee S, Carlin BP. Order-free co-regionalized areal data models with application to multiple-disease mapping. J Royal Stat Soc Ser B (Stat Methodol). 2007;69(5):817–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Zhang Y, Hodges JS, Banerjee S. Smoothed ANOVA with spatial effects as a competitor to MCAR in multivariate spatial smoothing. Ann Appl Stat. 2009;3(4):1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Diva U, Dey DK, Banerjee S. Parametric models for spatially correlated survival data for individuals with multiple cancers. Stat Med.2008;27(12):2127–2144. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Martinez-Beneito MA. A general modelling framework for multivariate disease mapping. Biometrika. 2013;100(3):539–553. [Google Scholar]
  • 20.Marí-Dell’Olmo M, Martinez-Beneito MA, Gotsens M, Palència L. A smoothed ANOVA model for multivariate ecological regression. Stoch Env Res Risk A. 2014;28(3):695–706. [Google Scholar]
  • 21.Lawson B, Banerjee S, Haining R, Ugarte D. Handbook of Spatial Epidemiology. Boca Raton, FL: CRC press; 2016. [Google Scholar]
  • 22.Mardia K. Multi-dimensional multivariate Gaussian Markov random fields with application to image processing. J Multivar Anal. 1988;24(2):265–284. [Google Scholar]
  • 23.Sain SR, Furrer R, Cressie N. A spatial analysis of multivariate output from regional climate models. Ann Appl Stat. 2011;5(1):150–175. doi: 10.1214/10-AOAS369 [DOI] [Google Scholar]
  • 24.MacNab YC. Linear models of coregionalization for multivariate lattice data: a general framework for coregionalized multivariate CAR models. Stat Med. 2016;35(21):3827–3850. doi: 10.1002/sim.6955 [DOI] [PubMed] [Google Scholar]
  • 25.MacNab YC. Some recent work on multivariate Gaussian Markov random fields (with discussion). TEST Offic J Spanish Soc Stat Oper Res. 2018;27(3):497–541. doi: 10.1007/s11749-018-0605-3 [DOI] [Google Scholar]
  • 26.MacNab YC. Bayesian estimation of multivariate Gaussian Markov random fields with constraint. Stat Med. 2020;39(30):4767–4788. doi: 10.1002/sim.8752 [DOI] [PubMed] [Google Scholar]
  • 27.Zhu J, Eickhoff JC, Yan P. Generalized linear latent variable models for repeated measures of spatially correlated multivariate data. Biometrics. 2005;61(3):674–683. doi: 10.1111/j.1541-0420.2005.00343.x [DOI] [PubMed] [Google Scholar]
  • 28.Bradley JR, Holan SH, Wikle CK. Multivariate spatio-temporal models for high-dimensional areal data with application to longitudinal employer-household dynamics. Ann Appl Stat. 2015;9(4):1761–1791. doi: 10.1214/15-AOAS862 [DOI] [Google Scholar]
  • 29.Daniels MJ, Zhou Z, Zou H. Conditionally specified space-time models for multivariate processes. J Comput Graph Stat. 2006;15(1):157–177. doi: 10.1198/106186006X100434 [DOI] [Google Scholar]
  • 30.Wall MM. A close look at the spatial structure implied by the CAR and SAR models. J Stat Plan Infer. 2004;121(2):311–324. [Google Scholar]
  • 31.Meng XL, Wong WH. Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Stat Sin. 1996;6:831–860. [Google Scholar]
  • 32.Gronau QF, Sarafoglou A, Matzke D, et al. A tutorial on bridge sampling. J Math Psychol. 2017;81:80–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Cover TM, Thomas JA. Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing. Hoboken, NJ: Wiley Interscience; 1991. [Google Scholar]
  • 34.Banerjee S. Modeling massive spatial datasets using a conjugate Bayesian linear modeling framework. Spat Stat. 2020;37:100417. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Cressie N, Zammit-Mangion A. Multivariate spatial covariance models: a conditional approach. Biometrika. 2016;103(4):915–935. doi: 10.1093/biomet/asw045 [DOI] [Google Scholar]
  • 36.Gamerman D, Lopes HF. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Boca Raton, FL: Chapman & Hall/CRC Press; 2006. [Google Scholar]
  • 37.Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Stat Sci. 1999;14:382–401. [Google Scholar]
  • 38.Perrakis K, Ntzoufras I, Tsionas EG. On the use of marginal posteriors in marginal likelihood estimation via importance sampling. Comput Stat Data Anal. 2014;77:54–69. [Google Scholar]
  • 39.Gelfand AE, Dey DK. Bayesian model choice: asymptotics and exact calculations. J Royal Stat Soc Ser B (Methodol). 1994;56(3):501–514. [Google Scholar]
  • 40.Gronau QF, Singmann H, Wagenmakers EJ. Bridgesampling: an R package for estimating normalizing constants; 2017. arXiv preprintarXiv:1710.08162. [Google Scholar]
  • 41.Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res. 2010;11(Dec):3571–3594. [Google Scholar]
  • 42.Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Stat Comput. 2014;24(6):997–1016. [Google Scholar]
  • 43.Gelfand AE, Ghosh SK. Model choice: a minimum posterior predictive loss approach. Biometrika. 1998;85(1):1–11. [Google Scholar]
  • 44.Agrawal K, Markert RJ, Agrawal S. Risk factors for adenocarcinoma and squamous cell carcinoma of these ophagus and lung. Hypertension. 2018;61(46):0–09. [Google Scholar]
  • 45.Shi WX, Chen SQ. Frequencies of poor metabolizers of cytochrome P450 2C19 in esophagus cancer, stomach cancer, lung cancer and bladder cancer in Chinese population. World J Gastroenterol. 2004;10(13):1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Kurishima K, Miyazaki K, Watanabe H, et al. Lung cancer patients with synchronous colon cancer. Mol Clin Oncol. 2018;8(1):137–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Akhtar J, Bhargava R, Shameem M, et al. Second primary lung cancer with glottic laryngeal cancer as index tumor–A case report. Case Rep Oncol. 2010;3(1):35–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Surveillance Research Program. National Cancer Institute. SEER*Stat software. Surveillance Res Program. 2019. https://seer.cancer.gov/seerstat/ [Google Scholar]
  • 49.Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, FL: CRC Press; 2014. [Google Scholar]
  • 50.California Department of Public Health. California Tobacco control program California tobacco facts and figures; 2018. [Google Scholar]
  • 51.Wang F, Wall MM. Generalized common spatial factor model. Biostatistics. 2003;4(4):569–582. doi: 10.1093/biostatistics/4.4.569 [DOI] [PubMed] [Google Scholar]
  • 52.Dey D, Datta A, Banerjee S. Graphical Gaussian processes for highly multivariate spatial data. Biometrika. 2020. doi: 10.1093/biomet/asab061 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Azevedo D, Prates M, Bandyopadhyay D. MSPOCK: alleviating spatial confounding in multivariate disease mapping models. J Agric Biol Environ Stat. 2021;26:464–491. [Google Scholar]
  • 54.Khan K, Calder CA. Restricted spatial regression methods: implications for inference. J Am Stat Assoc. 2020;1–13. doi: 10.1080/01621459.2020.1788949 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Zimmerman DL, Hoef JMV. On deconfounding spatial confounding in linear models. Am Stat. 2021;1–9. doi: 10.1080/00031305.2021.1946149 [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Information

Data Availability Statement

All computer programs implementing the examples in this paper can be found in the public domain and downloaded from https://github.com/LeiwenG/Multivariate_DAGAR.

RESOURCES