Abstract
Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop multivariate directed acyclic graphical autoregression models to accommodate spatial and inter-disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter-disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results program.
Keywords: areal data analysis, Bayesian hierarchical models, directed acyclic graphical autoregression, multiple disease mapping, multivariate areal data models
1 |. INTRODUCTION
Spatially referenced data comprising regional aggregates of health outcomes over delineated administrative units such as counties or zip codes are widely used by epidemiologists to map mortality or morbidity rates and better understand their geographic variation. Disease mapping, as this exercise is customarily called, employs statistical models to present smoothed maps of rates or counts of a disease. Such maps can assist investigators in identifying lurking risk factors1 and in detecting “hot-spots” or spatial clusters emerging from common environmental and socio-demographic effects shared by neighboring regions. By interpolating estimates of health outcome from areal data onto a continuous surface, disease mapping also generates smoothed maps for the small-area scale, adjusting for the sparsity of data or low population size.2,3
For a single disease, there has been a long tradition of employing Markov random fields (MRFs)4 to introduce conditional dependence for the outcome in a region given its neighbors. Two conspicuous examples are the conditional autoregression (CAR)5,6 and simultaneous autoregression (SAR) models7 that build dependence using undirected graphs to model geographic maps. More recently, a class of directed acyclic graphical autoregressive (DAGAR) models was proposed as a preferred alternative to CAR or SAR models in allowing better identifiability and interpretation of spatial autocorrelation parameters.8
Multivariate disease mapping is concerned with the analysis of multiple diseases that are associated among themselves and across space. It is not uncommon to find substantial associations among different diseases sharing genetic and environmental risk factors. Quantification of genetic correlations among multiple cancers has revealed associations among several cancers including lung, breast, colorectal, ovarian, and pancreatic cancers.9 Disease mapping exercises with lung and esophageal cancers have also evinced associations among them.10 When the diseases are inherently related so that the prevalence of one encourages (or inhibits) occurrence of the other, there can be substantial inferential benefits in jointly modeling the diseases rather than fitting independent univariate models for each disease.10–21
The existence of multivariate MRFs can be demonstrated using a multivariate extension of the so called “Brook’s lemma,” which attempts to derive a joint distribution from specified full conditionals.13,22,23 McNab, in a series of papers, has delivered substantial insights into the construction, computation, and properties of different classes of multivariate CAR models.21,24–26 Rather than work with full conditionals, an alternate approach builds joint distributions using linear transformations of a set of univariate CAR models.14,16,19,27,28 A different classe merges from hierarchical constructions10,29 where each disease enters the model in a given sequence (or order) of conditional probability models. This produces simple yet flexible and interpretable association structures, but every ordering produces a different model resulting in an explosion of models even for a modest number of cancers (say, more than 2 or 3 diseases). While multivariate MRF models constructed from undirected graphs are invariant to ordering, hence obviate the issue of order dependence, they impose restrictions to ensure positive-definiteness of covariance matrices, can be computationally onerous and render covariance structures that are challenging to interpret.
We introduce a class of multivariate DAGAR (MDAGAR) models for multiple diseases mapping by building the joint distribution hierarchically using univariate DAGAR models. This approach is analogous to generalized MCAR (GMCAR) models.10 The objective here is to retain the interpretation of spatial autocorrelation offered by the DAGAR, which is challenging for the CAR30 and order-free MCAR models. Our methodological innovation is devising a hierarchical MDAGAR model in conjunction with a bridge sampling algorithm31,32 for choosing among differently ordered hierarchical models and, more importantly, offering Bayesian model averaged (BMA) inference to neutralize the effect of order dependent inference. The idea is to begin with a fixed ordered set of cancers, posited to be associated with each other and a cross space, and build a hierarchical model. The DAGAR specification produces a comprehensible association structure, while bridge sampling allows us to rank differently ordered models using their marginal posterior probabilities. Since each model corresponds to an assumed conditional dependence, the marginal posterior probabilities will indicate the tenability of such assumptions given the data. Epidemiologists, then, will be able to use this information to establish relationships among the diseases and spatial autocorrelation for each disease.
The article proceeds as follows. Section 2 develops the hierarchical MDAGAR model and introduces a bridge sampling method to select the MDAGAR with the best hierarchical order. Section 3 presents simulation studies comparing MDAGAR with GMCAR and order-free MCAR models and also illustrates model averaged inference from the bridge sampling. Section 4 applies our MDAGAR to age-adjusted incidence rates of four cancers from the Surveillance, Epidemiology, and End Results (SEER) database and discusses different cases with respect to predictors. Finally, in Section 5, we summarize with some concluding remarks and pointers for future research.
2 |. METHODS
2.1 |. Overview of univariate DAGAR modeling
Let be a graph corresponding to a geographic map, where the vertices represent clearly delineated regions on the map and is the collection of edges between the vertices representing neighboring pairs of regions. We denote two neighboring regions i and j by . We assume that the vertices in are ordered in a fixed sequence according to their number labels. The DAGAR model builds a spatial autocorrelation model for a single outcome on using the ordered set of vertices in 8. Let be the empty set and let be the index for any region except 1. We define to be the set of labels of geographic neighbors of j that precede j in , that is, .8 Let be a collection of k random variables defined over the map. DAGAR specifies the following autoregression,
| (1) |
where with the precision , and . This implies that , where is a spatial precision matrix that depends only upon a spatial autocorrelation parameter and is a positive scale parameter. The precision matrix is a strictly lower-triangular matrix and F is a diagonal matrix. The elements of B and F are denoted by and , respectively, where
| (2) |
is the number of members in and . The above definition of is consistent with the lower-triangular structure of cB because for any . The derivation of B and F as functions of a spatial correlation parameter is based upon forming local autoregressive models on embedded spanning trees of subgraphs of .8
DAGAR and CAR are both examples of MRFs.4 They are similar in that both models use a graph to model geographic neighbors, but they are different in how they model spatial dependencies. DAGAR, as the name suggests, builds dependencies using a directed acyclic graph (DAG). This produces a joint likelihood using sequential construction of the partial conditional distributions . CAR builds a joint model by specifying Gaussian full conditional distributions by treating the underlying map as an undirected graph, where absence of an edge between two regions denotes conditional independence of their spatial effects given other geographic neighbors. These two approaches yield different structures for the precision matrix with different interpretations for the parameter . DAGAR retains the interpretation of as an autocorrelation parameter,8 while the interpreting spatial autocorrelation in CAR is challenging.30
2.2 |. Motivating multivariate disease mapping
There is a substantial literature on joint modeling of multiple spatially oriented outcomes, some of which have been cited in Section 1. While it is possible to model each disease separately using a univariate DAGAR, hence independent of each other, the resulting inference will ignore the association among the diseases. This will be manifested in model assessment because the less dependence among diseases that a model accommodates, the farther away it will be from the joint model in the sense of Kullback-Leibler divergence.
More formally, suppose we have two mutually exclusive sets A and B that contain labels for diseases. Let and be the vectors of spatial outcomes over all regions corresponding to the diseases in set A and set B, respectively. A full joint model , where , can be written as . Let and be two nested subsets of diseases in A such that . Consider two competing models, and , where and are probability densities constructed from the joint probability measure by imposing conditional independence such that , respectively. Both and suppress dependence by shrinking the conditional set A, but suppresses more than . We show below that is farther away from than .
A straightforward application of Jensen’s inequality yields , where denotes the conditional expectation with respect to . Therefore,
| (3) |
The equality in the last row follows from the fact that the argument is a function of diseases in and and, hence, in B and because . The argument given in (3) is free of distributional assumptions and is linked to the submodularity of entropy and the “information never hurts” principle.33,34 Equation (3) shows that models built upon hierarchical dependence structures depend upon the order in which the diseases enter the model. While this is a disadvantage, hierarchical dependencies are easier to interpret, easier to compute using currently available Bayesian modeling software such as BUGS or JAGS and have been shown to be very competitive in inferential performance.35 Hence, we develop and implement Bayesian model averaging over different ordered models in a computationally efficient manner.
2.3 |. Multivariate DAGAR model
Modeling multiple diseases will introduce associations among the diseases and spatial dependence for each disease. Let be a disease outcome of interest for disease i in region j. For sake of clarity, we assume that is a continuous variable (eg, incidence rates) related to a set of explanatory variables through the regression model,
| (4) |
where is a vector of explanatory variables specific to disease i within region are the slopes corresponding to disease is a random effect for disease i in region j, and is the random noise arising from uncontrolled imperfections in the data.
Part of the residual from the explanatory variables is captured by the spatial-temporal effect . Let for . We adopt a hierarchical approach,10 where we specify the joint distribution of as . We model and each of the conditional densities with for as univariate spatial models. The merits of this approach include simplicity and computational efficiency while ensuring that richness in structure is accommodated through the .
We point out two important distinctions from the GMCAR model10: (i) instead of using CAR for the spatial dependence, we use DAGAR; and (ii) we apply a computationally efficient bridge sampling algorithms32 to compute the marginal posterior probabilities for each ordered model. The first distinction allows better interpretation of spatial autocorrelation than the CAR models. The second distinction is of immense practical value and makes this approach feasible for a much larger number of outcomes. Without this distinction, analysts would be dealing with q! models for q diseases and choose among them based upon a model-selection metric. That would be overly burdensome for more than 2 or 3 diseases.
2.4 |. A conditional multivariate DAGAR model
The multivariate DAGAR (or MDAGAR) model is constructed as
| (5) |
where and are univariate DAGAR precision matrices with B and F as in (2). In (5), we model as a univariate DAGAR and, progressively, the conditional density of each given is also as a DAGAR for .
Each disease has its own distribution with its own spatial autocorrelation parameter. There are q spatial autocorrelation parameters, , corresponding to the q diseases. Given the differences in the geographic variation of different diseases, this flexibility is desirable. Each matrix in (5) with models the association between diseases i and . We specify , where M is the binary adjacency matrix for the map, that is, if and 0 otherwise. Coefficients and associate with and . In other words, is the diagonal element in Aii′, while is the element in the jth row and j′th column if . Therefore, for the joint distribution of w, if A is the strictly block-lower triangular matrix with (ii′)th block being whenever , and , then (5) renders .
Since is still lower triangular with 1s on the diagonal, it is nonsingular with . Writing , where and the block diagonal matrix has on the diagonal, we obtain with
| (6) |
We say that w follows MDAGAR if .
Interpretation of is clear: measures the spatial association for the first disease, while , is the residual spatial correlation in the disease i after accounting for the first diseases. Similarly, is the spatial precision for the first disease, while , is the residual spatial precision for disease i after accounting for the first diseases.
2.4.1 |. Model implementation
We extend (4) to the following Bayesian hierarchical framework with the posterior distribution
| (7) |
where and with and for and . For variance parameters and is the inverse-gamma distribution with shape and rate parameters a and b, respectively. For each element in we choose a normal prior , while the prior can also be written as
| (8) |
where , and .
We sample the parameters from the posterior distribution in (7) using Markov chain Monte Carlo (MCMC) with Gibbs sampling and random walk metropolis36 as implemented in the rjags package within the R statistical computing environment. Web Appendix B S.2.1 presents details on the MCMC updating scheme.
2.5 |. Model selection via bridge sampling
It is clear from (5) that each ordering of diseases in MDAGAR will produce a different model. For the bivariate situation, it is convenient to compare only two models (orders) by the significance of parameter estimates as well as model performance. However, when there are more than two diseases involved in the model, at least six models (for three diseases) will be fitted and comparing all models become cumbersome or even impracticable.
Instead, we pursue model averaging of MDAGAR models. Given a set of candidate models, say , Bayesian model selection and model averaging calculates
| (9) |
for .37 Computing the marginal likelihood in (9) is challenging. Methods such as importance sampling38 and generalized harmonic mean39 have been proposed as stable estimators with finite variance, but finding the required importance density with strong constraints on the tail behavior relative to the posterior distribution is often challenging. Bridge sampling estimates the marginal likelihood (ie, thenormalizing constant) by combining samples from two distributions: a bridge function and a proposal distribution .40 Let be the set of parameters in model with prior as defined in the first row of (7). Based on the identity,
a current version of the bridge sampling estimator is
| (10) |
where , are posterior samples and , are samples drawn from the proposal distribution.32 The likelihood is obtained by integrating out w from (7) as
| (11) |
given that with , diag(𝝈) is a diagonal matrix with , on the diagonal, and X is the design matrix with, as block diagonal where . The bridge function is specified by the optimal choice31,
| (12) |
where C is a constant. Inserting (12) in (10) yields the estimate of after convergence of an iterative scheme31 as
| (13) |
where , and .
Given the log marginal likelihood estimates from bridgesampling, the posterior model probability for each model is calculated from (9) by setting prior probability of each model . For Bayesian model averaging (BMA), the model averaged posterior distribution of a quantity of interest Δ is obtained as ,37 and the posterior mean is
| (14) |
Setting fetches us the model averaged posterior estimates for spatial random effects as well as calculating the posterior mean incidence rates as discussed in Section 4.
3 |. SIMULATION
We simulate three different experiments. The first is designed to evaluate MDAGAR’s inferential performance against GMCAR. The second compares MDAGAR, GMCAR and order-free MCAR for data generated from the latter. The third experiment illustrates the effectiveness of bridge sampling (Section 2.5) in preferring models with a correct “ordering” of the diseases.
3.1 |. Data generation
We compare MDAGAR’s inferential performance with GMCAR10 (Section 3.2) and order-free MCAR16 (Section 3.3). We choose the 48 states of the contiguous United States as our underlying map, where two states are treated as neighbors if they share a common geographic boundary. We generated our outcomes using the model in (4) with , that is, two outcomes, and two covariates, and , with and . We fixed the values of the covariates after generating them from , independent across regions. The regression slopes were set to and .
Turning to the spatial random effects, we generated values of from a distribution, where the precision matrix is
| (15) |
We set , and in (15) and take , where is the spatial decay for disease i and refers to the distance between the embedding of the jth and j′th vertex. The vertices are embedded on the Euclidean plane and the centroid of each state is used to create the distance matrix. Using this exponential covariance matrix to generate the data offers a “neutral” ground to compare the performance of MDAGAR with GMCAR. We specified A12 using fixed values of . Here, we considered three sets of values for to correspond to low, medium and high correlation among diseases. We fixed to ensure an average correlation of 0.15 (range 0.072–0.31); with an average correlation of 0.55 (range 0.45–0.74); and with a mean correlation of 0.89 (range 0.84–0.94). We generated for each of the above specifications for and, with the values of generated as above, we generated the outcome , where . We repeat the above procedure to replicate 85 datasets for each of the three specifications of .
For our third experiment (Sections 3.4 and 3.5), we generate a dataset with cancers. We extend the above setup to include one more disease. We generate from (4) with the value of fixed after being generated from , and . Let denote the model . For three diseases the six resulting models are denoted as , and .
Each of the six models imply a corresponding joint distribution which is used to generate the . Let the parenthesized suffix (i) denote the disease in the ith order. For example, in , we write w in the form of (5) as
where with as in the first experiment, and is the coefficient matrix associating random effects for diseases in the ith and i′th order. We set and to completely specify for each of the 6 models. For each , we generate 50 datasets by first generating and then generating from (4) using the above specifications. Details on the algorithms and the computing environments for each model are provided in Section S.2.1.
3.2 |. Comparisons between MDAGAR and GMCAR
In our first experiment, we analyzed the 85 replicated datasets using (7) with
| (16) |
where and Unif is the uniform density. Prior specifications are completed by setting , in (7). The same set of priors was used for both MDAGAR and GMCAR as they have the same number of parameters with similar interpretations. Both models are fast to compute; MDAGAR reported an average running time of 3.87 minutes for each dataset in the bivariate disease analysis, while that for GMCAR was 6.25 minutes.
We compare models using the widely applicable information criterion (WAIC)41,42 and a model comparison score D based on a balanced loss function for replicated data.43 Both WAIC and D reward goodness of fit and penalize model complexity. Details on how these metrics are computed are provided in Web Appendix B S.2.2. In addition, we also computed the average mean squared error (AMSE) of the spatial random effects estimated from each of the 85 datasets. We found the mean (standard deviation) of the AMSEs to be 1.69 (0.034) from the 85 low-correlation datasets, 1.47 (0.030) from the 85 medium-correlation datasets, and 2.35 (0.059) from the 85 high-correlation datasets. The corresponding numbers for GMCAR were 1.83 (0.033), 1.59 (0.031), and 2.14 (0.050), respectively. The MDAGAR tends to have smaller AMSE for low and medium correlations, while GMCAR’s AMSE tends to be pronouncedly lower than MDAGAR’s when the correlations are high. We also compute the mean values of WAICs and D scores for each simulated dataset. Figure 1 plots the values of WAICs (A-C) and D scores (D-F) for the 85 datasets corresponding to each of the three correlation settings. Here, MDAGAR outperforms GMCAR in all three correlation settings with respect to both WAICs and D scores. While MDAGAR outperforms GMCAR in overall model fitting scores for most correlation settings, GMCAR can yield better estimates of spatial effects in high correlation settings.
FIGURE 1.
Density plots for WAICs and D scores over 85 datasets. (A-C) Density plots of WAIC for MDAGAR (blue) and GMCAR (red) models with low, medium, and high correlation, respectively, (D-F) the corresponding density plots for D scores. The dotted vertical line shows the mean for WAIC and D in each plot
Figure 2 presents scatter plots for the true values (x-axis) of spatial random effects against their posterior estimates (y-axis). To be precise, each panel plots true values of the elements of the vector w for 85 datasets against their corresponding posterior estimates. We see strong agreements between the true values and their estimates for both MDAGAR and GMCAR. The agreement is more pronounced for the datasets corresponding to medium and high correlations. For the low-correlation datasets, MDAGAR still exhibits strong agreement which is better than GMCAR.
FIGURE 2.
Scatter plots for estimates of spatial random effects (y-axis) against the true values (x-axis) with 45° lines over 85 datasets: (A-C) Estimates from MDAGAR model with low, medium, and high correlation, (D-F) the corresponding estimates from GMCAR. Pearson’s correlation coefficient for each plot is indicated as “r”
We compute , which is the Kullback-Leibler divergence between the model for w with the true generative precision matrix () and those with MDAGAR and GMCAR precisions . Using the posterior samples in the precision matrix, we evaluate the posterior probability that is smaller than . Figure 3 depicts a density plot of these probabilities over the 85 datasets. w and medium, the MDAGAR has a mean probability of around 69% to be closer to the true model than the GMCAR, while for high correlations GMCAR excels with an average probability of 72% to be closer to the true model. These findings are consistent with the AMSEs, where GMCAR tended to perform better when correlations were high. Additional comparative diagnostics from MDAGAR and GMCAR, such as coverage probabilities for parameters and correlations between random effects for two diseases in the same state, are presented in Web Appendix B S.2.2.2.
FIGURE 3.
Density plots for probability that the KL-divergence between the MDAGAR and the true model is smaller than that between GMCAR and the true model with three levels of correlation for two diseases: Low (purple), medium (green), and high (red)
3.3 |. Comparisons between MDAGAR and order-free MCAR
We also generated data using an order-free MCAR model16 to evaluate MDAGAR and GMCAR when the underlying structure is different from the proposed conditional scheme. For the MCAR model, we specified the joint covariance matrix of w as
| (18) |
where is a matrix corresponding to disease dependence, A is the upper triangular Cholesky decomposition of and is a block diagonal matrix with ( precision matrix for a proper CAR) for each i = 1, …, q. This corresponds to the MCAR generated from , where and for , is the diagonal matrix with number of neighbors along the diagonal and W is a binary adjacency matrix. Therefore, w is generated from independent but not identically distributed latent proper CAR distributions (see Reference 16, Section 3.2).
Keeping other model specifications same as in Section 3.1 (so and 𝜌i’s are as in Section 3.1), we fixed . Computing (17) with these specifications yields a mean correlation of 0.52 among the entries of the matrix (range: 0.48–0.54). The above procedure is replicated for 50 datasets for each model. We estimated the MDAGAR and GMCAR models in two opposite orders, denoted MDAGAR1, MDAGAR2, GMCAR1, and GMCAR2, and compared with the order-free MCAR. We estimate (7) with the respective specifications for for each model. For the MDAGAR and GMCAR models, we used the priors specified in the previous section using (16). For the MCAR, we assigned , and a21 with normal priors with variances 0.0625 and 100, respectively. The order-free MCAR is also fast to compute and reported an average running time of 5.89 minutes for each dataset in this experiment.
Figure 4 plots values of (A) WAICs, (B) D scores, and (C) the posterior mean of the Kullback-Leibler divergence between a given model and the true density, , for each of the 50 datasets (indexed in the x-axis) computed for each of the five models. For model fitting, GMCAR1 exhibits better performance with smaller values for WAIC and D, while GMCAR2, MDAGAR1, and MDAGAR2 are all comparable with MCAR. GMCAR1 and MDAGAR1 exhibit slightly better performance in D scores compared with GMCAR2 and MDAGAR2, respectively, but the two orders produce similar WAICs. In terms of the posterior means of , MCAR is expectedly closer to the true model (having the same data generating structure), but MDAGAR is still very competitive performance in spite of being a misspecified model. The variability in the posterior means of for the different models reveal substantial overlap so conditional models have the ability to compete with order-free MCAR even when data are generated from the latter.
FIGURE 4.
Density plots for (A) WAICs, (B) D scores, and (C) the posterior mean of over 50 datasets, respectively, using MDAGAR1, MDAGAR2, GMCAR1, GMCAR2, and MCAR. The dot vertical line shows the mean for each plot
3.4 |. Analyses using different orderings for spatial units
The MDAGAR model in Section 3.2 is analyzed using an ordering of spatial units (counties) from the southwest to the northeast. Here, we repeat the analysis for the MDAGAR model using three other orderings that start in the southeast, northwest, and northeast, respectively. We present results from these differently ordered DAGAR models using the 85 low-correlation simulated datasets. For the random effects, the mean (standard deviation) of the AMSEs for three different orderings (southeast, northwest, and northeast) are 1.61 (0.029), 1.28 (0.026), and 1.43 (0.027), respectively, without significantly differing from the original ordering in Section 3.2.
Figure 5 plots the densities of mean WAICs, D scores, and over the 85 datasets for the MDAGAR model using three different orderings and the original ordering in Section 3.2. In computing , we specify , which is the density of the true y and is the density for y from MDAGAR. While the ordering of the diseases does not appear to have a significant impact on model fitting as the density plots for the four orderings almost overlap with each other, (3) suggests that some order dependence may be expected.
FIGURE 5.
Density plots for WAICs, D scores, and over 85 datasets for the MDAGAR model using four different orderings: Northeast (red), northwest (green), southeast (blue), and southwest (purple). The dotted vertical line shows the mean for each plot
3.5 |. Model selection for different disease orders
We now evaluate the effectiveness of the method in Section 2.5 at selecting the model with the correct ordering of diseases. We used the bridgesampling package in R to compute for each of datasets generated as described in Section 3.1. Table 1 presents the probability of each model being selected for different true model scenarios. The probability of selecting the true model is shown in bold along the diagonal. Our experiment reveals that bridge sampling is extremely effective at choosing the correct order. It was able to identify the correct order between78% and 90%, which is substantially larger than any of the probability of choosing any of the misspecified models.
TABLE 1.
Proportion of times bridge sampling chose the model with the correct order out of the 50 datasets with that order
| True model | ||||||
|---|---|---|---|---|---|---|
| M 1 | 0.90 | 0.00 | 0.10 | 0.00 | 0.00 | 0.00 |
| M 2 | 0.00 | 0.86 | 0.00 | 0.00 | 0.14 | 0.00 |
| M 3 | 0.14 | 0.00 | 0.86 | 0.00 | 0.00 | 0.00 |
| M 4 | 0.00 | 0.00 | 0.00 | 0.90 | 0.00 | 0.10 |
| M 5 | 0.00 | 0.22 | 0.00 | 0.00 | 0.78 | 0.00 |
| M 6 | 0.00 | 0.00 | 0.00 | 0.16 | 0.00 | 0.84 |
4 |. MULTIPLE CANCER ANALYSIS FROM SEER
We now turn to analyzing an areal dataset using the MDAGAR model for four different cancers: lung, esophagus, larynx, and colorectal. The incidence of adenocarcinoma of lung and esophageal cancer have been found to share common risk factors44 and metabolic mechanisms.45 Lung cancer appears to be among the most common second primary cancers in patients with colon cancer.46 Meanwhile, patients with laryngeal cancer have also been reported to possess high risks of developing second primary lung cancer.47 The dataset is extracted from the SEER∗Stat database using the SEER∗Stat statistical software.48 The dataset consists of the four cancers: lung, esophagus, larynx, and colorectal, where the outcome is the 5-year average age-adjusted incidence rates (age-adjusted to the 2000 U.S. Standard Population) per 100000 population in the years from 2012 to 2016 across 58 counties in California, USA, as mapped in Figure 6. The maps exhibit preliminary evidence of correlation across space and among cancers. Cutoffs for the different levels of incidence rates are quantiles for each cancer. For all four cancers, incidence rates are relatively higher in counties concentrated in the middle northern areas including Shasta, Tehama, Glenn, Butte, and Yuba than those other areas. In general, northern areas have higher incidence rates than in the south. This is especially pronounced for lung cancer and esophagus cancer. For larynx cancer, while the highest incidence rates are in the northwest (Del Norte and Sisikiyou counties), the incidence rates in the south are also at somewhat higher levels. For colorectal cancer, the edge areas at the bottom also exhibit high incidence rates.
FIGURE 6.
Maps of 5-year average age-adjusted incidence rates per 100 000 population for lung, esophagus, larynx, and colorectal cancer in California, 2012 to 2016
As an exploratory tool to assess associations among the cancers, we calculate Pearson’s correlation for each pair of cancers by regarding incidence rates in different counties as independent samples and find Pearson’s correlation coefficient between the incidence of lung cancer and those of esophageal, larynx, and colorectum cancers to be 0.55, 0.46, and 0.46, respectively. Meanwhile, the correlation between esophageal and larynx cancer is 0.27. Next, to explore the spatial association for each disease, we calculate Moran’s I based upon rth order neighbors for each cancer and plot the areal correlogram.49 Defining distance intervals, , the rth order neighbors refer to units with distance in , that is, within distance dr but separated by more than . The distance is the Euclidean distance from an Albers map projection of California. As shown in Figure 7, lung, esophageal, and colorectum cancers all present spatial patterns that initially diminish with increasing r and eventually flatten close to 0. Overall, counties with similar levels of incidence rates tend to depict some spatial clustering.
FIGURE 7.
Moran’s I of rth order neighbors for lung, esophageal, larynx, and colorectum cancer
We turn to model based inference using (7). We return to the MDAGAR, GMCAR, and MCAR, where neighbors are defined using shared borders. We analyze this dataset and separate the spatial correlation for each cancer from association among cancers with the following prior specifications,
| (19) |
We also discuss a case excluding the risk factor (see Web Appendix B Section S.2.2.3).
For covariates, we include county attributes that possibly affect the incidence rates, including percentages of residents younger than 18 years old (youngij), older than 65 years old (oldij), with education level below high school (eduij), percentages of unemployed residents (unempij), black residents (blackij), male residents (maleij), uninsured residents (uninsureij), and percentages of families below the poverty threshold (povertyij). All covariates are common for different cancers and extracted from the SEER∗Stat database48 for the same period, 2012 to 2016.
Since cigarette smoking is a common risk factor for cancers, adult smoking rates (smokeij) for 2014 to 2016 were obtained from the California Tobacco Facts and Figures 2018 database.50 Spatial patterns in the map of adult cigarette smoking rates, shown in Figure 8, are similar to the incidence of cancers, especially lung and esophageal cancers, the highest smoking rates are concentrated in the north. While some central California counties (eg, Stanislaus, Tuolumne, Merced, Mariposa, Fresno, and Tulare) also exhibit high rates, although there is clearly less spatial clustering of the high rates than in the north.
FIGURE 8.
Important county-level covariates with significant effects: Adult cigarette smoking rates (left), percentage of black residents (middle), and uninsured residents (right)
Since the order of cancers in the DAG specify the model, we fit all models using (7) and compute the marginal likelihoods using bridge sampling (Section 2.5). By setting the prior model probabilities as for , we compute the posterior model probabilities using (9). These are presented in Table 2. We obtain BMA estimates using (14) with the weights in Table 2. Among all models, model is selected as the best model with the largest posterior probability 0.577 and the corresponding conditional structure is [esophageal] × [larynx | esophageal] × [colorectal | esophageal, larynx] × [lung | esophageal, larynx, colorectal].
TABLE 2.
The posterior model probabilities for 24 models
| 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 0.000 | 0.577 | 0.000 | 0.000 | 0.000 | 0.000 | 0.342 | 0.079 |
| 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.002 |
Note: Bold values signify that the 95% credible intervals exclude 0.
Table 3 is a summary of the parameter estimates including regression coefficients, spatial auto correlation , spatial precision ,and noise variance for each cancer. From and BMA, we find the regression slopes for the percentage of smokers and uninsured residents are significantly positive and negative, respectively, for esophageal cancer. The negative association between percentage of uninsured and esophageal cancer may seem surprising, but is likely a consequence of counties exhibiting low incidence rates for esophageal cancer having a relatively large number of uninsured residents (see top right in Figure 6 and the right most figure in Figure 8). Since esophageal cancer has low incidence rates, this association could well be spurious due to spatial confounding. Percentage of smokers is, unsurprisingly, found to be a significant risk factor for lung cancer, while the percentage of blacks seems to be significantly associated with elevated incidence of larynx cancer. In addition, we tend to see that percentage of population below the poverty level has a pronounced association with higher rates of lung and esophageal cancer.
TABLE 3.
Posterior means (95% credible intervals) for parameters estimated from and BMA estimates for regression coefficients only for the SEER four cancer dataset
| Parameters | Model | Esophageal | Larynx | Colorectal | Lung |
|---|---|---|---|---|---|
| Intercept | 16.76 (4.06, 29.56) | 6.37 (−1.16, 13.89) | 19.16 (−11.94, 49.72) | 28.68 (−18.3, 74.93) | |
| BMA | 15.87 (2.92, 28.63) | 6.85 (−0.71, 14.38) | 18.21 (−14.03, 49.07) | 28.25 (−18.12, 74.52) | |
| Smokers(%) | 0.25 (0.12, 0.37) | 0.04 (−0.03, 0.12) | 0.23 (−0.12, 0.57) | 0.81 (0.08, 1.62) | |
| BMA | 0.23 (0.10, 0.36) | 0.05 (−0.03, 0.12) | 0.22 (−0.13, 0.58) | 0.80 (0.08, 1.59) | |
| Young(%) | −0.12 (−0.31, 0.07) | −0.07 (−0.18, 0.04) | 0.27 (−0.2, 0.76) | −0.08 (−0.90, 0.74) | |
| BMA | −0.11 (−0.3, 0.08) | −0.08 (−0.19, 0.03) | 0.29 (−0.18, 0.78) | −0.01 (−0.86, 0.82) | |
| Old (%) | −0.11 (−0.25, 0.04) | −0.05 (−0.14, 0.03) | 0.10 (−0.28, 0.48) | −0.09 (−0.81, 0.67) | |
| BMA | −0.10 (−0.25, 0.05) | −0.05 (−0.14, 0.03) | 0.10 (−0.29, 0.49) | −0.08 (−0.79, 0.66) | |
| Edu (%) | 0.02 (−0.08, 0.12) | −0.02 (−0.08, 0.04) | 0.16 (−0.12, 0.43) | −0.20 (−0.75, 0.31) | |
| BMA | 0.02 (−0.09, 0.12) | −0.02 (−0.07, 0.04) | 0.15 (−0.14, 0.42) | −0.24 (−0.79, 0.27) | |
| Unemp (%) | −0.13 (−0.29, 0.03) | 0.01 (−0.08, 0.10) | −0.09 (−0.54, 0.37) | 0.60 (−0.47, 1.55) | |
| BMA | −0.12 (−0.28, 0.05) | 0.01 (−0.08, 0.1) | −0.08 (−0.54, 0.38) | 0.61 (−0.43, 1.56) | |
| Black (%) | 0.14 (−0.06, 0.34) | 0.14 (0.03, 0.26) | −0.16 (−0.73, 0.39) | 0.15 (−1.06, 1.29) | |
| BMA | 0.13 (−0.07, 0.33) | 0.15 (0.03, 0.27) | −0.18 (−0.75, 0.39) | 0.14 (−1.02, 1.25) | |
| Male (%) | −0.04 (−0.17, 0.09) | 0.00 (−0.07, 0.08) | 0.24 (−0.12, 0.60) | 0.14 (−0.57, 0.79) | |
| BMA | −0.04 (−0.17, 0.09) | 0 (−0.07, 0.08) | 0.24 (−0.12, 0.62) | 0.14 (−0.55, 0.82) | |
| Uninsured (%) | −0.24 (−0.44, −0.04) | −0.08 (−0.20, 0.04) | 0.07 (−0.44, 0.58) | 0.01 (−0.82, 0.86) | |
| BMA | −0.23 (−0.42, −0.02) | −0.08 (−0.2, 0.04) | 0.09 (−0.42, 0.61) | 0 (−0.81, 0.82) | |
| Poverty (%) | 0.30 (−0.24, 0.84) | 0.20 (−0.12, 0.51) | −0.06 (−1.51, 1.45) | 0.85 (−2.15, 3.85) | |
| BMA | 0.32 (−0.23, 0.87) | 0.2 (−0.12, 0.51) | −0.08 (−1.54, 1.42) | 0.8 (−2.14, 3.75) | |
| 0.25 (0.01, 1.00) | 0.33 (0.01, 0.96) | 0.50 (0.03, 0.97) | 0.52 (0.03, 0.99) | ||
| 25.27 (5.08, 61.57) | 27.60 (8.05, 60.42) | 19.97 (3.06, 55.61) | 20.31 (1.77, 55.92) | ||
| 1.67 (1.11, 2.47) | 0.49 (0.28, 0.75) | 8.22 (1.09, 14.23) | 1.19 (0.18, 5.21) |
Note: Bold values signify that the 95% credible intervals exclude 0.
Recall from Section 2.4 that is the residual spatial autocorrelation for esophageal cancer after accounting for the explanatory variables, while for are residual spatial autocorrelations after accounting for the explanatory variables and the preceding cancers in the model . From Table 3, we see that esophageal cancer exhibits relatively weaker spatial autocorrelation, while the residual spatial autocorrelations for larynx and colorectal cancers after accounting for preceding cancers are both at moderate levels of around 0.5. Similarly for the spatial precision , larynx appears to have the smallest conditional variability while that for colorectal and lung are slightly larger.
For the posterior mean incidence rates and spatial random effects , we present estimates from model and BMA. Figure 9A,B is maps of posterior mean spatial random effects and model fitted incidence rates for four cancers obtained from BMA, while Figure 10A,B shows maps of those from model . The posterior mean incidence rates from BMA and are in accord with each other, and both present DAGAR-smoothed versions of the original patterns in Figure 6. For posterior means of spatial random effects, in general, the estimates from are similar to model averaged estimates, especially for lung and colorectal cancers, exhibiting relatively large positive values in the northern counties, where the incidence rates are high. However, for esophageal and larynx cancers we see slight discrepancies between and BMA in the north. The BMA estimates produce larger positive random effects, ranging between 0.1and 0.5,in most counties, while produces estimates between 0 and 0.1 for esophageal cancer. More counties with random effects larger than 0.1 are estimated from for larynx cancer. We believe this is attributable, at least in part, to another competitive model, (posterior probability 0.342), which contributes to the BMA. On the other hand, the effects of some important county-level covariates play an essential role in the discrepancy between the estimates of random effects and model fitted incidence rates for each cancer.
FIGURE 9.
Maps of posterior results using BMA for lung, esophagus, larynx, and colorectal cancer in California including (A) posterior mean spatial random effects and (B) posterior mean incidence rates
FIGURE 10.
Maps of posterior results using the highest probability model for lung, esophagus, larynx, and colorectal cancer in California including (A) posterior mean spatial random effects and (B) posterior mean incidence rates
Recall from Section 2 that and reflect the associations among cancers that can be attributed to spatial structure. Specifically, larger values of will indicate inherent associations unrelated to spatial structure, while the magnitude of reflects associations due to spatial structure. Figure S.2 presents posterior distributions of for all pairs of cancers. We see from the distribution of that there is a pronounced nonspatial component in the association between lung and colorectal cancers. Similar, albeit somewhat less pronounced, nonspatial associations are seen between larynx and esophageal cancers and between lung and larynx cancers. Analogously, the posterior distributions for and tend to have substantial positive support suggesting substantial spatial cross-correlations between lung and colorectal cancers and between colorectal and larynx cancers. Interestingly, we find negative support in the posterior distributions for and .The negative mass implies that the covariance among cancers with in a region is suppressed by strong dependence with neighboring regions. This seems to be the case for associations between lung and esophageal cancers and between lung and larynx cancers.
Web Appendix B also presents supplementary analysis that excludes adult smoking rates from the covariates, which we refer to as “Case 2.” Figure S.3 shows estimated correlations between pairwise cancers in each of the 58 counties. The top row presents the correlations including smoking rates (“Case 1”) as has been analyzed here. The bottom row presents the corresponding maps for “Case 2.” Interestingly, accounting for smoking rates substantially diminishes the associations among esophageal, colorectal and lung cancers. These are significantly associated in “Case 2” but only lung and colorectal retain their significance after accounting for smoking rates.
We also implemented the order-free MCAR model (as described in Section 3.3) and presented the estimates of posterior mean incidence rates and spatial random effects in Figure 11. Compared with MDAGAR, the MCAR exhibits better fitting for colorectal cancer since the posterior incidence rates in Figure 11B is closer to those in the raw map (Figure 6), while MDAGAR seems to outperform MCAR for larynx cancer. Overall, the model fitting is comparable between MDAGAR and MCAR.
FIGURE 11.
Maps of posterior results (Case 1) using MCAR for lung, esophagus, larynx, and colorectal cancer in California including (A) posterior mean spatial random effects and (B) posterior mean incidence rates
5 |. DISCUSSION
We have developed a multivariate “MDAGAR” model in conjunction with a bridge sampling method to estimate spatial correlations for multiple correlated diseases. The MDAGAR is constructed hierarchically over areal units based on univariate DAGAR models. We demonstrate that MDAGAR tends to outperform GMCAR when association between spatial random effects for different diseases is weak or moderate. MDAGAR retains the interpretability of spatial autocorrelations, as in univariate DAGAR, separating the spatial correlation for each disease from any inherent or endemic association among diseases. While MDAGAR, like all DAG based models, is specified according to a fixed order of the diseases, we show that bridge sampling can effectively choose among the different orders and also provide BMA inference in a computationally efficient manner.
Our data analysis elicits how correlations between incidence rates for different cancers are impacted by risk factors. For example, eliminating adult cigarette smoking rates produces similar spatial patterns for the incidence rates of esophageal, lung and colorectal cancer. In addition, the significant correlation between lung and esophageal cancer, even after accounting for smoking rates, implies other inherent or endemic association such as latent risk factors and metabolic mechanisms. We also see that the MDAGAR based posterior estimates of the latent spatial effects in Figures 9A and 10A resemble those from MDAGAR without covariates (Figure 12), while the maps for the estimated incidence rates in Figures 9B and 10B account for the spatial variability of the covariates.
FIGURE 12.
Maps of posterior mean spatial random effects (with no covariates) using the same order as
Future research will look at different constructions of graphical models for areal data. Examples can include defining rth order neighbors using distance metrics, as in Figure 7, and deriving alternate precision matrices. We also intend to address scalability with very large number of diseases. Here, common spatial factor models for areal data51 can be adapted to model the factors as DAGAR, thereby yielding classes of DAGAR based factor models. A very different approach will be to build scalable graphical models using two different graphs: one for areal units (CAR or DAGAR) and another undirected graph representing conditional independence among cancers. Multidimensional MRFs as well as developments analogous to recently introduced graphical Gaussian processes52 can be pursued for high-dimensional disease mapping. Finally, spatial confounding in multivariate disease mapping53–55 will be explored in the context of MDAGAR.
DATA AVAILABILITY STATEMENT
All computer programs implementing the examples in this paper can be found in the public domain and downloaded from https://github.com/LeiwenG/Multivariate_DAGAR.
Supplementary Material
ACKNOWLEDGEMENTS
The work of the first and third authors has been supported in part by the Division of Mathematical Sciences (DMS) of the National Science Foundation (NSF) under grant 1916349 and by the National Institute of Environmental Health Sciences (NIEHS) under grants R01ES030210 and 5R01ES027027. The work of the second author was supported by the Division of Mathematical Sciences (DMS) of the National Science Foundation (NSF) under grant 1915803.
Funding information
Division of Mathematical Sciences, Grant/Award Numbers: 1915803, 1916349; National Institute of Environmental Health Sciences, Grant/Award Numbers: 5R01ES027027, R01ES030210
Footnotes
CONFLICT OF INTEREST
The authors declare no potential conflict of interest.
SUPPORTING INFORMATION
Additional supporting information may be found online in the Supporting Information section at the end of this article.
REFERENCES
- 1.Koch T. Cartographies of Disease: Maps, Mapping, and Medicine. Redlands, CA: Esri Press; 2005. [Google Scholar]
- 2.Berke O. Exploratory disease mapping: kriging the spatial risk function from regional count data. Int J Health Geogr. 2004;3(1):18. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Richardson S, Thomson A, Best N, Elliott P. Interpreting posterior relative risk estimates in disease-mapping studies. Environ Health Perspect. 2004;112(9):1016–1025. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Rue H, Held L. Gaussian Markov Random Fields : Theory and Applications. Monographs on Statistics and Applied Probability. Boca Raton, FL: Chapman and Hall/CRC Press; 2005. [Google Scholar]
- 5.Besag J. Spatial interaction and the statistical analysis of lattice systems. J Royal Stat Soc Ser B (Methodol). 1974;36(2):192–225. [Google Scholar]
- 6.Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics. Ann Inst Stat Math. 1991;43(1):1–20. [Google Scholar]
- 7.Kissling WD, Carl G. Spatial autocorrelation and the selection of simultaneous autoregressive models. Glob Ecol Biogeogr. 2008;17(1): 59–71. [Google Scholar]
- 8.Datta A, Banerjee S, Hodges JS, Gao L. Spatial disease mapping using directed acyclic graph auto-regressive (DAGAR) models. Bayesian Anal. 2019;14(4):1221–1244. doi: 10.1214/19-BA1177 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Lindström S, Finucane H, Bulik-Sullivan B, et al. Quantifying the genetic correlation between multiple cancer types. Cancer Epidemiol Prev Biomark. 2017;26(9):1427–1435. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Jin X, Carlin BP, Banerjee S. Generalized hierarchical multivariate CAR models for areal data. Biometrics. 2005;61(4):950–961. [DOI] [PubMed] [Google Scholar]
- 11.Knorr-Held L, Best NG. A shared component model for detecting joint and selective clustering of two diseases. J R Stat Soc A Stat Soc. 2001;164(1):73–85. [Google Scholar]
- 12.Kim H, Sun D, Tsutakawa RK. A bivariate Bayes method for improving the estimates of mortality rates with a twofold conditional autoregressive model. J Am Stat Assoc. 2001;96(456):1506–1521. [Google Scholar]
- 13.Gelfand AE, Vounatsou P. Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics. 2003;4(1):11–15. [DOI] [PubMed] [Google Scholar]
- 14.Carlin BP, Banerjee S. Hierarchical multivariate CAR models for spatio-temporally correlated survival data. Bayesian Stat. 2003;7(7):45–63. [Google Scholar]
- 15.Held L, Natário I, Fenton SE, Rue H, Becker N. Towards joint disease mapping. Stat Methods Med Res. 2005;14(1):61–82. [DOI] [PubMed] [Google Scholar]
- 16.Jin X, Banerjee S, Carlin BP. Order-free co-regionalized areal data models with application to multiple-disease mapping. J Royal Stat Soc Ser B (Stat Methodol). 2007;69(5):817–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Zhang Y, Hodges JS, Banerjee S. Smoothed ANOVA with spatial effects as a competitor to MCAR in multivariate spatial smoothing. Ann Appl Stat. 2009;3(4):1805. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Diva U, Dey DK, Banerjee S. Parametric models for spatially correlated survival data for individuals with multiple cancers. Stat Med.2008;27(12):2127–2144. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Martinez-Beneito MA. A general modelling framework for multivariate disease mapping. Biometrika. 2013;100(3):539–553. [Google Scholar]
- 20.Marí-Dell’Olmo M, Martinez-Beneito MA, Gotsens M, Palència L. A smoothed ANOVA model for multivariate ecological regression. Stoch Env Res Risk A. 2014;28(3):695–706. [Google Scholar]
- 21.Lawson B, Banerjee S, Haining R, Ugarte D. Handbook of Spatial Epidemiology. Boca Raton, FL: CRC press; 2016. [Google Scholar]
- 22.Mardia K. Multi-dimensional multivariate Gaussian Markov random fields with application to image processing. J Multivar Anal. 1988;24(2):265–284. [Google Scholar]
- 23.Sain SR, Furrer R, Cressie N. A spatial analysis of multivariate output from regional climate models. Ann Appl Stat. 2011;5(1):150–175. doi: 10.1214/10-AOAS369 [DOI] [Google Scholar]
- 24.MacNab YC. Linear models of coregionalization for multivariate lattice data: a general framework for coregionalized multivariate CAR models. Stat Med. 2016;35(21):3827–3850. doi: 10.1002/sim.6955 [DOI] [PubMed] [Google Scholar]
- 25.MacNab YC. Some recent work on multivariate Gaussian Markov random fields (with discussion). TEST Offic J Spanish Soc Stat Oper Res. 2018;27(3):497–541. doi: 10.1007/s11749-018-0605-3 [DOI] [Google Scholar]
- 26.MacNab YC. Bayesian estimation of multivariate Gaussian Markov random fields with constraint. Stat Med. 2020;39(30):4767–4788. doi: 10.1002/sim.8752 [DOI] [PubMed] [Google Scholar]
- 27.Zhu J, Eickhoff JC, Yan P. Generalized linear latent variable models for repeated measures of spatially correlated multivariate data. Biometrics. 2005;61(3):674–683. doi: 10.1111/j.1541-0420.2005.00343.x [DOI] [PubMed] [Google Scholar]
- 28.Bradley JR, Holan SH, Wikle CK. Multivariate spatio-temporal models for high-dimensional areal data with application to longitudinal employer-household dynamics. Ann Appl Stat. 2015;9(4):1761–1791. doi: 10.1214/15-AOAS862 [DOI] [Google Scholar]
- 29.Daniels MJ, Zhou Z, Zou H. Conditionally specified space-time models for multivariate processes. J Comput Graph Stat. 2006;15(1):157–177. doi: 10.1198/106186006X100434 [DOI] [Google Scholar]
- 30.Wall MM. A close look at the spatial structure implied by the CAR and SAR models. J Stat Plan Infer. 2004;121(2):311–324. [Google Scholar]
- 31.Meng XL, Wong WH. Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Stat Sin. 1996;6:831–860. [Google Scholar]
- 32.Gronau QF, Sarafoglou A, Matzke D, et al. A tutorial on bridge sampling. J Math Psychol. 2017;81:80–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Cover TM, Thomas JA. Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing. Hoboken, NJ: Wiley Interscience; 1991. [Google Scholar]
- 34.Banerjee S. Modeling massive spatial datasets using a conjugate Bayesian linear modeling framework. Spat Stat. 2020;37:100417. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Cressie N, Zammit-Mangion A. Multivariate spatial covariance models: a conditional approach. Biometrika. 2016;103(4):915–935. doi: 10.1093/biomet/asw045 [DOI] [Google Scholar]
- 36.Gamerman D, Lopes HF. Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference. Boca Raton, FL: Chapman & Hall/CRC Press; 2006. [Google Scholar]
- 37.Hoeting JA, Madigan D, Raftery AE, Volinsky CT. Bayesian model averaging: a tutorial. Stat Sci. 1999;14:382–401. [Google Scholar]
- 38.Perrakis K, Ntzoufras I, Tsionas EG. On the use of marginal posteriors in marginal likelihood estimation via importance sampling. Comput Stat Data Anal. 2014;77:54–69. [Google Scholar]
- 39.Gelfand AE, Dey DK. Bayesian model choice: asymptotics and exact calculations. J Royal Stat Soc Ser B (Methodol). 1994;56(3):501–514. [Google Scholar]
- 40.Gronau QF, Singmann H, Wagenmakers EJ. Bridgesampling: an R package for estimating normalizing constants; 2017. arXiv preprintarXiv:1710.08162. [Google Scholar]
- 41.Watanabe S. Asymptotic equivalence of Bayes cross validation and widely applicable information criterion in singular learning theory. J Mach Learn Res. 2010;11(Dec):3571–3594. [Google Scholar]
- 42.Gelman A, Hwang J, Vehtari A. Understanding predictive information criteria for Bayesian models. Stat Comput. 2014;24(6):997–1016. [Google Scholar]
- 43.Gelfand AE, Ghosh SK. Model choice: a minimum posterior predictive loss approach. Biometrika. 1998;85(1):1–11. [Google Scholar]
- 44.Agrawal K, Markert RJ, Agrawal S. Risk factors for adenocarcinoma and squamous cell carcinoma of these ophagus and lung. Hypertension. 2018;61(46):0–09. [Google Scholar]
- 45.Shi WX, Chen SQ. Frequencies of poor metabolizers of cytochrome P450 2C19 in esophagus cancer, stomach cancer, lung cancer and bladder cancer in Chinese population. World J Gastroenterol. 2004;10(13):1961. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Kurishima K, Miyazaki K, Watanabe H, et al. Lung cancer patients with synchronous colon cancer. Mol Clin Oncol. 2018;8(1):137–140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Akhtar J, Bhargava R, Shameem M, et al. Second primary lung cancer with glottic laryngeal cancer as index tumor–A case report. Case Rep Oncol. 2010;3(1):35–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Surveillance Research Program. National Cancer Institute. SEER*Stat software. Surveillance Res Program. 2019. https://seer.cancer.gov/seerstat/ [Google Scholar]
- 49.Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data. Boca Raton, FL: CRC Press; 2014. [Google Scholar]
- 50.California Department of Public Health. California Tobacco control program California tobacco facts and figures; 2018. [Google Scholar]
- 51.Wang F, Wall MM. Generalized common spatial factor model. Biostatistics. 2003;4(4):569–582. doi: 10.1093/biostatistics/4.4.569 [DOI] [PubMed] [Google Scholar]
- 52.Dey D, Datta A, Banerjee S. Graphical Gaussian processes for highly multivariate spatial data. Biometrika. 2020. doi: 10.1093/biomet/asab061 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Azevedo D, Prates M, Bandyopadhyay D. MSPOCK: alleviating spatial confounding in multivariate disease mapping models. J Agric Biol Environ Stat. 2021;26:464–491. [Google Scholar]
- 54.Khan K, Calder CA. Restricted spatial regression methods: implications for inference. J Am Stat Assoc. 2020;1–13. doi: 10.1080/01621459.2020.1788949 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Zimmerman DL, Hoef JMV. On deconfounding spatial confounding in linear models. Am Stat. 2021;1–9. doi: 10.1080/00031305.2021.1946149 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
All computer programs implementing the examples in this paper can be found in the public domain and downloaded from https://github.com/LeiwenG/Multivariate_DAGAR.












