Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2021 Apr 14.
Published in final edited form as: Bayesian Anal. 2019 Oct 3;14(4):1221–1244. doi: 10.1214/19-ba1177

Spatial disease mapping using directed acyclic graph auto-regressive (DAGAR) models

Abhirup Datta 1, Sudipto Banerjee 2, James S Hodges 3, Leiwen Gao 4
PMCID: PMC8046356  NIHMSID: NIHMS1051582  PMID: 33859772

Abstract

Hierarchical models for regionally aggregated disease incidence data commonly involve region specific latent random effects that are modeled jointly as having a multivariate Gaussian distribution. The covariance or precision matrix incorporates the spatial dependence between the regions. Common choices for the precision matrix include the widely used ICAR model, which is singular, and its nonsingular extension which lacks interpretability. We propose a new parametric model for the precision matrix based on a directed acyclic graph (DAG) representation of the spatial dependence. Our model guarantees positive definiteness and, hence, in addition to being a valid prior for regional spatially correlated random effects, can also directly model the outcome from dependent data like images and networks. Theoretical results establish a link between the parameters in our model and the variance and covariances of the random effects. Substantive simulation studies demonstrate that the improved interpretability of our model reaps benefits in terms of accurately recovering the latent spatial random effects as well as for inference on the spatial covariance parameters. Under modest spatial correlation, our model far outperforms the CAR models, while the performances are similar when the spatial correlation is strong. We also assess sensitivity to the choice of the ordering in the DAG construction using theoretical and empirical results which testify to the robustness of our model. We also present a large-scale public health application demonstrating the competitive performance of the model.

Keywords: Areal data, Bayesian inference, Directed acyclic graphs, Disease mapping, Spatial autoregression

1. Introduction

Epidemiological data for disease rates are often presented as aggregated disease counts over entire geographical regions like states or counties. Such areal or areally-referenced data are ubiquitous in public health applications. Accurate identification of trends and factors associated with the disease requires accounting for the spatial dependence among the regions. A common approach to analyze areal datasets envisions the geographic domain as an undirected graph with the regions constituting the vertices and an edge between two vertices if the corresponding regions share a geographical border. This creates well defined neighbors for each region which are used to specify the joint or conditional distributions of region-specific latent Gaussian random effects in a hierarchical setup. For example, the popular conditional autoregressive (CAR) model (Besag 1974; Clayton and Bernardinelli 1992) incorporates the underlying neighborhood structure in specifying the full conditional distribution for each observation. If wi denotes the random effect representing the ith region for i = 1, …, k and i ~ j indicates that regions i and j are neighbors, then the CAR model specifies the full conditional distributions

wi|wi~N(j~iwj/ni,τwni), (1)

where wi denotes the vector of observations leaving out the ith one, ni denotes the number of neighbors for the ith region and throughout the text we adopt the convention that N(α, Δ) denotes normal distribution with mean α and precision Δ, both in univariate and multivariate contexts. Hence, in (1) above, τwni is the conditional precision of wi | wi.

The joint distribution of w = (w1, …, wk)T can be derived from (1) as w ~ N(0, τw(DA)) where A = (aij) is the adjacency matrix of the neighborhood graph i.e. aij = 1 if and only if i ~ j, and D is a diagonal matrix with n1, …, nk on the diagonal. As DA is singular, this construction yields an improper joint distribution of the wi’s, referred to as the intrinsic or improper CAR (ICAR) model. This impropriety renders the model ineligible for directly modeling the response or for generating data, although both can proceed by using contrasts as demonstrated in Besag and Kooperberg (1995). Also, the distribution can still be used as a prior for latent spatial random effects w and the posterior of w usually remains valid.

The impropriety of the ICAR model can be rectified by generalizing the full conditional mean to E(wi | wi) = ρj~i wj/ni yielding the joint distribution w ~ N(0, τw(DρA)) which is proper for a certain range of ρ. Although introduction of ρ imparts more flexibility than the parameter-free improper analogue, it is difficult to interpret ρ as even very high values of ρ induce only modest spatial correlation among the observations (see Banerjee et al. 2014, for a discussion on this). Furthermore, Wall (2004) shows that even negative values of ρ may lead to positive correlation among neighboring regions. Assuncao and Krainski (2009) found that these oddities are a general feature of CAR models.

The second popular approach is the simultaneous autoregressive (SAR) model (Whittle 1954) which proceeds by simultaneously modeling the random effects as

wi=jibijwj+ϵi for i=1,2,,k (2)

where ϵi~ ind N(0,τi) are errors independent of w. Defining B = (bij) and F to be a diagonal matrix with entries τ1, …, τk, the set of equations in (2) yields the joint distribution w ~ N(0, (IB)F(IB)T). However, the common choice of defining bij = ρI(i ~ j)/ni, where I(·) denotes the indicator function, leads to similar problems with respect to interpretation of the parameter ρ (Wall 2004).

Beyond these two approaches, the inventory of covariance models for areal datasets is very limited. Leroux et al. (2000) and MacNab and Dean (2000) extended the CAR model by accommodating overdispersion alongside spatial information. They proposed using the precision matrix λτw(DA) + (1 − λ)τwI, where λ ∈ [0, 1] controls the degree of dependence among the regions. For a regular graph where all vertices have same number of neighbors d, D = dI. In this case, λτw(DA) + (1 − λ)τwI can be rewritten as 1+(d1)λdτw(Dρ*A) where ρ*=dλ1+(d1)λ. Thus, if the numbers of neighbors for the vertices do not vary greatly, this approach is somewhat similar to the proper CAR model and is encumbered by the same interpretability concerns.

For lattice based applications, there is a richer class of parametric intrinsic autoregression models (Besag and Kooperberg 1995; Besag and Higdon 1999). However, all such intrinsic models rely heavily on the lattice structure and cannot be used directly for arbitrary graphs. Applications to irregular areal data can proceed by breaking up the region into a fine lattice, using the intrinsic model on the lattice and aggregating over each area. Besag and Mondal (2005) demonstrated that certain classes of intrinsic autoregressive models can be interpreted as average of a fine scale Gaussian Process over the entire domain. In disease mapping contexts, where the data are often observed over fixed politically delineated regions, such a latent fine scale spatial process may be difficult to interpret. In this manuscript, we only focus on models that can be formulated directly on the areal units.

We propose a new way of constructing precision matrices for areal models using a directed acyclic graph derived from the original undirected graph. Directed acyclic graphs or DAGs have been used in the spatial literature for modeling large spatial datasets (Datta et al. 2016a) and for generating image textures (Cressie and Davidson 1998). Instead of modeling the precision matrix directly, we model its Cholesky factor, which for any multivariate Gaussian distribution is determined by the conditional distributions of the wi’s. We specify these conditional distributions using autoregressive covariance models on a sequence of local trees created from this directed acyclic graph. The resulting Cholesky factor and the precision matrix are sparse. We refer to this model as the directed acyclic graph autoregressive or DAGAR model.

Unlike the ICAR model, our model’s covariance matrix is guaranteed to be positive definite. This opens up a new avenue to generate or directly model multivariate Gaussian data with dependence structure derived from a graph. Common examples of such data, besides aggregated regional data, include images or social network data. We establish, both theoretically and empirically, that our model endows ρ with a clear interpretation as a spatial autocorrelation parameter, which, in fact, resolves an important conundrum in the conditional and simultaneous autoregressive models (Wall 2004). Also, the Cholesky factor has the same level of sparsity as the undirected graph ensuring scalability for analyzing very large areal datasets.

Cholesky factors inherit the dependence of directed acyclic graphs on ordering of the regions, thereby making our model order-dependent. As spatial regions generally do not have any natural ordering, to understand the impact of ordering, we propose a novel order free model by averaging over all k! possible orderings. We show that the resulting precision matrix, which is order-free, can be evaluated in closed form and we use it to present some theoretical results suggesting that the DAGAR precision matrix with a reasonably chosen ordering is often similar to the order-free matrix. The theoretical results complemented by simulation exercises reveal that the choice of ordering does not significantly affect the results. Simulation experiments also show that when the spatial correlation is weak or moderate, the DAGAR model outperforms CAR models in their ability to correctly estimate a latent spatial surface while the performances are similar for data with stronger spatial dependence.

2. Model

2.1. Cholesky Factors

We first review a general approach to modeling Gaussian covariance matrices using sparse Cholesky factors and discuss how this relates to CAR models and general covariance estimation. This helps motivate the subsequent construction of our model in Section 2.2. We assume that the graph of the regions is connected. Disconnected graphs with multiple islands will entail a simple extension with block diagonal covariance structures, where each block represents an island. Let G=(V,E) denote the connected graph with the regions as vertices V and edges E between neighbors. We denote the ith region simply by i and let A = (aij) denote the adjacency matrix for this undirected graph. To model Cholesky factors we specify distributions of the wi’s as

w1=ϵ1,w2=b21w1+ϵ2,,wk=bk1w1++bk,k1wk1+ϵk, (3)

where the ϵi’s are independent N(0, τi) errors. Throughout this section and Section 2.2, we assume τw, the scale parameter for the wi’s, is one. B = (bij) in (3) is a strictly lower triangular matrix. Let F be the diagonal matrix with τ1, …, τk on the diagonal. Then w ~ N(0, LT FL) where L = IB. Switching from the specification in (2) to a strictly lower triangular matrix B is not restrictive because of the following result.

Theorem 1.

Let w ~ N(0, Q) where Q is the (possibly singular) precision matrix. Then there exists a permutation matrix P, a strictly lower triangular matrix B and a diagonal matrix F with non-negative entries such that PQP = (IB) F(IB).

While this result is trivial if Q is non-singular, for rank deficient Q this relies on the algorithm for obtaining the Cholesky factor. All proofs are presented in the Supplement. Hence, any multivariate normal distribution can be expressed as in (3) under certain orderings of the areal units. For low rank distributions this will be equivalent to setting some of the τi’s to zero. In fact, switching to the lower triangular B has several advantages. First, L is lower triangular with ones on the diagonal, guaranteeing that LT FL is positive definite as long as all τi’s are positive. Also det(LT FL) is simply i=1nτi and the quadratic form wT LT FLw can be expressed as τ1w12+i=2kτi(wi{j<i}wjbij)2, evaluating which requires O(k+s) floating point operations (FLOPs) where s is the sparsity, i.e., the number of non-zero entries of B. Hence, if B is sparse, the joint density of w can be evaluated in an extremely scalable manner.

To complete the specification in (3), we need to fully specify the matrices B and F. The parameters {bij} and {τi} are identifiable up to a marginal precision parameter because the factorization (IB) F(IB) is the LDLT factorization (a variant of Cholesky decomposition with ones on the diagonal of L), which is unique. If multiple observations have been made for each region, we can estimate B and F without imposing simplifying assumptions. In fact, since a sparse representation of the Cholesky factor IB is desired, the problem reduces to high-dimensional covariance or precision matrix estimation for which there exists a vast inventory of statistical methods including banding (Wu and Pourahmadi 2003; Bickel and Levina 2008b), tapering (Cai et al. 2010), thresholding (Bickel and Levina 2008a; El Karoui 2008; Rothman et al. 2009) and penalization (Meinshausen and Buhlmann 2006; Friedman et al. 2007; Xue et al. 2012) among others.

On the other hand, for large point-referenced spatial datasets, Datta et al. (2016a); Finley et al. (2017); Datta et al. (2016b) construct sparse Cholesky factor approximations of the precision matrix from a Matérn covariance function (Stein 1999). These approximations are hence derived from an original joint distribution of the spatial random effects.

However, most areal datasets lack replication that would permit use of fully data-driven learning methods to estimate B and F. Also, unlike well defined Matérn Gaussian processes on continuous domains, there is no well defined covariance matrix on arbitrary graphs from which one can derive sparse Cholesky factors. In fact, our goal here is the opposite, that is to construct a multivariate Gaussian distribution on graphs starting from the sparse Cholesky factor. Consequently, we will make parametric assumptions that will lead to an interpretable covariance model.

2.2. Directed acyclic graph autoregressive models

To achieve sparsity, we adopt the strategy of defining neighbor sets N(i) such that bij = 0 for all jN(i). The choice and size of the neighbor sets for areal datasets can be predicated upon the underlying neighborhood graph G. For i > 1, we define N(i) = {j < i, j ~ i}. The constraint j < i is necessary to endow B with a lower triangular structure. This reduces (3) to

w1=ϵ1,   wi=jN(i)wjbij+ϵi,   (i=2,,k) (4)

This specification is analogous to auto-regressive models for time series. In fact, if wi denotes the response at time i, N(i) includes all time points less than i up to a lag of r, and bij = bij, then (4) simply denotes the autoregressive model of order r. In a time-series context, where i and j denote time points, assigning the weights bij based on the temporal lag seems natural, but for irregular areal datasets, enumeration of the areal units does not have any physical interpretation. In the context of image texture analysis, Cressie and Davidson (1998) used different coefficients for wj in (4) based on the direction of neighbors on a regular lattice, to generate images with a wide range of textures. In general, vertices of irregular graphs based on areal datasets do not share such commonality in terms of spatial orientation of their neighbors. Hence, it is intuitive to assign equal weights to all the neighbors, i.e., letting bij = bi for all jN(i). A natural choice would be to let bi = 1/n<i and τin<i where n<i = |N(i)| denotes the cardinality of the neighbor set. This specification is similar to (1) except that we are only using the directed neighbors N(i) instead of all neighbors. Since n<1 = 0, this choice of bij leads to the conundrum of how to specify τ1. Either we need to define τ1 in a manner inconsistent with the definition of τi for i > 1 or we define τ1 = 0 which yields an improper distribution for w. We circumvent this using a more general specification described below that includes the degenerate prior with bi = 1/n<i and τin<i as a limiting case.

Let dij denote the length of the shortest path on G between nodes i and j. If G is a tree, i.e., an acyclic graph, then for any 0 ≤ ρ < 1, the matrix with elements ρdij is positive definite and can be used to model the covariance of w. This extends the AR(1) model for time series to any tree graph (Basseville et al. 2006). However, graphs corresponding to areal datasets are rarely acyclic and for loopy graphs such results generally do not hold. A spanning tree of a graph is a subgraph that is a tree and includes all the vertices of the original graph. Spanning trees have been used to iteratively approximate parametric covariance matrices over loopy graphs (Sudderth 2002). Borrowing these ideas, a potential solution would be to define the covariance matrix for w as the AR(1) covariance matrix on a spanning tree of G. However, for large graphs, strategies for deciding upon the best spanning tree are unclear and computationally expensive. Furthermore, as demonstrated in (Sudderth 2002), ignoring certain edges when pruning G to a spanning tree can lead to large errors. Instead, we will use local spanning tree embeddings of small subgraphs of G to construct the lower dimensional conditional densities specified in (4). This method will not ignore any edge and yet produces a computationally convenient precision matrix.

Let Gi be the subgraph of G comprising vertices {i} ∪ N(i) and the edges among them. We intend to construct the conditional density wi | wN(i) using an embedded spanning tree Ti of Gi. The natural candidate for Ti is the tree graph ({i} ∪ N(i), {i ~ j | jN(i)}) as it contains all edges between i and N(i). We specify the conditional density wi | wN(i) using the AR(1) model on Ti with parameter ρ. To be precise, for any 0 ≤ ρ < 1, an auto-regressive AR(1) covariance matrix with parameter ρ on Ti is given by

(1ρρρρ1ρ2ρ2ρρ21ρ2ρρ2ρ21)=(1viviΣi). (5)

This helps us define E(wi|wN(i))=viTΣi1wN(i) and var(wi|wN(i))=1viTΣi1vi, where vi is the n<i ×1 vector of covariances between wi and wN(i), and Σi is the n<i × n<i covariance matrix of wN(i) assuming an AR(1) model on Ti. From equation (5) it is clear that vi = ρ1 where 1 denotes the vector of ones, and Σi is the matrix with one on the diagonals and ρ2 on the off-diagonals. Equating this with (4), we have

bij=ρ1+(n<i1)ρ2(i=2,,k;jN(i)),τi=1+(n<i1)ρ21ρ2(i=1,,k) (6)

The specifications in (6) reveal some desirable intuitive features. First, as discussed earlier, bij = bi for all jN(i), thereby assigning equal weights to all the directed neighbors. Also, the conditional precision τi for wi increases with the number of directed neighbors. The formulation of Ti also ensures that any edge between i and j is incorporated in the conditional specification of wi or wj depending on which comes later in the ordering. So, unlike approximating the entire graph with a spanning tree, the local spanning tree approach ensures that no edge of G is ignored. Furthermore, for any 0 ≤ ρ < 1 all τi’s are positive thereby ensuring a proper probability distribution w ~ N(0, LT FL). The limiting case of ρ = 1 is equivalent to the improper prior with bi = 1/n<i and τin<i.

The constructions in (3) and (6) assume a specific ordering, which we now generalize to any other ordering. Let π = {π(1), …, π(n)} be any predetermined ordering of the regions and π−1 denote its corresponding inverse permutation. Under this ordering, for any iπ(1), we define its past observations w<i,π as the collection {wj | π−1(j) < π−1(i)} and its set of directed neighbors Nπ(i) = {j | i ~ j, π−1(j) < π−1(i)}. Let Eπ denote the collection of directed edges from all members of Nπ(i) to i for every iπ(1). We now have a directed acyclic graph Dπ=(V,Eπ). Let nπ(i) = |Nπ(i)|. The generalization of (6) based on Dπ is:

wi|w<i,π~N(ρ1+(nπ(i)1)ρ2jNπ(i)wj,1+(nπ(i)1)ρ21ρ2), (7)

where for any i that has no directed neighbors under π, nπ(i) = 0 and the conditional mean in (7) is zero. If wπ = (wπ(1), …, wπ(k))T, and Lπ and Fπ denote the analogous matrices corresponding to this ordering π, then we have

wπ~N(0,LπTFπLπ) (8)

This completes the specification of a new class of covariance models for areal datasets. Since the construction is predicated upon a directed acyclic graph derived from an original graph G, we refer to this as the directed acyclic graph autoregressive or DAGAR model. Since Lπ is lower triangular with eπ = |Eπ| non-zero sub-diagonal entries and Fπ is diagonal, for any ρ, the determinant of cov(wπ) is simply (1ρ2)k/i=1k(1+(nπ(i)1)ρ2) and the likelihood for the model in (8) can be evaluated using O(k+e) FLOPs. This ensures that our model is scalable and can be used to analyze massive areal datasets.

2.3. Interpretation of ρ

While scalability of the DAGAR model is an important aspect in the analysis of large spatial datasets, our current emphasis is on offering a class of areal models with an interpretable parameterization. In this regard, we resolve the issue of a lack of meaningful relationship ρ and spatial correlation in the proper CAR models. We now offer insight about the interpretability of ρ in the DAGAR model for certain special graphs.

Theorem 2.

Let T denote a tree with vertices V = {1, …, k} and π denote any ordering such that for any iπ(1), nπ(i) = 1. Then the covariance matrix in (8) defines the autoregressive Gaussian process on T, i.e., (LπTFπLπ)1=(ρdij) where dij denotes the shortest path on T between i and j.

Theorem 2 shows that if G is acyclic, then, under certain orderings including breadth-first and pre-order tree traversals, this model is equivalent to the stationary AR(1) model on trees with ρ being the correlation between neighboring areas. Here, for any two vertices separated by a distance d, the correlation is ρd. The result also shows that while we require an ordering of the locations to construct the DAGAR model, the resulting matrix in this case is order-free and stationary and simply a function of the underlying undirected graph. We now present a result on interpretation of ρ for an m × n regular grid graph.

Theorem 3.

Let G denote the m × n grid graph with vertices V = {(i, j)|i = 1, …, m; j = 1, …, n} and neighbors to the north, south, east, and west, and let π denote any diagonal ordering of the vertices corresponding to non-decreasing or non-increasing order of i + j or ij, then var(w(i,j)) = 1 for all i and j, and for any neighboring pair of vertices (i, j) and (i′, j′), cov(w(i,j), w(i,j′)) = ρ.

Hence, although for a grid, the DAGAR precision matrix is a function of the ordering, for all orderings of Theorem 3, the model yields unit variances and a correlation of ρ for all neighbor pairs. Hence, ρ is still interpretable. This result for a grid graph is quite promising as graphs arising from areal data, like the grid graph, are loopy. Also note that when a CAR model is specified as w ~ N(0, τw(DρA)), it may seem that 1/τw is the marginal variance of each spatial random effect. Unfortunately, this is not true as the specification of the precision matrix for the CAR model effectuates a heteroskedastic distribution. Consequently, τw can only be interpreted as a common scale factor for the marginal variances. This is remedied in the DAGAR specification, as we see from Theorems 2 and 3 that the resulting model N(0, τwQ(ρ)), where Q(ρ) is the DAGAR precision matrix, is homoskedastic, and hence 1/τw is the marginal variance. So, the DAGAR model ensures interpretability for both ρ and τw. We will see that this interpretability empowers the DAGAR model to deliver significantly improved inference about the spatial parameters in areal data analysis.

Of course, it is difficult to generalize these theoretical interpretability results for irregular graphs. Hence, we also conducted numerical experiments to corroborate the results in Theorems 2 and 3, and also gain insight into the relationship between ρ and the neighbor-pair correlations for the proper CAR model and the DAGAR model using an irregular graph. So we used three different graphs: a simple path graph with 100 vertices which is analogous to a time-series, a two-dimensional 10 × 10 lattice or grid graph with edges between vertically or horizontally adjacent vertices, and the state map of the contiguous United States, where two states are said to have an edge if they share a common geographical boundary.

We generated covariance matrices corresponding to the two models for ρ ∈ {i/10 | i = 1, …, 9}. Figure 1 plots the average neighbor-pair correlation, given by c(ρ)=i~jcov(wi,wj)/(2var(wi)var(wj))/(ni) as a function of ρ, for proper CAR and DAGAR models. For the path and grid graphs, we find the average neighbor pair correlation c(ρ) for our model is exactly ρ as guaranteed in Theorems 2 and 3. For the highly irregular United States graph, the theoretical results, of course, do not hold. Nevertheless, c(ρ) for the DAGAR model is much closer to ρ than for the proper CAR model. For the CAR model, even when ρ is close to one, c(ρ) is less than 0.4. In fact, for all three graphs, the average neighbor-pair correlation for the proper CAR model remains modest. This seems to be true even for very high values of ρ and is consistent with findings elsewhere (see, e.g., Banerjee et al. 2014).

Figure 1:

Figure 1:

Average neighbor pair correlations as a function of ρ for proper CAR and DAGAR model. The solid gray line represents x = y line.

2.4. Impact of Ordering

Unlike covariance or precision matrices that remain invariant up to a permutation factor under different orderings of the multivariate vector, Cholesky factors depend on the ordering of the observations. Our model in (8) assumes a predetermined ordering π. We have already seen that for tree and grid graphs, Theorems 2 and 3 guarantee that the under many different orderings, the DAGAR model retains desirable properties like homoskedasticity and neighbor-pair correlation of ρ.

To understand the impact of ordering beyond the variances and neighbor-pair correlations, we consider an order-free model using a product-of-experts type construction (Hinton 2002). Let Pπ denote the permutation matrix corresponding to π, i.e., Pπ(x1, …, xk)T = (xπ(1), …, xπ(k))T for any k-dimensional vector x, and let Q denote the average of the DAGAR precision matrices in (8) over all permutations π, i.e.,

Q(ρ)=1k!πPπTLπTFπLπPπ. (9)

It is clear that Q = Q(ρ) is free of any ordering and is only a function of the undirected graph G. Also, since it is the average of positive definite matrices, it is also positive definite. We will use the order-free model Q to understand how the DAGAR precision matrices Qπ differ under different choices of the ordering π. In order to do this, we first note that Q can be expressed in closed form.

Theorem 4.

Let ij mean that i and j share at least one common neighbor. There exist functions f(ρ, r) and g(ρ, r) for any positive integer r and 0 ≤ ρ < 1 such that

Qii=1+niρ22(1ρ2)+ρ21ρ2j~if(ρ,nj)
Qij=ρ1ρ2I(i~j)+11ρ2I(ij)k~N(i)N(j)g(ρ,nk).

Here I(·) denotes the indicator function. Explicit expressions for f(ρ, r) and g(ρ, r) are provided in the proof of Theorem 4. Let K denote a set of ‘reasonable’ orderings that one can consider for a given areal dataset. We note that for two orderings π1 and π2 in K, the relative difference Qπ1(ρ)Qπ2(ρ)F/Q(ρ)F, where || · ||F denotes the Frobenius norm, is bounded above by 2maxπKQπ(ρ)Q(ρ)F/Q(ρ)F. We now investigate the asymptotic behavior of this quantity for the path and grid graphs.

Theorem 5.

Consider the path graph with k nodes, and let π denote the left to right or right to left ordering. Then the relative difference

limkQπ(ρ)Q(ρ)FQ(ρ)F=4ρ8+2ρ4(3+6ρ2+ρ4)2+18ρ2(1+ρ2)2+2ρ4. (10)

Theorem 5 quantifies asymptotically the relative difference between the DAGAR model and the order free version. Figure 2 plots the quantity on the right hand side of (10) as a function of ρ ∈ [0, 1]. We see that it is a monotonically increasing function of ρ. For small values of ρ the difference is extremely small (0.02 for ρ = 0.25) and even for moderate values of ρ (0.5 − 0.75) the difference is around or less than 10%. Below we also provide the analogous result for the two-dimensional grid graph.

Figure 2:

Figure 2:

Asymptotic relative difference between the DAGAR model and the order free DAGAR model in terms of Frobenius norm for path graph (left) and grid graph (right). The five numbers on each curve corresponds to the values of the difference at ρ = 0, 0.25, 0.5, 0.75 and 1 respectively.

Theorem 6.

Consider a m × m grid graph and let π denote any of the orderings used in Theorem 3. Then

limmQπ(ρ)Q(ρ)FQ(ρ)F=ρ4(s(ρ)521+ρ2)2+2(13s(ρ)30ρ21+ρ2)2+12(16s(ρ)60)2(1+ρ2+ρ2s(ρ)5)2+4ρ2+20(16s(ρ)60)2 (11)

where s(ρ)=r=14r1+(r1)ρ2.

The limit in Theorem 6 looks more complicated than the analogous result for the path graph. However, things are simplified noting that 1 − s(ρ)/10 is O(ρ2) and, hence, so is the numerator in the right hand side of (11). Figure 2 (b) plots this quantity as a function of ρ. We see that once again this is monotonic in ρ and the difference is negligible for small ρ. Theorems 5 and 6 show that at least for small ρ, the impact of ordering is negligible, though for larger ρ theoretically the DAGAR precision matrices for different orderings will be somewhat different.

While these results are restricted to the case of special graphs, for an arbitrary areal dataset, one approach would be to use simple intuitive orderings based on the coordinates representing the nodes in some Euclidean embedding of G. Similar strategies have been used in Cholesky factor based approaches in Datta et al. (2016a), Stein et al. (2004) and Vecchia (1988) who observed empirically that the joint distribution seemed to be less sensitive to ordering of the regions. Our own set of simulations, detailed in Section 3.2 will confirm these finding as we observe that results corresponding to different orderings are similar for both regular and irregular graphs.

The order-free model, owing to the availability of the precision matrix in closed form, can be deemed a worthy candidate for analyzing areal data given its liberation from a synthetic ordering. Our simulation analyses detailed in Section 3.1, show that for a wide range of scenarios, performance of the DAGAR model and its order-free version were very similar. However, the order-free model has certain disadvantages. From Theorem 4, it is clear that for ij, Qij ≠ 0 if and only if either i ~ j or ij. Hence, the sparsity of Q is e2 where e2 is the number of edges in the second order graph created from G. As e2 > e, Q is less sparse than the precision matrices for the original DAGAR model in Section 2.2 or the CAR models. Furthermore, unlike the DAGAR precision matrix Qπ, Q does not have a closed form expression for the determinant, which invokes expensive computations. These computational roadblocks limit the possibility of using the order-free model for larger datasets. The results of Theorems 2 and 3 about interpretability of the parameter ρ also do not carry over to the order-free model.

3. Data analysis

3.1. Data generated using an exponential Gaussian Process

Models for areal datasets are typically used as priors for areal random effects in a hierarchical setup. For example, let yi denote the response observed at region i and xi denote the corresponding set of covariates. A spatial generalized linear mixed model framework assumes h(E(yi))=xiTβ+wi where h(·) denotes a suitable link function. Subsequently, a hierarchical areal model is customarily specified as

i=1kpr(yi|xiTβ+wi,θ)×N(w|0,τwQ(ρ))×pr(β,τw,ρ,θ) (12)

where Q(ρ) denotes the precision matrix of the areal model, pr(yi|xiTβ+wi,θ) denotes the density corresponding to the link h(·) and pr(β, τw, ρ, θ) is the prior for the parameters. If h(·) is the identity link, e.g., the responses are Gaussian, then we can exploit conjugacy for generating w in a sampler. However, for non-Gaussian responses, we have to sample w and the other parameters using a Metropolis random walk sampler from the joint density in (12).

We conducted simulation experiments assessing the performance of the areal data models using the three graph structures described in Section 2.3 — path, grid and US states. For each graph, we embed the vertices on the Euclidean plane and generate the spatial random effect vector w from an Gaussian process, i.e., w ~ N(0, τwM) where M−1 is the covariance matrix corresponding to an exponential (Matérn1/2) GP, i.e., M−1 = exp(−ϕd(i, j)) with d(i, j) denoting the distance between the embedding of the ith and jth vertex. The path graph has a distance preserving embedding in the Euclidean plane such that D(i, j) = |ij|. We embed the grid graph within a 10 × 10 grid in the Euclidean plane. Although the resulting distance matrix is not identical to the shortest distance (or geodesic) matrix on the graph, the distance between each neighbor-pair remains one. For the United States graph, we use the centroid of each state to create the distance matrix. We scale the distance matrix so that the average neighbor-pair distance is one. To generate w, we use τw = 0.25, τe = 2.5 and ϕ = −log(j/10) for j = 1, …, 9. This implies that for the exponential GP the average neighbor pair correlation ρ = exp(−ϕ) varies between 0.1 and 0.9, thereby covering a wide spectrum of scenarios. Subsequently, we generate the response y comprising independent yi=xiTβ+wi+ϵi, where xi is a 2 × 1 vector comprising two independent standard normal variables, β = (1, 5)T, and the ϵi’s are independent N(0, τe).

We fitted all models using the hierarchical setup in (12) with the six different choices of Q(ρ) — 1) proper CAR, 2) ICAR, 3) a scaled ICAR model (as suggested by one reviewer) of Sørbye and Rue (2014) which specifies a prior for τw in the ICAR model as τw~Gamma(2,σref2) where

σref2=exp(1ki=1klog((Q+)ii)),

Q+ denoting the Moore-Penrose inverse of Q, 4) the dimension-reducing sparse GLMM (Hughes and Haran 2013), and the 5) ordered (DAGAR) and 6) order-free (DAGAROF) directed acyclic graph autoregressive models. To create the directed acyclic graph for our model, we used the ordering based on the sum of the co-ordinates of the mappings of the vertices. The sparse GLMM was implemented directly using the ngSpatial R-package (Hughes and Cui 2018). For the remaining models, we used conjugate priors β ~ N(0, 0.001I), τw ~ Gamma(2, 1) (except for the scaled ICAR for which the prior for τw is specified above) and τe ~ Gamma(2, 0.1). For the proper CAR and the two DAGAR models, the spatial correlation parameter ρ was assigned a Unif(0, 1) prior. For each combination of parameter values we used 100 replicate datasets.

Figure 3 plots the mean square error (MSE) between the true and estimated w averaged over 100 replicates for each scenario. We first observe that the mean square errors for the ICAR-based models (ICAR, scaled ICAR and sparse GLMM) are significantly higher than the other three models for all three graphs. The scaled ICAR was the best among these three, producing substantially lower MSEs than the original ICAR and the sparse GLMM. The sparse GLMM also produced lower MSEs than ICAR for path and USA graphs, but had slightly higher MSE for the grid graph. However, the three models involving the additional ρ parameters, i.e., the proper CAR and the two DAGAR models, consistently produced lower MSEs. For the path graph there is no significant difference in MSE among these three models. However, for the grid and USA graphs when ρ is small, the DAGAR models yielded substantially lower errors. We also noticed that, in terms of MSE, there was very little difference between the performance of the ordered DAGAR model (8) and its order-free analogue (9) for most of the scenarios. This result is encouraging as the scalability of the former is a pragmatic solution for analyzing very large areal datasets or networks.

Figure 3:

Figure 3:

MSE as a function of the true ρ (x-axis) for the simulation data analysis using data generated from an exponential GP

Next, we consider estimation of ρ, as ensuring interpretability of ρ is the motivation driving the construction of the DAGAR model. As the data was generated using an exponential GP, ρ is the unit-distance spatial correlation. We look at the estimates and credible bands of ρ in Figure 12c for the three models involving ρ (the ICAR-based models do not involve ρ and, hence, are not included). We see that for all three graphs, estimates for ρ from the proper CAR model are considerably higher than the true value. The bias is especially stark when the true ρ is small. The DAGAR models generally perform much better in this respect with much less estimation bias, particularly for small ρ. For larger ρ, the order-free model performs slightly better with the ordered model demonstrating some downwards bias. The 95% confidence bands for both the DAGAR models cover the true value of ρ in most scenarios for all three graphs, while the bands for the proper CAR clearly miss many of the true ρ values.

Figure 12:

Figure 12:

Estimate and confidence bands of ρ as a function of the true ρ (x-axis) for the simulation data analysis using Poisson responses

Finally, in Figure 5 we plot the coverage probabilities (CP) defined as the mean coverage for a parameter by the 95% confidence intervals over the 100 replications. The regression coefficients β1, β2 and the error variance σe2=1/τe are common to all the models, and hence we compare the CPs for all six models for these parameters. For ρ, we only compare the three models using ρ. We do not compare σw2=1/τw, as it has different interpretation for different models. For example, it is the homoskedastic spatial variance for the ordered DAGAR model, but simply a scale factor for the heteroskedastic spatial variances in the CAR models. We see that for ρ, the two DAGAR models offer significantly improved coverage over the proper CAR model. This trend was already reflected in the confidence bands in Figure 12c and is confirmed here. For smaller ρ the coverages of the two DAGAR models are nearly identical to the nominal level of 95%. While the coverages decline for larger values of ρ, they are still almost uniformly and substantially better than the coverage provided by the proper CAR model.

Figure 5:

Figure 5:

Coverage probabilities of the parameters as a function of the true ρ (x-axis) for the simulation data analysis using data generated from an exponential GP

For the regression coefficients, β1 and β2, we see that all models except the sparse GLMM offered satisfactory coverage, close to 95%. This is not surprising as estimates of regression coefficients are typically robust to variance misspecification. The under coverage of the sparse GLMM is also expected as it tries to adjust for spatial confounding based on an underlying model assumption, which can lead to worse estimates if that assumption is violated. If we generated data in a way such that the eigenvectors correspon to non-zero eigenvalues of the covariance matrix are uncorrelated with the covariates, then it is likely that the sparse GLMM will produce the most accurate estimates as the DAGAR does not adjust for spatial confounding. However, our focus in this manuscript is not on spatial confounding and the data generation scheme we used is extremely common for geo-spatial settings. The dimension reduction approach used by sparse GLMM with the ICAR model as their baseline, can possibly also be adopted for the DAGAR models, to yield versions that guard against confounding. However, care has to be taken avoid eigen decompositions of the covariance matrix at every step of the MCMC, as the DAGAR models, unlike the ICAR, involve ρ whose value will be updated at every iteration. We identify this as one of the future research directions.

Turning to the error variance σe2=1/τe, we first note that the ICAR-based models performs surprisingly poorly offering almost zero coverage for small ρ for all the three graphs. Only the scaled ICAR offers somewhat improved coverage for larger values of ρ. The CPs from the DAGAR models are once again close to 95% for smaller values of ρ but decline for larger ρ. The proper CAR generally offers coverage worse than the DAGAR model and better than the ICAR-based models.

Reliable estimation and inference for spatial covariance parameters is a notoriously difficult problem. These results for the coverage probabilities of the spatial parameters clearly demonstrate the value of our interpretable model in delivering more accurate inference about the parameters for areal data. The results present strong evidence for the superiority of the DAGAR model both in terms of effectively recovering the latent spatial surface and the ability to assess hypotheses related to parameters describing the spatial structure in the data.

3.2. Analyses using different orderings

The DAGAR model used in the analyses in Section 3.1 for the USA graph was constructed by ordering the nodes (states) from the southwest to the northeast. In this section, we repeat the analysis for the DAGAR model using three other orderings which start at southeast, northwest and northeast respectively and go approximately diagonally to the opposite end of the map.

Figure 6 plots the average MSE (left) and the estimates and confidence bands for ρ over 100 replicated datasets for the DAGAR model using these three orderings and the original ordering used in Section 3.1. We see that the ordering has little impact on the results as the MSE as well as the estimates and confidence bands for ρ for the four different orderings are nearly indistinguishable.

Figure 6:

Figure 6:

MSE (left) and estimates and confidence bands of ρ (right) as a function of the true ρ (x-axis) for four different orderings of the DAGAR model

3.3. Data generated using CAR and DAGAR

In Sections 3.1 and A.1, we generated the data using Gaussian Processes to ensure that the data generating mechanism is different from all the models we are fitting to the data (except for the path graph for data generated using an exponential GP). In this Section, we considered simulation schemes where the data were generated using the DAGAR or the proper CAR model for all three graphs. All the parameter choices were kept the same as in Section 3.1 and 100 replicates were used for each setting.

Figures 7 and 8 plot the average MSE when the data is generated using a DAGAR model. We see that when the data is generated using a DAGAR covariance, the DAGAR model substantially outperforms all the ICAR-based models with significantly lower MSE for all three graphs and all values of ρ. For the path graph, the proper CAR also produces MSEs similar to the DAGAR model, whereas for grid and USA graphs for smaller values of ρ, the MSE is higher than those of the DAGAR models. The trends in MSE are broadly similar to what was observed in Sections 3.1 and A.1. When data is generated using a proper CAR, Figure 8 reveals that the DAGAR models, alongwith the proper CAR model (which is the true model), once again produce MSEs substantially lower than the ICAR-based models for all 3 graphs.

Figure 7:

Figure 7:

MSE as a function of the true ρ (x-axis) for the simulation data analysis using data generated from a DAGAR model

Figure 8:

Figure 8:

MSE as a function of the true ρ (x-axis) for the simulation data analysis using data generated from a proper CAR model

3.4. Additional analyses

The Supplemental file contains additional analyses for a) simulated data generated with spatial random effects coming from a smoother Matérn3/2 GP instead of the exponential GP (Section A.1), and b) non-Gaussian (Poisson) areal data (Section A.2). Overall, the findings from these analysis concur with the results presented here. Across all the simulation scenarios, we found the performance of the DAGAR model to be remarkably robust, uniformly producing the lowest MSEs, accurately estimating the regression coefficients, error variances, and most remarkably the spatial correlation ρ, which is considered to be notoriously difficult to estimate. With the exception of data generated on a path graph using an exponential GP, all the other settings effectively correspond to misspecified models, and the estimates of the regression coefficients from the DAGAR model were quite robust to this. The proper CAR model also performed quite well, often producing MSEs close to those from the DAGAR models except for cases when the true spatial correlation was weak, i.e., ρ was small. However, both estimation and inference for ρ from the proper CAR model was generally much less accurate than the DAGAR model. We note that across all scenarios, the ICAR-based models generally performed quite poorly, producing large MSE, with the scaled ICAR being the best in this class. Finally, the order-free DAGAR model produced results very similar to the ordered DAGAR model for all scenarios. However, it was much slower, especially due to determinant calculations. While using state-of-the-art sparse matrix algorithms would definitely help scale reduce the computing times for the order-free model, that is not the focus of the current manuscript and hence, we do not consider the order-free model for the real data analyses.

4. County-level US infant mortality data

We now analyze a large areal data using the DAGAR model. The dataset consists of counts of infant births Bi and deaths Di for each of 3071 US counties. County-specific covariates, which possibly affect infant death rates, were available and include number of births with low weight (lowi), percentages of black residents (blacki) and Hispanic residents (Hispi), a Gini index measuring income disparity (ginii), social affluence (affi), and a measure of residential stability (stabi). The dataset is publicly available in the ‘ngspatial’ package in R and was analyzed in Hughes and Haran (2013) where more description of the data is available.

We analyzed this dataset using Poisson spatial regression model where each Di is modeled as an independent Poisson random variable with mean Bi exp(α + β1 lowi + β2 blacki + β3 Hispi + β4 ginii + β5 affi +β6 stabi +wi) where the wi’s are the spatial random effects. We assign α and the βis independent N(0, 10−4) priors. We present the results for w ~ N(0, τwQ) where Q is either the DAGAR or the ICAR model. We could not implement the proper CAR model for such a large dataset. However, we also add the results of the sparse spatial GLMM model. In addition to presenting the parameter estimates and confidence intervals, we also use model comparison metrics to evaluate the three covariance models. We used the Deviance Information Criterion (DIC, Spiegelhalter et al. 2002) to compare the posterior distributions. Table 1 presents the results for the three models. Among the seven regression coefficients, estimates for six of them were similar across the three models, with each of the credible intervals yielding the same inference. The exception to this was β4, whose estimates differed substantially between the sparse GLMM and the other two models. The sparse GLMM was the only one yielding a credible interval that does not cover zero.

Table 1:

Parameter estimates (posterior medians) and model comparison metrics for the US infant mortality data. The numbers inside braces indicates the lower and upper bounds for the 95% credible intervals

DAGAR ICAR sparse SGLMM
α −5.623 (−5.944, −5.353) −5.641 (−5.871, −5.413) −5.430 (−5.616, −5.246)
β1 7.803 (6.438, 9.172) 7.716 (3.924 9.166) 8.777 (7.540, 10.032)
β2 0.00376 (0.00208, 0.00543) 0.00364 (0.00182, 0.00915) 0.00423 (0.00288, 0.00556)
β3 −0.00347 (−0.00501, −0.00189) −0.00286 (−0.00859, −0.00262) −0.00379 (−0.00488, −0.00270)
β4 −0.0616 (−0.570, 0.480) 0.103 (−0.425, 0.631) −0.555 (−0.977, −0.125)
β5 −0.0770 (−0.0911, −0.0632) −0.0778 (−0.0935, −0.0616) −0.0757 (−0.0877, −0.0638)
β6 −0.0413 (−0.0590, −0.0234) −0.0448 (−0.0643, −0.0249) −0.0285 (−0.0433, −0.0138)
τw 7.544 (3.615, 12.866) 32.080 (14.11, 39.87) 9.450 (3.870, 16.459)
ρ 0.987 (0.974, 0.995)
DIC 10145.8 9902.0 10110.0

We do not know if the difference for β4 was due to the sparse GLMM accounting for spatial confounding, as this can only be answered depending on what we believe the true data generation process was. We have seen consistently in the simulation analyses using the usual data generation paradigm, how the sparse GLMM offered higher MSEs and poor inference on the regression coefficients. Also, while the credible intervals for the sparse GLMM were significant for all 7 regression coefficients (compared to 6 for the other two models), it also produced a slightly higher DIC than the ICAR model despite being a dimension reduction approach with a fewer number of parameters.

DAGAR was the only model to accommodate ρ, and the estimate and confidence intervals suggest strong spatial correlation. We have seen consistently from the simulation exercises that when the underlying spatial correlation is strong, the DAGAR model performs similarly to the CAR models. This is consistent with what we observe here. In fact, the deviance information criterion of the three models are within approximately 1% of each other, demonstrating the competitive performance of the DAGAR model even for large datasets in a non-Gaussian setup under strong spatial dependence. Moreover, through the estimation of ρ, DAGAR provides additional insight about the spatial dependence that is not offered by the ICAR model or the sparse SGLMM model.

5. Discussion

The existing repertoire of covariance models used for analyzing areal datasets is extremely limited. In this manuscript, we have developed an alternative parametric model for areal datasets that promises to be a significant addition to this inventory. The parametric DAGAR models we have proposed in Section 2 are novel and offer a greater degree of interpretability than the CAR models, and will be scalable for large datasets. We observe that when spatial dependence is weak or modest, the DAGAR model excels over both variants of the CAR model, while the results are similar when there is strong spatial correlation. Since, the magnitude of the underlying spatial correlation is unknown apriori in most real life applications, we believe the DAGAR model will be a useful alternative to the CAR models. While the ordering of the locations for the DAGAR model is artificial, the theoretical results and extensive simulations strongly suggest that substantive inference from the DAGAR model is expected to be robust to the ordering.

Analyzing prevalence of many diseases simultaneously in a multivariate setup is becoming increasingly important to accommodate the correlations among different disease prevalences. Many of the popular approaches rely on Cholesky factors of conditionally autoregressive precision matrices (Gelfand and Vounatsou 2003; Martinez-Beneito 2013; Martinez-Beneito et al. 2017) which can be computationally prohibitive for large k. Our ordered model lends itself naturally to these settings due to the readily available Cholesky decomposition and, hence, promises to broaden the inventory of multivariate disease mapping models. The ordered model also offers a coherent way of modeling on arbitrary graphs or networks of growing size, i.e., if a new point is added to the graph, the nested distributions remain same, unlike any of the other three models considered here.

A. Additional simulation analyses

A.1. Data generated using a smoother Gaussian Process

As pointed out by one reviewer, in the simulation settings of Section 3.1, the data generation model using an exponential GP becomes same as the DAGAR model for the path graph. While this is not true for the grid and USA graphs, and the results were generally consistent across the choice of the graphs, in this section we tried a different data generation model to assess the performance of the areal models. Keeping all other model specifications same, we generated the spatial random effects wi using a smoother Matérn3/2 GP instead of an exponential GP. This ensures that the data generation model does not correspond to any of the six models fitted to the data for any of the three graphs.

We first look at the mean square error in terms of estimating the latent spatial random effects in Figure 9. We see similar trends as in the case of exponential GP. The MSEs from the ICAR models are much higher, with the scaled ICAR, once again, producing lower MSE than the original ICAR and sparse GLMM. The sparse GLMM was better than the ICAR for path and grid graph but was worse for the USA graph. The proper CAR and the two DAGAR models produced lower MSEs than these ICAR-based models for all three graphs, with the improvement more prominent for smaller ρ. For smaller ρ, we also see that the DAGAR models produce lowest MSE among all the six models, whereas for larger ρ, the MSEs for most of the models are similar.

Figure 9:

Figure 9:

MSE as a function of the true ρ (x-axis) for the simulation data analysis using data generated from a Matérn3/2 GP

We also briefly summarize the comparison of the models based on inference (CP) on the parameters involved. We only look at the common parameters β1, β2 and σe2. We do not consider ρ as, unlike the exponential GP, the spatial decay parameter in the Matérn3/2 GP does not have a simple relationship with ρ. Figure 10 provides the coverage probabilities of the three parameters as a function of ρ. We see once again that the trends observed for the exponential GP data analysis in Section 3.1 carry over to here. The coverages for the regression coefficients are close to 95% for all the models except the sparse GLMM. For σe2, all models produce under-coverage for larger values of ρ. For smaller values of ρ, however, the coverage of the proper CAR and the two DAGAR models are close to 95%.

Figure 10:

Figure 10:

Coverage probabilities of the parameters as a function of the true ρ (x-axis) for the simulation data analysis using Gaussian data generated from a Matérn3/2 GP

A.2. A non-Gaussian example

In this Section, we conduct a simulation study using a non-Gaussian response. We generate independent yi~Poisson(exp(xiβ+wi)) where the spatial random effect vector w = (w1, w2, …, wk) are generated as realizations from an exponential GP, akin to Section 3.1. All other parameter and covariate choices remain unchanged from the previous simulations, and the same set of six candidate models are assessed.

We first compare the MSEs which are quite close for all the models except for the sparse GLMM and ICAR (for path graph) which produce significantly higher MSEs. The DAGAR models produced the lowest MSEs for USA graph, and joint lowest MSEs along-with the proper CAR model for the path graph. We then compare the estimation of ρ for the DAGAR and proper CAR models, as for an exponential GP, ρ corresponds to the correlation at unit distance and the data generation ensured that on average the neighboring units are separated by unit distance (see Section 3.1). The estimates and confidence bands in Figure 12 demonstrates how the DAGAR model produces accurately estimates the spatial correlation between neighbors even when the data is non-Gaussian, whereas the estimates from the CAR model are far off akin to the Gaussian case. Similarly. the coverage probabilities of parameters in Figure 13, repeat the trends observed in Figure 5 for the Gaussian case, with all models except the sparse GLMM offering close to 95% coverage for the regression coefficients, and the DAGAR models offering substantially improved coverage for ρ than the proper CAR model.

Figure 13:

Figure 13:

Coverage probabilities of the parameters as a function of the true ρ (x-axis) for the simulation data analysis using Poisson responses

B. Proofs

B.1. Proof of Theorem 1

Let r = rank(Q) and Q+ denote the Moore-Penrose inverse of Q. Then, by Theorem 1 part (b) of Higham (1990), there exists a permutation P such that

PQ+P=RR where R=[R1R200]

Here R1 is r × r upper triangular matrix with positive diagonal elements. Let D1 = diag(R1) and R1*=D11R1, which has ones on the diagonal. We can now write

R=DU where U=[R1*D11R20I] and D=[D1000]

Since U is a lower triangular matrix with one on the diagonals, so is L = U−⊤. Hence, PQP = (IB) F(IB) where F = D+2 and B = IL is a strictly lower triangular matrix.

B.2. Proof of Theorem 2

First of all, as T is a tree, it is always possible to have an ordering π such that nπ(i) = 1 for any iπ(1). For example, the orderings corresponding to any pre-order or breadth-first tree traversal of T will satisfy this. Without loss of generality we rename the nodes such that π = {1, …, k} and for i > 1, p(i) denotes the directed neighbor of i in π implying p(i) < i. Letting w0 = 0, p(1) = 0 and nπ(1) = 0, the model in (7) reduces to wi = ρ wp(i) + (1 − ρ2)0.5ϵi where ϵi are independent standard normal variables. We shall show that for any positive integers jik, cov(wi,wj)=ρdij. We prove this using the strong form of mathematical induction. Since p(2) = 1, it is easy to verify this for i = 2. We assume that this is true for i = 2, …, i − 1. It immediately follows that var(wi) = ρ2var(wp(i)) + (1 – ρ2)var(ϵi) = 1. For any j < i, cov(wi,wj)=ρcov(wp(i),wj)=ρ1+dp(i)j (by induction). Since T is acyclic and nπ(i) = 1 for all i > 1, the shortest path from j to i runs through p(i). Hence, dij = dp(i)j + 1 and the result follows.

B.3. Proof of Theorem 3

If i + j = i′ + j′, then (i, j) and (i′, j′) are never neighbors. Hence, without loss of generality, we prove the result for π = (S2, …, Sm+n)T where Sr = {(i, j) | i + j = r}. Let d((i, j), (i′, j′)) = |ii′| + |jj′| denote the Manhattan distance on G, and wSr denote the sub-vector of w corresponding to the indices in Sr. It is enough to show by induction on r that wSr~N(0,(ρDr)1) where Dr denotes the distance matrix on Sr. This holds trivially for r = 2. Let us assume that it holds true for r − 1. If (1, r − 1) ∈ Sr, we define w(0, r − 1) = ρw(1, r − 1) + ϵ(0, r − 1) where ϵ(0, r − 1) ~ N(0, 1/(1 – ρ2)) is independent of w. If (r − 1, 1) ∈ Sr we define w(r − 1, 0) similarly. Let wSr1* be the augmented vector which includes w(0, r − 1) or w(r − 1, 0) or both, along with wSr1. From the construction, w(0, r − 1) = ρ2w(1, r − 2) + ρϵ(1, r − 1) + ϵ(0, r − 1) implying var(w(0, r − 1)) = 1 and cov(w(0, r − 1), w(1, r − 2)) = ρ2. Hence cov(wSr1*)=ρDr* where Dr* is the augmented distance matrix corresponding to Sr1*. Letting ρ2 = u, we have for any (i, j) and (i,j′) in Sr,

cov(w(i,j),w(i,j))=u(1+u)2(cov(w(i1,j),w(i1,j))+cov(w(i,j1),w(i1,j))+cov(w(i1,j),w(i,j1))+cov(w(i,j1),w(i,j1)))+I(i=i)1u1+u=u(1+u)2(ρ|ii+1|+|jj1|+2ρ|ii|+|jj|+ρ|ii1|+|jj+1|)+I(i=i)1u1+u

If i = i′ then j = j′ and the expression above equals 1. If i < i′, then j > j′ and |ii′ + 1| + |jj′ − 1| = (i′ − i − 1) + (jj′ − 1) = |ii′| + |jj′| − 2. Similarly, |ii′ − 1| + |jj′ + 1 = |ii′| + |jj′| + 2. So, ρ|ii′ + 1| + |jj′ − 1| + 2ρ|ii′| + |jj′| + ρ|ii′ − 1| + |jj′ + 1| = ρ|ii′| + |jj′|(1/u + 2 + u). Hence, the results follows.

B.4. Proof of Theorem 4

For any vertex i with ni neighbors, let πir denote the set of all permutations π such that nπ(i) = r. By symmetry, |πir| = k!/(ni + 1) for r = 0, 1, …, ni. Also, for any i ~ j and r = 0, …, ni, let πijr denote the set of all permutations π such that nπ(i) = r and jNπ(i). Then, |πijr| = k!/(ni +1) × pr(j is among the r directed neighbors of i) = rk!/(ni(ni + 1)). We now have

Q[i,i]=1k!(1ρ2)π(1+(nπ(i)1)ρ2+j~iI(iNπ(j))ρ21+(nπ(j)1)ρ2)=1+ρ2k!(1ρ2)(r=0nir|πr|+j~ir=0nj|πjir|1+(r1)ρ2)=1+niρ22(1ρ2)+ρ21ρ2j~i1nj(nj+1)r=1njr1+(r1)ρ2=1+niρ22(1ρ2)+ρ21ρ2j~i1nj(nj+1)f(ρ,nj).

To evaluate the non-diagonal entries of Q, we additionally define πijkr to be the set of all permutations π such that nπ(i) = r and {j, k} ⊆ Nπ(i). Applying the combinatorial argument used earlier, we see that |πijkr| = r(r − 1)k!/((ni − 1)ni(ni + 1)). Let ij implies that there exists at least one node k such that i ~ k and j ~ k.

Q[i,j]=1k!(1ρ2)π(ρI(i~j)+I(ij)k:{i,j}Nπ(k)ρ21+(nπ(k)1)ρ2)=ρ1ρ2I(i~j)+ρ2k!(1ρ2)I(ij)k~N(i)N(j)r=0nk|πkijr|1+(r1)ρ2=ρ1ρ2I(i~j)+ρ21ρ2I(ij)k~N(i)N(j)1(nk1)nk(nk+1)r=1nkr(r1)1+(r1)ρ2=ρ1ρ2I(i~j)+11ρ2I(ij)k~N(i)N(j)(12(nk1)1(nk1)nk(nk+1)f(ρ,nk)).

B.5. Proof of Theorem 5

We write Qπ(ρ) and Q(ρ) as Qπ and Q hiding the dependence on ρ except when necessary. From Theorem 4 we have for i = 3, …, k − 2, Qii=3+6ρ2+ρ43(1+ρ2)(1ρ2), Qi,i+1=ρ1ρ2 and Qi,i+2=ρ23(1+ρ2)(1ρ2). Hence,

QF2=k9(1+ρ2)2(1ρ2)2((3+6ρ2+ρ4)2+18ρ2(1+ρ2)2+2ρ4)+o(k)

where the o(k) term arises from rows and columns corresponding to the nodes at the extreme right or left.

Now using left-to-right or right-to-left ordering, from (6), a typical term in the quadratic form wQπw will be of the form (wiρwi−1)2/(1 − ρ2). Hence, for i = 1, 2, …, k – 2, Qπ:ii=1+ρ21ρ2, Qπ:i,i+1=ρ1ρ2 and Qπ:i,i+2 = 0. So,

QQπF2=k9(1+ρ2)2(1ρ2)2((3+6ρ2+ρ43(1+ρ2)2)2+2ρ4)+o(k)=k9(1+ρ2)2(1ρ2)2(4ρ8+2ρ4)+o(k)

Hence, the result follows.

B.6. Proof of Theorem 6

We index the nodes of the grid as (i, j), and entries of Qπ and Q as Qπ:(ij),(ij′) and Q(ij),(ij′) respectively for 1 ≤ i, i′, j, j′ ≤ m. Like in the proof of Theorem 5, it only suffices to evaluate the Frobenius norms for the interior points of the grid (having 4 neighbors each of whom also have 4 neighbors) as the contribution from the remaining terms will be o(m2). Hence from Theorem 3 we have for 3 ≤ i, i,j, j′ ≤ m − 2,

Q(ij),(ij)=1+ρ2+ρ2s(ρ)/51ρ2Q(ij),(i+1,j)=Q(ij),(i,j+1)=ρ1ρ2Q(ij),(i+2,j)=Q(ij),(i,j+2)=11ρ2(1/6s(ρ)/60)Q(ij),(i+1,j+1)=Q(ij),(i+1,j1)=21ρ2(1/6s(ρ)/60)

Summing up, we have

QF2=m2(Q(ij),(ij)2+4(Q(ij),(i+1,j)2+Q(ij),(i+2,j)2+Q(ij),(i+1,j+1)2)+o(1))=m2(1ρ2)2((1+ρ2+ρ2s(ρ)/5)2+4ρ2+20(1/6s(ρ)/60)2)+o(m2)

Now, without loss of generality we assume that the DAGAR precision matrix Qπ(ρ) was constructed by ordering the nodes in increasing order of (i + j). Then, a typical term in the quadratic form wQπw will be of the form 1+ρ21ρ2(wijρ1+ρ2(wi,j1+wi1,j))2. Hence, we will have

Qπ:(ij),(ij)=1+ρ2+2ρ2/(1+ρ2)1ρ2Qπ:(ij),(i+1,j)=Qπ:(ij),(i,j+1)=ρ1ρ2Qπ:(ij),(i+2,j)=Qπ:(ij),(i,j+2)=Qπ:(ij),(i+1,j+1)=0Qπ:(ij),(i+1,j1)=ρ2(1+ρ2)(1ρ2)

Subtracting, we have

QπQF2=m2((Qπ:(ij),(ij)Q(ij),(ij))2+4Q(ij),(i+2,j)2+2(Qπ:(ij),(i+1,j1)Q(ij),(i+1,j1))2+2Q(ij),(i+1,j+1)2+o(1))

Hence, the result follows.

Supplementary Material

Supplement

Figure 4:

Figure 4:

Estimate and confidence bands of ρ as a function of the true ρ (x-axis) for the simulation data analysis using data generated from an exponential GP

Figure 11:

Figure 11:

MSE as a function of the true ρ (x-axis) for the simulation data analysis using Poisson responses

Contributor Information

Abhirup Datta, Johns Hopkins University.

Sudipto Banerjee, University of California Los Angeles.

James S. Hodges, University of Minnesota

Leiwen Gao, University of California Los Angeles.

References

  1. Assuncao R and Krainski E (2009). Neighborhood dependence in Bayesian spatial models. Biometrical Journal, 51:851–869. [DOI] [PubMed] [Google Scholar]
  2. Banerjee S, Carlin BP, and Gelfand AE (2014). Hierarchical Modeling and Analysis for Spatial Data. Chapman & Hall/CRC, Boca Raton, FL, second edition. [Google Scholar]
  3. Basseville M, Benveniste A, Chou KC, Golden SA, Nikoukhah R, and Willsky AS (2006). Modeling and estimation of multiresolution stochastic processes. IEEE Trans. Inf. Theor, 38(2):766–784. [Google Scholar]
  4. Besag J (1974). Spatial interaction and statistical analysis of lattice systems. Journal of the Royal Statistical Society, Series B, 36:192–225. [Google Scholar]
  5. Besag J and Higdon D (1999). Bayesian analysis of agricultural field experiments. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 61(4):691–746. [Google Scholar]
  6. Besag J and Kooperberg C (1995). On conditional and intrinsic autoregressions. Biometrika, 82:733–746. [Google Scholar]
  7. Besag J and Mondal D (2005). First-order intrinsic autoregressions and the de wijs process. Biometrika, 92(4):909–920. [Google Scholar]
  8. Bickel PJ and Levina E (2008a). Covariance regularization by thresholding. The Annals of Statistics, 36(6):2577–2604. [Google Scholar]
  9. Bickel PJ and Levina E (2008b). Regularized estimation of large covariance matrices. The Annals of Statistics, 36(1):199–227. [Google Scholar]
  10. Cai TT, Zhang C-H, and Zhou HH (2010). Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38(4):2118–2144. [Google Scholar]
  11. Clayton DG and Bernardinelli L (1992). Bayesian methods for mapping disease risk. In Elliott P, Cuzick J, English D, and Stern R, editors, Geographical and Environmental Epidemiology: Methods for Small-Area Studies, pages 205–220. Oxford University Press. [Google Scholar]
  12. Cressie N and Davidson JL (1998). Image analysis with partially ordered Markov models. Computational Statistics and Data Analysis, 29(1):1–26. [Google Scholar]
  13. Datta A, Banerjee S, Finley AO, and Gelfand AE (2016a). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111(514):800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Datta A, Banerjee S, Finley AO, Hamm NAS, and Schaap M (2016b). Nonseparable dynamic nearest neighbor Gaussian process models for large spatio-temporal data with an application to particulate matter analysis. Ann. Appl. Statist, 10(3):1286–1316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. El Karoui N (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. The Annals of Statistics, 36(6):2717–2756. [Google Scholar]
  16. Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, and Banerjee S (2017). Applying nearest neighbor Gaussian processes to massive spatial data sets: Forest canopy height prediction across Tanana Valley Alaska. https://arxiv.org/pdf/1702.00434.pdf.
  17. Friedman J, Hastie T, and Tibshirani R (2007). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9:432–441. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Gelfand AE and Vounatsou P (2003). Proper multivariate conditional autoregressive models for spatial data analysis. Biostatistics, 4(1):11. [DOI] [PubMed] [Google Scholar]
  19. Higham NJ (1990). Analysis of the cholesky decomposition of a semi-definite matrix. In in Reliable Numerical Computation, pages 161–185. University Press. [Google Scholar]
  20. Hinton GE (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14:1711–1800. [DOI] [PubMed] [Google Scholar]
  21. Hughes J and Cui X (2018). ngspatial: Fitting the Centered Autologistic and Sparse Spatial Generalized Linear Mixed Models for Areal Data. Denver, CO. R package version 1.2–1. [Google Scholar]
  22. Hughes J and Haran M (2013). Dimension reduction and alleviation of confounding for spatial generalized linear mixed models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(1):139–159. [Google Scholar]
  23. Leroux BG, Lei X, and Breslow N (2000). Estimation of disease rates in small areas: A new mixed model for spatial dependence. In Halloran ME and Berry D, editors, Statistical Models in Epidemiology, the Environment, and Clinical Trials, pages 179–191. Springer New York, New York, NY. [Google Scholar]
  24. MacNab Y and Dean C (2000). Parametric bootstrap and penalized quasi-likelihood inference in conditional autoregressive models. Statistis in Medicine, 19:15–30. [DOI] [PubMed] [Google Scholar]
  25. Martinez-Beneito MA (2013). A general modelling framework for multivariate disease mapping. Biometrika, 100(3):539. [Google Scholar]
  26. Martinez-Beneito MA, Botella-Rocamora P, and Banerjee S (2017). Towards a multidimensional approach to Bayesian disease mapping. Bayesian Anal, 12(1):239–259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Meinshausen N and Buhlmann P (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3):1436–1462. [Google Scholar]
  28. Rothman AJ, Levina E, and Zhu J (2009). Generalized thresholding of large covariance matrices. Journal of the American Statistical Association, 104(485):177–186. [Google Scholar]
  29. Sørbye SH and Rue H (2014). Scaling intrinsic gaussian markov random field priors in spatial modelling. Spatial Statistics, 8:39–51. [Google Scholar]
  30. Spiegelhalter DJ, Best NG, Carlin BP, and van der Linde A (2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society, Series B, 64:583–639. [Google Scholar]
  31. Stein ML (1999). Interpolation of Spatial Data: Some Theory for Kriging. Springer, New York, NY, first edition. [Google Scholar]
  32. Stein ML, Chi Z, and Welty LJ (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society, Series B, 66:275–296. [Google Scholar]
  33. Sudderth EB (2002). Embedded trees: Estimation of Gaussian processes on graphs with cycles. http://cs.brown.edu/sudderth/papers/sudderthMasters.pdf.
  34. Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50:297–312. [Google Scholar]
  35. Wall M (2004). A close look at the spatial structure implied by the CAR and SAR models. Journal of Statistical Planning and Inference, 121:311–324. [Google Scholar]
  36. Whittle P (1954). On stationary processes in the plane. Biometrika, 41(3/4):434–449. [Google Scholar]
  37. Wu W and Pourahmadi M (2003). Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika, 90(4):831–844. [Google Scholar]
  38. Xue L, Ma S, and Zou H (2012). Positive-definite 1-penalized estimation of large covariance matrices. Journal of the American Statistical Association, 107(500):1480–1491. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

RESOURCES