Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains

Michele Peruzzi; Sudipto Banerjee; Andrew O Finley

doi:10.1080/01621459.2020.1833889

. Author manuscript; available in PMC: 2022 Aug 5.

Published in final edited form as: J Am Stat Assoc. 2020 Nov 24;117(538):969–982. doi: 10.1080/01621459.2020.1833889

Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains

Michele Peruzzi ^*,^†, Sudipto Banerjee ^‡, Andrew O Finley ^*

PMCID: PMC9354857 NIHMSID: NIHMS1639134 PMID: 35935897

Abstract

We introduce a class of scalable Bayesian hierarchical models for the analysis of massive geostatistical datasets. The underlying idea combines ideas on high-dimensional geostatistics by partitioning the spatial domain and modeling the regions in the partition using a sparsity-inducing directed acyclic graph (DAG). We extend the model over the DAG to a well-defined spatial process, which we call the Meshed Gaussian Process (MGP). A major contribution is the development of a MGPs on tessellated domains, accompanied by a Gibbs sampler for the efficient recovery of spatial random effects. In particular, the cubic MGP (Q-MGP) can harness high-performance computing resources by executing all large-scale operations in parallel within the Gibbs sampler, improving mixing and computing time compared to sequential updating schemes. Unlike some existing models for large spatial data, a Q-MGP facilitates massive caching of expensive matrix operations, making it particularly apt in dealing with spatiotemporal remote-sensing data. We compare Q-MGPs with large synthetic and real world data against state-of-the-art methods. We also illustrate using Normalized Difference Vegetation Index (NDVI) data from the Serengeti park region to recover latent multivariate spatiotemporal random effects at millions of locations. The source code is available at github.com/mkln/meshgp.

Keywords: Bayesian, spatial, large n, graphical models, domain partitioning, sparsity

1. Introduction

Collecting large quantities of spatial and spatiotemporal data is now commonplace in many fields. In ecology and forestry, massive datasets are collected using satellite imaging and other remote sensing instruments such as LiDAR that periodically record high-resolution images. Unfortunately, clouds frequently obstruct the view resulting in large regions with missing information. Figure 1 shows this phenomenon in Normalized Difference Vegetation Index (NDVI) data from the Serengeti region. Filling such gaps in the data is an important goal as is quantifying uncertainty in predictions. This goal is achieved through stochastic modeling of the underlying phenomenon, which involves the specification of a spatial or spatiotemporal process characterizing dependence from a finite realization. Gaussian processes (GP) are a customary choice to characterize spatial dependence, but their implementation is notoriously burdened by their O(n³) computational complexity. Consequently, intense research has been devoted in recent years to developing scalable models for large spatial datasets – see detailed reviews by Sun et al. (2011) and Banerjee (2017).

Figure 1: — Left: NDVI in the Serengeti region on 2016-12-17. White areas correspond to missing data due to cloud cover. Right: elevation data for the same region.

Computational complexity can be reduced by considering low-rank models; among these, knot-based methods motivated by “kriging” ideas enjoy some optimality properties but over-smooth the estimates of spatial random effects unless the number of knots is large, and require corrections to avoid overestimation of the nugget (Banerjee et al., 2008; Cressie and Johannesson, 2008; Banerjee et al., 2010; Guhaniyogi et al., 2011; Finley et al., 2012). Other methods reduce the computational burden by introducing sparsity in the covariance matrix; strategies include tapering (Furrer et al., 2006; Kaufman et al., 2008) or partitioning of the spatial domain into regions with a typical assumption of independence across regions (Sang and Huang, 2012; Stein, 2014). These can be improved by considering a recursive partitioning scheme, resulting in a multi-resolution approximation (MRA; Katzfuss 2017). Other assumptions on conditional independence assumptions also have a good track record in terms of scalability to large spatial datasets: Gaussian random Markov random fields (GMRF; Rue and Held, 2005), composite likelihood methods (Eidsvik et al., 2014), and neighbor-based likelihood approximations (Vecchia, 1988) belong to this family.

The recent literature has witnessed substantial activity surrounding the so called Vecchia approximation (Vecchia, 1988). This approximation can be regarded as a special case of the GMRF approximations with a simplified neighborhood structure motivated from a directed acyclic graphical (DAG) representation of a Gaussian process likelihood. Extensions leading to well-defined spatial processes to accommodate inference at arbitrary locations by extending the DAG representation to the entire domain include Nearest neighbor Gaussian processes (NNGP; Datta et al. 2016a,b) and further generalizations by constructing DAGs over the augmented space of outcomes and spatial effects (Katzfuss and Guinness, 2017). These approaches render computational scalability by introducing sparsity in the precision matrix. The DAG relies upon a specific topological ordering of the locations, which also determine the construction of neighborhood sets, and certain orderings tend to deliver improved performance of such models (Katzfuss and Guinness, 2017; Guinness, 2018).

When inference on the latent process is sought, Bayesian inference has the benefits of providing direct probability statements based upon the posterior distribution of the process. Inference based on asymptotic approximations are avoided, but there remain challenges in computing the posterior distribution given that inference is sought on a very high-dimensional parameter space (including the realizations of the latent process). One possibility, available for Gaussian first-stage likelihoods, is to work with a collapsed or marginalized likelihood by integrating out the spatial random effects. However, Gibbs samplers and other MCMC algorithms for the collapsed models can be inexorably slow and are impractical when data are in the millions. A sequential Gibbs sampler that updates the latent spatial effects (Datta et al., 2016a) is faster in updating the parameters but suffers from high autocorrelation and slow mixing. Another possibility emerges when interest lies in prediction or imputation of the outcome variable only and not the latent process. Here, a so called “response” model that models the outcome itself using an NNGP can be constructed. This model is much faster and enjoys superior convergence properties, but we lose inference on the latent process and its predictive performance tends to be inferior to the latent process model. Furthermore, these options are unavailable in non-Gaussian first-stage hierarchical models or when the focus is not uniquely on prediction. A detailed comparison of different approaches for computing Bayesian NNGP models is presented in Finley et al. (2019).

Our current contribution introduces a class of Meshed Gaussian Process (MGP) models for Bayesian hierarchical modeling of large spatial datasets. This class builds upon the aforementioned works that build upon Vecchia (1988) and other DAG based models. The inferential focus remains within the context of massive spatial datasets over very large domains. We exploit the demonstrated benefits of the DAG based models, but we now adapt them to partitioned domains. We describe dependence across regions of a partitioned domain using a small, patterned DAG which we refer to as a mesh. Within each region, some locations are selected as reference and collectively mapped to a single node in the DAG. Relationships among nodes are governed by kriging ideas. In the resulting MGP, regions in the spatial domain depend on each other through the reference locations. Realizations at all other locations are assumed independent, conditional upon the reference locations. This construction leads to a valid standalone spatial process.

As a particular subclass of MGPs, we propose a novel partitioning and graph design based on domain tessellations. Unlike methods that build sparse DAGs by limiting dependence to m nearest neighbors, our approach shapes the underlying DAG with a known, repeating pattern corresponding to the chosen tessellation geometry. The underlying sparse DAG enables scaling computations to large data settings and its known pattern guarantees the availability of block-parallel sampling schemes; furthermore, large computational savings can be achieved at no additional approximation cost if data are collected on patterned lattices. Finally, extensions to spatiotemporal and/or multivariate data are straightforward once a suitable covariance function has been defined. We use axis-parallel domain partitioning and the corresponding cubic DAG – resulting in cubic MGPs or Q-MGPs – to show substantial improvements in computational time and inferential performance relative to other models with data sizes ranging from the thousands to the several millions, for both spatial and spatiotemporal data and using multivariate spatial processes.

The present work may appear to share similarities with the block-NNGP model of Quiroz et al. (2019), who advocate building sparse DAGs on grouped locations based on their ordering and subsequent identification of m “past” neighbors. Unlike block-NNGPs, our tessellated GPs consider the domain tessellation as generating the DAG; the number of parents of any node is thus fixed and depends on the geometry of the chosen tessellation rather than on a user-defined parameter. Inclusion of more distant locations in the parent set of any location will, therefore, not proceed by increasing the number of neighbors m, but rather by increasing the regions’ size and/or modifying their shape. Central to tessellated GPs is the idea of forcing a DAG with known coloring on the data, resulting in guaranteed efficiencies when recovering the latent spatial effects. This strategy is analogous in spirit to multi-resolution approximations (Katzfuss, 2017; Gramacy and Lee, 2008), which also force a DAG on the data, resulting in conditional independence patterns that are known in advance and that can be used to improve computations. However, while multi-resolution approximations are defined by branching graphs associated to recursive domain partitioning, tessellated GPs use a single domain partition, with each region connected in the DAG only to its immediate neighbors. Compared to treed graphs, tessellated GPs are associated to DAGs with fewer conditionally-independent groups and whose repeated patterns facilitate the identification of redundant matrix operations arising when one or more coordinate margins are gridded. We also note that while the idea of partitioning domains to create approximations is not new, construction of the DAG-based approximation over partitioned domains has received considerably less attention. Finally, our focus here is in developing tessellated GPs as a methodology that enables the efficient recovery of the latent spatial random effects and the Bayesian estimation of covariance parameters via MCMC; we are thus not focusing on alternative computational algorithms (see e.g. Finley et al. 2019), which have been developed for NNGPs but can nonetheless all be adapted to general MGP models.

The balance of this paper proceeds as follows. Section 2 introduces our general framework for hierarchical Bayesian modeling of spatial processes using networks of grouped spatial locations. The MGP is outlined in Section 3, where we provide a general, scalable computing algorithm in Section 3.1. Tessellation-based schemes and the specific case of Q-MGPs are outlined in Section 4, which highlights their properties and computational advantages. We illustrate the performance of our proposed approach in Section 5 using simulation experiments and an application on a massive dataset with millions of spatiotemporal locations. We conclude the paper with a discussion and pointers to further research. Supplementary material accompanying this manuscript as an Appendix is available online and contains further comparisons of Q-MGPs with several state-of-the-art methods for spatial data.

2. Spatial processes on partitioned domains

A q × 1 spatial process assigns a probability law on ${w (ℓ) : ℓ \in D}$ , where w(ℓ) is a q × 1 random vector with elements w_i(ℓ) for i = 1,2,…,q. In the following general discussion we will not distinguish between spatial $(D \subset ℜ^{d})$ and spatiotemporal domains $(D \subset ℜ^{d + 1})$ , and denote spatial or spatiotemporal locations as ℓ,s, or u.

For any finite set of spatial locations ${ℓ_{1}, ℓ_{2}, \dots, ℓ_{n_{L}}} = L \subset D$ of size $n_{L}$ , let P(·) denote the probability law of the $n_{L} q \times 1$ random vector $w_{L} = {(w {(ℓ_{1})}^{⊤}, w {(ℓ_{2})}^{⊤}, \dots, w {(ℓ_{n_{L}})}^{⊤})}^{⊤}$ with probability density p(·). The joint density of $w_{L}$ can be expressed as a DAG (or a Bayesian network model) with respect to the ordered set of locations $L$ as

p (w_{L}) = \prod_{i = 1}^{n_{L}} p (w (ℓ_{i}) ∣ w (ℓ_{1}), \dots, w (ℓ_{i - 1})),

(1)

where the conditional set for each w(ℓ_i) can be interpreted as the set of its parents in a large, dense Bayesian network. Defining a simplified valid joint density on $L$ by reducing the size of the conditioning sets is a popular strategy for fast likelihood approximations in the context of large spatial datasets. One typically limits dependence to “past” neighboring locations with respect to the ordering in (1) (Vecchia, 1988; Gramacy and Apley, 2015; Stein et al., 2004; Datta et al., 2016a; Katzfuss and Guinness, 2017). The neighbors are defined and fixed and model performance may benefit from the addition of some distant locations (Stein et al., 2004). The ordering in $L$ is also fixed and inferential performance may benefit from the use of some fixed permutations (Guinness, 2018). The result of shrinking the conditional sets to a smaller set of neighbors from the past yields a sparse DAG or Bayesian network, which yields potentially massive computational gains.

We proceed in a similar manner, but instead of defining a sparse DAG at the level of each individual location, we map entire groups of locations to nodes in a much smaller graph; the same graph will be used to model the dependence between any location in the spatial domain and, therefore, to define a spatial process. Let $P = {D_{1}, \dots, D_{M}}$ be a partition of $D$ into M mutually exclusive subsets so that $D = \cup_{i = 1}^{M} D_{i}$ and $D_{i} \cap D_{j} = \emptyset$ whenever i ≠ j. Similar to the nomenclature in the NNGP, we fix a reference set $S = {s_{1}, \dots, s_{n_{S}}} \subset D$ , which itself is partitioned using $P$ by letting $S_{j} = D_{j} \cap S$ . The set of non-reference locations is similarly partitioned with $U_{j} = D_{j} \ S_{j}$ so that $D_{j} = S_{j} \cup U_{j}$ for each j = 1,2,…,M. We now construct a DAG to model dependence within and between $S$ and $U$ . Let $G = {V, E}$ be a graph with nodes V = A ∪ B, where we refer to A = {a₁,…,a_M} as the reference nodes and to B = {b₁,…,b_M} as the non-reference, or simply “other”, nodes. Let A ∩ B = ∅. We introduce a map $η : D \to V$ such that

η (ℓ) = {\begin{array}{l} a_{j} \in A if ℓ \in S_{j} \\ b_{j} \in B if ℓ \in U_{j} \end{array} .

(2)

This surjective many-to-one map links each location in $S_{j}$ and $U_{j}$ to a node in $G$ . The edges connecting nodes in $G$ are E = {Pa[v₁],…,Pa[v_2M]} where Pa[v] ⊂ V denotes the set of parents of any v ∈ V and, hence, identifies the directed edges pointing to v. We let $G$ be acyclic, i.e., there is no chain ${v_{i_{1}} \to v_{i_{2}} \to \dots \to v_{i_{t}}}$ of elements of V such that $v_{i_{j}} \in Pa [v_{i_{j + 1}}]$ and $v_{i_{j + 1}} \in Pa [v_{i_{1}}]$ . Crucially, we assume that Pa[v] ⊂ A for all v ∈ V, i.e., that only reference nodes have children, to distinguish the reference nodes A from the other nodes B. Apart from the assumption that a_j ∈ Pa[b_j], we refrain from defining the parents of a node, thereby retaining flexibility. In general, however, all locations in $U_{j}$ will share the same parent set. In Section 4 we will consider meshes associated to domain tessellations.

Consider the enumeration $S_{i} = {s_{i_{1}}, \dots, s_{i_{n_{i}}}}$ , where ${i_{1}, i_{2}, \dots, i_{n_{i}}} \subset {1, 2, \dots, n_{S}}$ , and let $w_{i} = {(w {(s_{i_{1}})}^{⊤}, w {(s_{i_{2}})}^{⊤}, \dots, w {(s_{i_{n_{i}}})}^{⊤})}^{⊤}$ be the n_iq × 1 random vector listing elements of w(s) for each $s \in S_{i}$ . We now rewrite (1) as a product of M conditional densities

p (w_{S}) = p (w_{1}, w_{2}, \dots, w_{M}) = \prod_{i = 1}^{M} p (w_{i} ∣ w_{1}, \dots, w_{i - 1}) .

(3)

The conditioning sets are then reduced based on the graph $G$ :

\tilde{p} (w_{S}) = \prod_{i = 1}^{M} p (w_{i} ∣ w_{[i]}),

(4)

where we denote w_[i] = {w_j : a_j ∈ Pa[a_i]}, and Pa[a_i] ⊂ {a₁,…,a_i−1} ⊂ A. This is a proper multivariate joint density since the graph is acyclic (Lauritzen, 1996). It is instructive to note how the above approximation behaves when the size of the parent set shrinks, for a given domain partitioning scheme. To this end, we adapt a result in Banerjee (2020) and show that sparser DAGs correspond to a larger Kullback-Leibler (KL) divergence from the base density p. This result has been proved earlier for Gaussian likelihoods by Guinness (2018), but the argument given below is free of distributional assumptions and is linked to the submodularity of entropy and the “information never hurts” principle (see e.g. Cover and Thomas, 1991).

Consider random vector w and some partition of the domain $P$ corresponding to nodes V = {v₁,…,v_M} via map η. Let the base process correspond to graph $G_{0} = {V, E_{0}}$ where E₀ = {Pa₀[v₁],…,Pa₀[v_M]}. Then, let $G_{1} = {V, E_{1}}$ where E₁ = {Pa₁[v₁],…,Pa₁[v_M]} and Pa₁[v_i] ⊆ Pa₀[v_i] for all i ∈ {1,…,M}. Finally construct $G_{2} = {V, E_{2}}$ by letting Pa₂[v_i*] = Pa₁[v_i*] \ {v*} for some v* ∈ Pa₁[v_i*]. In other words, graph $G_{2}$ is obtained by removing the directed edge v* → v_i* from $G_{1}$ . We approximate p using densities p₁ and p₂ based on $G_{1}$ and $G_{2}$ , respectively, obtaining

\frac{p_{1} (w)}{p_{2} (w)} = \prod_{i = 1}^{M} \frac{p (w_{i} ∣ w_{{[i]}_{1}})}{p (w_{i} ∣ w_{{[i]}_{2}})} = \frac{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{1}})}{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{2}})} .

(5)

Considering the Kullback-Leibler divergence of each density from p, and denoting V * = V \ {{i*} ∪ Pa₁[i*]}, we find

K L (p_{2} ‖ p) - K L (p_{1} ‖ p) = \int {log (\frac{p (w)}{p_{2} (w)}) - log (\frac{p (w}{p_{1} (w})} p (w) d w = \int log (\frac{p_{1} (w)}{p_{2} (w)}) p (w) d w = \int log (\frac{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{1}})}{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{2}})}) p (w) d w = \int log (\frac{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{1}})}{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{2}})}) p (w_{i^{*}}, w_{{[i^{*}]}_{1}}) d w_{i^{*}} d w_{{[i^{*}]}_{1}} = \int {\int log (\frac{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{1}})}{p (w_{i^{*}} ∣ w_{{[i^{*}]}_{2}})}) p (w_{i^{*}} ∣ w_{{[i^{*}]}_{1}}) d w_{i^{*}}} p (w_{{[i^{*}]}_{1}}) d w_{{[i^{*}]}_{1}} \geq 0,

(6)

where we use (5), the fact that V * and {i*} ∪ Pa₁[i*] are disjoint, and Jensen’s inequality. This result implies that larger parent sets are preferrable as they correspond to better approximations to the full model; the choice of sparser graphs will be driven by computational considerations – see Section 3.2.

We construct the spatial process over arbitrary locations by enumerating other locations as $U = {u_{1}, \dots, u_{n_{U}}} \subset D \ S$ and extending (4) to the non-reference locations. Given the partition of $U$ defined earlier with components $U_{j}$ for j = 1,2,…,M, for each $u \in U_{j}$ we set η(u) = b_j and recall that Pa[b_i] ⊂ A by construction. For each $i = 1, \dots, n_{U}$ , we denote $w_{[u_{i}]} = {w_{j} : a_{j} \in Pa [η (u_{i})]} \subset w_{S}$ and define the conditional density of $w_{U}$ given $w_{S}$ as

\tilde{p} (w_{U} ∣ w_{S}) = \prod_{u_{i} \in U} p (w (u_{i}) ∣ w_{[u_{i}]}) = \prod_{j = 1}^{M} p (w_{U_{j}} ∣ w_{[b_{j}]}) .

(7)

Therefore, for any finite subset of spatial locations $L \subset D$ we can let $U = L \ S$ and obtain

\tilde{p} (w_{L}) = \int \tilde{p} (w_{U} ∣ w_{S}) \tilde{p} (w_{S}) \prod_{s_{i} \in S \ L} d (w (s_{i})) .

We show (see Appendix A, available online) that this is a well-defined process by verifying the Kolmogorov consistency conditions. This new process can be built starting from a base process, a fixed reference set, domain partition $P$ and a graph $G$ . Next, we elucidate with Gaussian processes.

3. Meshed Gaussian Processes

Let ${w (ℓ) : ℓ \in D}$ be a q-variate multivariate Gaussian process, denoted as w(ℓ) ~ GP(0, C(·, · | θ)). The cross-covariance C(·,· | θ) indexed by parameters θ is a function $C : D \times D \to M_{q \times q}$ , where $M_{q \times q}$ is a subset of $ℜ^{q \times q}$ (the space of all q × q real matrices) such that the (i, j)-th entry of C(ℓ, ℓ′ | θ) evaluates the covariance between the i-th and j-th elements of w(ℓ) at ℓ and ℓ′, respectively, i.e., cov(w_i(ℓ), w_j(ℓ′)). We omit dependence on θ to simplify notation. The cross-covariance function itself needs to be neither symmetric nor positive-definite, but must satisfy the following two properties: (i) C(ℓ, ℓ′) = C(ℓ′, ℓ)^⊤; and (ii) $\sum_{i = 1}^{n} \sum_{j = 1}^{n} z_{i}^{⊤} C (ℓ_{i}, ℓ_{j}) z_{j} > 0$ for any integer n and any finite collection of points {ℓ₁, ℓ₂, …, ℓ_n} and for all $z_{i} \in ℜ^{q} \ {0}$ . See Genton and Kleiber (2015) for a review of cross-covariance functions for multivariate processes. The (partial) realization of the multivariate process over any finite set $L$ has a multivariate normal distribution $w_{L} ~ N (0, C_{L})$ where $w_{L}$ is the $q n_{L} \times 1$ column vector and $C_{L}$ is the $q n_{L} \times q n_{L}$ block matrix with the q×q matrix C(ℓ_i, ℓ_j) as its (i, j) block for $i, j = 1, \dots, n_{L}$ .

We construct the MGP from a base, or parent, multivariate GP for w(ℓ) and then, using the graph $G$ defined in Section 2, represent the joint density at the reference set $S$ as

\tilde{p} (w_{S}) = \prod_{j = 1}^{M} N (w_{j} ∣ H_{j} w_{[j]}, R_{j}),

(8)

where $H_{1} = O_{n_{1} \times 1}$ , $R_{1} = C_{S_{j}}$ and for j > 1, $H_{j} = C_{S_{j}, S_{[j]}} C_{S_{[j]}}^{- 1}$ and $R_{j} = C_{S_{j}} - C_{S_{j}, S_{[j]}} C_{S_{[j]}}^{- 1} C_{S_{[j]}, S_{j}}$ . The resulting joint density $\tilde{p} (w_{S})$ is multivariate normal with covariance ${\tilde{C}}_{S}$ and a precision matrix ${\tilde{C}}_{S}^{- 1}$ . The precision matrix for Gaussian graphical models is easily derived using customary linear model representations for each conditional regression. Consider the DAG in (4). Each w_i is n_iq × 1 and let J_i = |Pa[a_i]| be the number of parents for a_i in the graph $G$ . Furthermore, let C_i,j be the n_iq × n_jq covariance matrix between w_i and w_j, C_i,[i] be the n_iq × J_iq covariance matrix between w_i and w_[i], and C_[i],[i] be the J_iq × J_iq covariance matrix between w_[i] and itself. Representing each conditional density in (4) as a linear regression on w_i, we get

w_{1} = ω_{1} ~ N (0, R_{1}); w_{i} = \sum_{{j : a_{j} \in Pa [a_{i}]}} H_{i j} w_{j} + ω_{i}, i = 2, 3, \dots, M,

(9)

where each H_ij is an n_iq×n_jq is a coefficient matrix representing the multivariate regression of w_j given w_[i], $ω_{i} \overset{i n d}{~} N (0, R_{i})$ for i = 1,2,…,M, and each R_i is an n_iq × n_iq residual covariance matrix. We set H_ii = O and H_ij = O, where O is the matrix of zeros, whenever j ∈ {j : a_j ∉ Pa[a_i]}. For j ∈ {j : a_j ∈ Pa[a_i]}, let ${j_{1}, j_{2}, \dots, j_{J_{i}}}$ be the indices in Pa[a_i] and let $H_{i, [i]} = [H_{i, j_{1}}, H_{i, j_{2}}, \dots, H_{i, j_{J_{i}}}]$ be the $n_{i} q \times (\sum_{k = 1}^{J_{i}} n_{j_{k}}) q$ block matrix formed by stacking $H_{i, j_{k}}$ side by side for each $a_{j_{k}} \in Pa [a_{i}]$ . Since $E [w_{i} ∣ w_{[i]}] = H_{i, [i]} w_{[i]} = C_{i, [i]} C_{[i] [i]}^{- 1} w_{[i]}$ , we obtain $H_{i, [i]} = C_{i, [i]} C_{[i] [i]}^{- 1}$ and each $H_{i j_{K}}$ can be obtained from the respective submatrix of H_i[i]. We also obtain $R_{i} = var {w_{i} ∣ w_{[i]}} = C_{i, i} - C_{i, [i]} C_{[i] [i]}^{- 1} C_{[i], i}$ . Therefore, all the H_ij’s and R_i’s can be computed from the base cross-covariance function.

The distribution of $w = {[w_{1}^{⊤}, w_{2}^{⊤}, \dots, w_{M}^{⊤}]}^{⊤}$ can be obtained by noting that w = Hw+ ω, where H = {H_ij} is the $(\sum_{i = 1}^{M} n_{i} q) \times (\sum_{i = 1}^{M} n_{i} q)$ block matrix with {H_ij} as (i, j)-th block. Therefore, ${\tilde{C}}_{S} = var (w) = {(I - H)}^{- 1} R {(I - H)}^{- ⊤}$ , where R is block-diagonal with R_i as the (i, i)-th block. Note that I − H is block lower-triangular with 1’s on the diagonal, hence non-singular. Also, the precision matrix ${\tilde{C}}_{S}^{- 1} = {(I - H)}^{⊤} R^{- 1} (I - H)$ is sparse because of H_ij = O whenever a_j ∉ Pa[a_i]. Block-sparsity of ${\tilde{C}}_{S}^{- 1}$ can be induced by building $G$ with few, carefully placed directed edges among nodes in A; Appendix B, available online, contains a more in-depth treatment. We extend (8) to the collection of non-reference locations $U \subset D \ S$ :

\tilde{p} (w_{U} ∣ w_{S}) = \prod_{j = 1}^{M} N (w_{U_{j}} ∣ H_{U_{j}} w_{[b_{j}]}, R_{U_{j}}) = N (w_{U} ∣ H_{U} w_{S}, R_{U}),

(10)

where $H_{U_{j}} = C_{U_{j}, S_{[b_{j}]}} C_{S_{[b_{j}]}}^{- 1}$ and $R_{U_{j}} = C_{U_{j}} - C_{U_{j}, S_{[b_{j}]}} C_{S_{[b_{j}]}}^{- 1} C_{S_{[b_{j}]}, U_{j}}$ , analogously to (8), while $H_{U}$ and $R_{U}$ are analogous to $H_{S}$ and $R_{S}$ . Clearly, given that all the $\tilde{p}$ densities are Gaussian, all finite dimensional distributions will also be Gaussian. We have constructed a Gaussian process with the following cross-covariance function for any two locations $ℓ_{1}, ℓ_{2} \in D$

C o v_{\tilde{p}} (w (ℓ_{1}), w (ℓ_{2})) = {\begin{array}{l} {\tilde{C}}_{s_{i}, s_{j}} & if ℓ_{1} = s_{i}, ℓ_{2} = s_{j} and s_{i}, s_{j} \in S \\ H_{ℓ_{1}} {\tilde{C}}_{S_{[ℓ_{1}]}, s_{j}} & if ℓ_{1} \in D \ S, ℓ_{2} = s_{j} and s_{j} \in S \\ δ_{(ℓ_{1} = ℓ_{2})} R_{ℓ_{1}} + H_{ℓ_{1}} {\tilde{C}}_{S_{[ℓ_{1}]}, S_{[ℓ_{2}]}} H_{ℓ_{2}}^{⊤} & otherwise. \end{array}

For a given base Gaussian covariance function C, domain partitioning $P$ , mesh $G$ , and reference set $S$ , we denote the corresponding meshed Gaussian process as MGP( $G$ , $P$ , $S$ ,C).

3.1. Bayesian hierarchical model and Gibbs sampler

Meshed GPs produce block-sparse precision matrices that are constructed cheaply from their block-sparse Cholesky factors by solving small linear systems. General purpose sparse-Cholesky algorithms (Davis, 2006; Chen et al., 2008) can then be used to obtain collapsed samplers as in Finley et al. (2019). Unfortunately, these algorithms can only be used on Gaussian first stage models and are computationally impracticable for data in the millions. Hence, we develop a more general scalable Gibbs sampler for the recovery of spatial random effects in hierarchical MGP models that entirely circumvents large matrix computations.

Consider a multivariate spatiotemporally-varying regression model at $ℓ \in D \subset ℜ^{d + 1}$ ,

y (ℓ) = X {(ℓ)}^{⊤} β + Z {(ℓ)}^{⊤} w (ℓ) + ε (ℓ),

(11)

where $y (ℓ) \in ℜ^{l}$ is the multivariate point-referenced outcome, $X {(ℓ)}^{⊤} = blockdiag {x_{i} {(ℓ)}^{⊤}}_{i = 1}^{l}$ is a l × p = l × ∑p_i matrix of spatially referenced predictors linked to constant coefficients β, w(ℓ) is the spatial process, Z(ℓ) is a l × q design matrix, ε(ℓ) is measurement error such that $ε (ℓ) \overset{i i d}{~} N (0, D)$ and $D = diag (τ_{1}^{2}, \dots, τ_{l}^{2})$ . A simple univariate regression model with a spatially-varying intercept can be obtained with l = 1, Z(ℓ) = 1. For observed locations $T = {ℓ_{1}, \dots, ℓ_{n}}$ , we write the above model compactly y = Xβ + Zw + ε, where y = (y(ℓ₁)^⊤, …, y(ℓ_n)^⊤)^⊤, w and ε are similarly defined, X = [X(ℓ₁) : ⋯ : X(ℓ_n)]^⊤, $Z = blockdiag ({Z {(ℓ_{i})}^{⊤}}_{i = 1}^{n})$ , and $D_{n} = blockdiag ({D}_{i = 1}^{n})$ .

For subsets ${ℓ_{1}, \dots, ℓ_{n_{A}}} = A \subset T$ , let $y (A) = {(y {(ℓ_{1})}^{⊤}, \dots, y {(ℓ_{n_{A}})}^{⊤})}^{⊤}$ , with analogous definitions for $w (A)$ and $ε (A)$ , $X (A) = {[X (ℓ_{1}) : \dots : X (ℓ_{n_{A}})]}^{⊤}$ , $Z_{A} = blockdiag ({Z {(ℓ_{i})}^{⊤}}_{i = 1}^{n_{A}})$ and $D_{A} = blockdiag ({D}_{i = 1}^{n_{A}})$ . After fixing a reference set $S$ , we obtain $S^{*} = T \cap S$ and $U = T \ S$ . We partition the domain as above to obtain $S_{j}$ , $S_{j}^{*}$ , $U_{j}$ for j = 1, …, M and model w(ℓ) using the MGP which yields $w ~ N (0, {\tilde{C}}_{S}^{- 1})$ . We complete the model specification by assigning β ~ N(β | μ_β, Σ_β), $τ_{j}^{2} ~ I n v . Gamma (τ_{j}^{2} ∣ a_{τ_{j}}, b_{τ_{j}})$ , θ ~ p(θ).

The resulting full conditional distribution for β is $N (Σ_{β}^{*} μ_{β}^{*}, Σ_{β}^{*})$ , where $Σ_{β}^{*} = (Σ_{β}^{- 1} + {X^{⊤} D_{n}^{- 1} X)}^{- 1}$ , $μ_{β}^{*} = Σ_{β}^{- 1} μ_{β} + X^{⊤} D_{n}^{- 1} (y - Z w)$ . For $τ_{r}^{2}$ , r = 1, …, q, the full conditional is Inverse-Gamma with parameters $a_{τ_{r}} + n / 2$ and $b_{τ_{r}} + \frac{1}{2} E_{r}^{⊤} E_{r}$ where E_r = y_·r −X_·rβ−Z_·rw and y_·r,X_·r,Z_·r are the subsets of y,X,Z corresponding to outcome r (out of q).

The Gibbs update of the $w_{U}$ components can proceed simultaneously as all blocks in $U$ have no children and their parents are in $S$ . The full conditional for $w_{U_{j}}$ for j = 1,…,M is thus $N (Σ_{U_{j}}^{*} μ_{U_{j}}^{*}, Σ_{U_{j}}^{*})$ where $Σ_{U_{j}}^{*} = {(Z (U_{j}) D^{- 1} Z {(U_{j})}^{⊤} + R_{U_{j}}^{- 1})}^{- 1}$ and $μ_{U_{j}}^{*} = Z (U_{j}) D^{- 1} (y (U_{j}) - X {(U_{j})}^{⊤} β) + R_{U_{j}}^{- 1} H_{U_{j}} w_{[b_{j}]}$ , where $w_{[b_{j}]}$ is the spatial process at locations corresponding to the parents of b_j ∈ B ⊂ V.

We update $w_{S_{j}} = w_{j}$ for j = 1,…,M via its full conditional $N (Σ_{j}^{*} μ_{j}^{*}, Σ_{j}^{*})$ . Let $1_{j} = {(I n (s_{1} \in S_{j}^{*}), \dots, I n (s_{n_{j}} \in S_{j}^{*}))}^{⊤}$ be the vector of indicators that identify locations with non-missing outputs, and let a_j ∈ V be the node in $G$ corresponding to $S_{j}$ . Then,

Σ_{j}^{* - 1} = Z_{j}^{⊤} {\tilde{D}}_{n_{j}}^{- 1} Z_{j} + R_{j}^{- 1} + \sum_{i = 1}^{| Ch [a_{j}] |} H_{i}^{[j] ⊤} R_{i}^{[j] - 1} H_{i}^{[j]} μ_{j}^{*} = R_{j}^{- 1} H_{j} w_{[j]} + Z_{j}^{⊤} {\tilde{D}}_{n_{j}}^{- 1} {\tilde{y}}_{j} + \sum_{i = 1}^{| Ch [a_{j}] |} H_{i}^{[j] ⊤} R_{i}^{[j] - 1} w_{i}^{[j]},

(12)

where ${\tilde{D}}_{n_{j}}^{- 1} = I_{j} ⊙ D_{n_{j}}^{- 1}$ with $I_{j} = 1_{j} 1_{j}^{⊤}$ , and ${\tilde{y}}_{j} = 1_{j} ⊙ (y_{j} - X_{j} β)$ and ⊙ denotes the Hadamard or Schur (element-by-element) product. Finally, θ is updated via a Metropolis step with target density $p (θ) N (w_{S} ∣ 0, {\tilde{C}}_{S}) N (w_{U} ∣ H_{U} w_{S}, R_{U})$ using (8) and (10). The Gibbs sampling algorithm will iterate across the above steps and, upon convergence, will produce samples from $p (β, {τ_{j}^{2}}_{j = 1}^{q}, w ∣ y)$ .

We obtain posterior predictive inference at arbitrary $ℓ \in D$ by evaluating p(y(ℓ) | y). If $ℓ \in S \cup U$ , then we draw one sample of y(ℓ) ~ N(X(ℓ)^⊤β + Z(ℓ)^⊤w(ℓ), D) for each draw of the parameters from $p (β, {τ_{j}^{2}}_{j = 1}^{q}, w ∣ y)$ . Otherwise, considering that $ℓ \in D_{j}$ for some j and thus η(ℓ) = b_j, with parent nodes Pa[b_j] and children Ch[b_j] = ∅, we sample w(ℓ) from the full conditional $N (Σ_{ℓ}^{*} μ_{ℓ}^{*}, Σ_{ℓ}^{*})$ , where $Σ_{ℓ}^{*} = {(Z (ℓ) D^{- 1} Z {(ℓ)}^{⊤} + R_{ℓ}^{- 1})}^{- 1}$ and $μ_{ℓ}^{*} = Z (ℓ) D^{- 1} (y (ℓ) - X {(ℓ)}^{⊤} β) + R_{ℓ}^{- 1} H_{ℓ} w_{[b_{j}]}$ , then draw y(ℓ) ~ N(X(ℓ)^⊤β + Z(ℓ)^⊤w(ℓ), D).

3.2. Non-separable multivariate spatiotemporal covariances

We provide an account of the computational cost of general MGPs as a starting point to motivate the introduction of more efficient tessellated MGPs, and specifically Q-MGPs, in Section 4. We consider (11) and take l = 1 to simplify our exposition. In the resulting model, β is the regression coefficient on the p point-referenced regressors with a static effect on the outcome, whereas the q-variate spatiotemporal process w(·) captures the dynamic effect of the Z regressors. Typically in geostatistical modeling p and q are small, hence sampling β and τ² carries a negligible computational cost. The cost of each Gibbs iteration is dominated by updates of θ and w. Let us assume, solely for expository purposes, that each of the M blocks comprise the same number of locations, i.e. $| S_{j} | = | U_{j} | = m$ , for all j = 1,…,M. Thus, $m = \frac{n}{2 M}$ and the graph nodes have J or fewer parents and L or fewer children.

The evaluation of $N (w_{S} ∣ 0, {\tilde{C}}_{S}) = \prod_{j = 1}^{M} N (w_{j} ∣ H_{j} w_{[j]}, R_{j})$ and $N (w_{U} ∣ H_{U} w_{S}, R_{U}) = \prod_{j = 1}^{M} N (w_{U_{j}} ∣ H_{U_{j}} w_{[b_{j}]}, R_{U_{j}})$ dominates the computation. Each term in the product entails $R_{j}^{- 1}$ and $R_{U_{j}}^{- 1}$ , both of size qm×qm, and their determinants. These require $C_{[j]}^{- 1}$ of size Jqm×Jqm or less, resulting in $O (2 M (q^{3} m^{3} + J^{3} q^{3} m^{3})) = O (2 M q^{3} m^{3} (J^{3} + 1)) \approx O (2 M q^{3} m^{3} J^{3}) = O (\frac{n^{3} q^{3} J^{3}}{M^{2}})$ flops via Cholesky decomposition. Reasonably, J and m are fixed so M may grow linearly with sample size and the cost is O(nq³J³) considering M ∝ n. The total computing time is $~ O (\frac{n q^{3} J^{3}}{K})$ with K processors for computing the 2M densities. Sampling $w_{S}$ and $w_{U}$ from their full conditional distributions requires O(2Mq³m³ +MLq²m² +Mq²m²) flops, assuming $R_{j}^{- 1}$ and $R_{U_{j}}^{- 1}$ are stored in the previous step. The first term in the complexity order is due to the Cholesky decomposition of covariance matrices, the second is due to sampling the reference nodes, and the third comes from sampling other nodes. Without further assumptions, parallelization reduces complexity to $O (\frac{2 M q^{3} m^{3}}{K} + \frac{M q^{2} m^{2}}{K} + M L q^{2} m^{2})$ , since the covariances can be computed beforehand and the M components of $w_{U}$ are independent given $w_{S}$ . With fixed block size m, the overall complexity for a Gibbs iteration is $O (\frac{2}{K} M q^{3} m^{3} (J^{3} + 1) + \frac{1}{K} 2 M q^{3} m^{3} + \frac{1}{K} M q^{2} m^{2} + M L q^{2} m^{2}) \approx O (\frac{1}{K} J^{3} q^{3} n + q^{2} n) \approx O (n)$ , linear in the sample size and cubic in J, highlighting the computational speedup of sparse graphs (J small), the negative impact of large q, and the serial sampling of $w_{S}$ .

In terms of storage, H_j and R_j correspond to a storage requirement of O(4Mq²m²) = O(q²n). The matrix Z of size qn×qn can be represented as a list of 2M block-diagonal (hence sparse) Z_j matrices. Furthermore, computing Zw (dimension n×1) can be vectorized as the row-wise sum of Z* ⊙ w* where Z* and w* are n×q matrices with jth column representing the jth space-time varying predictor. The cost of storing Z is thus O(2qn).

Complexity is further reduced by considering a graph with small J or a finer partition resulting in large M and small m, whereas the overall time can be reduced by distributing computations on K processors. Possible choices for $G$ include nearest-neighbor graphs and multiresolution trees. In settings with large q, adjusting J and M may be insufficient to reduce the computational burden. Covariance functions that are separable in the variables (but perhaps non-separable in space and time) bring the cost of Choesky factorizations of Jqm × Jqm matrices from O(J³q³m³) to O(J³m³ + q³) because $C^{- 1} = {(C_{h, u} \otimes C_{v})}^{- 1} = C_{h, u}^{- 1} \otimes C_{v}^{- 1}$ , where C_h,u is the Jm×Jm space-time component of the cross-covariance, and C_v the q × q variable component. Savings accrue when evaluating the likelihood and in sampling from the full-conditionals at the cost of realism in describing the spatial process.

The next section develops a novel MGP design based on domain tessellations or tiling – i.e. partitions of the domain forming repeated patterns – to which we associate similarly patterned meshes. If observations are also located in patterns, the bulk of the largest linear solvers will be redundant, resulting in a significant reduction in computational time. In either scenario, sampling $w_{S}$ will also proceed in parallel with improved mixing.

4. MGPs based on domain tessellation or tiling

We construct MGPs based on a tessellation or tiling of the domain. For spatial domains (d = 2, Figure 2), regular tiling results in triangular, quadratic, or hexagonal tessellations; mixed designs are also possible. These partition schemes can be linked to a DAG $S$ by drawing directed outgoing edges starting from an originating node/tile. The same fixed pattern can be repeated over a surface of any size. In dimensions d > 2, which may include time, space-filling tessellations or honeycombs can be constructed analogously, along with their corresponding meshes. Constructions of MGPs based on these ideas simply requires partitioning the locations $S$ into subsets based on the chosen tessellation.

This subclass of MGP models corresponds by design to graphs with known coloring, with each color linked to a subgraph conditionally independent of all nodes of other colors, regardless of the dimension of the domain. This feature enables large-scale parallel sampling of $w_{S}$ and improves mixing without the need to implement heuristic graph-coloring algorithms. Furthermore, regions in a tessellated domain are typically translations and/or rotations of a single geometric shape. Carefully choosing $S$ , it will be possible to avoid computing the bulk of linear solvers, resulting in substantial computational gains. Subsequently, we focus on axis-parallel partitioning (quadratic or cubic tessellation) and cubic meshes, but analogous constructions and the same properties hold with other tessellation schemes.

A cubic MGP (Q-MGP) is constructed by partitioning each coordinate axis into intervals. In d + 1 dimensions, splitting each axis into L intervals results in L^d+1 regions. Consider a spatiotemporal domain $D = \times_{r = 1}^{d + 1} D^{(r)}$ , where $D^{(d + 1)}$ is the time dimension. We partition each coordinate axis into L_r disjoint sets: $D^{(r)} = I_{r, 1} \cup \dots \cup I_{r, L_{r}}$ , where $I_{r, j} \cap I_{r, k} = \emptyset$ if j ≠ k and $I_{r, s}$ denotes the sth interval in the rth coordinate axis. Solely for exposition, and without loss of generality, assume that $D^{(r)} = I = [0, 1]$ and L_r = L for r = 1,…,d + 1. Any location $ℓ = (ℓ_{1}, \dots, ℓ_{d + 1}) \in D$ will be such that $ℓ \in I_{1, i_{1}} \times \dots \times I_{d + 1, i_{d + 1}} = D_{j}$ for some i₁,…,i_d+1 and with j = 1,…,M, where M = L^d+1. We refer to this axis-parallel partition scheme as a cubic tessellation and denote it by $T = {I_{r, s}}_{r = 1, \dots, d + 1}^{s = 1, \dots, L}$ . We use T to partition the reference set $S$ as $S_{j} = D_{j} \cap S$ for j = 1,…,L^d+1.

Next, we define η(ℓ) = (η₁(ℓ), …,η_L(ℓ)) ∈ {1, …, L}^d+1, where η_j = η_j(ℓ) = r if $ℓ_{j} \in I_{j, r}$ . Then, let $Q = (V, E)$ be a directed acyclic graph with V = A∪B and reference nodes $A = {a_{1}, \dots, a_{L^{d + 1}}}$ . Therefore, for any j = 1,…,L^d+1 if $s \in S_{j}$ then η(s) = a_j ∈ A ⊂ V. We write each node v ∈ V as $v = (v_{η_{1}}, \dots, v_{η_{L}}) \in {1, \dots, L}^{d + 1}$ . The directed edges are constructed using a “line-of-sight” strategy. Suppose Pa[v] = {x⁽¹⁾,…,x^(d+1)}. The hth parent of v is defined as $x^{(h)} = (a_{η_{1}}, \dots, a_{η_{h}} - k, \dots, a_{η_{L}}) \cap {1, \dots, d + 1}^{d + 1}$ , where k ≥ 1 is the smallest integer such that x^(h) ∈ A. Consequently x^(h) = ∅ if a_h = 1. Thus, the parents of node v = η(ℓ) are the ones that precede it along each of the d + 1 coordinates. If $ℓ \in D_{j} \ S_{j}$ , then η(ℓ) = b_j ∈ B and Pa[b_j] = {a_j} ∪ Pa[a_j] where a_j ∈ A is a reference node. To avoid Pa[b_j] = ∅ we set $Pa [b_{j}] = {x_{1}^{(1)}, x_{2}^{(1)}, \dots, x_{1}^{(d + 1)}, x_{2}^{(d + 1)}}$ . The two parents along the hth dimension are $x_{1}^{(h)} = a_{η_{h}} + k_{1}, x_{2}^{(h)} = a_{η_{h}} - k_{2}$ where k_i is the smallest positive integer such that $x_{i}^{(h)} \in A$ , i = 1,2. In this setting J = 2(d + 1). The construction is finalized by fixing the cross-covariance function C(ℓ, ℓ′); Figure 3 shows that the same basic structure can be immediately extended to higher dimensions, including time.

Figure 3: — Q-MGP meshes used for spatial data on d = 2 (left) can be extended for use on spatiotemporal data d = 3 (right). Node colors correspond to Gibbs sampler blocks.

4.1. Caching redundant expensive matrix operations

The key computational bottleneck for the Gibbs sampler in Section 3.2 is calculating, for j = 1,…,2M, of (i) $C_{[j]}^{- 1}$ (2MJ³q³m³ flops) and (ii) $R_{j}^{- 1}$ , $Σ_{j}^{* - 1}$ 4Mq³m³ flops). The former is costlier than the latter by a factor of J³/2. Q-MGPs are designed to greatly reduce this cost. We start with an axis-parallel tessellation of the domain in equally-sized regions $D_{1}, \dots, D_{M}$ , storing observed locations in $U$ to create $U_{1}, \dots, U_{M}$ , which we assume, for simplicity, to be no larger than m in size. Taking a stationary base-covariance function C, implies that $C (L_{1}, L_{2}) = C (L_{1} + h, L_{2} + h)$ , where $h \in ℜ^{d + 1}$ is used to shift all locations in the sets. Recall that the reference set $S$ of MGPs can include unobserved locations. Hence, we can build $S$ on a lattice of regularly spaced locations. Since domain partitions have the same size, we have $S_{j} = S^{*} + h_{j}$ for j = 1,…,M, where $S^{*}$ is a single “prototype set” using which one can locate all other reference subsets. Also, since Pa[a_j] ⊂ Pa[b_j], there will be 4(d + 1) prototype sets for parents, i.e. $S_{Pa [v_{j}]} = S_{r}^{*} + h_{j}$ for some r ∈ {1,…,4(d + 1)} and j = 1,…,2M. Then, we can build maps $ξ_{S} : {1, \dots, M} \to {1, \dots, 4 (d + 1)}$ and $ξ_{U} : {1, \dots, M} \to {1, \dots, 4 (d + 1)}$ linking each of $S_{j}$ and $U_{j}$ to a parent prototype. This ensures that $C_{[j]}^{- 1} = C_{S_{r}^{*}}^{- 1}$ for each j = 1,…,2M. One only needs the maps $ξ_{S}$ and $ξ_{U}$ , cache the r unique inverses, and reuse them. The same method applies to cache $R_{S_{j}}^{- 1} = R_{S_{r}^{*}}^{- 1}$ on reference sets, but not on other locations since no redundancy arises in $C_{U_{j}}$ for j = 1,…,M. See Figure 4 for an illustration. Compared to general MGPs (see Table 1), the number of large linear system solvers is now constant with sample size and (d + 1) ≪ M significantly reduces computational cost.

Figure 4: — Visualizing redundancies: a spatial domain is partitioned in M = 25 regions and linked to a quadratic mesh. The reference set $S$ is fixed on a regular grid, with m = 9. Parent locations of the orange (resp. purple) are in green (resp. blue). Using a stationary covariance, C_blue,blue = C_green,green. Therefore only one inversion is necessary; this can be replicated at no cost across 9 of the 16 regions.

Table 1:

Summary of computational cost of general MGPs and Q-MGPs. Rows are sorted from most expensive (top) to least expensive (bottom).

	$C_{[j], [j]}^{- 1}$	$R_{S_{j}}^{- 1}$	$R_{U_{j}}^{- 1}$	$\sum_{S_{j}}^{* - 1}$	$\sum_{U_{j}}^{* - 1}$	Sampling $w_{S}$ , $w_{U}$
MGPs (all cases)	2MJ³q³m³	Mq ³ m ³	Mq ³ m ³	Mq ³ m ³	Mq ³ m ³	MLq²m² + Mq²m²
Q-MGPs
— Irregular locations	4(d + 1)J³q³m³	4(d + 1)q³m³	Mq ³ m ³	Mq ³ m ³	Mq ³ m ³	MLq²m² + Mq²m²
— Pattern lattice w/missing	2M*J³q³m³	2M*q³m³		Mq ³ m ³		MLq ² m ²
— Lattice w/missing	4(d + 1)J³q³m³	4(d + 1)q³m³		Mq ³ m ³		MLq ² m ²
— Full lattice and Z(ℓ) = I_q	4(d + 1)J³q³m³	4(d + 1)q³m³		2^{(d + 2)}(d + 1)q³m³		MLq ² m ²

Open in a new tab

Furthermore, Q-MGPs automatically adjust to settings where observed locations $T$ are on partly regular lattices, i.e., they are located at patterns repeating in space or time which emerge after initial inspections of the data. Appendix G, available online, outlines a simple algorithm to identify such patterns and create maps $ξ_{S}$ and $ξ_{U}$ . In such cases, we fix $S \supseteq T$ and $U = \emptyset$ . In addition to the above mentioned savings, we now do not have to compute $R_{U_{j}}^{- 1}$ and $Σ_{U_{j}}^{* - 1}$ . If $T$ is not a regular lattice over the whole domain, 4(d + 1) is a lower bound and in general there are M* ≪ M inverses to compute. If $T$ is a fully observed regular lattice and if Z(ℓ) = I (a varying intercept model), then we save in computing the full conditional covariances as well, since all D_j = I. See Appendix C, available online, for details on choosing $S$ and $U$ .

4.2. Improved mixing via parallel sampling

With caching, a much larger proportion of time is spent on sampling; parallelization may in general be achieved via appropriate node coloring algorithms (see e.g. Molloy and Reed, 2002; Gonzalez et al., 2011; Lewis, 2016), but this step is unnecessary in Q-MGPs as the colors in $Q$ are set in advance independently of the data and result in efficient parallel sampling of the latent effects. Reference nodes A of $Q$ are colored to achieve independence conditional on realizations of nodes of all other colors. For example, we partition spatial domains (d = 2) into M₁ × M₂ regions and link each region to a reference node in a quadratic mesh. A “central” reference node v₊ will have two parents and two children, i.e. Pa[v₊] = {v_l, v_b} and Ch[v₊] = {v_r, v_t}, with l, b, r, t respectively denoting left, bottom, right, top – refer to Figure 3 (left). We have Pa[v_t] = {v₊, v_tl} and Pa[v_r] = {v₊, v_br}. The Markov blanket of v₊, denoted as mb(v₊), is the set of neighbors of v₊ in the undirected “moral” graph $Q^{M}$ , hence mb(v₊) = Pa[v₊] ∪ Ch[v₊] ∪ {v_tl, v_br}. The corresponding spatial process is such that $p (w_{+} ∣ w \ w_{+}) = p (w_{+} ∣ w_{mb (v_{+})})$ . Denoting v_bl = Pa[v_l] ∩ Pa[v_b] and v_tr = Ch[v_r] ∩ Ch[v_t], we note that {v_bl, v_tr} ∩ mb(v₊) = ∅. We partition reference nodes A into four groups {A⁽¹⁾,A⁽²⁾,A⁽³⁾,A⁽⁴⁾}, such that {v₊} ⊂ A⁽¹⁾, {v_b, v_t} ⊂ A⁽²⁾, {v_l, v_r} ⊂ A⁽³⁾, and {v_tl,v_tr, v_bl, v_br} ⊂ A⁽⁴⁾. This 3 × 3 pattern is repeated over the whole graph. Then, if v ∈ A^(j), mb(v) ∩ A^(j) = ∅. Denoting by $D$ the other variables in the Gibbs sampler, we get:

p (w_{j} ∣ w_{- j}, D) = p (w_{j} ∣ w_{m b (v_{j})}, D) = \prod_{v_{i} \in A^{(j)}} p (w_{i} ∣ w_{A^{(- j)}}, D) .

Since parallelization is possible within each of the groups, only be four serial steps are needed; time savings are due to M/4 typically being orders of magnitude larger than the number of available processors. Extensions to other tessellation schemes and higher dimensional domains and the associated graphs follow analogously.

5. Data analysis

Satellite imaging and remote sensing data are nowadays frequently collected in large quantities and processed to be used in geology, ecology, forestry, and other fields, but clouds and atmospheric conditions obstruct aerial views and corrupt the data creating gaps. Recovery of the underlying signal and quantification of the associated uncertainty are thus the major goals to enable practitioners in the natural sciences to fully exploit these data sources. Several scalable geostatistical models based on Gaussian processes have been implemented on tens or hundreds of thousands of data points, with few exceptions. In considering larger data sizes, one must either have a large time budget – usually several days – or reduce model flexibility and richness. Scalability concerns become the single most important issue in multivariate spatiotemporal settings. In fact, repeated collection of aerial images and multiple spatially-referenced predictors modeled to have a variable effect on the outcome have a multiplicative effect on data size. With no separability assumptions, the dimension of the latent spatial random effects that one may wish to recover will be huge even when individual images would be manageable when considered individually.

The lack of software to implement scalable models for spatiotemporal data makes it difficult to compare our proposed approach with others in these settings. On the other hand, a recent article (Heaton et al., 2019) pins many state-of-the-art models against each other in a spatial (d = 2) prediction contest. On the same data, we show in Appendix E, available online, that Q-MGPs can outperform all competitors in terms of predictive performance and coverage while using a similar computational budget.

5.1. Non-separable multivariate spatiotemporal base covariance

In our analyses, we choose a class of multivariate space-time cross-covariances that models the covariance between variables i and j at the $(h, u) \in ℜ^{d + 1}$ space-time lags as:

C_{i j} (h, u) = \frac{σ^{2}}{{(ψ_{1} (\frac{| u |^{2}}{ψ_{2} (δ_{i j}^{2})}))}^{d / 2} {(ψ_{2} (δ_{i j}^{2}))}^{1 / 2}} ϕ_{1} (\frac{‖ h ‖^{2}}{ψ_{1} (\frac{| u |^{2}}{ψ_{2} (δ_{i j}^{2})})}),

(13)

where δ_ij > 0 (and with δ_ij = δ_ji) is the latent dissimilarity between variables i and j. In the resulting cross-covariance function C(h, u, v) in $ℜ^{d + 1 + k}$ , each component of the q-variate spatial process is represented by a point in a k-dimensional latent space, k ≤ q. Refer to Apanasovich and Genton (2010) for a more in-depth discussion. We set ϕ₁(x) = exp(−cx) and ψ_j(x) = (a_jx^αj + 1)^βj, j = 1,2; see Gneiting (2002) for alternatives. We also fix $α_{1} = α_{2} = \frac{1}{2}$ , and seek to estimate θ = (σ², c, a₁, β₁, a₂, β₂, {δ_ij}_{i<j,j=1,…,q}) a posteriori. The usual exponential covariance arises in univariate spatial settings.

5.2. Synthetic data

We mimick real world satellite imaging data analyzed later in Section 5.3 at a much smaller scale by generating 81 datasets from the model y(ℓ) = Z(ℓ)^⊤w(ℓ) + ε(ℓ), where ε(ℓ) ~ N(0, τ²) with $ℓ \in T$ and $T$ is a regular grid of size 40×40×10, resulting in n_all = 16,000 total locations. We take w ~ GP(0,C) where C is as in (13), ψ₂ ≡ 1 and σ² = 1. We generate one dataset for each combination of τ² ∈ {1/1000,1/20,1/10}, temporal range α ∈ {5,50,500}, space-time separability $β \in {1 / 20, 1 / 2, 1 - \frac{1}{20}}$ , and spatial range c ∈ {1,5,25}.

We compare Q-MGPs with the similarly-targeted Gapfill method of Gerber et al. (2018) as implemented in the R package gapfill. We create “synthetic clouds” of radius $\sqrt{0.1}$ and with center (c_1,t, c_2,t) ∈ [0,1/20]² where $c_{1, t}, c_{2, t} \overset{i i d}{~} U [0, 1]$ to cover the outcomes at six randomly selected times for each of the 81 datasets. Outcomes at two of the remaining four time periods were then randomly selected to be completely unobserved at all but 10 locations in order to avoid errors from gapfill. Refer to Figure 5 for an illustration.

Figure 5: — Artificial cloud coverin in synthetic data.

A Q-MGP model with M = 500 was fit by partitioning each spatial axis into 10 intervals and the time axis into 5 intervals. The priors were τ² ~ Inv.G.(2,1), σ² ~ Inv.G.(2,1), β ~ U(0,1), α ~ U(0,10⁴), c ~ U(0,10⁴); 7000 iterations of Gibbs sampling were run, of which 5000 used for burn-in and thinning the remaining 2000 to obtain a posterior sample of size 1000. For each of the 81 datasets we calculate the mean absolute prediction error (MAE) and the root mean squared prediction error (RMSE). Figure 6 compares Gapfill’s 90% intervals with 90% posterior equal-tailed credible intervals for the Q-MGP predictions obtained from 1000 posterior samples. In terms of MAE, the Q-MGP model outperformed Gapfill in all datasets; in terms of RMSE, it outperformed Gapfill in all but one dataset. The average MAE of Q-MGP was 0.4094 against Gapfill’s 0.5366; the average RMSE was 0.5308 against Gapfill’s 0.6820. The Q-MGP also yielded improved coverage of the prediction intervals, although some under-coverage was observed possibly due to the large M. This comparison may favor Q-MGPs as the data were generated from a Gaussian process. Appendix K, available online, confirms similar findings on non-Gaussian data (a GIF image).

Figure 6: — Performance of Q-MGP and Gapfill in out-of-sample predictions in 81 spatiotemporal datasets, at the three tested levels of noise variance τ².

5.3. NDVI data from the Serengeti ecosystem

Time series of Normalized Difference Vegetation Index (NDVI) derived from satellite imagery are used to understand spatial patterns in vegetation phonology. For such studies, image pixel-level NDVI values are observed over time to assess seasonal trends in vegetation green-up, growing season length and peak, and senescence. These analyses typically require NDVI values for all pixels over the region and time period of interest. As noted in the beginning of this section, atmospheric conditions, e.g., cloud cover, and sensor malfunction cause missing NDVI pixel values and hence predicting such values, i.e., gap-filling, is of key interest to the remote sensing community. Here, we consider NDVI data derived from the LandSat 8 sensor (which provides a ~30×30 m spatial resolution pixel) taken over Serengeti National Park, Tanzania, Africa. These data were part of a larger study conducted by Desanker et al. (2019) that looked at environmental drivers in vegetation phonology change. The data cover an area of 30×30km and 34 months, and correspond to 64 images of size 1000×1000 collected at 16-day intervals. Data on NDVI are complemented with elevation and soil moisture data, for a total of three spatially referenced predictors. We are thus interested in understanding their varying effect in space and time, in addition to predicting NDVI output at missing locations. We achieve both these goals by implementing model (11), where Z(ℓ) = X(ℓ) includes the intercept and three predictors; their varying effect will be represented by w(ℓ), which we recover by implementing Q-MGP models. Storing posterior samples of the multivariate spatially-varying coefficients for the full data with q = 4 is impossible using our computing resources as each sample would be of size 1000×1000×64×4 = 2.56e+8. For this reason, we consider two feasible setups. Denote by n_all the number of observed and missing locations. In model (1), we subsample each image to obtain 64 frames of size 500 × 500, and fit a regression model with Z(ℓ) = 1 resulting in a spatially-varying intercept model on n_obs = 12,582,484 observed locations, a total of n_all = 16,000,000 locations for prediction, and a latent spatial random effect w of the same size. The Q-MGP was fit using M = 328,125 space-time regions of size ~48.

The base covariance of (13) becomes a univariate non-separable spatiotemporal covariance as in Gneiting (2002). In model (2), we aim to estimate the varying effect of elevation on NDVI. We subsample each image to obtain 64 frames of size 278 × 278, each covering an area of 25×25km, and take Z(ℓ) = (1 X_elev(ℓ)) resulting in q = 2 and targeting the recovery of latent effects of size 9,892,352. Considering the additional computational burden of the multivariate latent effects, in this case we used M = 156,800, corresponding to smaller space-time regions of average size ~31. In this model there is a single unknown δ_ij in (13) which corresponds to the latent dissimilarity between the intercept and elevation. We thus consider ψ₂ = (a₂δ_ij + 1)^β2 as the unknown parameter. We assign priors β_r ~ N(0,100) for r = 1,…,q, σ² ~ Inv.G.(2,1), τ² ~ Inv.G(2,1), and uniform priors to other covariance parameters (their support is reported in Table 2).

Table 2:

Posterior summaries of Q-MGP models implemented on the Serengeti data.

	Q-MGP Model (1)	Q-MGP Model (2)
n _all	16, 000, 000	4, 946, 176
n _obs	12, 755,856	3, 961, 715
M	328125	156800
q	1	2
β _elevation	0.0017_{(0.0014,0.0021)}	0.0415_{(0.0398,0.0432)}
β _topoindex	5.54e-4_{(4.72e-4,6.30e-4)}	−0.0011_{(−0.0012,−0.0008)}
β _accum	−4.84e-4_{(−5.66e-4,−4.02e-4)}	7.88e-4_{(6.94e-4,9.06e-4)}
σ ²	0.0585_{(0.0583,0.0587)}	0.0728_{(0.0711,0.0749)}
τ ²	1.05e-4_{(1.05e-4, 1.05e-4)}	1.27e-4_{(1.21e-4,1.32e-4)}
c ~ U(0, 1e+6)	7.0331_{(7.0146,7.0519)}	3.0710_{(3.0562,3.0846)}
a₁ ~ U(0, 1e+6)	433.98_{(429.67,439.50)}	3857.6_{(3492.6,4154.7)}
β₁ ~ U(0,1)	0.0694_{(0.0690,0.0697)}	0.1058_{(0.1043,0.1080)}
ψ₂ ~ U(0, 1e+6)	–	221.36_{(198.09,240.57)}
95% coverage	94.96	95.66
RMSE	0.0175	0.0253
Time/it. (s)	6.18	4.53
Time (hours)	42.9	31.5

Open in a new tab

In both cases, approximate posterior samples of the latent random effects and the other unknown parameters were obtained by running the proposed Gibbs sampler for a total of 25,000 iterations. A posterior sample of size 500 was obtained by using the first 22,000 iterations as burn-in, and thinning the remaining 3,000 by a factor of 6. Additional computational details are at Appendix F, available online. Posterior summaries of the unknown parameters for these models are reported in Table 2, along with RMSE in predicting NDVI at 10,000 left-out locations, 95% posterior coverage at those locations, and run times. Both models achieved similar out-of-sample predictive performance and coverage. Figure 7 shows the NDVI predictions of model (2) at one of the 64 time points. This reveals that the varying effect of elevation on NDVI output (see e.g. Figure 8) is credibly different from zero at 42.54% of the space-time locations (95% c. i.). In particular, it highlights the extent to which higher elevation reduces vegetation. The spatial range is approximately 4km; the time range is about 8 days. The large estimated ψ₂ indicates that the correlation between the two covariates of the latent random process is very small at all spatial and temporal lags. The predicted NDVI and latent spatiotemporal effects are supplied as animated GIF images in the supplement.

Figure 7: — NDVI predictions from Q-MGP model (2) at time 60 (2016-12-17).

Figure 8: — Top: the effect of elevation on NDVI output, evolving over five time periods. Middle left: effect on NDVI not explained by elevation; right: effect on NDVI attributable to elevation. Bottom left: estimated covariance at different space-time lags; right, in blue: locations with credibly nonzero effect of elevation on NDVI output.

6. Discussion

We have developed a class of Bayesian hierarchical models for large spatial and spatiotemporal datasets based on linking domain partitions to directed acyclic graphs. These models can be tailored for specific algorithmic needs, and we have demonstrated the advantages of using a cubic tessellation scheme (Q-MGP) when targeting the efficient recovery of spatial random effects in Bayesian hierarchical models using Gibbs samplers.

When considering alternative computational strategies, the proposed Q-MGP may not be optimal. For example, Gaussian first stage models enable marginalization of the latent spatial effects; posterior sampling of unknown covariance parameters via MCMC is typically associated by better mixing. Future research may thus focus on identifying “optimal” DAGs for collapsed samplers. Furthermore, the blocked conditional independence structure of Q-MGPs may be suboptimal as it corresponds to possibly restrictive conditional independence assumptions in neighboring locations. While we have not focused on the effect of different tessellations or partitioning choices in this article, alternative tessellation schemes (e.g. hexagonal) may be associated to less stringent assumptions and possibly better performance, while retaining all the desirable features of Q-MGP.

Other natural extensions to high-dimensional spatiotemporal statistics include settings where there are a very large number of spatiotemporal outcomes in addition to a large number of spatial and temporal locations. Here there are a few different avenues. One approach is in the same spirit of joint modeling pursued here, but instead of modeling the cross-covariance functions explicitly, as has been done here, we pursue dimension reduction using factor models (see, e.g., Christensen and Amemiya, 2003; Lopes et al., 2008; Ren and Banerjee, 2013; Taylor-Rodriguez et al., 2019). The aforementioned references have attempted to model the factor models using spatial processes some of which have used scalable low-rank predictive processes or the NNGP. We believe that modeling latent factors using spatiotemporal MGPs will impart some of the computational and inferential benefits seen here. However, this will need further development especially with regard to identifiability of loading matrices (Lopes et al., 2008; Ren and Banerjee, 2013) and process parameters.

A different approach to multivariate spatial modeling has relied upon conditional or hierarchical specifications. This has been well articulated in the text by Cressie and Wikle (2011); see also Royle and Berliner (1999) and the recent developments in Cressie and Zammit-Mangion (2016). An advantage of the hierarchical approach is that the multivariate processes are valid stochastic processes, essentially by construction and without requiring spectral representations, and can also impart considerable computational benefits. It will be interesting to extend the ideas in Cressie and Zammit-Mangion (2016) to augmented spaces of DAGs to further introduce conditional independence, and therefore sparsity, in MGP models with high-dimensional outcomes.

Finally, it is worth pointing out that alternate computational algorithms, particularly tuned for high-dimensional Bayesian models, should also be explored. Recent developments on algorithms based upon classes of piecewise deterministic Markov processes (see e.g., Fearnhead et al., 2018; Bierkens et al., 2019, and references therein) that avoid Gibbs samplers and even reversible MCMC algorithms are being shown to be increasingly effective for high-dimensional Bayesian inference. Adapting such algorithms to MGP and Q-MGP for scalable Bayesian spatial process models will constitute natural extensions of our current offering.

Supplementary Material

Supplement

NIHMS1639134-supplement-Supplement.pdf^{(8.1MB, pdf)}

Acknowledgements

Banerjee was supported by the NSF grants DMS-1513654, IIS-1562303, and DMS-1916349; and by the National Institute of Health grants NIEHS-R01ES027027 and NIEHS-R01ES030210. Finley and Peruzzi were supported by National Science Foundation (NSF) EF-1253225 and DMS-1916395, and National Aeronautics and Space Administration’s Carbon Monitoring System project. Peruzzi was supported in part by 1R01ES028804 of the National Institute of Environmental Health Sciences of the National Institutes of Health and European Union project 856506.

References

Apanasovich TV and Genton MG (2010). Cross-covariance functions for multivariate random fields based on latent dimensions. Biometrika, 97:15–30. doi: 10.1093/biomet/asp078. [DOI] [Google Scholar]
Banerjee S (2017). High-dimensional Bayesian geostatistics. Bayesian Analysis, 12(2):583–614. doi: 10.1214/17-BA1056R. [DOI] [PMC free article] [PubMed] [Google Scholar]
Banerjee S (2020). Modeling massive spatial datasets using a conjugate Bayesian linear modeling framework. Spatial Statistics, in press. doi: 10.1016/j.spasta.2020.100417. [DOI] [PMC free article] [PubMed] [Google Scholar]
Banerjee S, Finley AO, Waldmann P, and Ericsson T (2010). Hierarchical spatial process models for multiple traits in large genetic trials. Journal of American Statistical Association, 105(490):506–521. doi: 10.1198/jasa.2009.ap09068. [DOI] [PMC free article] [PubMed] [Google Scholar]
Banerjee S, Gelfand AE, Finley AO, and Sang H (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:825–848. doi: 10.1111/j.1467-9868.2008.00663.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bierkens J, Fearnhead P, and Roberts G (2019). The zig-zag process and super-efficient sampling for bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320. doi: 10.1214/18-AOS1715. [DOI] [Google Scholar]
Chen Y, Davis TA, Hager WW, and Rajamanickam S (2008). Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. ACM Trans. Math. Softw, 35(3). doi: 10.1145/1391989.1391995. [DOI] [Google Scholar]
Christensen WF and Amemiya Y (2003). Modeling and prediction for multivariate spatial factor analysis. Journal of Statistical Planning and Inference, 115(2):543–564. doi: 10.1016/S0378-3758(02)00173-8. [DOI] [Google Scholar]
Cover TM and Thomas JA (1991). Elements of information theory. Wiley Series in Telecommunications and Signal Processing. Wiley Interscience. [Google Scholar]
Cressie N and Johannesson G (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:209–226. doi: 10.1111/j.1467-9868.2007.00633.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Cressie N and Zammit-Mangion A (2016). Multivariate spatial covariance models: a conditional approach. Biometrika, 103(4):915–935. doi: 10.1093/biomet/asw045. [DOI] [Google Scholar]
Cressie NAC and Wikle CK (2011). Statistics for spatio-temporal data. Wiley series in probability and statistics. Hoboken, N.J. Wiley. [Google Scholar]
Datta A, Banerjee S, Finley AO, and Gelfand AE (2016a). Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111:800–812. doi: 10.1080/01621459.2015.1044091. [DOI] [PMC free article] [PubMed] [Google Scholar]
Datta A, Banerjee S, Finley AO, Hamm NAS, and Schaap M (2016b). Non-separable dynamic nearest neighbor gaussian process models for large spatio-temporal data with an application to particulate matter analysis. The Annals of Applied Statistics, 10:1286–1316. doi: 10.1214/16-AOAS931. [DOI] [PMC free article] [PubMed] [Google Scholar]
Davis TA (2006). Direct Methods for Sparse Linear Systems. SIAM, Philadelphia, PA. doi: 10.1137/1.9780898718881. [DOI] [Google Scholar]
Desanker G, Dahlin KM, and Finley AO (2019). Environmental controls on landsat-derived phenoregions across an east african megatransect. Ecosphere, In press. [Google Scholar]
Eidsvik J, Shaby BA, Reich BJ, Wheeler M, and Niemi J (2014). Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics, 23:295–315. doi: 10.1080/10618600.2012.760460. [DOI] [Google Scholar]
Fearnhead P, Bierkens J, Pollock M, and Roberts GO (2018). Piecewise deterministic Markov processes for continuous-time Monte Carlo. Statistical Science, 33(3):386–412. doi: 10.1214/18-STS648. [DOI] [Google Scholar]
Finley AO, Banerjee S, and Gelfand AE (2012). Bayesian dynamic modeling for large space-time datasets using Gaussian predictive processes. Journal of Geographical Systems, 14:29–47. doi: 10.1007/s10109-011-0154-8. [DOI] [Google Scholar]
Finley AO, Datta A, Cook BD, Morton DC, Andersen HE, and Banerjee S (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28:401–414. doi: 10.1080/10618600.2018.1537924. [DOI] [PMC free article] [PubMed] [Google Scholar]
Furrer R, Genton MG, and Nychka D (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15:502–523. doi: 10.1198/106186006X132178. [DOI] [Google Scholar]
Genton MG and Kleiber W (2015). Cross-covariance functions for multivariate geostatistics. Statistical Science, 30:147–163. doi: 10.1214/14-STS487. [DOI] [Google Scholar]
Gerber F, Furrer R, Schaepman-Strub G, de Jong R, and Schaepman ME (2018). Predicting missing values in spatio-temporal remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 56(5):2841–2853. doi: 10.1109/TGRS.2017.2785240. [DOI] [Google Scholar]
Gneiting T (2002). Nonseparable, stationary covariance functions for space-time data. Journal of the American Statistical Association, 97:590–600. doi: 10.1198/016214502760047113. [DOI] [Google Scholar]
Gonzalez J, Low Y, Gretton A, and Guestrin C (2011). Parallel Gibbs sampling: From colored fields to thin junction trees. In Gordon G, Dunson D, and Dudík M, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 324–332, Fort Lauderdale, FL, USA. PMLR. [Google Scholar]
Gramacy RB and Apley DW (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24:561–578. doi: 10.1080/10618600.2014.914442. [DOI] [Google Scholar]
Gramacy RB and Lee HKH (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103:1119–1130. doi: 10.1198/016214508000000689. [DOI] [Google Scholar]
Guhaniyogi R, Finley AO, Banerjee S, and Gelfand AE (2011). Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics, 22:997–1007. doi: 10.1002/env.1131. [DOI] [PMC free article] [PubMed] [Google Scholar]
Guinness J (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429. doi: 10.1080/00401706.2018.1437476. [DOI] [PMC free article] [PubMed] [Google Scholar]
Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, and Zammit-Mangion A (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398–425. doi: 10.1007/s13253-018-00348-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
Katzfuss M (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112:201–214. doi: 10.1080/01621459.2015.1123632. [DOI] [Google Scholar]
Katzfuss M and Guinness J (2017). A general framework for Vecchia approximations of Gaussian processes. arXiv:1708.06302. [Google Scholar]
Kaufman CG, Schervish MJ, and Nychka DW (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103:1545–1555. doi: 10.1198/016214508000000959. [DOI] [Google Scholar]
Lauritzen S,L (1996). Graphical Models. Clarendon Press, Oxford, UK. [Google Scholar]
Lewis R (2016). A guide to graph colouring. Springer International Publishing. doi: 10.1007/978-3-319-25730-3. [DOI] [Google Scholar]
Lopes HF, Salazar E, and Gamerman D (2008). Spatial dynamic factor analysis. Bayesian Analysis, 3(4):759–792. doi: 10.1214/08-BA329. [DOI] [Google Scholar]
Molloy M and Reed B (2002). Graph colouring and the probabilistic method. Springer-Verlag; Berlin Heidelberg. doi: 10.1007/978-3-642-04016-0. [DOI] [Google Scholar]
Quiroz ZC, Prates MO, and Dey DK (2019). Block nearest neighboor Gaussian processes for large datasets. arXiv:1908.06437. [Google Scholar]
Ren Q and Banerjee S (2013). Hierarchical factor models for large spatially misaligned data: A low-rank predictive process approach. Biometrics, 69(1):19–30. doi: 10.1111/j.1541-0420.2012.01832.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
Royle JA and Berliner LM (1999). A hierarchical approach to multivariate spatial modeling and prediction. Journal of Agricultural, Biological, and Environmental Statistics, 1(4):29–56. doi: 10.2307/1400420. [DOI] [Google Scholar]
Rue H and Held L (2005). Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC. doi: 10.1201/9780203492024. [DOI] [Google Scholar]
Sang H and Huang JZ (2012). A full scale approximation of covariance functions for large spatial data sets. Journal of the Royal Statistical Society, Series B, 74:111–132. doi: 10.1111/j.1467-9868.2011.01007.x. [DOI] [Google Scholar]
Stein ML (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8:1–19. doi:doi: 10.1016/j.spasta.2013.06.003. [DOI] [Google Scholar]
Stein ML, Chi Z, and Welty LJ (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society, Series B, 66:275–296. doi: 10.1046/j.1369-7412.2003.05512.x. [DOI] [Google Scholar]
Sun Y, Li B, and Genton M (2011). Geostatistics for large datasets. In Montero J, Porcu E, and Schlather M, editors, Advances and Challenges in Space-time Modelling of Natural Events, pages 55–77. Springer-Verlag, Berlin Heidelberg. doi: 10.1007/978-3-642-17086-7. [DOI] [Google Scholar]
Taylor-Rodriguez D, Finley AO, Datta A, Babcock C, Andersen HE, Cook BD, Morton DC, and Banerjee S (2019). Spatial factor models for high-dimensional and large spatial data: An application in forest variable mapping. Statistica Sinica, 29(3):1155–1180. doi: 10.5705/ss.202018.0005. [DOI] [PMC free article] [PubMed] [Google Scholar]
Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50:297–312. doi: 10.1111/j.2517-6161.1988.tb01729.x. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement

NIHMS1639134-supplement-Supplement.pdf^{(8.1MB, pdf)}

[R1] Apanasovich TV and Genton MG (2010). Cross-covariance functions for multivariate random fields based on latent dimensions. Biometrika, 97:15–30. doi: 10.1093/biomet/asp078. [DOI] [Google Scholar]

[R2] Banerjee S (2017). High-dimensional Bayesian geostatistics. Bayesian Analysis, 12(2):583–614. doi: 10.1214/17-BA1056R. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] Banerjee S (2020). Modeling massive spatial datasets using a conjugate Bayesian linear modeling framework. Spatial Statistics, in press. doi: 10.1016/j.spasta.2020.100417. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R4] Banerjee S, Finley AO, Waldmann P, and Ericsson T (2010). Hierarchical spatial process models for multiple traits in large genetic trials. Journal of American Statistical Association, 105(490):506–521. doi: 10.1198/jasa.2009.ap09068. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] Banerjee S, Gelfand AE, Finley AO, and Sang H (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:825–848. doi: 10.1111/j.1467-9868.2008.00663.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R6] Bierkens J, Fearnhead P, and Roberts G (2019). The zig-zag process and super-efficient sampling for bayesian analysis of big data. The Annals of Statistics, 47(3):1288–1320. doi: 10.1214/18-AOS1715. [DOI] [Google Scholar]

[R7] Chen Y, Davis TA, Hager WW, and Rajamanickam S (2008). Algorithm 887: CHOLMOD, Supernodal Sparse Cholesky Factorization and Update/Downdate. ACM Trans. Math. Softw, 35(3). doi: 10.1145/1391989.1391995. [DOI] [Google Scholar]

[R8] Christensen WF and Amemiya Y (2003). Modeling and prediction for multivariate spatial factor analysis. Journal of Statistical Planning and Inference, 115(2):543–564. doi: 10.1016/S0378-3758(02)00173-8. [DOI] [Google Scholar]

[R9] Cover TM and Thomas JA (1991). Elements of information theory. Wiley Series in Telecommunications and Signal Processing. Wiley Interscience. [Google Scholar]

[R10] Cressie N and Johannesson G (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:209–226. doi: 10.1111/j.1467-9868.2007.00633.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Cressie N and Zammit-Mangion A (2016). Multivariate spatial covariance models: a conditional approach. Biometrika, 103(4):915–935. doi: 10.1093/biomet/asw045. [DOI] [Google Scholar]

[R12] Cressie NAC and Wikle CK (2011). Statistics for spatio-temporal data. Wiley series in probability and statistics. Hoboken, N.J. Wiley. [Google Scholar]

[R13] Datta A, Banerjee S, Finley AO, and Gelfand AE (2016a). Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111:800–812. doi: 10.1080/01621459.2015.1044091. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] Datta A, Banerjee S, Finley AO, Hamm NAS, and Schaap M (2016b). Non-separable dynamic nearest neighbor gaussian process models for large spatio-temporal data with an application to particulate matter analysis. The Annals of Applied Statistics, 10:1286–1316. doi: 10.1214/16-AOAS931. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] Davis TA (2006). Direct Methods for Sparse Linear Systems. SIAM, Philadelphia, PA. doi: 10.1137/1.9780898718881. [DOI] [Google Scholar]

[R16] Desanker G, Dahlin KM, and Finley AO (2019). Environmental controls on landsat-derived phenoregions across an east african megatransect. Ecosphere, In press. [Google Scholar]

[R17] Eidsvik J, Shaby BA, Reich BJ, Wheeler M, and Niemi J (2014). Estimation and prediction in spatial models with block composite likelihoods. Journal of Computational and Graphical Statistics, 23:295–315. doi: 10.1080/10618600.2012.760460. [DOI] [Google Scholar]

[R18] Fearnhead P, Bierkens J, Pollock M, and Roberts GO (2018). Piecewise deterministic Markov processes for continuous-time Monte Carlo. Statistical Science, 33(3):386–412. doi: 10.1214/18-STS648. [DOI] [Google Scholar]

[R19] Finley AO, Banerjee S, and Gelfand AE (2012). Bayesian dynamic modeling for large space-time datasets using Gaussian predictive processes. Journal of Geographical Systems, 14:29–47. doi: 10.1007/s10109-011-0154-8. [DOI] [Google Scholar]

[R20] Finley AO, Datta A, Cook BD, Morton DC, Andersen HE, and Banerjee S (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28:401–414. doi: 10.1080/10618600.2018.1537924. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Furrer R, Genton MG, and Nychka D (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15:502–523. doi: 10.1198/106186006X132178. [DOI] [Google Scholar]

[R22] Genton MG and Kleiber W (2015). Cross-covariance functions for multivariate geostatistics. Statistical Science, 30:147–163. doi: 10.1214/14-STS487. [DOI] [Google Scholar]

[R23] Gerber F, Furrer R, Schaepman-Strub G, de Jong R, and Schaepman ME (2018). Predicting missing values in spatio-temporal remote sensing data. IEEE Transactions on Geoscience and Remote Sensing, 56(5):2841–2853. doi: 10.1109/TGRS.2017.2785240. [DOI] [Google Scholar]

[R24] Gneiting T (2002). Nonseparable, stationary covariance functions for space-time data. Journal of the American Statistical Association, 97:590–600. doi: 10.1198/016214502760047113. [DOI] [Google Scholar]

[R25] Gonzalez J, Low Y, Gretton A, and Guestrin C (2011). Parallel Gibbs sampling: From colored fields to thin junction trees. In Gordon G, Dunson D, and Dudík M, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 324–332, Fort Lauderdale, FL, USA. PMLR. [Google Scholar]

[R26] Gramacy RB and Apley DW (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24:561–578. doi: 10.1080/10618600.2014.914442. [DOI] [Google Scholar]

[R27] Gramacy RB and Lee HKH (2008). Bayesian treed Gaussian process models with an application to computer modeling. Journal of the American Statistical Association, 103:1119–1130. doi: 10.1198/016214508000000689. [DOI] [Google Scholar]

[R28] Guhaniyogi R, Finley AO, Banerjee S, and Gelfand AE (2011). Adaptive Gaussian predictive process models for large spatial datasets. Environmetrics, 22:997–1007. doi: 10.1002/env.1131. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R29] Guinness J (2018). Permutation and grouping methods for sharpening Gaussian process approximations. Technometrics, 60(4):415–429. doi: 10.1080/00401706.2018.1437476. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R30] Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, and Zammit-Mangion A (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398–425. doi: 10.1007/s13253-018-00348-w. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R31] Katzfuss M (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112:201–214. doi: 10.1080/01621459.2015.1123632. [DOI] [Google Scholar]

[R32] Katzfuss M and Guinness J (2017). A general framework for Vecchia approximations of Gaussian processes. arXiv:1708.06302. [Google Scholar]

[R33] Kaufman CG, Schervish MJ, and Nychka DW (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103:1545–1555. doi: 10.1198/016214508000000959. [DOI] [Google Scholar]

[R34] Lauritzen S,L (1996). Graphical Models. Clarendon Press, Oxford, UK. [Google Scholar]

[R35] Lewis R (2016). A guide to graph colouring. Springer International Publishing. doi: 10.1007/978-3-319-25730-3. [DOI] [Google Scholar]

[R36] Lopes HF, Salazar E, and Gamerman D (2008). Spatial dynamic factor analysis. Bayesian Analysis, 3(4):759–792. doi: 10.1214/08-BA329. [DOI] [Google Scholar]

[R37] Molloy M and Reed B (2002). Graph colouring and the probabilistic method. Springer-Verlag; Berlin Heidelberg. doi: 10.1007/978-3-642-04016-0. [DOI] [Google Scholar]

[R38] Quiroz ZC, Prates MO, and Dey DK (2019). Block nearest neighboor Gaussian processes for large datasets. arXiv:1908.06437. [Google Scholar]

[R39] Ren Q and Banerjee S (2013). Hierarchical factor models for large spatially misaligned data: A low-rank predictive process approach. Biometrics, 69(1):19–30. doi: 10.1111/j.1541-0420.2012.01832.x. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R40] Royle JA and Berliner LM (1999). A hierarchical approach to multivariate spatial modeling and prediction. Journal of Agricultural, Biological, and Environmental Statistics, 1(4):29–56. doi: 10.2307/1400420. [DOI] [Google Scholar]

[R41] Rue H and Held L (2005). Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC. doi: 10.1201/9780203492024. [DOI] [Google Scholar]

[R42] Sang H and Huang JZ (2012). A full scale approximation of covariance functions for large spatial data sets. Journal of the Royal Statistical Society, Series B, 74:111–132. doi: 10.1111/j.1467-9868.2011.01007.x. [DOI] [Google Scholar]

[R43] Stein ML (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8:1–19. doi:doi: 10.1016/j.spasta.2013.06.003. [DOI] [Google Scholar]

[R44] Stein ML, Chi Z, and Welty LJ (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society, Series B, 66:275–296. doi: 10.1046/j.1369-7412.2003.05512.x. [DOI] [Google Scholar]

[R45] Sun Y, Li B, and Genton M (2011). Geostatistics for large datasets. In Montero J, Porcu E, and Schlather M, editors, Advances and Challenges in Space-time Modelling of Natural Events, pages 55–77. Springer-Verlag, Berlin Heidelberg. doi: 10.1007/978-3-642-17086-7. [DOI] [Google Scholar]

[R46] Taylor-Rodriguez D, Finley AO, Datta A, Babcock C, Andersen HE, Cook BD, Morton DC, and Banerjee S (2019). Spatial factor models for high-dimensional and large spatial data: An application in forest variable mapping. Statistica Sinica, 29(3):1155–1180. doi: 10.5705/ss.202018.0005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50:297–312. doi: 10.1111/j.2517-6161.1988.tb01729.x. [DOI] [Google Scholar]

PERMALINK

Highly Scalable Bayesian Geostatistical Modeling via Meshed Gaussian Processes on Partitioned Domains

Michele Peruzzi

Sudipto Banerjee

Andrew O Finley

Abstract

1. Introduction

Figure 1:

2. Spatial processes on partitioned domains

3. Meshed Gaussian Processes

3.1. Bayesian hierarchical model and Gibbs sampler

3.2. Non-separable multivariate spatiotemporal covariances

4. MGPs based on domain tessellation or tiling

Figure 2:

Figure 3: