Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 8.
Published in final edited form as: J Mach Learn Res. 2024 Mar;25:87.

Spatial meshing for general Bayesian multivariate models

Michele Peruzzi a,*, David B Dunson b
PMCID: PMC12237421  NIHMSID: NIHMS2070173  PMID: 40630724

Abstract

Quantifying spatial and/or temporal associations in multivariate geolocated data of different types is achievable via spatial random effects in a Bayesian hierarchical model, but severe computational bottlenecks arise when spatial dependence is encoded as a latent Gaussian process (GP) in the increasingly common large scale data settings on which we focus. The scenario worsens in non-Gaussian models because the reduced analytical tractability leads to additional hurdles to computational efficiency. In this article, we introduce Bayesian models of spatially referenced data in which the likelihood or the latent process (or both) are not Gaussian. First, we exploit the advantages of spatial processes built via directed acyclic graphs, in which case the spatial nodes enter the Bayesian hierarchy and lead to posterior sampling via routine Markov chain Monte Carlo (MCMC) methods. Second, motivated by the possible inefficiencies of popular gradient-based sampling approaches in the multivariate contexts on which we focus, we introduce the simplified manifold preconditioner adaptation (SiMPA) algorithm which uses second order information about the target but avoids expensive matrix operations. We demostrate the performance and efficiency improvements of our methods relative to alternatives in extensive synthetic and real world remote sensing and community ecology applications with large scale data at up to hundreds of thousands of spatial locations and up to tens of outcomes. Software for the proposed methods is part of R package meshed, available on CRAN.

Keywords: multivariate spatial models, directed acyclic graphs, domain partitioning, latent Gaussian processes

1. Introduction

Geolocated data are routinely collected in many fields and motivate the development of geostatistical models based on Gaussian processes (GPs). GPs are appealing due to their analytical tractability, their flexibility via a multitude of covariance or kernel choices, and their ability to effectively represent and quantify uncertainty. When Gaussian distributional assumptions are appropriate, GPs may be used directly as correlation models for the multivariate response. Otherwise, flexible models of multivariate spatial association can in principle be built via assumptions of conditional independence of the outcomes on a latent GP encoding space- and/or time-variability, regardless of data type. The poor scalability of naïve implementations of GPs to large scale data is addressed in a growing body of literature. Sun et al. (2011), Banerjee (2017) and Heaton et al. (2019) review and compare methods for big data geostatistics. Methods include low-rank approaches (Banerjee et al., 2008; Cressie and Johannesson, 2008), covariance tapering (Furrer et al., 2006; Kaufman et al., 2008), domain partitioning (Sang and Huang, 2012; Stein, 2014), local approximations (Gramacy and Apley, 2015), and composite likelihood approximations (Stein et al., 2004). In particular, a popular strategy is to assume sparsity in the Gaussian prec ision matrix via Gaussian random Markov fields (GRMF; Rue and Held, 2005) which can be represented as sparse undirected graphical models. Proper joint densities are a result of using directed acyclic graphs (DAG), leading to Vecchia’s approximation (Vecchia, 1988), nearest-neighbor GPs (NNGPs; Datta et al., 2016a), and generalizations (see e.g. Katzfuss, 2017; Katzfuss and Guinness, 2021). DAGs can be designed by taking a small number of “past” neighbors after choosing an arbitrary ordering of the data. In models of the response and in the conditionally-conjugate latent Gaussian case, posterior computations rely on sparse-matrix routines for scalability (Finley et al., 2019; Jurek and Katzfuss, 2020), enabling fast cross-validation (Shirota et al., 2019; Banerjee, 2020). Alternatives to sparse-matrix algorithms involve Gibbs samplers whose efficiency improves by prespecifying a DAG defined on domain partitions, resulting in spatially meshed GPs (MGPs; Peruzzi et al., 2022). These perspectives are reinforced when considering multivariate outcomes (see e.g. Zhang and Banerjee 2022; Dey et al. 2021; Peruzzi and Dunson 2022).

The literature on scalable GPs predominantly relies on Gaussian assumptions on the outcomes, but in many applied contexts these assumptions are restrictive, inflexible, or inappropriate. For example, vegetation phenologists may wish to characterize the life cycle of plants in mountainous regions using remotely sensed Leaf Area Index (LAI, a count variable) and relate it to snow cover during 8 day periods (SC, a discrete variable whose values range from 0 to 8—see e.g., Figure 1). Similarly, community ecologists are faced with spatial patterns when considering counts or dichotomous presence/absence data of several animal species (Figure 2). In this article, we address this key gap in the literature, which is how to construct arbitrary Bayesian multivariate geostatistical models which (1) may include non-Gaussian components, (2) lead to efficient computation for massive datasets.

Figure 1:

Figure 1:

Snow cover (left) and Leaf Area Index, as measured by the MODIS-TERRA satellite. Missing data are in orange. Bottom maps detail the extents of cloud cover and other phenomena negatively impacting data quality.

Figure 2:

Figure 2:

An extract of dichotomized North American Breeding Bird Survey data. Orange points correspond to locations at which at least 1 individual has been observed.

There are considerable challenges in these contexts for efficient Bayesian computation when avoiding Gaussian distributional assumptions on the outcomes. General purpose Markov chain Monte Carlo (MCMC) methods can in principle be used to draw samples from the posterior distribution of the latent process by making local proposals within accept/reject schemes. However, due to the huge dimensionality of the parameter space, poor mixing and slow convergence are likely. For instance, random-walk Metropolis proposals are cheaply computed but lack in efficiency as they overlook the local geometry of the high dimensional posterior. Alternatively, one may consider gradient-based MCMC methods such as the Metropolis-adjusted Langevin algorithm (MALA; Roberts and Stramer 2002), Hamiltonian Monte Carlo (HMC; Duane et al. 1987; Neal 2011; Betancourt 2018) and others such as MALA and HMC on the Riemannian manifold (Girolami and Calderhead, 2011) or the no-U-turn sampler (NUTS; Hoffman and Gelman, 2014) used in the Stan probabilistic programming language (Carpenter et al., 2017). These methods are appealing because they modulate proposal step sizes using local gradient and/or higher order information of the target density. Unfortunately, their performance very rapidly drops with parameter dimension (Dunson and Johndrow, 2020). Although it is common in other contexts to rely on subsamples to cheaply approximate gradients, Johndrow et al. (2020) show that such approximate MCMC algorithms are either slow or have large approximation error. Such issues can be tackled by considering low-rank models, which facilitate the design of more efficient proposals as they involve parameters of greatly reduced dimension. Certain low-rank models endowed with conjugate full conditional distributions (Bradley et al., 2018, 2019) lead to always-accepted Gibbs proposals. However, excessive dimension reduction—which may be necessary for acceptable MCMC performance—may lead to oversmoothing of the spatial surface, overlooking the small-range variability that frequently occurs in big spatial data (Banerjee et al., 2010). Alternative dimension reduction strategies via divide-and-conquer methods that combine posterior samples obtained via MCMC from data subsets typically rely on assumptions of independence that are inappropriate in the highly correlated data settings in which we are interested (Neiswanger et al., 2014; Wang and Dunson, 2014; Wang et al., 2015b; Nemeth and Sherlock, 2018; Blomstedt et al., 2019; Mesquita et al., 2020) or have only considered univariate Gaussian likelihoods (Guhaniyogi and Banerjee, 2018).

The poor practical performance of MCMC in high dimensional settings has motivated the development of MCMC-free methods for posterior computation that take advantage of Laplace approximations (Sengupta and Cressie, 2013; Zilber and Katzfuss, 2020). In particular, the integrated nested Laplace approximation (INLA; Rue et al., 2009) iterates between Gaussian approximations of the conditional posterior of the latent effects, and numerical integrations over the hyperparameters. INLAs are accurate because of the non-negligible impact the Gaussian prior on the latent process has on its posterior; they achieve scalability to big spatial data by forcing sparsity on the Gaussian precision matrix via a GMRF assumption (Lindgren et al., 2011). INLAs are reliable alternatives to MCMC methods in several settings, but may be outperformed by carefully-designed MCMC methods in terms of accuracy or uncertainty quantification (Taylor and Diggle, 2014). Furthermore, the practical reliance of INLAs on Matérn covariance models with small dimensional hyperparameters for fast numerical integration makes them less flexible than MCMC methods in multivariate contexts or whenever special-purpose parametric covariance functions are required.

In this article, we introduce methodological and computational innovations for scalable posterior computations for general non-Gaussian spatial models. Our contributions include a class of Bayesian hierarchical models of multivariate outcomes of possibly different types based on spatial meshing of a latent multivariate process. In our treatment, outcomes can be misaligned—i.e., not all measured at all spatial locations—and relatively large in number, and there is no Gaussian assumption on the latent process. We maintain this perspective when developing posterior sampling methods. In particular, we develop a new Langevin algorithm which, based on ideas related to manifold MALA, adaptively builds a preconditioner but also avoids cubic-cost operations, leading to efficiency improvements in the contexts in which we focus. Our methods enable computations on data of size 105 or more. Unlike low-rank methods, we do not require restrictive dimensionality reduction at the level of the latent process. Unlike INLA, our computational methods are exact (upon convergence) for a class of valid spatial processes which is not restricted to latent GPs with Matérn covariances; furthermore, our methods are hit by a smaller computational penalty in higher-dimensional multivariate settings. Our methods are generally applicable to models of spatially referenced data, but we highlight the connections between Langevin methods and the Gibbs sampler available for Gaussian outcomes, and we develop new results for latent coregionalization models using MGPs. In applications, we consider Student-t processes, HMC and NUTS, and other cross-covariance models as methodological and computational alternatives to latent GPs, Langevin algorithms, and coregionalization models, respectively. Software for the proposed methods and the related posterior sampling algorithms is available as part of the meshed package for R, available on CRAN.

The article proceeds as follows. Section 2 outlines our model for spatially-referenced multivariate outcomes of different types and introduces general purpose methods and algorithms for scaling computations to high dimensional spatial data. Section 3 outlines Langevin methods for posterior sampling of the latent process and introduces a novel algorithm for multivariate spatial models. Section 4 translates the proposed methodologies for the latent Gaussian model of coregionalization. The remaining sections highlight algorithmic efficiency in applications on large synthetic and real world datasets motivated by remote sensing and spatial community ecology. The supplementary material includes alternative constructions of our proposed methods based on latent grids, Student-t processes, and NUTS for posterior computations, in addition to proofs, practical guidelines, additional simulations, and a real world application of our methods in the context of spatial multi-species N-mixture models.

2. Meshed Bayesian multivariate models for non-Gaussian data

We introduce our model for multivariate outcomes of possibly different types (e.g. continuous and counts) which also allows for misalignment. Let 𝒢={A,E} be a DAG with nodes A=a1,,aM and edges E={Par(a):aA}, where Par(a)A is referred to as the parent set of a. Let 𝒟 be the input domain and 𝒮𝒟 denote a user-specified set of “knots” or “reference locations.” We partition 𝒮 into subsets 𝒮i𝒮 such that 𝒮i𝒮j= if ij and i=1M𝒮i=𝒮. Then, we set up our hierarchical model for multivariate outcomes as:

yj()ηjxj(),wj(),γjFjηjxj(),wj(),γj,βj,γjπβj,γjθπ(θ),w()Π𝒢,θ (1)

where Fj is the probability distribution of the j th outcome, parametrized by an unknown constant γj and spatially-varying function ηjxj(),wj(), which includes a pj-dimensional vector of covariates specific for the j th outcome, denoted by xj(), whereas wj() is the jth element of the random vector w(), for j=1,,q. A common linear assumption leads to ηj()=xj()βj+wj(). Given a set of locations 𝒟 of size n we denote w=(w1,w2,,wn). We assume w is the finite realization at of an infinite-dimensional latent process w(), with law Π𝒢 and density π𝒢, which characterizes spatial/temporal dependence between outcomes. We construct such a process by enforcing conditional independence assumptions encoded in 𝒢 onto the law Π of a q-variate spatial process (also referred to as the base or parent process). For locations 𝒮, we make the assumption that π𝒢 factorizes according to 𝒢. This means π𝒢w𝒮θ=aiAπwiw[i],θ, where we denote wi=w𝒮i and w[i] is the vector of w() at locations ajParaiSj – i.e. the set of locations mapped to parents of ai. For locations 𝒰=𝒟\𝒮, we assume conditional independence given a set of parents []A, which means π𝒢w𝒰w𝒮,θ=𝒰πw()w[],θ where w[] is a vector collecting realizations of w() at locations 𝒮[]=ai[]𝒮i.

2.1. DAG and partition choice

We refer to the method of building spatial processes via sparse DAGs associated to domain partitioning as spatial meshing. Several options for constructing 𝒢 and populating and partitioning 𝒮 are available, but sparsity assumptions on 𝒢 are necessary to avoid computational bottlenecks in using Π𝒢. Specifically, we restrict our focus on sparse DAGs such that |mb(a)|m for all aA, where mb(a) is the Markov blanket of a, and m is a small number. The Markov blanket of a node in a DAG is the set mb(a)=Par(a)Chi(a)Copar(a) which enumerates the parents of a along with the set of children of a,Chi(a)={bA:aPar(b)}, and the set of co-parents of a,Copar(a)={cA:ca and a,cParb for some bChi(a)}—this is the set of a’s children’s other parents. We additionally assume that the undirected moral graph 𝒢¯ obtained by adding pairwise edges between co-parents has a small number of colors; if node a has color c, then no elements of mb(a) have the same color. Because our assumptions on the size of the Markov blanket lead to large scale conditional independence, the spatially meshed process Π𝒢 has a simpler dependence structure than the parent process Π from which it originates. The “screening” effect (Stein, 2002) makes these assumptions appealing in geostatistical contexts. Furthermore, if the Markov blanket of nodes in 𝒢 can be built to cover their spatial neighbors, then Π𝒢 can provably accurately approximate Π in some settings (Zhu et al., 2022). If Π is a GP, the i,j entry of the resulting precision matrix is nonzero if the corresponding nodes are in their respective Markov blankets. In the context of Gibbs-like samplers that visit each node of 𝒢, a small Markov blanket bounds the compute time for each step of the algorithm; we take advantage of our assumptions on step 4 of Algorithm 1. Refer to Algorithm 3 and the supplement for an account of computational complexity in the coregionalized GP setting.

Figure 3 visualizes (1) when implemented on a “cubic” spatial DAG using row-column indexing of the nodes resulting in M=MrowMcol and 𝒮=i=1Mrowj=1McolSji. Even though DAGs are abstract representations of conditional independence assumptions, nodes of the DAG in Figure 3 conform to a single pattern (i.e., edges from left and bottom nodes, and to right and top nodes). As a consequence, the moral graph 𝒢¯ only adds undirected edges between ai+1,j and ai,j+1 for all i=1,,Mrow-1 and j=1,,Mcol-1, leading to cliques of size 3 and 3 colors, irrespective of input data. We refer to this kind of DAG as a cubic DAG as it naturally extends to a hypercube structure in d>2 dimensions.

Figure 3:

Figure 3:

Directed acyclic graph representing a special case of model (1). For simplicity, we omit the directed edges from βj,γj to each yj(),𝒯. If yj is unobserved and therefore 𝒯j, the corresponding node is missing.

Once a sparse DAG has been set, one needs to associate each node to a partition 𝒮i of 𝒮. With cubic DAGs, the ith node of 𝒢 can be associated to the ith domain partition found via axis-parallel tiling, or via Voronoi tessellations using a grid of centroids. These two partitioning strategies are equivalent when data have no gaps; otherwise, the latter strategy simplifies the proposal in Peruzzi et al. (2022) and can be used to guarantee that every domain partition includes observations, see e.g. Figure 4. Suppose 𝒟i,i=1,,M is the chosen domain tessellation. Then, the parent set [] for a location 𝒰 can be as simple as letting []=𝒮i if 𝒰i=𝒟i\𝒮i.

Figure 4:

Figure 4:

Visualizing cubic DAG and associated domain partitioning. Left: scatter of 𝒮 locations. Right: 𝒢 overlaid to partitions of the domain with colors matching those of 𝒢¯.

This general methodology can be used to construct other processes. For instance, dropping the sparsity assumptions on 𝒢, one can recover the base process itself.

Proposition 2.1.

If 𝒢 is such that for all aijA,Parai=a1,,ai-1 then Π𝒢=Π at 𝒮, i.e. π𝒢w𝒮=πw𝒮. The same result holds if M=1.

Proof. Omitting θ for clarity, π𝒢w𝒮=aiAπwiw[i]=πw1i=2Mπwiw1,,wi-1=πw1,,wM=πw𝒮. If M=1 then A=a1 and 𝒮=𝒮1,E={}, and the result is immediate.

Several other spatial process models based on Vecchia’s approximation can be derived similarly (Vecchia, 1988; Banerjee et al., 2008; Datta et al., 2016a; Katzfuss, 2017; Katzfuss and Guinness, 2021; Peruzzi and Dunson, 2022, and others) and any of these can be used in place of Π𝒢. For example, a Vecchia approximation can be obtained by partitioning 𝒮=1,,nS into sets of size 1; the sparse DAG is then generated by finding the m nearest neighbors of i from the set 1,,i-1. Heuristic graph coloring algorithms can be used to ensure a degree of parallelization in Algorithm 1. Unlike in the cubic DAG setting, the number of colors cannot be determined in advance because it is bounded below by clique size, which depends on the order of elements in 𝒮 and their values, and m. A larger number of colors corresponds to smaller sampling blocks and may correspond to lower MCMC efficiency when sampling latent surfaces with strong spatial correlations.

DAG and partition choice both relate to the restrictiveness of spatial conditional independence assumptions. Relative to the same partition, adding edges to a DAG brings Π𝒢 closer to Π in a Kullback-Leibler (KL) sense (Peruzzi et al., 2022, Section 2), and similar reasoning informs placement of knots in recursive treed DAGs (Peruzzi and Dunson, 2022). Here, we consider a cubic DAG and alternative nested partitions. Proposition 2.2 shows that coarser partitions lead to smaller KL divergence of Π𝒢 from the base process Π.

Proposition 2.2.

Consider a 2 × 1 domain partition w=w1,w2 and suppose 𝒢1 is a DAG with nodes A1=a1,a2 and the edge a1a2. Take a finer 3 × 1 partition nested in the first, i.e. we write w2=w21,w22, and DAG 𝒢2 such that A2=a1,a21,a22, edges a1a21 and a21a22. Then, KLππ𝒢1KLππ𝒢2.

Proof. Since π𝒢1=πw1πw2w1=πw1πw21w1πw22w21,w1, the coarser partition model can be equivalently written in terms of the finer partition using the DAG 𝒢1* with nodes A1*=A2 and the additional edge a1a22. Then, 𝒢2 is sparser than 𝒢1* and therefore KLππ𝒢1KLππ𝒢2.

We provide a discussion in the supplement relating to KL comparisons between non-nested partitioning schemes.

2.2. Posterior distribution and sampling

After introducing the set 𝒯j=𝒯:yj()isobserved}, we obtain 𝒯1𝒯q=𝒯=1,,n as the set of locations at which at least one outcome is observed. Then, we denote as 𝒯¯=𝒯\𝒮 the set of non-reference locations with at least one observed outcome. The posterior distribution of (1) is

π({βj,γj}j=1q,w𝒮,w𝒯¯,θy𝒯)π(θ)π𝒢(w𝒮θ)π𝒢w𝒯¯w𝒮θj=1qπβj,γj𝒯jdFjyj()wj(),βj,γj. (2)

Sampling (2) may proceed via Algorithm 1, where we denote as yi the vector of observed outcomes at 𝒮i and as wmb(i) the vector of latent effects at the Markov blanket of wi, which includes parents, children, coparents of aiA, and all locations 𝒰 such that wi is part of w[]. Algorithm 1 has the structure of a Gibbs sampler, as the Bayesian hierarchy is expanded to include the spatial DAG 𝒢: at each step of the MCMC loop, the goal is to sample from a full conditional distribution of one random component, conditioning on the most recent value of all the others. Upon convergence, one obtains correlated samples from the target joint posterior density. The lack of conditional conjugacy at steps 1–5, which is expected given our avoidance of simplifying assumptions on Fj’s and the base process Π, implies that 1–5 will require accept/reject steps in which updating parameter z proceeds by generating a move to z* via a proposal distribution q(z) and then accepting such move with probability min{1,pz*-qzz*p(z-)qz*z} where p(z-) is the target distribution to be sampled from. Steps 1 and 2 are generally not a concern in the setting on which we focus due to the independence of βj,γj on βi,γi for ij given the latent process and the fact that the number of covariates for each outcomes is typically small relative to the data size.

It is also typical in these settings to choose a reference set 𝒮 which includes all locations with at least one observed outcome, implying that 𝒯¯=; when this is the case, step 5 is not performed in Algorithm 1. We consider alternative strategies to restore flexibility in choosing 𝒮 in the supplementary material. Our sparsity assumptions encoded in Π𝒢 via 𝒢 facilitate computations at steps 3 and 4, which would otherwise be the two major computational bottlenecks. Specifically, in step 3 and assuming 𝒯¯=, a proposal θ* generated from a distribution q(·|θ) is accepted with probability α

α=min1,π(θ*)i=1Mπ(wiw[i],θ*)q(θθ*)π(θ)i=1Mπ(wiw[i],θ)q(θ*θ), (3)

whose computation is likely expensive when wi and w[i] are high dimensional because the

Algorithm 1.

Posterior sampling of spatially meshed model (1) and predictions.

Initialize βj(0) and γj(0) for j=1,...,q, w𝒮(0)w𝒯¯(0), and θ(0)
for t{1,,T*,T*+1,,T*+T} do ⊳ sequential MCMC loop
1:  for j=1,...,q, sample βj(t)y𝒯,w𝒯(t-1),γj(t-1)
2:  for j=1,...,q, sample γj(t)y𝒯,w𝒯(t-1),βj(t)
3:  sample θ(t)w𝒯¯(t-1),w𝒮(t-1)
for cColors(𝒢) do ⊳ sequential
  for ii:Colorai=c do in parallel
4:    sample wi(t)wmb(i)(t),yi,θ(t),{βj(t),γj(t)}j=1q ⊳ reference sampling
for 𝒯¯ do in parallel
5:   sample w()(t)w[](t-1),y(),θ(t),{βj(t),γj(t)}j=1q ⊳ non-reference sampling
Assuming convergence has been attained after T* iterations:
discard {βj(t),γj(t)}j=1q,w𝒮(t),w𝒯¯(t),θ(t) for t=1,,T*
Output: Correlated sample of size T with density
{βj(t),γj(t)}j=1q,w𝒮(t),w𝒯¯(t),θ(t)π𝒢(βj,γjj=1q,w𝒮(t),w𝒯¯(t),θy𝒯).
Predict at *𝒰: for t=1,...,T and j=1,...,q, sample from π(w*(t)w*(t),θ(t)), then from Fj(wj(*)(t),βj(t),γj(t))

base law Π models pairwise dependence of elements of wi based on their spatial distance. As an example, a GP assumption on Π leads to πwiw[i],θ=Nwi;Hi,Ri where Hi=Ci,[i]C[i]-1 and Ri=Ci-HiC[i],i, whose computation has complexity O(min{ni3q3,n[i]3q3}). If ni or the number of parent locations n[i] are large, such density evaluation is computationally prohibitive. Partitioning of 𝒮 ensures that ni is small for all i, and the assumed small Markov blankets of nodes in 𝒢 ensure that the number of parents, and thus n[i], is small.

Step 4 updates the latent process at each partition and is performed in two loops. The outer loop is sequential with a number of sequential steps equalling the number of colors of 𝒢¯, which is small by construction. The inner loop can be performed in parallel or, equivalently, all partitions of the same color can be updated as a single block. In step 4, the lack of conditional conjugacy implies that proposals for wi* for all i=1,...,M need to be designed and then accepted with probability αi

αi=min1,πwi*-dFyiwi*,-qwiwi*πwi-dFyiwi,-qwi*wi, (4)

where we denote the full conditional distribution of wi as πwi- and the outcome densities dFyiwi*,-=j=1qi𝒮i𝒯jdFjyjiwj(),βj,γj. Here, it is desirable to increase the size of each wi: in proposition 2.2 we showed that a coarser partitioning of 𝒮i leads to less restrictive spatial conditional independence assumptions. Furthermore, we may expect a smaller number of larger blocks to lead to improved sampling efficiency at step 4. However, several roadblocks appear when wi is high dimensional. Firstly, evaluating πwi*-/πwi- becomes expensive. Secondly, it is difficult to design an efficient proposal distribution qwi in high dimensions. A random-walk Metropolis (RWM) proposal proceeds by letting wi*=wi+gi where we let giN0,Gi, but the niq×niq matrix Gi must be specified by the user for all i, making a RWM proposal unlikely to achieve acceptable performance in practice if ni is large, especially if one were to take Gi as diagonal matrices. Manual specification of Gi’s can be circumvented via Adaptive Metropolis (AM) methods, which build Gi dynamically based on past acceptances and rejections (see e.g., Haario et al., 2001; Andrieu and Thoms, 2008; Vihola, 2012), or via gradient-based schemes such as HMC, which use local information about the target distribution. However, when the dimension of wi is large the Markov chain will only make small steps and thus negatively impact overall efficiency and convergence regardless of the proposal scheme. The above mentioned issues worsen when q is larger, because spatial meshing via partitioning and a sparse DAG only operates at the level of the spatial domain.

Finally, while it is easier to specify smaller dimensional proposals, reducing the size of each wi will lead to more restrictive spatial conditional independence assumptions and poorer sampling performance due to high posterior correlations in the spatial nodes. Therefore, proposal mechanisms for updating wi should (1) be inexpensive to compute and allow for the number of outcomes to increase without overly restrictive spatial conditional independence assumptions, and (2) use local target information with minimal or no user input or tuning.

We begin detailing novel computational approaches in the next section, maintaining a general perspective. We implement our proposals on Gaussian coregionalized meshed process models and detail Algorithm 3 with an account of computational cost in terms of flops and clock time.

3. Gradient-based sampling of spatially meshed models

Algorithm 1 is essentially a Metropolis-within-Gibbs sampler for updating the latent effects w𝒯 in M+|𝒯¯| small dimensional substeps. The setup and tuning of efficient proposals for updating wi remains a challenge and we consider several update schemes below. Given our assumption that 𝒯¯=, we only need to sample all wi ‘s conditional on their Markov blanket (step 4). The target full conditional density, for i=1,,M, is

pwi-πwiw[i],θj{ij}πwjwi,w[j]\{i},θj=1,,q,𝒮iyj()isobserveddFjyj()wj(),βj,γj, (5)

which takes the form pwi-[isparents]×[ischildren]×[dataati] and where the last term is a product of one-dimensional densities due to conditional independence of the outcomes given the latent process. The update of wi proceeds by proposing a move wiwi* using density qwi; then, wi* is accepted with probability min{1,α} where α=pwi*-qwiwi*pwi-qwi*wi. We consider gradient-based update schemes that are accessible due to the sparsity of 𝒢 and the low dimensional terms in (5).

3.1. Langevin methods for meshed models

Updating w𝒮 in spatial models via a Metropolis-adjusted Langevin algorithm proceeds in general by proposing a move to wi* for each i=1,...,M via

qwi*wi=Nwi+εi2Mwilogpwi-/2,εi2M,i.e.wi*=wi+εi22Mwilogpwi-+εiM12u, (6)

here uN0,Ini,Ini is the identity matrix of dimension ni,wipwi- denotes the gradient of the full conditional log-density logpwi- with respect to wi, and εi is a step size specific to node i which can be chosen adaptively via dual averaging (see, e.g., the discussion in Hoffman and Gelman, 2014). With (5) as the target, let fi be the niq×1 vector that stacks ni blocks of size q×1; each of the ni blocks of size q×1; each of the ni blocks has δδwj()logdFyj()wj(),βj,γj as its jth element, for 𝒮i, and zeros if yj() is unobserved. Then, we obtain

wilogpwi-=fi+δδwilogpwiw[i],θ+j{ij}δδwilogpwjwi,w[j]\i},θ. (7)

The matrix M in (6) is a preconditioner also referred to as the mass matrix (Neal, 2011). In the simplest setting, one sets M=Ini to obtain a MALA update (Roberts and Tweedie, 1996). If we assume that gradients can be computed with linear cost, MALA iterations run very cheaply in Oqni flops. However, we may conjecture that taking into account the geometry of the target beyond its gradient might be advantageous when seeking to formulate efficient updates. Complex update schemes that achieve this goal may operate on the Riemannian manifold (Girolami and Calderhead, 2011), but lead to an increase in the computational burden relative to simpler schemes. A special case of manifold MALA corresponding to relatively small added complexity uses a position-dependent preconditioner Mwi=Gwi=-Eδ2δwi2logpwi--1. Let Fi be the niq×niq diagonal matrix whose diagonal diagFi is a niq×1 vector that stacks ni of size q×1; each of the ni blocks has -Eδ2δ2wj()logdFyj()wj(),βj,γj as its j th element, for 𝒮i, and zeros if yj() is unobserved. For a target taking the form of (5) we find

Gwi-1=Fi-δ2δwi2logpwiw[i],θ-j{ij}δ2δwi2logpwjwi,w[j]\i},θ; (8)

this choice leads to an interpretation of (6) as a simplified manifold MALA proposal (SM-MALA) in which the curvature of the target pwi- is assumed constant. We make a connection between a modified SM-MALA update and the Gibbs sampler available when the latent process and all outcomes are Gaussian.

Proposition 3.1.

In the hierarchical model αNkα;mα,Vα,xα,SNn(x;Aα,S), consider the following proposal for updating α|x,S:

α*=α+ε122Gααlogpα-+ε2Gα12u,

where uNn0,In, and we set ε1=2,ε2=1. Then, qα*α=pα*x,S, i.e. this modified SM-MALA proposal leads to always accepted Gibbs updates.

The proof is in the supplement. A corollary of this proposition in the context of spatially meshed models is that when Fjyj();wj(),βj,γj=N(yj();wj()+xj()βj,γj2) for all j=1,,q, an algorithm based on the modified SM-MALA proposal with unequal step sizes for updating wi is a Gibbs sampler. In other words, SM-MALA updates are related to a generalization of Gibbs samplers that have been shown to scale to big spatial data analyses (Datta et al., 2016a,b; Finley et al., 2019; Peruzzi et al., 2022; Peruzzi and Dunson, 2022; Peruzzi et al., 2021). With non-Gaussian outcomes, the probability of accepting the proposed wi* depends on the ratio qwiwi*/qwi*wi. Computing this ratio requires O2q3ni3 floating point operations since the dimension of wi and wi* is qni and one needs to compute both Gwi-12 and Gwi*-12, e.g. via Cholesky or QR factorizations. For these reasons, SM-MALA proposals may lead to unsatisfactory performance with larger q due to their steeper compute costs relative to simpler MALA updates. We propose a novel adaptive algorithm below to overcome these issues.

3.2. Simplified Manifold Preconditioner Adaptation

Using a dense, constant preconditioner M in (6) rather than the identity matrix leads to a computational cost of Oq2ni2 per MCMC iteration; this cost is larger than MALA updates, but “good” choices of M might improve overall efficiency. Relative to position-dependent SM-MALA updates, a constant M might be convenient if q and/or ni are large, but it is unclear how M can be fixed from the outset in the context of Algorithm 1. In the context of model (1), we cannot take M-1 as the expected Fisher information evaluated at the mode due to the high dimensionality of the latent variables and their dependence on unknown hyperparameters. Adaptive methods may build a preconditioner (or its inverse) by starting from an initial guess M(0), then applying smaller and smaller changes to M(m) at iteration m to get M(m+1). Past values of wi can be used to build a preconditioner: see, e.g., Haario et al. (2001), Andrieu and Thoms (2008), Marshall and Roberts (2012) for adaptive Metropolis, and Atchadé (2006) for MALA. These methods are not immediately advantageous because adaptation using past acceptances may be slow and lead to poor performance, especially in the within-Gibbs contexts in which we operate. Because Oq3ni3 updates must be performed each time M(m) or its inverse are updated due to the need to compute a matrix square root (e.g., Cholesky), slow adaptation methods become increasingly unappealing compared to simpler methods, like MALA, or methods that systematically construct a position-dependent preconditioner, like RM-MALA.

Algorithm 2.

The mth iteration of Simplified Manifold Preconditioner Adaptation.

Setup and inputs: d-dimensional random vector X𝒳Rd,XP whose density p(·)>0 is continuous with respect to the Lebesgue measure, assume K is a compact subset of 𝒳, fix the constants D0,κ>0,0<Tadapt<, step size 0<ε<D, denote gx=xlogp(x) g~x=gxminDmaxigx[i],1,Gx-1=-δ2δx2logp(x),G~x-1=Gx-1minDmaxiGx[i,i],1, let the sequence γm,mN be such that γm>0,γm0.
function SiMPA:
1:  Sample zU(0,1),vU(0,1), uN(0,Id).
2:  Let μ(new)=x(m-1)+ε22M(m-1)g~x(m-1) and propose x(new)=μ(new)+εM(m-1)12u.
3:  Let μ(back)=x(new)+ε22M(m-1)g~x(new).
4:  Compute
α=px(new)px(m-1)Nx(m-1);μ(back),ε2M(m-1)Nx(new);μ(new),ε2M(m-1).
5: if α<v and x(new)-x(m)<D, ⊳ proposal accepted
6:   Set x(m)=x(new).
7: else: set x(m)=x(m-1). ⊳ proposal rejected
if z<γm and x(m)K or m<Tadapt: ⊳ adapting
8:   Set M(m)-1=M(m-1)-1+κ(G˜x(m)-1-M(m-1)-1) and compute M(m)12.
else: ⊳ not adapting
9:   Set M(m)=M(m-1).

To resolve these issues, we outline our Simplified Manifold Preconditioner Adaptation (SiMPA) as Algorithm 2. We present SiMPA in general terms as it operates independently of spatial meshing. The main feature of our algorithm is that it uses the negative Hessian matrix Gx-1 to adaptively build a (position-independent) preconditioner. In spatially meshed models and corresponding within-Gibbs posterior sampling algorithms, Gx can be computed easily using (8), also refer to the supplement for additional details. Comparatively, an adaptive algorithm similar to Atchadé (2006), which we label YA-MALA, replaces step 8 in Algorithm 2 with x¯(m)=x¯(m-1)+κx(m)-x¯(m-1) and M(m)=M(m-1)+κΓm-M(m-1), where Γm=x(m)-x¯(m-1)x(m)-x¯(m-1)+10-6Id and leaves everything else the same. We show the benefits of adapting via SiMPA compared to YA-MALA in the supplement.

In SiMPA, we reduce the number of iterations with Oq3ni3 complexity by applying fixed changes to M(m) with probability γm0 as m. As a consequence, the (expected) cost at iteration m is Oq2ni2+γmq3ni3 rather than Oq3ni3. In the context of spatially meshed models, ni is small, and the quadratic cost on q can be further reduced via coregionalization (we do so in Section 4). In our applications, we use γm=1(mT)+1(m>T)(m-T)-a, where 1A is the indicator for the occurrence of A,T< is the number of initial iterations during which adaptation always occurs, and a>0 is the rate at which the probability of adaptation decays after T. Small values of the parameter κ lead to M(m) having long memory of the past.

We conservatively choose T=500,a=1/3,κ=1/100 as these values allowed ample burn-in time for all spatial nodes in all our applications. Preliminary analyses with T=1000 led to an increase in compute time with no advantage in estimation, prediction, or efficiency. On the other hand, T=100 or a=1/2 resulted in lower compute times at the cost of overall performance: letting γm decay too quickly may lead to an inability to capture the appropriate geometry of the target density.

Because its update does not result in any increase in computational complexity, the step size ε can be changed at each step, for example via dual averaging (DA) as in Algorithms 5 and 6 of Hoffman and Gelman (2014). We use the same DA scheme when comparing SiMPA to other gradient-based sampling methods. DA involves updates to ε at each iteration m<Tadapt and none afterwards. Because Tadapt<, DA has no impact on the eventual convergence of the chain. Finally, the constant D is used to limit the jump size of the proposals as well as bound the index set for adaptation. We need D as well as additional conditions on the algorithmic behavior near the boundary of K to satisfy the containment or bounded convergence condition (Roberts and Rosenthal, 2007, 2009) that allows SiMPA to provably converge in total variation distance to the target distribution P even when the state space is not compact. Intuitively, outside of the compact K we stop adapting after iteration Tadapt, whereas we perform an infinite (diminishing) adaptation inside it, in order to satisfy the conditions of Theorem 21 of Craiu et al. (2015).

Proposition 3.2.

Suppose π is everywhere non-zero and twice differentiable o that gx and Gx are well defined. Let ε>0,KRd,D>0. Additionally assume that if x(m)K with distx(m),Kc=u with 0u1 then the proposal is changed to x(new)Nx(m)+ε22M(Tadapt)g~x(m),ε2MTadapt). Then, Algorithm 2 converges in distribution to P.

The proof is in the supplement. The containment condition would hold without introducing K and without specifying the behavior of the algorithm near and outside K by assuming that 𝒳 itself is compact, which is in principle a restrictive assumption. In practice, K can be fixed large enough so that the chain essentially never leaves it. The SiMPA preconditioner will not in general correspond to the negative Hessian computed at the mode of the target density; rather, by a law of large numbers argument it will converge to the expectation of the negative Hessian of the target density.

4. Gaussian coregionalization of multi-type outcomes

We have so far outlined general methods and sampling algorithms for big data Bayesian models on multivariate multi-type outcomes. In this section, we remain agnostic on the outcome distributions, but specify a Gaussian model of latent dependency based on coregionalization. GPs are a convenient and common modeling option for characterizing latent cross-variability. We now assume the base process law Πθ is a q-variate GP, i.e. w()GP0,Cθ(,). The matrix-valued cross-covariance function Cθ(,) is parametrized by θ and is such that Cθ(,)=[cov{wi(),wj()}]i,j=1q, the q×q matrix with (i,j) th element given by the covariance between wi() and wj. Cθ(,) must be such that Cθ,=Cθ, and i=1nj=1nziCθi,jzj>0 for any integer n and any finite collection of points 1,2,,n and for all ziRq\{0} (see, e.g., Genton and Kleiber, 2015).

4.1. Coregionalized cross-covariance functions

The challenges in constructing valid cross-covariance functions can be overcome by considering a linear model of coregionalization (LMC; Matheron, 1982; Wackernagel, 2003; Schmidt and Gelfand, 2003). A stationary LMC builds q-variate processes via linear combinations of k univariate processes, w()=h=1kλhvh()=Λv(), where Λ=λ1,,λk is a q×k full (column) rank matrix with (i,j) th entry λij, whose i th row is denoted λ[i,:], and each vj() is a univariate spatial process with correlation function ρj,=ρ,;ϕj, and therefore θ=(vec(Λ),Φ) where Φ=ϕ1,,ϕk. Independence across the kq components of v() implies covvj(),vh=0 whenever hj, and therefore v() is a multivariate process with diagonal cross-correlation ρ,;Φ. As a consequence, the q-variate w() process cross-covariance is defined as Cθ,=Λρ,;ΦΛ=h=1kλhλhρ,,ϕh. If -=0, then Cθ(0)=Λρ(0;Φ)Λ=ΛΛ since ρ(0;Φ)=Ik. Therefore, when k=q,Λ is identifiable e.g. as a lower-triangular matrix with positive diagonal entries corresponding to the Cholesky factorization of Cθ(0) (see e.g., Finley et al., 2008; Zhang and Banerjee, 2022, and references therein for Bayesian LMC models). When k<q, a coregionalization model is interpretable as a latent spatial factor model. For a set =1,,n of locations, we let ρΦ, be the kn×kn block-matrix whose (i,j) block is ρi,j,ϕ-which has zero off-diagonal elements-and thus Cθ,=InΛρΦ,InΛ. Notice that the qn×1 vector w can be represented by a n×q matrix W whose j th column includes realizations of the j th margin of the q-variate process. Assuming a GP, we find w=vecWN0,Cθ,. We can also equivalently represent process realizations by outcome rather than by location: if we let w~=vec(W) then w~N0,QCθ,Q where Q is a permutation matrix that appropriately reorders rows of Cθ, (and thus, Q reorders its columns). We can write QCθ,Q=C~θ,=ΛInρ~Φ,ΛIn=ΛInJρΦ,JΛIn where J is a nk×nk permutation matrix that operates similarly to Q but on the k components of the LMC. Here, ρ~Φ, is a block-diagonal matrix whose j th diagonal block is ρj,, i.e. the j th LMC component correlation matrix at all locations. This latter representation clarifies that prior independence (i.e., a block diagonal ρ~Φ, ) does not translate to independence along the q outcome margins once the loadings Λ are taken into account (in fact, Cθ, is dense).

4.2. Latent GP hierarchical model

In practice, LMCs are advantageous in allowing one to represent dependence across q outcomes via kq latent spatial factors. We build a multi-type outcome spatially meshed model by specifying Π in (1) as a latent Gaussian LMC model with MGP factors

yj()ηj(),γjFjηj(),γj,ηj()=xj()βj+λ[j,:]v(),vh()MGP𝒢0,ρh(,),h=1,,k (9)

whose posterior distribution is

π({βj(t),γj(t)}j=1q,v𝒯,Φ,Λy𝒯)πΦh=1ki=1Mπvh,ivh,[i]ϕh.j=1q(πβj,γj𝒯jdFjyj()vj(),λ[j,:],βj,γj). (10)

The LMC assumption on w(·) using MGP margins leads to computational simplifications in evaluating the density of the latent factors. For each of the M partitions, we now have a product of k independent Gaussian densities of dimension ni rather than a single density of dimension qni.

4.3. Spatial meshing of Gaussian LMCs

When seeking to achieve scalability of LMCs to large scale data via spatial meshing, it is unclear whether one should act directly on the q-variate spatial process w(·) obtained via coregionalization, or independently on each of the k LMC component processes. We now show that the two routes are equivalent with MGPs if a single DAG and a single domain partitioning scheme are used.

If the base process Π is a q-variate coregionalized GP, then for i=1,...,M the conditional distributions are πwiw[i],θ=Nwi;Hiw[i],Ri where Hi=Ci,[i]C[i]-1,Ri=Ci-HiC[i],i, and C,=Λρ,Λ (we omit the θ and Φ subscripts for simplicity). When sampling, (5) simplifies to

pwiNwi;Hiwi,RijijNwj;Hijwi+Hj\iwj\i,Rj..j=1,,q,l𝒮iyjisobserveddFjyjwj,βj,γj, (11)

where the notation ij and [j]\{i} refers to the partitioning of Hj by column into Hj=HijH[j]\{i} and thus w[j]\{i} corresponds to blocks of w[j] excluding wi (i.e. the co-parents of i relative to node j).Hi and Ri have dimension qni×qn[i] and qni×qni, respectively. Their dimension depends on q, and the following proposition uncovers their structure.

Algorithm 3.

Posterior sampling and prediction of LMC model (1) with MGP priors.

Initialize βj(0),Λ(0) and γj(0) for j=1,,q,v𝒮(0), and Φ(0)
for t{1,,T*,T*+1,,T*+T} do ⊳ sequential MCMC loop
for j=1,...,q, do in parallel
1:  use SiMPA to update βj(t),λ[j;:](t)y𝒯,v𝒮(t-1),γj(t-1) O(nq(p+k)2)
for j=1,...,q, do in parallel
2:  use Metropolis-Hastings to update γj(t)y𝒯,v𝒮(t-1),βj(t),λ[j,:](t) O(nq)
3: use Metropolis-Hastings to update Φ(t)v𝒮(t-1) Onkd3m2
for cColors(𝒢) do ⊳ sequential
  for ii:Colorai=c do in parallel
4:    use SiMPA to update vi(t)vmb(i)(t),yi,Λ(t),Φ(t),{βj(t),γj(t)}j=1q Onmk2
Assuming convergence has been attained after T* iterations:
discard {βj(t),γj(t)}j=1q,v𝒮(t),Λ(t),Φ(t) for t=1,,T*
Output: Correlated sample of size T with density
{βj(t),γj(t)}j=1q,v𝒮(t),Λ(t),Φ(t)π𝒢(βj,γjj=1q,v𝒮(t),Λ,Φ,y𝒯).
Predict at *𝒰: for t=1,...,T and j=1,...,q, sample from π(v*(t)v[*](t),Φ(t)), then from Fj(wj(*)(t),βj(t),λ[j,:](t),γj(t))

Proposition 4.1.

A q-variate MGP on a fixed DAG 𝒢, a domain partition T, and a LMC cross-covariance function Cθ is equal in distribution to a LMC model built upon k independent univariate MGPs, each of which is defined on the same 𝒢 and T.

The proof proceeds by showing that if wi=IniΛvi then πwiw[i]=πviv[i] and that for all i=1,,M we can write πviv[i]=h=1kπ(vi(h)v[i](h)), concluding that π𝒢w𝒮=i=1Mπwiw[i]=i=1Mh=1kπ(vi(h)v[i](h))=h=1kπ𝒢(h)(v𝒮(h)) where π𝒢(h) is the density of the hth independent univariate MGP using 𝒢,T, and correlation function ρh(,). The complete derivation is available in the supplement. A corollary of Proposition 4.1 is that a different spatially meshed GP can be constructed via unequal spatial meshing (i.e., different graphs and partitions) along the k margins; this result intuitively says that an MGP behaves like a standard GP with respect to the construction of multivariate processes via LMCs and in other words, there is no loss in flexibility when using MGPs compared to the full GP. The supplementary material provides details on vilogp(v-) and Gvi for posterior sampling of the latent meshed Gaussian LMC models via Algorithm 1.

5. Applications on bivariate non-Gaussian data

We concentrate here on a scenario in which two possibly misaligned non-Gaussian outcomes are measured at a large number of spatial locations and we aim to jointly model them. We will consider a larger number of outcomes in Section 6, in the context of community ecology. In addition to the analysis presented here, the supplement includes (1) a comparison of methods across 750 multivariate synthetic datasets, and (2) performance assessments of multiple sampling schemes in multivariate multi-type models using latent coregionalized QMGPs.

5.1. Illustration: bivariate log-Gaussian Cox processes

When modeling spatial point patterns via log-Gaussian Cox processes with the goal of estimating the intensity surface, one typically proceeds by counting occurrences within cells in a regular grid of the spatial domain. We simulate this scenario by generating a bivariate Poisson outcome at each location of a 120 × 120 regular grid, for a data dimension of qn=28800. In model (1), we let Fj be a Poisson distribution with intensity expηj() at [0,1]2, where ηj()=x()βj+wj is the log-intensity for count outcome j. We sample 3 correlated covariates at each location independently as x()N30,Σx where Σx is a correlation matrix with off-diagonal elements σ12=0.9,σ13=-0.3,σ23=-0.6, and we let β1=(-0.5,-1,0),β2=(-1,-0.5,0.5). We fix the latent process Π in one scenario as a coregionalized GP and in another as a coregionalized NNGP. In both cases, wj()=λ[j,:]v() and ΛΛ=σiji,j=1,2 where σ11=4,σ12=σ21=-1.3,σ22=1, which yields a latent cross-correlation between the two outcomes of ρ=-0.65; the two spatial correlations used in the LMC model are ρh-=exp-ϕh- and we let ϕ1=1.5,ϕ2=2.5. We use R package GpGp to generate an NNGP using maxmin ordering of the spatial locations and 10 neighbors. We depict the latent NNGP process along with the full data in Figure 5. We introduce missing values at 20% of the spatial locations, independently for each outcome. As a result, our training data are misaligned.

Figure 5:

Figure 5:

Latent NNGP process realization and corresponding synthetic count data at 14,400 spatial locations for correlated spatial outcomes. We omit the plots corresponding to the unrestricted GP scenario as they are visually indistinguishable.

We investigate the comparative performance of several coregionalized QMGP variants computed via MALA, SM-MALA, SiMPA and NUTS. We also consider latent multivariate Student-t processes (Chen et al. 2020; Shah et al. 2014) using an alternative cross-covariance specification based on Apanasovich and Genton (2010)—in short “AG10”—and previously used in Peruzzi et al. (2022), which we also implement in the meshed Gaussian case. We detail the specifics of spatial meshing and gradient-based sampling for Student-t processes in the supplement. To the best of our knowledge, ours is the first implementation of a scalable Student-t process using DAGs. We also compare with a data transformation method based on NNGPs: for each outcome, we use y*=log(1+y), then fit NNGP models of the response on each outcome independently. All MCMC-based results are based on chains of length 30,000. All gradient-based methods share the dual averaging setup for adapting the step size ε and are thus allowed the same burn-in period. Finally, we implement an MCMC-free stochastic partial differential equations method (SPDE; Lindgren et al., 2011) fit via INLA. The SiMPA-estimated posterior means for the latent process, predictions across the spatial domain, as well as the width of 95% CIs about the linear predictors are reported in Figure 6, where we also highlight that the lack of visible spatial patterns in the linear predictor residuals is evidence of the ability of SiMPA to capture the spatial correlation in the data.

Figure 6:

Figure 6:

Output from fitting a coregionalized QMGP via SiMPA to simulated data in the latent NNGP scenario. Top row: estimated posterior mean of the spatial latent process and predictions for both outcomes. Bottom row: width of posterior credible intervals about log-intensity, and residual log-intensity.

A summary of results from all implemented methods is available in Table 1, which reports root mean square prediction error (RMSPE) and mean absolute error in prediction (MAEP) when predicting the log-intensity ηj,test and the outcomes yj,test,j=1,2 on the test set of 5740 locations, and the empirical coverage of 95% credible intervals (CI) about the log-intensity, in both scenarios. We observe that SiMPA offers excellent predictive performance and well calibrated credible intervals at a fraction of the compute cost, relative to state-of-the-art posterior sampling methods in this context. In the NNGP scenario, there is a disconnect between the fitted DAG (arising from a QMGP) and the data-generating DAG. This disconnect may explain why QMGP methods implementing the flexible AG10 cross-covariance function perform relatively better than in the GP scenario. Even in the NNGP setting, SiMPA retains excellent performance at a small compute cost.

Table 1.

Summary of out-of-sample results for all implemented models. Bolded values correspond to best performance.

Spatial model Covariance model Compute algorithm j Unrestricted GP Nearest-neighbor GP, NN = 10

yj,test() ηj,test() Time(s) yj,test() ηj,test() Time(s)
RMSPE MAEP RMSPE MAEP Covg95% RMSPE MAEP RMSPE MAEP Covg95%

SPDE LMC INLA 1 3.01 1.39 0.32 0.25 69.48 333 2.87 1.36 0.32 0.26 69.27 334
2 6.14 1.90 0.33 0.26 63.44 5.79 1.85 0.32 0.25 65.52


QMGP MALA 1 2.33 1.27 0.21 0.17 99.51 90 2.26 1.22 0.21 0.17 99.44 89
2 4.08 1.57 0.20 0.16 94.27 3.77 1.53 0.19 0.15 93.58

YA-MALA 1 7.27 2.56 0.95 0.76 5.42 111 6.96 2.48 0.93 0.75 5.56 108
2 18.98 4.92 1.28 1.01 3.92 18.85 4.89 1.30 1.03 4.20

SiMPA 1 2.18 1.22 0.17 0.14 95.83 117 2.16 1.19 0.17 0.14 95.21 116
2 4.03 1.57 0.19 0.15 94.55 3.60 1.51 0.19 0.15 94.48

RM-MALA 1 2.19 1.22 0.18 0.15 96.28 187 2.17 1.20 0.18 0.15 94.93 183
2 4.20 1.56 0.23 0.18 93.37 3.97 1.56 0.23 0.18 92.57

HMC 1 2.18 1.23 0.17 0.14 95.73 246 2.16 1.20 0.17 0.14 94.65 359
2 3.96 1.56 0.19 0.15 94.76 3.54 1.51 0.19 0.15 93.82

NUTS 1 2.16 1.22 0.18 0.14 93.89 620 2.21 1.20 0.18 0.14 92.57 633
2 4.07 1.57 0.20 0.16 90.66 3.56 1.51 0.19 0.15 90.97


AG10 1 2.22 1.23 0.18 0.14 92.64 501 2.13 1.19 0.17 0.14 91.39 480
2 4.01 1.56 0.20 0.16 91.60 3.56 1.50 0.19 0.15 92.01


QMTP 1 2.19 1.23 0.18 0.14 91.77 857 2.16 1.20 0.18 0.14 90.97 841
2 4.02 1.57 0.20 0.16 90.21 3.50 1.50 0.19 0.15 91.28

NNGP Exp Transform & Response 1 5.71 2.01 1.19 0.98 59.41 166 5.52 1.96 1.19 0.99 59.10 171
2 15.55 3.63 1.36 1.12 58.09 15.44 3.56 1.34 1.10 58.68

Because the only difference between SiMPA and YA-MALA is in how the preconditioner is adapted, the relatively poor performance of YA-MALA can be attributed to it requiring a much longer burn-in period in practice. We attribute the poor performance of the implemented NNGP methods to the fact that they are unable to capture cross-variable dependence, as well as their being limited to Gaussian outcomes in R package spNNGP

Figure 7 expands on the analysis of empirical coverage of CIs by reporting the performance of all models at additional quantiles, relative to the oracle coverage, i.e., the empirical coverage of the model in which all unknowns are set to their true value. A value of relative coverage near 1 implies that the empirical coverage of the Q% CI is close to the coverage of the true data generating model. From Figure 7 we observe that SiMPA outpeforms other methods at this task.

Figure 7:

Figure 7:

Top row: empirical coverage of uncertainty intervals at different quantiles, relative to the oracle model (values under 1 imply undercoverage of the interval), in the NNGP scenario. Bottom row: detailed comparison of relative coverage of SiMPA and HMC for the linear predictor of each outcome.

5.2. MODIS data: leaf area and snow cover

The dynamics of vegetation greenness are important drivers of ecosystem processes; in alpine regions, they are influenced by seasonal snow cover. Predictive models for vegetation greenup and senescence in these settings are crucial for understanding how local biological communities respond to global change (Walker et al., 1993; Jönsson et al., 2010; Wang et al., 2015a; Xie et al., 2020). We consider remotely sensed leaf area and snow cover data from the MODerate resolution Imaging Spectroradiometer (MODIS) on the Terra satellite operated by NASA (v.6.1) at 122,500 total locations (a 350× 350 grid where each cell covers a 0.25km2 area) over a region spanning northern Italy, Switzerland, and Austria, during the 8-day period starting on December 3rd, 2019 (Figure 1). Leaf area index (LAI; number of equivalent layers of leaves relative to a unit of ground area, available as level 4 product MOD15A2H) is our primary interest and is stored as a positive integer value but is missing or unavailable at 38.2% of all spatial locations due to cloud cover or poor measurement quality. Snow cover (SC; number of days within an 8-day period during which a location is covered by snow, obtained from level 3 product MOD10A2) is measured with error and missing at 7.3% of the domain locations.

We create a test set by introducing missingness in LAI at 10,000 spatial locations, of which 5030 are chosen uniformly at random across the whole domain and 4970 belong to a contiguous rectangular region as displayed on the bottom left subplot of Figure 8a. We attempt to explain LAI based on SC by fitting (9) on the bivariate outcome outcome y()=ySC(),yLAI() where we assume a Binomial distribution with 8 trials and logit link for SC, i.e. EySC()μ()=8μ()=81+exp-ηSC()-1, and a Poisson or Negative Binomial distribution for LAI. In both cases, EyLAI()ηLAI()=μLAI()=expηLAI(); for the Poisson model, VaryLAI()ηLAI()=μLAI(), whereas for the Negative Binomial model VaryLAI()ηLAI()=μLAI()+τμLAI2() where τ is an unknown scale parameter. We fit model (9) using latent coregionalized QMGPs with k = 2 on a 50 × 50 axis-parallel domain partition and run SiMPA for 30,000 iterations, of which 10,000 are discarded as burn-in and thinning the remaining ones with a 20:1 ratio, leading to a posterior sample of size 1,000. We compare our approaches in terms of prediction and uncertainty quantification about yLAI on the test set to a SPDE-INLA approach implemented on a 60 × 60 mesh which led to similar compute times. As shown in Table 2, QMGP-SiMPA is competitive with or outperforms the SPDE-INLA method across all measured quantities. Figure 8b reports predictive maps of the tested models (prediction values are censored at 100 for visualization clarity), along with a visualization of 75% one-sided credible intervals which shows the SPDE-INLA method exhibiting undesirable spatial patterns, unlike QMGP-SiMPA.

Figure 8:

Figure 8:

Performance of QMGP-SiMPA and SPDE-INLA in the MODIS data application.

Table 2.

Root mean square error (RMSPE), median absolute error (MedAE), continuous ranked probability score (CRPS), and empirical coverage of one-sided intervals (CIq), over the out-of-sample test set of 6,998 locations.

Method F LAI RMSPE MedAE CRPS CI 75 CI 95 CI 99 Time (minutes)
(mean) (median)

QMGP-SiMPA Poisson 16.543 1.322 3.916 1.199 0.867 0.974 0.989 25.4
Neg. Binom. 11.726 2.155 4.462 2.241 0.809 0.980 0.994 32

SPDE-INLA Poisson 27.839 2.154 4.695 1.214 0.835 0.938 0.961 25.8
Neg. Binom. 27019.470 2.444 54.986 1.720 0.875 0.975 0.987 86.5

6. Applications: spatial community ecology

Ecologists seek to jointly model the spatial occurrence of multiple species, while inferring the impact of phylogeny and environmental covariates (see, e.g., Dorazio and Royle 2005; Doser et al. 2022). In order to realistically model such a scenario, we consider cases in which a relatively large number of georeferenced outcomes is observed, with the goal of predicting their realization at unobserved locations and estimating their latent correlation structure after accounting for spatial and/or temporal variability. Presence/absence information for different species is encoded as a multivariate binary outcome. Our model for multivariate binary outcomes lets Fjyj();ηj()=Bernμj() where μj()=1+exp-ηj()-1 and vh()QMGP0,ρh(,),h=1,,k in model (9), leading to coregionalized k-factor QMGPs which we fit via several Langevin methods, all of which use domain partitioning with blocks of size ≈ 36 and independent standard Gaussian priors on the lower-triangular elements of the factor loadings Λ, unless otherwise noted.

We compare QMGP methods fit via our proposed Langevin algorithms to joint species distribution models (JSDM) implemented in R package Hmsc (Tikhonov et al., 2020), a popular software package for community ecology. Hmsc uses a probit link for binary outcomes, i.e. μj()=Φηj() where Φ() is the Gaussian distribution function; then, non-spatial JSDMs are implemented by letting vh()N(0,1) independently for all and h=1,,K, whereas NNGP-based JSDMs assume vh()NNGP0,ρh(,),h=1,,k. We set the number of neighbors as m=20 in the NNGP specification. Hmsc assumes a cumulative shrinkage prior on the factor loadings (Bhattacharya and Dunson, 2011), which we set up with minimal shrinking ( a1=2,a2=2 ) unless otherwise noted.

In the supplement, we compare our methods with alternative posterior sampling algorithm in fitting a multi-species N-mixture model for multi-species abundance data in community ecology.

6.1. Synthetic occupancy data

We generate 30 datasets by sampling q=10 binary outcomes at n=900 locations scattered uniformly in the domain [0,1]2: after sampling k=3 independent univariate GPs vj()GP(0,Cφj) where Cφj,=exp-φj- is the exponential covariance function with decay parameter φj, we construct a q-variate GP via coregionalization by letting w()=Λv() where Λ is a q×k lower-triangular matrix. We then sample the binary outcomes using a probit link, i.e. yj()Bernμj() where μj()=Φ(x()βj+wj()) for each j=1,,q and where x() is a column vector of p=2 covariates. For each of the 30 datasets, we randomly set φjU(1/2,10),j=1,,k,ΛjjU(3/2,2) for j=1,,k,ΛijU(-2,2) for i<j, and βjN0,I2/5. These choices lead to a wide range of latent pairwise correlations induced on the outcomes via w() : letting Ω=ωiji,j=1,,q=ΛΛ represent the cross-covariance at zero spatial distance, we find the cross-correlations as Ωcorr=diag(ωjj-1/2)Ωdiag(ωjj-1/2). We realistically model long-range spatial dependence by choosing small values for φj,j=1,,k. Lastly, we create a test set using 20% of the locations for each outcome (missing data locations differ for each outcome).

We use the setup of QMGPs and Hmsc outlined above, noting that the link function used to generate the data is correctly specified for Hmsc but not for our models based on QMGP due to our current software implementation in R package meshed. MCMC for all methods was run for 10,000 iterations, of which the first 5,000 is discarded as burn-in. We compare all models based on the out-of-sample classification performance on each of the 10 outcomes as measured via the area under the receiver operating characteristic curve (AUC). Since a primary interest in these settings is to estimate latent correlations across outcomes, we compare models based on Ωˆcorr-ΩcorrF, i.e. the Frobenius distance between the Monte Carlo estimate of cross-correlation and its true value. Therefore, smaller values of Ωˆcorr-ΩcorrF are desirable. Figure 10 shows box-plots summarising the results, whereas Table 3 reports averages along with compute times. In these settings, the non-spatial model unsurprisingly performed worst. Langevin methods for the spatial models proposed in this article – and in particular SiMPA – lead to improved classification performance, smaller errors in estimating latent correlations, and a 30-fold reduction in compute time, relative to the coregionalized NNGP method implemented via MCMC in Hmsc.

Figure 10:

Figure 10:

Box-plot summaries of estimation and classification performance over 30 datasets. Left: Ωˆcorr-ΩcorrF for the competing methods. Right: AUC for each outcome.

Table 3.

Performance in classification, estimation, and compute time, over 30 synthetic datasets.

Method Hmsc MALA SM-MALA SiMPA

Prior on rand. eff. non-spatial NNGP QMGP

Avg. AUC 0.827 0.885 0.882 0.874 0.885
Min. AUC 0.573 0.608 0.392 0.530 0.609
Max. AUC 0.969 0.983 0.986 0.987 0.984

Ωˆcorr-ΩcorrF 1.66 1.43 1.46 1.91 1.14

Avg. time (minutes) 5.15 17.4 0.44 0.73 0.53

6.2. North American breeding bird survey data

The North American Breeding Bird Survey dataset contains avian point count data for more than 700 North American bird taxa (species, races, and unidentified species groupings). These data are collected annually during the breeding season, primarily June, along thousands of randomly established roadside survey routes in the United States and Canada.

We consider a dataset of n=4118 locations spanning the continental U.S., and q=27 bird species. The specific species we consider belong to the passeriforme order and are observed at a number of locations which is between 40% and 60% of the total number of available locations – Figure 2 shows a subset of the data. We dichotomize the observed counts to obtain presence/absence data. The effective data size is nq=111,186. We implement Langevin methods using coregionalized QMGPs with k=2,4,6,8,10 spatial factors using exponential correlations with decay φU[0.1,10] a priori. We also test the sensitivity to the domain partitioning scheme by testing 8 × 4 (coarse), 16 × 8 (medium), and 32 × 16 (fine) axis-parallel domain partitioning schemes. Finer partitioning implies more restrictive spatial conditional independence assumptions. In implementing the shrinkage prior of Bhattacharya and Dunson (2011), Hmsc dynamically chooses the number of factors up to a maximum kmax: in the non-spatial Hmsc model, letting kmax=10 ultimately leads to 6 or fewer factors being used during MCMC. In the spatial Hmsc models using NNGPs, we set kmax=2 or kmax=5 to restrict run times. Figure 11 reports average classification performance and run times. QMGP-MALA scales only linearly with the number of factors, but its performance is strongly negatively impacted by partition size. QMGP-SM-MALA exhibits large improvements in classification performance, however these improvements come at a large run time cost. QMGP-SiMPA outperforms all other models while providing large time savings relative to SM-MALA and being less sensitive to the choice of partition. A QMGP-SiMPA model on the 32 × 16 partition with k=4 outperforms a spatial NNGP-Hmsc model in classifying the 27 bird species with a reduction in run time of over three orders of magnitude (respectively 4.1 minutes and 70.7 hours). We provide a summary of the efficiency in sampling the elements of Ωcorr in Table 4, where we make comparisons of ESS/s relative to the non-spatial Hmsc model. While efficient estimation of Ωcorr remains challenging, QMGP-SiMPA models show marked improvements relative to a state-of-the-art alternative.

Figure 11:

Figure 11:

Left: mean AUC across the 27 bird species for different choices of k. Center: run times in hours. Right: run time of QMGP models as a proportion to the run time choosing k = 2.

Table 4.

Out-of-sample performance in classification of the 27 bird species, compute time, and efficiency in estimation of Ωcorr, relative to a non-spatial JDSM model.

Method Hmsc SiMPA

Prior non-spatial NNGP QMGP

k ≤ 10 5 4 4 10 10
Setting m = 20 32 × 16 8 × 4 32 × 16 8 × 4

Avg. AUC 0.9349 0.9293 0.9565 0.9565 0.9728 0.9732

Time (minutes) 87.45 4245.02 4.08 43.10 9.27 187.24

ESS/s for elements of Ωcorr (relative to Hmsc non-spatial)

min 1 10−4 0.57 0.02 0.05 0.01
median 1 0.012 2.12 0.33 0.86 0.05
mean 1 0.015 3.69 0.45 1.20 0.07
max 1 0.102 42.47 3.95 11.15 0.56

7. Discussion

We have introduced Bayesian hierarchical models based on DAG constructions of latent spatial processes for large scale non-Gaussian multivariate multi-type data which may be misaligned, along with computational tools for adaptive posterior sampling. We illustrated our methods using applications with data sizes in the tens to hundreds of thousands, with compute times ranging from a few seconds to under 30 minutes in a single workstation. The compute time for a single SiMPA iteration for a univariate Poisson outcome observed on gridded coordinates with n = 106 is under 0.2 seconds after burn-in; in other words, our methods enable running MCMC for hundreds of thousands of iterations on massive spatial data under a total time budget of 12 hours.

We have applied our methodologies using practical cross-covariance choices such as models of coregionalization built on independent stationary covariances. However, nonstationary models are desirable in many applied settings. Recent work (Jin et al., 2021) highlights that DAG choice must be made carefully when considering explicit models of nonstationary, as spatial process models based on sparse DAGs induce nonstationarities even when using stationary covariances. Our work in this article will enable new research into nonstationary models of large scale non-Gaussian data. Furthermore, our methods can be applied for posterior sampling of Bayesian hierarchies based on more complex conditional independence models of multivariate dependence (Dey et al., 2021).

In our work, we have assumed a common DAG and partitioning for all spatial variables. In some settings, these assumptions may lead to inflexibility in modeling variables with fundamentally different dependence structures and/or spatial domain constraints (see, e.g., Jin et al. 2022). In the multivariate setting, one potentially useful direction towards building a highly flexible class of models is to infer different DAGs for different factors within a spatial factor model by extending the methods of Jin et al. (2021). Understanding how to generally build flexible and scalable spatial factor models using different DAGs and accounting for unequal domain constraints is an interesting direction for future research.

Our methodologies rely on the ability to embed the assumed spatial DAG within the larger Bayesian hierarchy and lead to drastic reductions in wall clock time compared to models based on unrestricted GPs. Nevertheless, high posterior correlations of high dimensional model parameters may still negatively impact overall sampling efficiency in certain cases. Motivated by recent progress in improving sampling efficiency of multivariate Gaussian models (Peruzzi et al., 2021), future research will consider generalized strategies for improving MCMC performance in spatial factor models of highly multivariate non-Gaussian data. Finally, optimizing DAG choice for MCMC performance is another interesting path, and recent work on the theory of Bayesian computation for hierarchical models (Zanella and Roberts, 2021) might motivate further development for spatial process models based on DAGs.

Supplementary Material

Supplement fix

Figure 9:

Figure 9:

Synthetic dichotomous occurrence data (top row), and the spatial latent effects used to generate them (bottom row). Here, we show 5 (of 10) outcomes in 1 (of 30) simulated datasets.

Figure 12:

Figure 12:

Lower triangular portion of Ωcorr, the estimated correlation among the 27 bird species.

Acknowledgements

The authors have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 856506), and grant R01ES028804 of the United States National Institutes of Health (NIH).

References

  1. Andrieu C and Thoms J (2008). A tutorial on adaptive MCMC. Statistics and Computing, 18:343–373. doi: 10.1007/s11222-008-9110-y. 13, 16 [DOI] [Google Scholar]
  2. Apanasovich TV and Genton MG (2010). Cross-covariance functions for multivariate random fields based on latent dimensions. Biometrika, 97:15–30. doi: 10.1093/biomet/asp078. 24 [DOI] [Google Scholar]
  3. Atchadé YF (2006). An adaptive version for the Metropolis adjusted Langevin algorithm with a truncated drift. Methodology and Computing in Applied Probability, 8:235–254. doi: 10.1007/s11009-006-8550-0. 16, 17 [DOI] [Google Scholar]
  4. Banerjee S (2017). High-dimensional Bayesian geostatistics. Bayesian Analysis, 12(2):583–614. doi: 10.1214/17-BA1056R. 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Banerjee S (2020). Modeling massive spatial datasets using a conjugate Bayesian linear modeling framework. Spatial Statistics, 37:100417. doi: 10.1016/j.spasta.2020.100417. 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Banerjee S, Finley AO, Waldmann P, and Ericsson T (2010). Hierarchical spatial process models for multiple traits in large genetic trials. Journal of American Statistical Association, 105(490):506–521. doi: 10.1198/jasa.2009.ap09068. 4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Banerjee S, Gelfand AE, Finley AO, and Sang H (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:825–848. doi: 10.1111/j.1467-9868.2008.00663.x. 2, 9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Betancourt M (2018). A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434. 3 [Google Scholar]
  9. Bhattacharya A and Dunson DB (2011). Sparse Bayesian infinite factor models. Biometrika, 98(2):291–306. doi: 10.1093/biomet/asr013. 29, 32 [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Blomstedt P, Parente Paiva Mesquita D, Lintusaari J, Sivula T, Corander J, and Kaski S (2019). Meta-analysis of Bayesian analyses. arXiv:1904.04484. 5 [Google Scholar]
  11. Bradley JR, Holan SH, and Wikle CK (2018). Computationally efficient multivariate spatio-temporal models for high-dimensional count-valued data (with discussion). Bayesian Analysis, 13(1):253–310. doi: 10.1214/17-BA1069. 4 [DOI] [Google Scholar]
  12. Bradley JR, Holan SH, and Wikle CK (2019). Bayesian hierarchical models with conjugate full-conditional distributions for dependent data from the natural exponential family. Journal of the American Statistical Association. doi: 10.1080/01621459.2019.1677471. 4 [DOI] [Google Scholar]
  13. Carpenter B, Gelman A, Hoffman MD, Lee D, Goodrich B, Betancourt M, Brubaker Marcus Guo J, Li P, and Riddell A (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1). doi: 10.18637/jss.v076.i01. 4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Chen Z, Wang B, and Gorban AN (2020). Multivariate Gaussian and Student-t process regression for multi-output prediction. Neural Computing and Applications, 32:3005–3028. doi: 10.1007/s00521-019-04687-8. 23 [DOI] [Google Scholar]
  15. Craiu RV, Gray L, Łatuszyński K, Madras N, Roberts GO, and Rosenthal JS (2015). Stability of adversarial Markov chains, with an application to adaptive MCMC algorithms. The Annals of Applied Probability, 25(6):3592 – 3623. doi: 10.1214/14-AAP1083. 18 [DOI] [Google Scholar]
  16. Cressie N and Johannesson G (2008). Fixed rank kriging for very large spatial data sets. Journal of the Royal Statistical Society, Series B, 70:209–226. doi: 10.1111/j.1467-9868.2007.00633.x. 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Datta A, Banerjee S, Finley AO, and Gelfand AE (2016a). Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets. Journal of the American Statistical Association, 111:800–812. doi: 10.1080/01621459.2015.1044091. 2, 9, 16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Datta A, Banerjee S, Finley AO, Hamm NAS, and Schaap M (2016b). Nonseparable dynamic nearest neighbor Gaussian process models for large spatio-temporal data with an application to particulate matter analysis. The Annals of Applied Statistics, 10:1286–1316. doi: 10.1214/16-AOAS931. 16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Dey D, Datta A, and Banerjee S (2021). Graphical Gaussian process models for highly multivariate spatial data. Biometrika. in press. doi: 10.1093/biomet/asab061. 2, 34 [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Dorazio RM and Royle JA (2005). Estimating size and composition of biological communities by modeling the occurrence of species. Journal of the American Statistical Association, 100(470):389–398. doi: 10.1198/016214505000000015. 29 [DOI] [Google Scholar]
  21. Doser JW, Finley AO, Kéry M, and Zipkin EF (2022). spOccupancy: An R package for single-species, multi-species, and integrated spatial occupancy models. Methods in Ecology and Evolution, 13(8):1670–1678. doi: 10.1111/2041-210X.13897. 29 [DOI] [Google Scholar]
  22. Duane S, A.D. K, Pendleton BJ, and Roweth D (1987). Hybrid Monte Carlo. Physics Letters B, 195:216–222. 3 [Google Scholar]
  23. Dunson D and Johndrow JE (2020). The Hastings algorithm at fifty. Biometrika, 107(1):1–23. doi: 10.1093/biomet/asz066. 4 [DOI] [Google Scholar]
  24. Finley AO, Banerjee S, Ek AR, and McRoberts RE (2008). Bayesian multivariate process modeling for prediction of forest attributes. Journal of Agricultural, Biological, and Environmental Statistics, 13:60. doi: 10.1198/108571108X273160. 20 [DOI] [Google Scholar]
  25. Finley AO, Datta A, Cook BD, Morton DC, Andersen HE, and Banerjee S (2019). Efficient algorithms for Bayesian nearest neighbor Gaussian processes. Journal of Computational and Graphical Statistics, 28:401–414. doi: 10.1080/10618600.2018.1537924. 2, 16 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Furrer R, Genton MG, and Nychka D (2006). Covariance tapering for interpolation of large spatial datasets. Journal of Computational and Graphical Statistics, 15:502–523. doi: 10.1198/106186006X132178. 2 [DOI] [Google Scholar]
  27. Genton MG and Kleiber W (2015). Cross-covariance functions for multivariate geostatistics. Statistical Science, 30:147–163. doi: 10.1214/14-STS487. 19 [DOI] [Google Scholar]
  28. Girolami M and Calderhead B (2011). Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B, 73(2):123–214. doi: 10.1111/j.1467-9868.2010.00765.x. 3, 15 [DOI] [Google Scholar]
  29. Gramacy RB and Apley DW (2015). Local Gaussian process approximation for large computer experiments. Journal of Computational and Graphical Statistics, 24(2):561–578. doi: 10.1080/10618600.2014.914442. 2 [DOI] [Google Scholar]
  30. Guhaniyogi R and Banerjee S (2018). Meta-kriging: Scalable Bayesian modeling and inference for massive spatial datasets. Technometrics, 60(4):430–444. doi: 10.1080/00401706.2018.1437474. 5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Haario H, Saksman E, and Tamminen J (2001). An adaptive Metropolis algorithm. Bernoulli, 7(2):223–242. doi: 10.2307/3318737. 13, 16 [DOI] [Google Scholar]
  32. Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, and Zammit-Mangion A (2019). A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics, 24(3):398–425. doi: 10.1007/s13253-018-00348-w. 2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Hoffman MD and Gelman A (2014). The no-U-turn sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15(47):1593–1623. https://www.jmlr.org/papers/v15/hoffman14a.html. 4, 14, 18 [Google Scholar]
  34. Jin B, Herring AH, and Dunson DB (2022). Spatial predictions on physically constrained domains: Applications to arctic sea salinity data. arXiv:2210.03913. 35 [Google Scholar]
  35. Jin B, Peruzzi M, and Dunson DB (2021). Bag of DAGs: Flexible & scalable modeling of spatiotemporal dependence. arXiv:2112.11870. 34, 35 [Google Scholar]
  36. Johndrow JE, Pillai NS, and Smith A (2020). No free lunch for approximate MCMC. arXiv:2010.125147. 4 [Google Scholar]
  37. Jönsson AM, Eklundh L, Hellström M, Bärring L, and Jönsson P (2010). Annual changes in MODIS vegetation indices of Swedish coniferous forests in relation to snow dynamics and tree phenology. Remote Sensing of Environment, 114:2719–2730. doi: 10.1016/j.rse.2010.06.005. 27 [DOI] [Google Scholar]
  38. Jurek M and Katzfuss M (2020). Hierarchical sparse Cholesky decomposition with applications to high-dimensional spatio-temporal filtering. arXiv:2006.16901. 2 [Google Scholar]
  39. Katzfuss M (2017). A multi-resolution approximation for massive spatial datasets. Journal of the American Statistical Association, 112:201–214. doi: 10.1080/01621459.2015.1123632. 2, 9 [DOI] [Google Scholar]
  40. Katzfuss M and Guinness J (2021). A general framework for Vecchia approximations of Gaussian processes. Statistical Science, 36(1):124–141. doi: 10.1214/19-STS755. 2, 9 [DOI] [Google Scholar]
  41. Kaufman CG, Schervish MJ, and Nychka DW (2008). Covariance tapering for likelihood-based estimation in large spatial data sets. Journal of the American Statistical Association, 103:1545–1555. doi: 10.1198/016214508000000959. 2 [DOI] [Google Scholar]
  42. Lindgren F, Rue H, and Lindström J (2011). An explicit link between Gaussian fields and Gaussian Markov random fields: the stochastic partial differential equation approach. Journal of the Royal Statistical Society: Series B, 73:423–498. doi: 10.1111/j.1467-9868.2011.00777.x. 5, 24 [DOI] [Google Scholar]
  43. Marshall T and Roberts G (2012). An adaptive approach to Langevin MCMC. Statistics and Computing, 22:1041–1057. doi: 10.1007/s11222-011-9276-6. 16 [DOI] [Google Scholar]
  44. Matheron G (1982). Pour une analyse krigeante des données régionalisées. Technical report N.732, Centre de Géostatistique. 19 [Google Scholar]
  45. Mesquita D, Blomstedt P, and Kaski S (2020). Embarrassingly parallel MCMC using deep invertible transformations. In Adams RP and Gogate V, editors, Proceedings of Machine Learning Research, volume 115, pages 1244–1252, Tel Aviv, Israel. PMLR. http://proceedings.mlr.press/v115/mesquita20a.html. 5 [Google Scholar]
  46. Neal RM (2011). MCMC using Hamiltonian dynamics. In Brooks S, Gelman A, Jones GL, and Meng X-L, editors, Handbook of Markov Chain Monte Carlo. CRC Press, New York. doi: 10.1201/b10905. 3, 15 [DOI] [Google Scholar]
  47. Neiswanger W, Wang C, and Xing EP (2014). Asymptotically exact, embarrassingly parallel MCMC. In Proceedings of the Thirtieth Conference on Uncertainty in Artificial Intelligence, UAI’14, page 623–632, Arlington, Virginia, USA. AUAI Press. 4 [Google Scholar]
  48. Nemeth C and Sherlock C (2018). Merging MCMC subposteriors through Gaussian-process approximations. Bayesian Analysis, 13(2):507–530. doi: 10.1214/17-BA1063. 5 [DOI] [Google Scholar]
  49. Peruzzi M, Banerjee S, Dunson DB, and Finley AO (2021). Grid-Parametrize-Split (GriPS) for improved scalable inference in spatial big data analysis. arXiv:2101.03579. 16, 35 [Google Scholar]
  50. Peruzzi M, Banerjee S, and Finley AO (2022). Highly scalable Bayesian geostatistical modeling via meshed Gaussian processes on partitioned domains. Journal of the American Statistical Association, 117(538):969–982. doi: 10.1080/01621459.2020.1833889. 2, 9, 10, 16, 24 [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Peruzzi M and Dunson DB (2022). Spatial multivariate trees for big data Bayesian regression. Journal of Machine Learning Research, 23(17):1–40. http://jmlr.org/papers/v23/20-1361.html. 2, 9, 10, 16 [PMC free article] [PubMed] [Google Scholar]
  52. Roberts GO and Rosenthal JS (2007). Coupling and ergodicity of adaptive Markov chain Monte Carlo algorithms. Journal of Applied Probability, 44:458–475. doi: 10.1239/jap/1183667414. 18 [DOI] [Google Scholar]
  53. Roberts GO and Rosenthal JS (2009). Examples of adaptive MCMC. Journal of Computational and Graphical Statistics, 18(2):349–367. doi: 10.1198/jcgs.2009.06134. 18 [DOI] [Google Scholar]
  54. Roberts GO and Stramer O (2002). Langevin diffusions and Metropolis-Hastings algorithms. Methodology And Computing In Applied Probability, 4:337–357. doi: 10.1023/A:1023562417138. 3 [DOI] [Google Scholar]
  55. Roberts GO and Tweedie RL (1996). Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363. 15 [Google Scholar]
  56. Rue H and Held L (2005). Gaussian Markov Random Fields: Theory and Applications. Chapman & Hall/CRC. doi: 10.1007/978-3-642-20192-9. 2 [DOI] [Google Scholar]
  57. Rue H, Martino S, and Chopin N (2009). Approximate Bayesian inference for latent Gaussian models by using integrated nested laplace approximations. Journal of the Royal Statistical Society: Series B, 71:319–392. doi: 10.1111/j.1467-9868.2008.00700.x. 5 [DOI] [Google Scholar]
  58. Sang H and Huang JZ (2012). A full scale approximation of covariance functions for large spatial data sets. Journal of the Royal Statistical Society, Series B, 74:111–132. doi: 10.1111/j.1467-9868.2011.01007.x. 2 [DOI] [Google Scholar]
  59. Schmidt AM and Gelfand AE (2003). A Bayesian coregionalization approach for multivariate pollutant data. Journal of Geophysical Research, 108:D24. doi: 10.1029/2002JD002905. 19 [DOI] [Google Scholar]
  60. Sengupta A and Cressie N (2013). Hierarchical statistical modeling of big spatial datasets using the exponential family of distributions. Spatial Statistics. doi: 10.1016/j.spasta.2013.02.002. 5 [DOI] [Google Scholar]
  61. Shah A, Wilson AG, and Ghahramani Z (2014). Student-t processes as alternatives to Gaussian processes. In Proceedings of the 17th International Conference on Artificial Intelligence and Statistics (AISTATS). 23 [Google Scholar]
  62. Shirota S, Finley AO, Cook BD, and Banerjee S (2019). Conjugate nearest neighbor Gaussian process models for efficient statistical interpolation of large spatial data. arXiv:1907.10109. 2 [Google Scholar]
  63. Stein ML (2002). The screening effect in kriging. The Annals of Statistics, 30(1):298–323. doi: 10.1214/aos/1015362194. 8 [DOI] [Google Scholar]
  64. Stein ML (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8:1–19. doi:doi: 10.1016/j.spasta.2013.06.003. 2 [DOI] [Google Scholar]
  65. Stein ML, Chi Z, and Welty LJ (2004). Approximating likelihoods for large spatial data sets. Journal of the Royal Statistical Society, Series B, 66:275–296. doi: 10.1046/j.1369-7412.2003.05512.x. 2 [DOI] [Google Scholar]
  66. Sun Y, Li B, and Genton M (2011). Geostatistics for large datasets. In Montero J, Porcu E, and Schlather M, editors, Advances and Challenges in Space-time Modelling of Natural Events, pages 55–77. Springer-Verlag, Berlin Heidelberg. doi: 10.1007/978-3-642-17086-7. 2 [DOI] [Google Scholar]
  67. Taylor BM and Diggle PJ (2014). INLA or MCMC? a tutorial and comparative evaluation for spatial prediction in log-Gaussian Cox processes. Journal of Statistical Computation and Simulation, 84(10):2266–2284. doi: 10.1080/00949655.2013.788653. 5 [DOI] [Google Scholar]
  68. Tikhonov G, Opedal OH, Abrego N, Lehikoinen A, de Jonge MMJ, Oksanen J, and Ovaskainen O (2020). Joint species distribution modelling with the R-package Hmsc. Methods in Ecology and Evolution, 11(3):442–447. doi: 10.1111/2041-210X.13345. 29 [DOI] [PMC free article] [PubMed] [Google Scholar]
  69. Vecchia AV (1988). Estimation and model identification for continuous spatial processes. Journal of the Royal Statistical Society, Series B, 50:297–312. doi: 10.1111/j.2517-6161.1988.tb01729.x. 2, 9 [DOI] [Google Scholar]
  70. Vihola M (2012). Robust adaptive Metropolis algorithm with coerced acceptance rate. Statistics and Computing, 22:997–1008. doi: 10.1007/s11222-011-9269-5. 13 [DOI] [Google Scholar]
  71. Wackernagel H (2003). Multivariate Geostatistics: An Introduction with Applications. Springer, Berlin. doi: 10.1007/978-3-662-05294-5. 19 [DOI] [Google Scholar]
  72. Walker DA, Halfpenny JC, Walker MD, and Wessman CA (1993). Long-term studies of snow-vegetation interactions. BioScience, 43(5):287–301. doi: 10.2307/1312061. 27 [DOI] [Google Scholar]
  73. Wang K, Zhang L, Qiu Y, Ji L, Tian F, Wang C, and Wang Z (2015a). Snow effects on alpine vegetation in the Qinghai-Tibetan plateau. International Journal of Digital Earth, 8(1):58–75. doi: 10.1080/17538947.2013.848946. 27 [DOI] [Google Scholar]
  74. Wang X and Dunson DB (2014). Parallelizing MCMC via Weierstrass sampler. arXiv:1312.4605. 4 [Google Scholar]
  75. Wang X, Guo F, Heller KA, and Dunson DB (2015b). Parallelizing MCMC with random partition trees. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, page 451–459, Cambridge, MA, USA. MIT Press. arXiv:1506.03164. 4 [Google Scholar]
  76. Xie J, Jonas T, Rixen C, de Jong R, Garonna I, Notarnicola C, Asam S, Schaepman ME, and Kneubühler M (2020). Land surface phenology and greenness in Alpine grasslands driven by seasonal snow and meteorological factors. Science of The Total Environment, 725:138380. doi: 10.1016/j.scitotenv.2020.138380. 27 [DOI] [PubMed] [Google Scholar]
  77. Zanella G and Roberts G (2021). Multilevel linear models, gibbs samplers and multigrid decompositions. Bayesian Analysis. doi: 10.1214/20-BA1242. 35 [DOI] [Google Scholar]
  78. Zhang L and Banerjee S (2022). Spatial factor modeling: A Bayesian matrix-normal approach for misaligned data. Biometrics, 78(2):560–573. doi: 10.1111/biom.13452. 2, 20 [DOI] [PMC free article] [PubMed] [Google Scholar]
  79. Zhu Y, Peruzzi M, Li C, and Dunson DB (2022). Radial neighbors for provably accurate scalable approximations of Gaussian processes. arXiv:2211.14692. 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  80. Zilber D and Katzfuss M (2020). Vecchia-Laplace approximations of generalized Gaussian processes for big non-Gaussian spatial data. arXiv:1906.07828. 5 [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement fix

RESOURCES