Neural networks for geospatial data

Wentao Zhan; Abhirup Datta

doi:10.1080/01621459.2024.2356293

. Author manuscript; available in PMC: 2026 Jan 1.

Published in final edited form as: J Am Stat Assoc. 2024 Jun 24;120(549):535–547. doi: 10.1080/01621459.2024.2356293

Neural networks for geospatial data

Wentao Zhan ¹, Abhirup Datta ¹

PMCID: PMC12007808 NIHMSID: NIHMS2003252 PMID: 40255678

Abstract

Analysis of geospatial data has traditionally been model-based, with a mean model, customarily specified as a linear regression on the covariates, and a Gaussian process covariance model, encoding the spatial dependence. While nonlinear machine learning algorithms like neural networks are increasingly being used for spatial analysis, current approaches depart from the model-based setup and cannot explicitly incorporate spatial covariance. We propose NN-GLS, embedding neural networks directly within the traditional Gaussian process (GP) geostatistical model to accommodate non-linear mean functions while retaining all other advantages of GP, like explicit modeling of the spatial covariance and predicting at new locations via kriging. In NN-GLS, estimation of the neural network parameters for the non-linear mean of the Gaussian Process explicitly accounts for the spatial covariance through use of the generalized least squares (GLS) loss, thus extending the linear case. We show that NN-GLS admits a representation as a special type of graph neural network (GNN). This connection facilitates the use of standard neural network computational techniques for irregular geospatial data, enabling novel and scalable mini-batching, backpropagation, and kriging schemes. We provide methodology to obtain uncertainty bounds for estimation and predictions from NN-GLS. Theoretically, we show that NN-GLS will be consistent for irregularly observed spatially correlated data processes. We also provide a finite sample concentration rate, which quantifies the need to accurately model the spatial covariance in neural networks for dependent data. To our knowledge, these are the first large-sample results for any neural network algorithm for irregular spatial data. We demonstrate the methodology through numerous simulations and an application to air pollution modeling. We develop a software implementation of NN-GLS in the Python package geospaNN.

Keywords: geostatistics, Gaussian process, neural networks, graph neural networks, machine learning, kriging, consistency

1. Introduction

Geostatistics, the analysis of geocoded data, is traditionally based on stochastic process models, which offer a coherent way to model data at any finite collection of locations while ensuring the generalizability of inference to the entire region. Gaussian processes (GP), with a mean function capturing effects of covariates and a covariance function encoding the spatial dependence, is a staple for geostatistical analysis (Stein, 1999; Banerjee et al., 2014; Cressie and Wikle, 2015). The mean function of a Gaussian process is often modeled as a linear regression on the covariates. The growing popularity and accessibility of machine learning algorithms such as random forest (Breiman, 2001), gradient boosting (Freund et al., 1999), and neural networks (Goodfellow et al., 2016), capable of modeling complex non-linear relationships has heralded a paradigm shift. Practitioners are increasingly shunning models with parametric assumptions like linearity in favor of machine-learning approaches that can capture non-linearity and high-order interactions. In particular, deep neural networks have seen considerable recent adoption and adaptation for geospatial data (see Wikle and Zammit-Mangion, 2023, for a comprehensive review).

Many of the machine-learning based regression approaches assume independent observations, implicit in the choice of an additive loss (e.g., ordinary least squares or OLS loss) as the objective function used in estimating the algorithm parameters. Explicit encoding of spatial dependency via a covariance function, as is common in process-based geospatial models, is challenging within these algorithms. Current renditions of neural networks for spatial data circumvent this by using spatial co-ordinates or some transformations (like distances or basis functions) as additional covariates (Gray et al., 2022; Chen et al., 2024; Wang et al., 2019). These added-spatial-features neural networks incorporate all spatial information into the mean, thus assuming that the error terms are independent. Consequently, they leave the GP model framework, abandoning its many advantages. They can suffer from the curse of dimensionality on account of adding many basis functions and cannot provide a separate estimate of the covariate effect $E (Y ∣ X)$ (see Section 2.2).

We propose a novel algorithm to estimate non-linear means with neural networks within the traditional Gaussian process models while explicitly accounting for the spatial dependence encoded in the GP covariance matrix. The core motivation comes from the extension of the ordinary least squares (OLS) to generalized least squares (GLS) for linear models with dependent errors. For correlated data, GLS is more efficient than OLS according to the Gauss-Markov theorem. For a neural network embedded as the mean of a GP, the GLS loss naturally arises as the negative log-likelihood. We thus refer to our algorithm as NN-GLS. We retain all advantages of the model-based framework, including separation of the covariate and spatial effects thereby allowing inference on both, parsimonious modeling of the spatial effect through the GP covariance function circumventing the need to create and curate spatial features, and seamlessly offering predictions at new locations via kriging. NN-GLS is compatible with any neural network architecture for the mean function and with any family of GP covariance functions.

We note that the philosophy of GLS has been adopted for non-linear spatial analysis before. Nandy et al. (2017) propose GAM-GLS, using GLS loss based on Gaussian Process covariance for estimating parameters of generalized additive models (GAM). GAM-GLS improves over GAM for non-linear additive function estimation by explicitly accounting for the spatial covariance. However, like GAM, GAM-GLS does not account for interactions among covariates. Recently, tree-based machine learning algorithms like Random Forests (RF-GLS, Saha et al., 2023) and boosted trees (GP-boost, Sigrist, 2022; Iranzad et al., 2022) has been extended to use GLS loss. Forest and tree estimators use a brute force search to iteratively grow the regression trees, requiring multiple evaluations of the GLS loss within each step. This severely reduces the scalability of these approaches. RF-GLS also requires pre-estimation of the spatial parameters, which are then kept fixed during the random forest estimation.

NN-GLS avoids both the issues of lack of scalability and prefixed spatial covariance parameters of existing GLS-based non-linear methods by offering a representation of the algorithm as a special type of graph neural network (GNN). We show that NN-GLS using any neural network architecture for the mean and a Nearest Neighbor Gaussian Process (NNGP, Datta et al., 2016) for the covariance is a GNN with two graph-convolution layers based on the nearest-neighbor graph where the graph-convolution weights are naturally derived as kriging weights. Leveraging this representation of the model as a GNN, we can exploit the various computing techniques used to expedite neural networks. This includes using mini-batching or stochastic gradients to run each iteration of the estimation on only a subset of the data and representing kriging predictions using GNN convolution and deconvolution. The spatial covariance parameters are now simply weights parameters in the network and are updated throughout the training. Finally, we provide a spatial bootstrap-based approach to construct interval estimates for the non-linear mean function.

We provide a comprehensive theoretical study of neural networks when the observations have spatially correlated errors arising from a stochastic process. Our theoretical contribution distinguishes itself from the current body of theoretical work on neural networks in two main ways. The existing asymptotic theory of neural networks (see Section S3 for a brief review) does not consider spatial dependence in either the error process generating the data (often being restricted to i.i.d. errors) or introduce any procedural modifications to the neural network algorithm to explicitly encode spatial covariance.

Our theory of NN-GLS accounts for spatial dependence in both the data generation and the estimation algorithm. We prove general results on the existence and the consistency of NN-GLS which subsumes special cases of interest including irregular spatial data designs, popular GP models like the Matérn Gaussian process, and using NNGP for the GLS loss. We also provide finite-sample error rates for the mean function estimation from NN-GLS. The error bound provides novel insights on the necessity of modeling the spatial covariance in neural networks, showing that the use of OLS loss in NN, ignoring spatial dependence, will lead to larger error rates. To our knowledge, these are the first theoretical results for neural networks for irregularly sampled, spatially dependent data.

The rest of the manuscript is organized as follows. In Section 2, we review process-based geostatistical models and existing spatial neural networks. Section 3 describes the idea of combining the two perspectives. Section 4 formally proposes the algorithm NN-GLS, depicts its connection with graph neural networks (GNN), and provides scalable estimation, prediction and inference algorithms. Section 5 presents the theoretical results. Section 6 and 7 respectively demonstrate the performance of NN-GLS in simulated and real data.

2. Preliminaries

2.1. Process-based geostatistical modeling

Consider spatial data collected at locations $s_{i}$ , $i = 1, \dots, n$ , comprising a covariate vector $X_{i} ≔ X (s_{i}) \in R^{d}$ and a response $Y_{i} ≔ Y (s_{i}) \in R$ . Defining $Y = {(Y_{1}, \dots, Y_{n})}^{'}$ and the $n \times d$ covariate matrix $X$ similarly, the spatial linear model is given by

Y \sim 𝒩 (X β, Σ (θ))

(1)

where $Σ ≔ Σ (θ)$ is a $n \times n$ covariance matrix. A central objective in spatial analysis is to extrapolate inference beyond just the data locations to the entire continuous spatial domain. Stochastic processes are natural candidates for such domain-wide models. In particular, (1) can be viewed as a finite sample realization of a Gaussian process (GP)

Y (s) = X {(s)}^{T} β + ϵ (s), ϵ (\cdot) \sim G P (0, Σ (\cdot, \cdot)) .

(2)

where $ϵ (s)$ is a zero-mean Gaussian process modeling the spatial dependence via the covariance function $Σ (\cdot, \cdot)$ such that $Σ (s_{i}, s_{j}) = C o v (Y (s_{i}), Y (s_{j}))$ . often, $ϵ (\cdot)$ can be decomposed into a latent spatial GP and a non-spatial (random noise) process. This results in the variance decomposition $Σ = C + τ^{2} I$ where $C$ is the covariance matrix corresponding to the latent spatial GP and $τ^{2}$ is the variance for the noise process. One can impose plausible modeling assumptions on the nature of spatial dependence like stationarity ( $C (s_{i}, s_{j}) = C (s_{i} - s_{j})$ ) or isotropy ( $C (s_{i}, s_{j}) = C (‖ s_{i} - s_{j} ‖)$ ) to induce parsimony, thereby requiring a very-low dimensional parameter $θ$ to specify the covariance function.

Nearest Neighbor Gaussian Processes (NNGP, Datta et al., 2016) are used to model the spatial error process when the number of spatial locations $n$ is large as evaluation of the traditional GP likelihood (1), requiring $O (n^{3})$ computations, becomes infeasible. NNGP provides a sparse approximation $Σ^{- 1}$ to the dense full GP precision matrix $Σ^{- 1}$ . This is constructed using a directed acyclic graph (DAG) based on pairwise distances among the $n$ data locations, such that each node (location) has almost $m ≪ n$ directed (nearest) neighbors. Letting $N (i)$ be the set of neighbors of location $s_{i}$ in the DAG, we have

Σ^{- 1} = {(I - B)}^{T} F^{- 1} (I - B), where B_{i, N (i)} = Σ (i, N (i)) Σ {(N (i), N (i))}^{- 1}, B_{i j} = 0 elsewhere, and F_{i i} = Σ_{i i} - Σ (i, N (i)) Σ {(N (i), N (i))}^{- 1} Σ (N (i), i) .

(3)

Here $B$ is a strictly lower triangular matrix and $F$ is a diagonal matrix. NNGP precision matrices only require inversion of $n$ small matrices of size $m \times m$ . As $m ≪ n$ , evaluation of NNGP likelihood requires total $O (n)$ time and storage.

2.2. Neural networks

Artificial neural networks (ANN) or, simply, neural networks (NN) are widely used to model non-parametric regression $f$ where $E (Y_{i}) = f (X_{i})$ . Mathematically, an $L$ -layer feed-forward neural networks or multi-layer perceptron (MLP) can be described as,

A_{i}^{(0)} = X_{i}, Z_{i}^{(l)} = W_{(l)}^{T} A_{i}^{(l - 1)}, A_{i}^{(l)} = g_{l} (Z_{i}^{(l)}), l = 1, \dots, L O_{i} = W_{(L + 1)}^{T} A_{i}^{(L)}, f (X_{i}) = O_{i}, i = 1, \dots, n

(4)

where for layer $l$ , $A^{(l)}$ represents the $d_{l}$ nodes, $W_{(l)}$ are the weight matrix, $Z_{i}^{(l)}$ ’s are a linear combination of nodes with unknown weights $W_{(l)}$ ’s, and $g_{(l)} (Z_{i}^{(l)})$ denotes the known non-linear activation (link) functions $g_{l} (\cdot)$ (e.g., sigmoid function, ReLU function) applied to each component of $Z_{i}^{(l)}$ . The final layer $O_{i}$ is called the output layer and gives the modeled mean of the response, i.e., $O_{i} = f (X_{i}) = E (Y_{i})$ . For regression, the unknown weights are estimated using backpropagation based on the ordinary least squares (OLS) loss

\sum_{i = 1}^{n} {(Y_{i} - f (X_{i}))}^{2} .

(5)

Estimation is expedited by mini-batching where the data are split into smaller and disjoint mini-batches and at each iteration the loss (5) is approximated by restricting to one of the mini-batches, and cycling among the mini-batches over iterations.

Current extensions of neural networks for spatial data have mostly adopted the added-spatiai-features strategy, adding spatial features like the (geographical coordinates, spatial distance or basis functions) as extra covariates in the neural network (Wang et al., 2019; Chen et al., 2024). Formally, they model $Y_{i} = g (X, B (s_{i}))$ where $B (s)$ denotes a set of spatial basis functions, and $g$ denotes a neural network on the joint space of the covariates $X$ and the basis functions $B (s)$ . These methods thus depart the GP model framework, and do not explicitly model spatial covariance as all the spatial structure is put into the mean, implicitly assuming that there is no residual spatial dependence. They cannot provide a separate estimate of the covariate effect $E (Y_{i} ∣ X_{i})$ as it models $Y_{i}$ jointly as a function of $X_{i}$ and $B (s_{i})$ . Also, the number of added basis functions carries a tradeoff, with more basis functions improving accuracy at the cost of increasing parameter dimensionality.

3. Neural networks for Gaussian process models

The two paradigms reviewed in Section 2 are complimentary in their scope. The popularity of the geospatial linear models (Section 2.1) is owed to their simplicity, interpretability, and parsimony – separating the covariate effect and the spatial effects, modeling the former through a linear mean, and the latter parsimoniously through the GP covariance, specified typically using only 2-3 interpretable parameters by encoding stationarity or isotropy. However, it relies on the strong assumption of a linear covariate effect. Neural networks can estimate arbitrary non-linear covariate effects. However, implicit in the usage of the OLS loss (5) for NN is the assumption that the data units are independent. This is violated for spatial data, where the error process is a dependent stochastic process as in (2). To our knowledge, there has been no previous work in directly incorporating spatial covariances in the neural network estimation process, with existing spatial neural networks mostly adopting the added-spatial-features strategy reviewed in Section 2.2.

We bridge the paradigms of traditional geospatial modeling and neural networks by embedding neural networks directly into the GP model, allowing the mean of the GP to be non-linear and modeling it with NN. We propose estimating the NN parameters using a novel GLS loss with the GP covariance matrix, which arises naturally from the log-likelihood of the GP model. We retain all advantages of GP based modeling framework, including separating the covariate and spatial effects into the mean and covariance structures, and obtaining predictions at new locations simply through kriging. In Section 4, we show that when using Nearest neighbor Gaussian processes for the spatial covariance, one can make a novel and principled connection between graph neural networks and geospatial models for irregular data, facilitating a scalable algorithm for estimation and prediction.

3.1. NN-GLS: Neural networks using GLS loss

We extend (2) to

Y (s) = f (X (s)) + ϵ (s); ϵ (\cdot) \sim G P (0, Σ (\cdot, \cdot)) .

(6)

Estimating $f$ in (6) using neural networks (4) needs to account for the spatial dependence modeled via the GP covariance. Using the OLS loss (5) ignores this covariance. We now extend NN to explicitly accommodate the spatial covariance in its estimation. Let $f (X) = {(f (X_{1}), f (X_{2}), \dots, f (X_{n}))}^{T}$ . Then the data likelihood for the model (6) is

Y \sim 𝒩 (f (X), Σ) .

(7)

This is a non-linear generalization of the spatial linear model (1). For parameter estimation in linear models like (1) for dependent data, the OLS loss is replaced by a generalized least squares (GLS) loss using the covariance matrix $Σ$ , as it is more efficient according to the Gauss-Markov theorem. Similarly, to estimate $f$ using NN, we propose using the GLS loss

ℒ_{n} (f) = \frac{1}{n} {(Y - f (X))}^{T} Q (Y - f (X)),

(8)

which accounts for the spatial dependency via the working precision matrix $Q$ , which equals $Σ^{- 1}$ or, more practically, an estimate of it.

We refer to the neural network estimation using the GLS loss (8) as NN-GLS. Conceptually, generalizing NN to NN-GLS is well-principled as minimizing the GLS loss (8) with $Q = Σ^{- 1}$ is equivalent to obtaining a maximum likelihood estimate of $f$ in (7). In practice, however, for spatial dependence modeled using GP, the GLS loss ushers in multiple computational issues for both mini-batching and backpropagation, techniques fundamental to the success of NN. As the GLS loss (8) is not additive over the data units, minibatching cannot be deployed as for the OLS loss, and back-propagation will involve computing an inverse of the dense $n \times n$ matrix $Σ$ which requires $O (n^{2})$ storage and $O (n^{3})$ time for each iteration. These computing needs are infeasible for even moderate $n$ . Next, we develop an algorithm NN-GLS with a specific class of GLS loss that mitigates these issues and offers a pragmatic approach to using NN for GP models.

4. NN-GLS as Graph Neural Network

We offer a representation of NN-GLS as a special graph neural network (GNN). This connection allows use of OLS loss with transformed (graph-convoluted) data, leading to scalable mini-batching and backpropagation algorithms for NN-GLS. We propose choosing $Q$ as the precision matrix from a Nearest Neighbor Gaussian Process (see Section 2.1), i.e., we optimize the objective function (8) with $Q = {Sigma}^{- 1}$ , as provided in (3). Basis functions derived from NNGP have been used as added-spatial-features in neural networks (Wang et al., 2019). This differs from our approach of using NNGP to directly model the spatial covariance, which is akin to the practice in spatial linear models.

A GLS loss can be viewed as an OLS loss for the decorrelated response $Y^{*} = Q^{\frac{1}{2}} Y$ , where $Q^{\frac{1}{2}}$ is the Cholesky factor of $Q = Q^{\frac{T}{2}} Q^{\frac{1}{2}}$ . Hence, decorrelation is simply a linear operation. A convenience of choosing $Q = Σ$ , the NNGP precision matrix, is that decorrelation becomes a convolution in the nearest neighbor DAG. To elucidate, note that $B_{i, N (i)}$ defined in (3) denotes the kriging weights for predicting $Y_{i}$ , based on its directed nearest neighbors $Y_{N (i)}$ using a GP with covariance $Σ (\cdot, \cdot)$ . Similarly, $F_{i i}$ in (3) is the corresponding nearest neighbor kriging variance. Letting $N^{*} [i] = N (i) \cup {i}$ denote the graph neighborhood for the $i^{t h}$ node and defining weights

v_{i}^{T} = \frac{1}{\sqrt{F_{i i}}} (1, - B_{i, N (i)}),

(9)

we can write $Y_{i}^{*} = v_{i}^{T} Y_{N^{*} [i]}$ . In this paper, index sets $N (i)$ or $N^{*} [i]$ are used in the subscript to subset a matrix or a vector. Thus, the decorrelated responses $Y_{i}^{*}$ are simply convolution over the DAG used in NNGP with the graph convolution weights $v_{i}$ defined using kriging. Similarly, one can define the decorrelated output layer $O_{i}^{*} = v_{i}^{T} O_{N^{*} [i]}$ using the same graph convolution, where $O_{i}$ is the output layer of the neural network $f$ (see (4)).

The decorrelation step makes NN-GLS a special type of graph neural network (GNN) as depicted in Figure 1. In GNN, typically, the input observations are represented on a graph and the locality information is aggregated using convolution layers based on the graph structure (graph convolution). For NN-GLS, both the inputs $X_{i}$ and the responses $Y_{i}$ are graph-valued objects as they both correspond to the locations $s_{i}$ , which are the nodes of the nearest neighbor DAG. First, the input $X_{i}$ is passed through the feed-forward NN (or multi-layer perceptron) to produce the respective output layer of $O_{i} = f (X_{i})$ . This is a within-node operation, and any architecture can be used (number of layers, number of nodes within each layer, sparsity of connections, choice of activation functions). Subsequently, the output layer $O_{i} ’ s$ from the MLP is passed through an additional graph-convolution to create the decorrelated output layer of $O_{i}^{*} ’ s$ using the weights $v_{i}$ (9). This layer is matched, using the OLS loss, to the decorrelated response layer of $Y_{i}^{*} ’ s$ , created from the $Y_{i} ’ s$ using the same graph convolution. So the objective function can be expressed as

\sum_{i} {(Y_{i}^{*} - O_{i}^{*})}^{2} where Y_{i}^{*} = v_{i}^{T} Y_{N^{*} [i]} and O_{i}^{*} = v_{i}^{T} O_{N^{*} [i]} .

(10)

Fig. 1 — NN-GLS as a graph neural network with two graph convolution layers

Thus fitting a GLS loss to an NN is simply fitting an OLS loss to a new NN with two additional decorrelation layers at the end of the MLP. From the form of $v_{i}$ in (9) and of $B_{i, N (i)}$ and $F_{i i}$ in (3), it is clear that $v_{i} ’ s$ can be calculated using matrices of dimension at −1 most $m$ , The NNGP precision matrix $Σ^{- 1}$ in (3) or the covariance matrix $Σ$ is never actually computed. Thus, the loss function (10) is evaluated without any large matrix computation.

To summarize, there are two ingredients of NN-GLS: a feed-forward NN or MLP (using any architecture) for intra-node operations; and a sparse DAG among the locations for incorporation of spatial correlation via inter-node graph convolutions. Information about the mutual distances among the irregular set of locations is naturally incorporated in the kriging-based convolution weights. We now discuss how this formulation of NN-GLS as a GNN with OLS loss helps leverage the traditional strategies to scale computing for NN.

4.1. Mini-batching

Leveraging the additivity of the OLS loss (10), we can write $ℒ_{n} = \sum_{b = 1}^{B} ℒ_{b, n}$ where $ℒ_{b, n} = \sum_{i \in S_{b}} {(Y_{i}^{*} - O_{i}^{*})}^{2}, S_{1}, \dots, S_{B}$ being a partition of the data-locations each of size $K$ . The $Y_{i}^{*} ’ s$ are uncorrelated and identically distributed (exactly under NNGP distribution, and approximately if the true distribution is a full GP), so the loss $L_{b, n}$ corresponding to the mini-batch $S_{b}$ are approximately i.i.d. for $b = 1, \dots, B$ . Hence, parameter estimation via mini-batching can proceed like the i.i.d. case. The only additional computation we incur is during the graph convolution, as obtaining all $O_{i}^{*}$ for the mini-batch $S_{b}$ involves calculating $O_{i} ’ s$ for the neighbors of all units $i$ included in $S_{b}$ .

4.2. Back-propagation

Gradient descent or back-propagation steps for NN-GLS can be obtained without any large matrix inversions. We provide the back-propagation equations for a single-layer network with $O_{i} = A_{i}^{T} β, δ_{i} = - 2 (Y_{i} - O_{i})$ , where $β$ and $A_{i} = g (Z_{i})$ are $d_{1}$ dimensional, $d_{1}$ being the number of hidden nodes, $g$ is the known activation (link) function. Here $Z_{i} = W^{T} X_{i}$ is the hidden layer created using $d \times d_{1}$ weight-matrix $W$ and $d \times 1$ input vector $X_{i}$ . $O_{i}$ , $δ_{i}$ are concatenated into vectors $O$ , $δ$ , and $X_{i}$ , $Z_{i}$ , $A_{i}$ are concatenated into matrices $X$ , $Z$ , $A$ . With $v_{i}$ defined in (9), we introduce scalars $O_{i}^{*} = v_{i}^{T} O_{N^{*} [i]} = v_{i}^{T} A_{\cdot, N^{*} [i]}^{T} β, a_{i r}^{*} = v_{i}^{T} A_{r, N^{*} [i]}^{T}$ and $δ_{i}^{*} = v_{i}^{T} δ_{N^{*} [i]}$ . Using the loss function (10) we derive the following customized back-propagation updates:

β_{r}^{(t + 1)} = β_{r}^{(t)} - γ_{t} \sum_{i \in S_{b (t)}} δ_{i}^{*} a_{i r}^{*} w_{r j}^{(t + 1)} = w_{r j}^{(t)} - γ_{t} β_{r} \sum_{i \in S_{b (t)}} δ_{i}^{*} (v_{i}^{T} (g^{'} (Z_{N^{*} [i], r}) ⊙ X_{N^{*} [i], j})), θ^{(t + 1)} = θ^{(t)} + \frac{γ_{t}}{2} \sum_{i \in S_{b (t)}} δ_{i}^{*} ({(\frac{\partial v_{i}}{\partial θ})}^{T} δ_{N^{*} [i]}) .

(11)

where $θ$ are parameters for the graph convolution weights $v_{i}$ . Here $γ_{t}$ is the learning rate, and $S_{b (t)}$ is the mini-batch for the $i^{t h}$ iteration and $g^{'}$ is the derivative of $g$ . Similar equations can be established for networks with more than one-layer, although the derivations will be tedious. Instead, for multi-layer networks, scalable gradient descent steps for NN-GLS can be conducted using off-the-shelf software. One can evaluate the mini-batch loss $ℒ_{b, n}$ scalably, as the $v_{i} ’ s$ can be computed without any large matrix computation, and obtain scalable gradient descent updates using numerical, automatic, or symbolic differentiation.

Note that in this GNN representation, the convolution weights $v_{i}$ are parametrized by spatial covariance parameter $θ$ (an example of $θ$ is the one for Matérn covariance family (Definition S4.2) where $θ = (σ^{2}, ϕ, ν)$ ). Hence, the updates for $θ$ can be just absorbed as a back-propagation step, as shown in (11). Alternatively, the spatial parameters $θ$ can also be updated, given the current estimate of $f$ , by writing $Y_{i}^{*} (θ) ≔ Y_{i}^{*}$ and $O_{i}^{*} (θ) ≔ O_{i}^{*}$ to note the dependence on $θ$ and maximizing the NNGP log-likelihood

\sum_{i} ({(Y_{i}^{*} (θ) - O_{i}^{*} (θ))}^{2} + \log F_{i i} (θ)) .

(12)

We only use full optimization of (12) to update the spatial parameters $θ$ , replacing its gradient descent update. The network parameters are always updated using gradient descent.

4.3. Kriging

Subsequent to estimating $\hat{f} (\cdot)$ , NN-GLS leverages the GP model-based formulation to provide both point and interval predictions at a new location simply via kriging. Exploiting the GNN representation of NN-GLS with a NNGP working covariance matrix, kriging can be conducted entirely with the neural architecture, as we show below.

Given a new location $s_{0}$ and covariates $X_{0} ≔ X (s_{0})$ , first $X_{0}$ is passed through the trained feed-forward part (MLP) of the GNN to obtain the output $O_{0} = \hat{f} (X_{0})$ (see Figure 1). Next, the new location $s_{0}$ is added as a new node to the nearest-neighbor DAG. Let $N (0)$ be its set of $m$ neighbors of $s_{0}$ on the DAG, $N^{*} [0] = {s_{0}} \cup N (0)$ and define the graph weights $v_{0}$ similar to (9). Then, using the graph convolution step, we obtain the decorrelated output $O_{0}^{*} = v_{0}^{T} O_{N^{*} [0]}$ . As the decorrelated output layer $O^{*}$ is the model for the decorrelated response layer $Y^{*}$ in the GNN, we have ${\hat{Y}}_{0}^{*} = O_{0}^{*}$ . Finally, as $Y_{0}^{*} = v_{0}^{T} Y_{N^{*} [0]}$ is the graph-convoluted version of $Y_{0}$ , we need to deconvolute ${\hat{Y}}_{0}^{*}$ over the DAG to obtain ${\hat{Y}}_{0}$ . This leads to the final prediction equation

{\hat{Y}}_{0} = \sqrt{F_{00}} {\hat{Y}}_{0}^{*} + B_{0, N (0)}^{T} Y_{N (0)} .

(13)

It is easy to verify that the prediction (13) is exactly the same as the $m$ -nearest neighbor kriging predictor for the spatial non-linear model (7), i.e.,

{\hat{Y}}_{0} = \hat{f} (X_{0}) + Σ (s_{0}, N (0)) {Σ (N (0), N (0))}^{- 1} (Y_{N (0)} - f_{N (0)}) .

(14)

Additionally, prediction variance is simply $σ_{0}^{2} = F_{00}$ . As NN-GLS is embedded within the Gaussian Process framework, predictive distributions conditional on the data and parameters are Gaussian. So the prediction interval (PI) can be obtained as $[\hat{f} (X_{0}) + Z_{2.5} σ_{0}, \hat{f} (X_{0}) + Z_{97.5} σ_{0}]$ , where $Z_{q}$ is the $q$ -quantile of a standard normal distribution.

Thus, the GNN connection for NN-GLS offers a simple and coherent way to obtain kriging predictions entirely within the neural architecture. We summarize the estimation and prediction from NN-GLS in Algorithm S1 in Section S1 of the Supplement. Section S2 presents a spatial bootstrap for obtaining pointwise confidence intervals of the mean.

5. Theory

We review some relevant theory of neural network regression in Section S3. Our theoretical contributions have two main differences from the existing theory. First, the overwhelming majority of the theoretical developments focus on the setting of i.i.d. data units. To our knowledge, asymptotic theory for neural networks under spatial dependence among observations from an irregular set of locations has not been developed.

Second, the existing theory has almost exclusively considered neural network methods like the vanilla NN (using the OLS loss) that do not explicitly accommodate spatial covariance among the data units. We provide general results on existence (Theorem 1) and asymptotic consistency (Theorem 2) of NN-GLS that explicitly encodes the spatial covariance via the GLS loss. In Propositions 1 and 2, we show that consistency of NN-GLS holds for spatial data generated from the popular Matérn GP under arbitrarily irregular designs. We also derive the finite-sample error rate of NN-GLS (Theorem 3) which shows how a poor choice of the working covariance matrix in the GLS loss will lead to large error rates, highlighting the necessity to accurately model spatial covariance in neural networks.

5.1. Notations and Assumptions

Let $R$ and $N$ denote the set of real numbers and natural numbers respectively. ${‖ \cdot ‖}_{p}$ denotes the $ℓ_{p}$ norm for vectors or matrices, $0 < p \leq \infty$ . Given the covariates $X_{1}, X_{2}, \dots, X_{n}$ , for any function $f$ , we define the norm ${‖ f ‖}_{n}^{2} = \frac{1}{n} \sum_{i = 1}^{n} f^{2} (X_{i})$ . Given a $n \times n$ matrix $A, λ (A)$ denotes its eigenspace, $λ_{\max} = sup {λ (A)}$ , and $λ_{min} = inf {λ (A)}$ . A sequence of numbers ${a_{n}}_{n \in N}$ is $O (b_{n}) (o (b_{n}))$ if the sequence ${∣ a_{n} ∕ b_{n} ∣}_{n \in N}$ is bounded from above (goes to zero) as $n \to \infty$ . Random variables (distributions) $X \sim Y$ means $X$ and $Y$ have the same distribution. We first specify the assumption of the data generation process.

Assumption 1 (Data generation process). The data $Y_{i} ≔ Y (s_{i}), i = 1, \dots, n$ is generated from a GP with a non-linear mean, i.e., $Y_{i} = f_{0} (X_{i}) + ϵ_{i}$ , where $f_{0} (\cdot)$ is a continuous function, ${X_{i}}_{i = 1, \dots, n}$ are fixed covariate vectors in a compact subset in $R^{d}$ , and the error process ${ϵ_{i}}$ is a GP such that the maximum (minimum) eigenvalue of the covariance matrix $Σ = C o v (Y)$ is uniformly (in $n$ ) upper-bounded (lower-bounded) by $Ψ_{h i g h} (Ψ_{l o w})$ .

Assumption 1 imposes minimal restrictions on the data generation process. The mean $f_{0}$ is allowed to be any continuous function. The restriction on the eigenvalues of the GP covariance matrix is tied to the spatial design. We show in Propositions 1 and 2 that this is satisfied for common GP covariance choices for any irregular set of locations in $R^{2}$ under the increasing domain asymptotics where locations are separated by a minimum distance.

We next state assumptions on the analysis model, i.e., the neural network family and the GLS working precision matrix. We consider a one-layer neural network class:

ℱ_{0} ≔ {α_{0} + \sum_{i = 1}^{n} α_{i} σ (W_{i}^{T} X + w_{0 i}), W_{i} \in R^{d \times p}, w_{0 i} \in R^{d}, α_{i} \in R} .

(15)

This formulation is equivalent to setting $L = 1$ , $g_{1} (\cdot)$ as the sigmoid function $σ (\cdot)$ and $g_{2} (\cdot)$ as the identity function in (4). It is easy to see that $ℱ_{0}$ can control the approximation error, as one-layer NN are universal approximators. However, this class of functions $ℱ_{0}$ can be too rich to control the estimation error. A common strategy to circumvent this is to construct a sequence of increasing function classes, also known as seive, to approximate $ℱ_{0}$ , i.e.,

ℱ_{1} \subseteq ℱ_{2} \subseteq \dots \subseteq ℱ_{n} \subseteq ℱ_{n + 1} \subseteq \dots \subseteq ℱ_{0} .

With careful tradeoff of the complexity of the function classes, it’s possible to control the estimation error (in terms of the covering number of $ℱ_{n}$ ) using some suitable uniform-law-of-large-number (ULLN) while still being able to keep the approximation error in check. Following Shen et al. (2023) we consider the seive given below.

Assumption 2 (Function class). The mean function $f$ is modeled to be in the NN class

ℱ_{n} = {α_{0} + \sum_{j = 1}^{r_{n}} α_{j} σ (γ_{j}^{T} X + γ_{0, j}) : γ_{j} \in R^{d}, α_{j}, γ_{0, j} \in R, \sum_{j = 0}^{r_{n}} ∣ α_{j} ∣ \leq V_{n} for some V_{n} > \frac{1}{L_{l i p}}, \max_{1 \leq j \leq r_{n}} \sum_{i, j = 0}^{d} ∣ γ_{i, j} ∣ \leq M_{n} for some M_{n} > 0},

(16)

where $r_{n}, V_{n}, M_{n} \to \infty$ as $n \to \infty$ and satisfies the scaling

r_{n} V_{n}^{2} \log V_{n} r_{n} = o (n) .

(17)

Here the activation function $σ (\cdot)$ is a Lipschitz function on $R$ with range $[- r_{b}, r_{b}]$ and Lipschitz constant $L_{l i p}$ . (For the sigmoid function, $r_{b} = 1$ and $L_{l i p} = 1 ∕ 4$ ).

Hornik et al. (1989) has shown that $\cup_{n} ℱ_{n}$ is dense in the continuous function class $ℱ_{0}$ which will control the approximation error. The estimation error will depend on the covering number for this class which can be controlled under the scaling rate (17).

Finally, to guarantee the regularity of the GLS loss (8) used for estimating the NN function $f$ , we require conditions on the working precision matrix $Q$ . Instead of directly imposing these conditions on $Q$ , we look at the discrepancy matrix

E = Σ^{\frac{T}{2}} Q Σ^{\frac{1}{2}}

(18)

which measures the discrepancy between the true covariance matrix $Σ$ and the working covariance matrix $Q^{- 1}$ . This is because we will see later in Theorem 3 that the finite sample error rates of NN-GLS will depend on $Q$ through the spectral interval of $E$ . First, without loss of generality, we can assume $λ_{min} (E) \leq 1 \leq λ_{\max} (E)$ . This is because the optimization in NN-GLS only depends on $Q$ up to a scalar multiplier. The following assumption states that additionally the spectral interval of $E$ needs to be uniformly bounded in $n$ .

Assumption 3 (Spectral interval of the discrepancy matrix.). For all $n$ , all eigenvalues $λ$ of $E = Σ^{\frac{T}{2}} Q Σ^{\frac{1}{2}}$ lie in ( $Λ_{l o w}, Λ_{h i g h}$ ) for universal constants $0 < Λ_{l o w} \leq 1 \leq Λ_{h i g h} < \infty$ .

Uniform upper and lower bounds of $E$ are used for ensuring the continuity of the GLS loss function and the consistency of the loss function (Lemma S2) using empirical process results. We show in Propositions 1 and 2 how this Assumption is satisfied when $Q$ is either the true GP precision matrix $Σ^{- 1}$ , its NNGP approximation. Of course, $Q = I$ , which can be viewed as NNGP precision matrix with $m = 0$ , trivially satisfies Assumption 3 if Assumption 1 holds as $E$ is simply $Σ$ .

5.2. Main results

We provide general results on the existence and consistency of neural network estimators minimizing a GLS loss for dependent data. The expected value of the GLS loss (8) is:

L_{n} (f) = E [ℒ_{n} (f)] = E [\frac{1}{n} {(y - f (x))}^{T} Q (y - f (x))] = \frac{1}{n} E [ϵ^{T} Q ϵ] + \frac{1}{n} {(f_{0} (x) - f (x))}^{T} Q (f_{0} (x) - f (x))

It is evident from above that $f_{0}$ naturally minimizes $L_{n} (f)$ , while NN-GLS tries to minimize $ℒ_{n} (f)$ . We first show that such a minimizer exists in the seive class $ℱ_{n}$ .

Theorem 1 (Existence of seive estimator). Given data $(Y_{i}, X_{i}, s_{i}), i = 1, \dots, n$ generated from (6) under Assumption 1, and a working precision matrix $Q$ satisfying Assumption 3, with the function classes $ℱ_{n}$ defined in (16), there exists a seive estimator ${\hat{f}}_{n}$ such that

{\hat{f}}_{n} = argmin {ℒ_{n} (f) : f \in ℱ_{n}} .

(19)

All proofs are in Section S4 of Supplementary materials. The existence result ensures that a seive estimator in the class of neural networks that minimizes the GLS loss is well-defined. It is then natural to study its asymptotic consistency, as we do in the next result.

Theorem 2 (Consistency). Under Assumptions 1, 2, and 3, the NN-GLS estimate ${\hat{f}}_{n}$ (19) minimizing the GLS loss $ℒ_{n} (f)$ in (8) is consistent in the sense ${‖ {\hat{f}}_{n} - f_{0} ‖}_{n} \overset{p}{\to} 0$ .

To our knowledge, this is the first result on the consistency of neural networks for estimating the mean function of a spatially dependent process observed at an irregular set of locations. We do not impose any assumption on the true mean function $f_{0}$ beyond continuity and rely on very mild assumptions on the function class, and the covariance matrices of the data generation and analysis models. In Section 5.3 we show that these assumptions are satisfied for typical GP covariances and irregular spatial data designs. Also, note that this general result does not restrict the nature of the dependence to be spatial. Hence, while spatial applications are the focus of this manuscript, Theorem 2 can be used to establish consistency of neural networks for time-series, spatio-temporal, network, or other types of structured dependence. Section S6.7 presents a simulation study empirically demonstrating this consistency.

Theorem 2 generalizes the analogous consistency result of Shen et al. (2023) from i.i.d. data and OLS loss to dependent error processes and the use of a GLS loss. Consequently, the proof addresses challenges that do not arise in the i.i.d. case. The spatial dependency makes the standard Rademacher randomization fail and prevents using the standard Uniform law of large number (ULLN) result. We overcome this via construction of a different normed functional space equipped with a new Orlicz norm to adjust for data dependence and use of the GLS loss. This enables applying a ULLN for our dependent setting by showing that the empirical process is well-behaved with respect to this Orlicz norm.

5.3. Matérn Gaussian processes

We now establish consistency of NN-GLS for common GP covariance families, spatial data designs, and choice of working precision matrices. The main task in applying the general consistency result (Theorem 2) for these specific settings is verifying compliance to the regularity assumptions – i.e., the spectral bounds on the true Gaussian process covariance (Assumption 1) and on the working precision matrix (Assumptions 3).

We provide consistency results of NN-GLS for spatial data generated from the Matérn Gaussian process (see S4.2). This is the predominant choice of covariance family in geostatistical models due to the interpretability of its parameters with a marginal spatial variance $σ^{2}$ , decay of spatial correlation $ϕ$ , smoothness $v$ of the underlying process (Stein, 1999). Our first result considers data generated from a class of GP that contains the Matérn family and where the working precision matrix is the true GP precision matrix.

Proposition 1. Consider data generated from a spatial process $Y_{i} = f_{0} (X_{i}) + ϵ (s_{i})$ at locations $s_{1}, \dots, s_{n}$ in $R^{2}$ , where $f_{0}$ is continuous, $ϵ (s_{i})$ is a Gaussian process with covariance function $Σ (s_{i}, s_{j}) = C (s_{i}, s_{j}) + τ^{2} I (s_{i} = s_{j})$ , and $C (s_{i}, s_{j}) = C (‖ s_{i} - s_{j} ‖)$ is a covariance of a stationary spatial GP. Suppose the data locations are separated by a minimum distance $h > 0$ , i.e., $‖ s_{i} - s_{j} ‖ \geq h, \forall i \neq j$ . Let $Σ = C + τ^{2} I$ denote the covariance matrix of $Y = {(Y (s_{1}), \dots, Y (s_{n}))}^{T}$ , where $C = (C (s_{i} - s_{j}))$ . Then NN-GLS using $Q = Σ_{n}^{- 1}$ is consistent, i.e., ${‖ {\hat{f}}_{n} - f_{0} ‖}_{n} \overset{p}{\to} 0$ , if either (a) $τ^{2} > 0$ and $C (u) = o (u^{- (2 + κ)})$ for some $κ > 0$ , or (b) $τ^{2} \geq 0$ and $C (u)$ is a Matérn covariance function with parameters $θ = (σ^{2}, ϕ, ν)$ .

The decay rate $C (u) = o (u^{- (2 + κ)})$ is satisfied by the Matérn family (Abramowitz and Stegun, 1948), so Proposition 1 proves the consistency of NN-GLS for Matérn GP, both with and without a nugget. The result holds for any irregular spatial design in $R^{2}$ being separated by a minimum distance. As the sample size $n$ grows, this is equivalent to considering the increasing domain paradigm that is commonly adopted as Matérn GP parameters are not identifiable if data are collected densely in a fixed spatial domain (Zhang, 2004).

Propositions 1 describes the case where true covariance structure is known. In that case, it’s possible to directly use the inverse of the covariance matrix as the working precision matrix in the GLS loss. However, this is often infeasible for multiple reasons. First, the true covariance parameters are usually unknown, and the working covariance matrix will typically use different (estimated) parameter values. Computationally, GLS loss using the full Matérn GP covariance matrix will require $O (n^{3})$ time and $O (n^{2})$ storage, which is not available even for moderate $n$ . The next proposition introduces a more pragmatic result proving the consistency of NN-GLS for data generated from Matérn GP but when using a working precision matrix derived from NNGP (as described in Section 4) and with parameter values different from the truth.

Proposition 2. Consider data generated as in Proposition 1 from a Matérn GP with parameters $θ_{0} = (σ_{0}^{2}, ϕ_{0}, ν_{0}, τ_{0}^{2})$ at locations separated by a minimum distance $h > 0$ . Let $Q$ be the NNGP precision matrix based on a Matérn covariance with parameters $θ = (σ^{2}, ϕ, ν, τ^{2})$ , i.e. $C (h ∣ σ^{2}, ϕ, ν) + τ^{2} I (h = 0)$ , and using neighbor sets of maximum size $m$ with each location appearing in at most $M$ many neighbor sets. Then NN-GLS using $Q$ is consistent, i.e., ${‖ {\hat{f}}_{n} - f_{0} ‖}_{n} \overset{p}{\to} 0$ for any choice of $ϕ > \frac{1}{h} C^{* - 1} (min (1, \frac{σ^{2} + τ^{2}}{(2 m + M \sqrt{m}) σ^{2}}))$ , where $C^{*}$ is the Matérn covariance function with unit variance, unit spatial decay, and smoothness $v$ , and $C^{* - 1}$ is its inverse function.

Proposition 2 provides consistency of NN-GLS for Matérn GP when using NNGP working precision matrices. This is the actual choice of $Q$ used in our algorithm as it can be represented as a GNN, thereby facilitating a scalable implementation, as described in Section 4. The result allows the spatial parameter used in the working covariance to be different from the truth. The spatial decay parameter $ϕ$ for the working precision matrix $Q$ needs to be sufficiently large to ensure $Q$ is not close to being singular, which will lead to numerical instability. We note that this is not a restriction on the data generation process, but just on the working precision matrix and can be chosen by the user. This is not enforced in practice, as these parameters are estimated as outlined in Section 4.2. In Section S5.1, we provide the theoretical intuition and empirical evidence that NN-GLS estimates the spatial parameters consistently. The restriction on each location appearing in at most a fixed number of neighbor sets of NNGP is usually satisfied in all but very pathological designs.

To our knowledge, Propositions 1 and 2 on the consistency of NN-GLS are the first examples of consistency of any machine-learning-based approach to estimating a nonlinear mean of a Matérn GP for irregular spatial designs. The only similar result in the literature is the consistency of RF-GLS, the GLS-based random forest approach of Saha et al. (2023). However, their result relies on a one-dimensional regular lattice design and restricts the true Matérn process smoothness to be half-integers. Our result is valid for any irregular spatial designs in the two-dimensional space – the most typical setting of spatial data collection. The result also holds for any true parameter values of the Matérn process.

5.4. Finite-sample error rates

We also obtain the finite-sample error rate for NN-GLS that informs about the importance of using the GLS loss in neural networks for spatial data.

Theorem 3 (Convergence rate). Let $π_{r_{n}}$ be the projection of $f_{0}$ in $ℱ_{n}$ . Under Assumptions 1, 2, and 3, the NN-GLS estimate ${\hat{f}}_{n}$ (19) satisfies

{‖ {\hat{f}}_{n} - f_{0} ‖}_{n} = O_{p} (\frac{Λ_{h i g h}}{Λ_{l o w}} {‖ π_{r_{n}} f_{0} - f_{0} ‖}_{n} \lor \frac{Λ_{h i g h}}{Λ_{l o w}} \sqrt{\frac{r_{n} \log n}{n}}) .

There are two main takeaways from this rate result. First, the rate for NN-GLS is, up to a scaling factor, the same as that for OLS neural networks in the i.i.d. case (Shen et al., 2023). We do not make any assumption on the class of the regression function $f_{0}$ beyond continuity and refer the readers to (Shen et al., 2023) for discussion on specific choices of classes of the true function $f_{0}$ , $r_{n}$ and $d$ where this rate is sharp up to logarithmic terms.

The second and novel insight is the scaling of the error rate by a factor $λ = Λ_{h i g h} ∕ Λ_{l o w} \geq 1$ — the ratio of the largest and smallest eigenvalues of the discrepancy matrix $E$ (18). In general, $λ$ is close to 1 if the discrepancy matrix is closer to $I$ . Proposition S2 in Appendix S5.2 shows that the Kullback-Leibler distance $KLD (I, E (m))$ between the identity matrix $I$ and the discrepancy matrix $E (m)$ when using an NNGP working covariance matrix with $m$ -nearest neighbors is a decreasing function of $m$ . The best case scenario is when using the true covariance as the working covariance, i.e., $Q = Σ^{- 1}$ yielding $E = I$ and $λ = 1$ . This corresponds to using NNGP with $m = n$ . At the other extreme, is the vanilla neural network using the OLS loss, corresponds to using a working covariance matrix of $I$ in NN-GLS. This corresponds to NNGP with $m = 0$ and is thus the worst approximation according to Proposition S2. We verify this empirically, plotting in Figure S2 both the KLD and the spectral width of $E$ as a function of $m$ . We see a very large spectral width for $m = 0$ , i.e., using OLS loss, implying a large error rate and inefficient estimation when using OLS loss in neural networks for spatial data. This is manifested in our results detailed later. The spectral width for NNGP approximation with $m \geq 20$ is really narrow, yielding a near-optimal scaling factor $λ$ close to 1. This demonstrates how modeling the spatial covariance in the neural network estimation via the GLS loss improves the error rates and guides the choice of the number of nearest neighbor $m$ in NN-GLS (see Section S5.3).

6. Simulation study

We conduct extensive simulation experiments to study the advantages of NN-GLS over existing methods in terms of both prediction and estimation. The data are simulated from the spatial GP model (6) with two choices for the non-linear mean function $f_{1} (x) = 10 \sin (π x)$ and $f_{2} (x) = \frac{1}{6} (10 \sin (π x_{1} x_{2}) + 20 {(x_{3} - 0.5)}^{2} + 10 x_{4} + 5 x_{5})$ (Friedman function, Friedman, 1991). Section S6.1 provides all parameter values and implementation details.

We consider a total of 10 methods that represent both linear and nonlinear, spatial and non-spatial, statistical and machine learning paradigms. We include 3 candidate neural network approaches for comparison to NN-GLS: NN without any spatial information (NN-nonspatial), NN using the spatial coordinates as two additional inputs (NN-latlon), and NN using spline basis as additional inputs (NN-splines). The NN-splines method is a rendition of Deepkriging (Chen et al., 2024). It uses the same basis functions as Deepkriging but to study the effect of additional covariates terms, we keep the numbers of layers and nodes the same for all the neural network based methods. Among these, only NN-GLS and NN-nonspatial offer both estimation and prediction, NN-latlon, and NN-splines only offer prediction and do not estimate of the covariate effect (see Section 2.2).

Other than neural networks, we also include 5 other popular non-linear methods, including Generalized Additive Models (GAM), Random Forests (RF), and their counterparts under spatial settings, GAM using the spatial coordinates as two additional inputs (GAM-latlon), GAM-GLS (Nandy et al., 2017), and RF-GLS (Saha et al., 2023). RF-GLS offers both estimation and spatial predictions. GAM-latlon is used for predictions, and the other three methods are only considered for estimation. We also include the spatial linear GP model, implemented using NNGP through the BRISC R-package (Saha and Datta, 2018). Section S6.1 provides details of all the methods, and Table S1 summarizes their estimation and prediction capabilities and scalability for large datasets.

A snapshot of the results from the different experiments is provided in Figure 2. We first evaluate the estimation performance of the methods that offer an estimate of the mean function — linear-spatial model, GAM, GAM-GLS, RF, RF-GLS, NN-nonspatial, and NN-GLS. We use the Mean Integrated Square Error (MISE) for the estimate $\hat{f}$ . Figures S3(a) and S4(a) in Sections S6.2 and S6.3 respectively present the estimation results for all scenarios. A representative result is provided in Figure 2 (a) which presents the MISE for all methods when $f_{0} = f_{2}$ and for 3 choices of the nugget variance $τ^{2}$ . NN-GLS consistently outperforms the other methods. The non-spatial neural network, which uses OLS loss, performs poorly for estimation even though it uses the same function class as NN-GLS for the mean. This shows the importance of explicitly accounting for the spatial covariance in the neural network estimation process. We further compare the two in Figure S5, which shows that NN-GLS has a more significant advantage over non-spatial neural networks when the spatial decay $ϕ$ is small. This is expected as for small $ϕ$ , there is a strong spatial correlation in the data, so the performance of NN-nonspatial suffers on account of ignoring this spatial information. The deterioration in the performance of NN-nonspatial over NN-GLS is lesser for large $ϕ$ , as there is only a weak spatial correlation in the data.

Fig. 2 — (a): The estimation performance comparison with $f_{0} = f_{2}$ ; (b): (Section S6.4) Estimation performance comparison among GAM, GAM-GLS and NN-GLS against *rho*, i.e., the interaction strength; (c): The prediction performance comparison with $f_{0} = f_{2}$ (d): (Section S6.9) Prediction performance comparison among non-spatial NN, NN-GLS and NN-splines against sample size; (e): (Section S6.7) The consistency of estimation. (f): (Section S6.7) The running time for estimation.

In Figure 2(b) we present focussed comparisons between NN-GLS and methods from the GAM family. We consider variants of the Friedman function $f_{2}$ for the mean, controlling the weight of the interaction term with a parameter $ρ \in [0, 1]$ . Neither GAM nor GAM-GLS can model interaction, and this is reflected in their performance. When $ρ$ is small (weak interaction effect), their MISE are close to that from NN-GLS. However, when $ρ$ is large (strong interaction effect), the MISE of the GAM methods are considerably worse. This shows the advantage of the neural network family over GAM for non-linear regression in the presence of interaction terms (see Section S6.4 for details).

We compare prediction performances using the Relative Mean Squared Error (RMSE) on the test set, obtained by standardizing the MSE by the variance of the response so that the quantity can be compared across different experiments. Figures S3(b) and S4(b) in Sections S6.2 and S6.3 present the prediction results from all scenarios. When $f_{0} = f_{1}$ , i.e., the sine function (Figure S3(b)), the spatial linear model, unsurprisingly, offers the worst prediction performance by missing the strong non-linearity. NN-GLS is consistently the best, on account of using spatial information both in estimation and prediction, and the performances of other methods lie between NN-GLS and the linear model. In Figure 2(c), we present the more interesting scenario when $f_{0} = f_{2}$ . It’s surprising to notice that the linear-spatial model outperforms all methods except NN-GLS in terms of prediction performance. The reason is the partial-linearity of Friedman function and the power of the GP in the linear model, NN-GLS offers the lowest RMSE. In fact, all the GP-based approaches (spatial linear model, RF-GLS, NN-GLS) outperform the added-spatial-features approaches (GAM-latlon, NN-latlon, NN-splines). This demonstrates the limitation of the added-spatial-features approaches compared to NN-GLS which embeds NN in the GP model. The choice and the number of the added basis functions may need to be tuned carefully to optimize performance for specific applications. Also, the sample sizes may need to be much larger to fully unlock the non-parametric ability of the added-spatial-features approaches. We see this in Figure 2(d) which shows the prediction performance against sample size n under fixed-domain sampling. For large $n$ , the performance of NN-splines becomes similar to NN-GLS, although NN-GLS still has lower RMSE. NN-GLS circumvents basis functions by parsimoniously modeling the spatial dependence directly through the GP covariance, and performs well for both small and large sample sizes and fixed- and increasing-domain sampling paradigms (see Section S6.9 for all the scenarios).

Figures 2(e) and (f) present the estimation performance and running times for different methods up to sample sizes of 500, 000. The large sample runs demonstrate how the approximation error for NN-GLS goes to zero while for others like the GAM family and linear models stay away from zero as they cannot approximate the Friedman function. This empirically verifies the consistency result. NN-GLS also scales linearly in $n$ due to the innovations of Section 4 and is much faster than the other non-linear GLS approaches like RF-GLS and GAM-GLS. Details are in Section S6.7.

Section S6 of the Supplementary materials presents all the other simulation results. The confidence interval for $\hat{f}$ and the prediction interval for $\hat{y}$ are evaluated in Sections S6.5 and S6.6 respectively. Both types of intervals provided by NN-GLS provide near-nominal coverage as well as superior interval scores against the competing methods. To study the impact of sampling design, a denser sampling is considered in Section S6.8 keeping the domain fixed. This is to supplement the theoretical results, which are only for the increasing domain. In section S6.10 we investigate if the estimation of the spatial parameters has an impact on the performance of NN-GLS by comparing it to NN-GLS (oracle) which uses the true spatial parameters. We observe that NN-GLS’s performance is quite similar to that of NN-GLS (oracle) since it provides an accurate estimation of spatial parameters. NN-GLS also performs well for a higher dimensional mean function (of 15 covariates) (Section S6.11). We finally assess the robustness of NN-GLS to model misspecification, including misspecification of the GP covariance (Section S6.12), and complete misspecification of the spatial dependence (Section S6.13). For both estimation and prediction, NN-GLS performs the best or comparably with the best method for both cases of misspecification.

7. Real data example

The real data example considered here is the spatial analysis of PM_2.5 (fine particulate matter) data in the continental United States. Following a similar analysis in Chen et al. (2024), we consider the daily PM_2.5 data from 719 locations on June 18th, 2022 and regress PM_2.5 on six meteorological variables — precipitation accumulation, air temperature, pressure, relative humidity, west wind (U-wind) speed, and north wind (V-wind) speed. Section S7.1 has more details about the data.

Figure 3 (a) demonstrates the PM_2.5 distribution on the date (smoothed by inverse distance weighting interpolation), as well as the nationwide EPA monitor’s distribution (the orange spots). The spatial nature of PM_2.5 is evident from the map. We consider the following 5 methods previously introduced in Section 6: GAM-latlon, RF-GLS, NN-latlon, NN-splines, and NN-GLS. To evaluate the prediction performance, we randomly take 83% (5/6^th) of the data as a training set and the rest part as a testing set using block-random data splitting strategy, which is closer to a real-world scenario (see Appendix S7.2 for details), train the model and calculate the RMSE of the prediction on the testing set. The procedure is repeated for 100 times. The performance of each method is shown in Figure 3(b). We find that NN-GLS and RF-GLS have comparably the lowest average RMSE, while GAM-latlon, NN-latlon, and NN-splines are outperformed. This trend is consistent for other choices of dates and for different ways of data splitting, except for some cases where the spatial pattern is not clear and NN-splines may do similarly (see Section S7.3 of the Supplementary materials). Sensitivity to the choice of hold-out data is presented in Section S7.2. Section S7.4 evaluates the performance of prediction intervals on hold-out data. We see that prediction intervals from NN-GLS offers near nominal coverage. Model adequacy checks are performed in Section S7.5 and reveal that data from different days adhere to the modeling assumptions of Gaussianity and exponentially decaying spatial correlation to varying degrees. As NN-GLS performs well on all days, it gives confidence in the applicability of NN-GLS in a wide variety of scenarios.

NN-GLS also provides a direct estimate of the effect of the meteorological covariates on PM_2.5, specified through the mean function $f (X)$ . However, $X$ is six-dimensional in this application, precluding any convenient visualization of the function $f (X)$ . Hence, we use partial dependency plots (PDP) – a common tool in the machine learning world for visualizing the marginal effect of one or two features on the response (see Section S7.6 for a detailed introduction). In Figure 4, we present the PDP of PM_2.5 on temperature and wind. While the PM_2.5 level is linearly affected by temperature, we see clear nonlinear effects in the wind, thereby demonstrating the need to move beyond the traditional linear spatial models and consider geostatistical models with non-linear means, like NN-GLS.

Fig. 4 — Partial dependency plots showing the marginal effects of temperature and west wind (U-wind).

8. Discussion

In this work, we showed how neural networks can be embedded directly within traditional geostatistical Gaussian process models, creating a hybrid machine learning statistical modeling approach that balances parsimony and flexibility. Compared to existing renditions of NN for spatial data which need to create and curate many spatial basis functions as additional features, NN-GLS simply encodes spatial information explicitly and parsimoniously through the GP covariance. It reaps all the benefits of the model-based framework including the separation of the non-spatial and spatial structures into the mean and the covariance, the use of GLS loss to account for spatial dependence while estimating the non-linear mean using neural networks, and the prediction at new locations via kriging.

We show that NN-GLS can be represented as a graph neural network. The resulting GLS loss is equivalent to adding a graph convolution layer to the output layer of an OLS-style neural network. This ensures the computational techniques used in standard NN optimization like mini-batching and backpropagation can be adapted for NN-GLS, resulting in a linear time algorithm. Also, kriging predictions can be obtained entirely using the GNN architecture using graph convolution and deconvolution.

Our connection of NN-GLS to GNN is of independent importance, as it demonstrates the connection of GNN to traditional geostatistics. Current adaptations of GNN to spatial data (Tonks et al., 2022; Fan et al., 2022) mostly use simple graph convolution choices like equal-weighted average of neighboring observations. For irregular spatial data, this disregards the unequal distances among the observations, which dictate the strength of spatial correlations in traditional geospatial models. Our representation result shows how graph-convolution weights arise naturally as nearest neighbor kriging weights which accounts for mutual distances among data locations. Future work will extend this GNN framework and consider more general graph deconvolution layers, and applicable to other types of irregular spatial data where graphical models are already in use, like multivariate spatial outcomes using inter-variable graphs (Dey et al., 2022) or areal data graphs (Datta et al., 2019).

We prove general theory on the existence and consistency of neural networks using GLS loss for spatial processes, including Matérn GP, observed over an irregular set of locations. We establish finite sample error rates of NN-GLS, which improves when the working covariance matrix used in the GLS loss is close to the true data covariance matrix. It proves that ignoring the spatial covariance by simply using the OLS loss leads to large error rates. To the best of our knowledge, we presented the first theoretical results for any neural network approach under spatial dependency.

There is a gap between the theory and the actual implementation of NN-GLS. The theory relies on a restricted class of neural networks, and does not consider steps like mini-batching and backpropagation used in practice. Even in a non-spatial context, this optimization error gap between the practice and theory of NN is yet to be bridged. Our theory is not complexity-adaptive as we do not assume any structure on the true mean function beyond continuity, It will be interesting to pursue adaptive rates for NN-GLS akin to recent works of Schmidt-Hieber (2020) and others in a non-spatial context (see Section S3). Extension to high-dimensional covariates as considered in Fan and Gu (2023) is also a future direction. Theoretically justified methodology for pointwise inference (interval estimates) for the mean function from neural networks remains open, even for i.i.d. case.

We primarily focussed on modeling the mean as a function of the covariates using a rich non-linear family, i.e., the neural network class, while using stationary covariances to model the spatial structure. However, non-stationarity can be easily accommodated in NN-GLS either by including a few basis functions in the mean or using the GLS loss with a non-stationary covariance function. For example, Zammit-Mangion et al. (2021) proposed rich classes of non-stationary covariance functions using transformations of the space modeled with neural networks.

Supplementary Material

Supp 1

NIHMS2003252-supplement-Supp_1.zip^{(9.9MB, zip)}

Acknowledgement and Disclosures

This work is supported by National Institute of Environmental Health Sciences grant R01ES033739. The authors report there are no competing interests to declare.

Footnotes

Supplementary materials and Software

The supplement contains proofs, additional theoretical insights or methodological details, and results from more numerical experiments. A Python implementation of NN-GLS which is publicly available in the geospaNN package at https://pypi.org/project/geospaNN/.

References

Abramowitz M and Stegun IA (1948), Handbook of mathematical functions with formulas, graphs, and mathematical tables, Vol. 55, US Government printing office. [Google Scholar]
Banerjee S, Carlin BP and Gelfand AE (2014), Hierarchical modeling and analysis for spatial data, CRC press. [Google Scholar]
Breiman L. (2001), ‘Random forests’, Machine learning 45(1), 5–32. [Google Scholar]
Chen W, Li Y, Reich BJ and Sun Y (2024), ‘Deepkriging: Spatially dependent deep neural networks for spatial prediction’, Statistica Sinica 34, 291–311. [Google Scholar]
Cressie N and Wikle CK (2015), Statistics for spatio-temporal data, John Wiley & Sons. [Google Scholar]
Datta A, Banerjee S, Finley AO and Gelfand AE (2016), ‘Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets’, Journal of the American Statistical Association 111(514), 800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
Datta A, Banerjee S, Hodges JS and Gao L (2019), ‘Spatial disease mapping using directed acyclic graph auto-regressive (dagar) models’, Bayesian analysis 14(4), 1221. [DOI] [PMC free article] [PubMed] [Google Scholar]
Dey D, Datta A and Banerjee S (2022), ‘Graphical gaussian process models for highly multivariate spatial data’, Biometrika 109(4), 993–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]
Fan J, Bai J, Li Z, Ortiz-Bobea A and Gomes CP (2022), A gnn-rnn approach for harnessing geospatial and temporal information: application to crop yield prediction, in ‘Proceedings of the AAAI conference on artificial intelligence’, Vol. 36, pp. 11873–11881. [Google Scholar]
Fan J and Gu Y (2023), ‘Factor augmented sparse throughput deep relu neural networks for high dimensional regression’, Journal of the American Statistical Association 0(0), 1–15. URL: 10.1080/01621459.2023.2271605 [DOI] [Google Scholar]
Freund Y, Schapire R and Abe N (1999), ‘A short introduction to boosting’, Journal-Japanese Society For Artificial Intelligence 14(771−780), 1612. [Google Scholar]
Friedman JH (1991), ‘Multivariate adaptive regression splines’, Annals of statistics 19(1), 1–67. [Google Scholar]
Goodfellow I, Bengio Y and Courville A (2016), Deep learning, MIT press. [Google Scholar]
Gray SD, Heaton MJ, Bolintineanu DS and Olson A (2022), ‘On the use of deep neural networks for large-scale spatial prediction.’, Journal of Data Science 20(4), 493–511. [Google Scholar]
Hornik K, Stinchcombe M and White H (1989), ‘Multilayer feedforward networks are universal approximators’, Neural networks 2(5), 359–366. [Google Scholar]
Iranzad R, Liu X, Chaovalitwongse WA, Hippe D, Wang S, Han J, Thammasorn P, Duan C, Zeng J and Bowen S (2022), ‘Gradient boosted trees for spatial data and its application to medical imaging data’, IISE Transactions on Healthcare Systems Engineering 12(3), 165–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
Nandy S, Lim CY and Maiti T (2017), ‘Additive model building for spatial regression’, Journal of the Royal Statistical Society Series B: Statistical Methodology 79(3), 779–800. [Google Scholar]
Saha A, Basu S and Datta A (2023), ‘Random forests for spatially dependent data’, Journal of the American Statistical Association 118(541), 665–683. [Google Scholar]
Saha A and Datta A (2018), ‘Brisc: bootstrap for rapid inference on spatial covariances’, Stat 7(1), e184. [Google Scholar]
Schmidt-Hieber J. (2020), ‘Nonparametric regression using deep neural networks with ReLU activation function’, The Annals of Statistics 48(4), 1875–1897. URL: 10.1214/19-AOS1875 [DOI] [Google Scholar]
Shen X, Jiang C, Sakhanenko L and Lu Q (2023), ‘Asymptotic properties of neural network sieve estimators’, Journal of Nonparametric Statistics 35(4), 839–868. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sigrist F. (2022), ‘Gaussian process boosting’, The Journal of Machine Learning Research 23(1), 10565–10610. [Google Scholar]
Stein ML (1999), Interpolation of spatial data: some theory for kriging, Springer Science & Business Media. [Google Scholar]
Tonks A, Harris T, Li B, Brown W and Smith R (2022), ‘Forecasting west nile virus with graph neural networks: Harnessing spatial dependence in irregularly sampled geospatial data’, arXiv preprint arXiv:2212.11367. [DOI] [PMC free article] [PubMed] [Google Scholar]
Wang H, Guan Y and Reich B (2019), Nearest-neighbor neural networks for geostatistics, in ‘2019 international conference on data mining workshops (ICDMW)’, IEEE, pp. 196–205. [Google Scholar]
Wikle CK and Zammit-Mangion A (2023), ‘Statistical deep learning for spatial and spatiotemporal data’, Annual Review of Statistics and Its Application 10, 247–270. [Google Scholar]
Zammit-Mangion A, Ng TLJ, Vu Q and Filippone M (2021), ‘Deep compositional spatial models’, Journal of the American Statistical Association pp. 1–22.35757777 [Google Scholar]
Zhang H. (2004), ‘Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics’, Journal of the American Statistical Association 99(465), 250–261. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supp 1

NIHMS2003252-supplement-Supp_1.zip^{(9.9MB, zip)}

[R1] Abramowitz M and Stegun IA (1948), Handbook of mathematical functions with formulas, graphs, and mathematical tables, Vol. 55, US Government printing office. [Google Scholar]

[R2] Banerjee S, Carlin BP and Gelfand AE (2014), Hierarchical modeling and analysis for spatial data, CRC press. [Google Scholar]

[R3] Breiman L. (2001), ‘Random forests’, Machine learning 45(1), 5–32. [Google Scholar]

[R4] Chen W, Li Y, Reich BJ and Sun Y (2024), ‘Deepkriging: Spatially dependent deep neural networks for spatial prediction’, Statistica Sinica 34, 291–311. [Google Scholar]

[R5] Cressie N and Wikle CK (2015), Statistics for spatio-temporal data, John Wiley & Sons. [Google Scholar]

[R6] Datta A, Banerjee S, Finley AO and Gelfand AE (2016), ‘Hierarchical nearest-neighbor Gaussian process models for large geostatistical datasets’, Journal of the American Statistical Association 111(514), 800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R7] Datta A, Banerjee S, Hodges JS and Gao L (2019), ‘Spatial disease mapping using directed acyclic graph auto-regressive (dagar) models’, Bayesian analysis 14(4), 1221. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R8] Dey D, Datta A and Banerjee S (2022), ‘Graphical gaussian process models for highly multivariate spatial data’, Biometrika 109(4), 993–1014. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R9] Fan J, Bai J, Li Z, Ortiz-Bobea A and Gomes CP (2022), A gnn-rnn approach for harnessing geospatial and temporal information: application to crop yield prediction, in ‘Proceedings of the AAAI conference on artificial intelligence’, Vol. 36, pp. 11873–11881. [Google Scholar]

[R10] Fan J and Gu Y (2023), ‘Factor augmented sparse throughput deep relu neural networks for high dimensional regression’, Journal of the American Statistical Association 0(0), 1–15. URL: 10.1080/01621459.2023.2271605 [DOI] [Google Scholar]

[R11] Freund Y, Schapire R and Abe N (1999), ‘A short introduction to boosting’, Journal-Japanese Society For Artificial Intelligence 14(771−780), 1612. [Google Scholar]

[R12] Friedman JH (1991), ‘Multivariate adaptive regression splines’, Annals of statistics 19(1), 1–67. [Google Scholar]

[R13] Goodfellow I, Bengio Y and Courville A (2016), Deep learning, MIT press. [Google Scholar]

[R14] Gray SD, Heaton MJ, Bolintineanu DS and Olson A (2022), ‘On the use of deep neural networks for large-scale spatial prediction.’, Journal of Data Science 20(4), 493–511. [Google Scholar]

[R15] Hornik K, Stinchcombe M and White H (1989), ‘Multilayer feedforward networks are universal approximators’, Neural networks 2(5), 359–366. [Google Scholar]

[R16] Iranzad R, Liu X, Chaovalitwongse WA, Hippe D, Wang S, Han J, Thammasorn P, Duan C, Zeng J and Bowen S (2022), ‘Gradient boosted trees for spatial data and its application to medical imaging data’, IISE Transactions on Healthcare Systems Engineering 12(3), 165–179. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] Nandy S, Lim CY and Maiti T (2017), ‘Additive model building for spatial regression’, Journal of the Royal Statistical Society Series B: Statistical Methodology 79(3), 779–800. [Google Scholar]

[R18] Saha A, Basu S and Datta A (2023), ‘Random forests for spatially dependent data’, Journal of the American Statistical Association 118(541), 665–683. [Google Scholar]

[R19] Saha A and Datta A (2018), ‘Brisc: bootstrap for rapid inference on spatial covariances’, Stat 7(1), e184. [Google Scholar]

[R20] Schmidt-Hieber J. (2020), ‘Nonparametric regression using deep neural networks with ReLU activation function’, The Annals of Statistics 48(4), 1875–1897. URL: 10.1214/19-AOS1875 [DOI] [Google Scholar]

[R21] Shen X, Jiang C, Sakhanenko L and Lu Q (2023), ‘Asymptotic properties of neural network sieve estimators’, Journal of Nonparametric Statistics 35(4), 839–868. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Sigrist F. (2022), ‘Gaussian process boosting’, The Journal of Machine Learning Research 23(1), 10565–10610. [Google Scholar]

[R23] Stein ML (1999), Interpolation of spatial data: some theory for kriging, Springer Science & Business Media. [Google Scholar]

[R24] Tonks A, Harris T, Li B, Brown W and Smith R (2022), ‘Forecasting west nile virus with graph neural networks: Harnessing spatial dependence in irregularly sampled geospatial data’, arXiv preprint arXiv:2212.11367. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] Wang H, Guan Y and Reich B (2019), Nearest-neighbor neural networks for geostatistics, in ‘2019 international conference on data mining workshops (ICDMW)’, IEEE, pp. 196–205. [Google Scholar]

[R26] Wikle CK and Zammit-Mangion A (2023), ‘Statistical deep learning for spatial and spatiotemporal data’, Annual Review of Statistics and Its Application 10, 247–270. [Google Scholar]

[R27] Zammit-Mangion A, Ng TLJ, Vu Q and Filippone M (2021), ‘Deep compositional spatial models’, Journal of the American Statistical Association pp. 1–22.35757777 [Google Scholar]

[R28] Zhang H. (2004), ‘Inconsistent estimation and asymptotically equal interpolations in model-based geostatistics’, Journal of the American Statistical Association 99(465), 250–261. [Google Scholar]

PERMALINK

Neural networks for geospatial data

Wentao Zhan

Abhirup Datta

Abstract

1. Introduction

2. Preliminaries