Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping

Daniel Taylor-Rodriguez; Andrew O Finley; Abhirup Datta; Chad Babcock; Hans-Erik Andersen; Bruce D Cook; Douglas C Morton; Sudipto Banerjee

doi:10.5705/ss.202018.0005

. Author manuscript; available in PMC: 2020 Dec 11.

Published in final edited form as: Stat Sin. 2019;29:1155–1180. doi: 10.5705/ss.202018.0005

Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping

Daniel Taylor-Rodriguez ¹, Andrew O Finley ^2,^*, Abhirup Datta ³, Chad Babcock ⁴, Hans-Erik Andersen ⁵, Bruce D Cook ⁶, Douglas C Morton ⁶, Sudipto Banerjee ⁷

PMCID: PMC7731981 NIHMSID: NIHMS996762 PMID: 33311955

Abstract

Gathering information about forest variables is an expensive and arduous activity. As such, directly collecting the data required to produce high-resolution maps over large spatial domains is infeasible. Next generation collection initiatives of remotely sensed Light Detection and Ranging (LiDAR) data are specifically aimed at producing complete-coverage maps over large spatial domains. Given that LiDAR data and forest characteristics are often strongly correlated, it is possible to make use of the former to model, predict, and map forest variables over regions of interest. This entails dealing with the high-dimensional (~10²) spatially dependent LiDAR outcomes over a large number of locations (~10⁵–10⁶). With this in mind, we develop the Spatial Factor Nearest Neighbor Gaussian Process (SF-NNGP) model, and embed it in a two-stage approach that connects the spatial structure found in LiDAR signals with forest variables. We provide a simulation experiment that demonstrates inferential and predictive performance of the SF-NNGP, and use the two-stage modeling strategy to generate complete-coverage maps of forest variables with associated uncertainty over a large region of boreal forests in interior Alaska.

Keywords: LiDAR data, forest outcomes, nearest neighbor Gaussian processes, spatial prediction

1. Introduction

Strong relationships between remotely sensed Light Detection and Ranging (LiDAR) data and forest variables have been documented in the literature (Asner et al., 2009; Babcock et al., 2013; Næsset, 2011). When used in forested settings, LiDAR data provide a high-dimensional signal that characterizes the vertical structure of the forest canopy at point-referenced locations. Traditionally LiDAR data acquisition campaigns have sought complete coverage at a high spatial resolution over relatively small spatial domains—resulting in a fine grid of point-referenced LiDAR signals. In such settings, the link between LiDAR data and forest variable measurements on sparsely sampled forest inventory plots has been exploited to create high resolution complete-coverage predictive maps of the forest variables. Commonly this link is established by first extracting relevant features of the high-dimensional LiDAR signals through a dimension reduction step (Babcock et al., 2015; Junttila and Laine, 2017), then using the LiDAR features as predictors in a regression model to explain variability in spatially coinciding forest variable outcomes. The model is then applied to predict the forest outcomes at all locations across the domain where LiDAR signals have been observed.

Considerably more ambitious next generation LiDAR collection initiatives, such as ICESAT-2 (ICESat-2, 2015), Global Ecosystem Dynamics Investigation LiDAR (GEDI) (GEDI, 2014), and NASA Goddard’s LiDAR, Hyper-Spectral, and Thermal imager (G-LiHT) (G-LiHT, 2016), seek to quantify and map forest variables over vast spatial extents. To fulfill their goals in a cost effective manner, these data gathering programs do not collect LiDAR data over the entire domain, but rather sparsely sample locations across the domain extent and over forest inventory plots (i.e., where forest variables have been measured). While generating complete-coverage high resolution maps of forest outcomes remains the primary intended use for these data, there is also interest in creating maps of LiDAR data over non-sampled locations, and assessing spatial dependence within and among LiDAR signals.

Our motivating application focuses on forest variable prediction and mapping in the boreal forests of interior Alaska using sparsely sampled LiDAR and forest variable measurements. Within these regions, acquiring complete coverage LiDAR is cost prohibitive (Andersen et al., 2011; Bolton et al., 2013; Nelson et al., 2012). Because complete coverage maps of forest variables (and perhaps LiDAR signals) is still the goal, the information in the sparsely sampled LiDAR must be leveraged to inform forest variable prediction. One attractive solution is to move the LiDAR predictor variables to the left hand side of the regression and model them jointly with the forest outcomes. When the number of LiDAR and forest variables is small, such joint models are possible via linear models of coregionalization, see, e.g., Babcock et al. (2017) and Finley et al. (2014a). Alternatively, if the LiDAR signal is high-dimensional but observed at a small number of locations reduced rank models can be employed. For example, Banerjee et al. (2008), Ren and Banerjee (2013), and Finley et al. (2017a) applied a reduced rank predictive process modeling strategy to analyze similar high-dimensional data. However, such approaches that employ a reduced rank representation of the desired spatial process cannot scale to datasets with tens of thousands of locations and can yield poor predictive performance (Stein, 2014).

Models able to handle high-dimensional signals observed over a large number of locations and capable of estimating within and among location dependence structures are needed. Recent modeling developments reviewed in Heaton et al. (2017) and Banerjee (2017) highlight several options for robust and practical approximation of univariate Gaussian Process (GP) models. A subset of these models can be easily extended to accommodate relatively small multivariate response vectors (5 or less) see, e.g., (Datta et al., 2016a); nevertheless, for our particular application we require an approach that can cope with both the high-dimensional LiDAR measurements, ~50 outcomes at a location, while making use of the large collection of observed locations.

The Nearest Neighbor Gaussian Process (NNGP) developed in Datta et al. (2016a), Datta et al. (2016b), and Datta et al. (2016c) can be used with a massive number of locations as its scalability is not mediated by the number of observed locations, but rather by the size of the nearest neighbor sets considered—a quality that yields minimal storage and computational requirements. These models belong to the class of methods that induce sparsity on the spatial precision matrix, and exploits the natural representation of sparsity provided by graphical models (Lauritzen, 1996; Murphy, 2012) to build a sparse GP that accurately approximates the original dense GP.

To tackle the high-dimensional LiDAR dataset, we develop a Bayesian NNGP spatial factor model (SFM), referred to as the SF-NNGP. Following Christensen and Amemiya (2002); Hogan and Tchernis (2004); Ren and Banerjee (2013) the SFM structure enables approximating the dependence between multivariate (spatially dependent) outcomes through a lower-dimensional set of spatial factors, alleviating the difficulty of dealing directly with highdimensional outcomes. The SF-NNGP allows us to model and map the LiDAR signals on both observed and unobserved locations, and, conditioning on the LiDAR spatial signatures, we can likewise map the forest variables over the entire spatial domain of interest. Furthermore, using a Bayesian approach for model fitting enables us to equip the derived estimates and predictions with associated measures of uncertainty; an essential requirement of many high-profile initiatives. Our methods are fully implemented in C++, using BLAS (Blackford et al., 2001; Zhang, 2016) to leverage efficient multi-processor matrix operations and openMP (Dagum and Menon, 1998) to improve key steps of the algorithm through parallelization. Code and reproducible results will be provided via a GitHub site prior to publication.

The structure for the remainder of document is as follows. Section 2 introduces the Bonanza Creek dataset. In Section 3 we formulate the proposed hierarchical Bayesian modeling strategy. Section 4 presents the analysis of a synthetic dataset to validate the performance of the SF-NNGP model. Using the available LiDAR and forest inventory data, in Section 5 we develop and validate a predictive model for forest variables. We close by providing some insights, recommendations and future directions in Section 6.

2. Data Description

The Bonanza Creek Experimental Forest (BCEF) is a Long-Term Ecological Research (LTER) site consisting of vegetation and landforms typical of interior Alaska. The BCEF is 21,000 ha and includes a section of the Tanana River floodplain along the southeastern borders (Bonanza Creek LTER, 2016). Figure 1 shows the location and extent of the BCEF data detailed in this section.

Figure 1: — Bonanza Creek Experimental Forest extent with color enhanced Landsat image and locations where the LiDAR signals were measured *(LiDAR* in the legend) and locations where both LiDAR signals and forest variables were measured *(LiDAR & inventory* in the legend).

Forest variables were collected on 197 plots in 2014 using the USDA Forest Service Forest Inventory and Analysis Program protocol (Bechtold and Patterson, 2005). We consider three forest variables commonly used by forest professionals to make management decisions: above-ground biomass (AGB); tree density (TD); basal area (BA). AGB for individual trees was estimated using the Component Ratio Method described in Woodall et al. (2015). TD for a plot is expressed in thousands of trees per ha. BA for a plot is the sum of individual trees’ cross-sectional areas in m² at breast height scaled to a per ha basis.

In the summer 2014 LiDAR data were collected using a flight-line strip sampling approach with NASA Goddard’s G-LiHT sensor (Cook et al., 2013), which is a portable multi-sensor system that accurately characterizes complex terrain and vertical distribution of canopy elements (Jakubowski et al., 2013; White et al., 2013). Point cloud information was summarized to a 13×13 m grid cell size to approximate field plot areas. Over each grid cell, psuedo-waveforms were generated by calculating LiDAR return count densities for .5 m height bins between 0 and 28.5 m (i.e., 57 LiDAR outcomes per location). LiDAR return count density for height bin l is defined as the number of returns in height bin l divided by the total number of LiDAR returns over the grid cell. Identical LiDAR psuedo-waveforms were obtained using point clouds extracted over each field plot. G-LiHT data for the study area are available online at https://gliht.gsfc.nasa.gov. For this analysis, 50,197 LiDAR observations were used for model-fitting.

A Landsat 8 top of atmosphere (TOA) reflectance product was procured for the BCEF area for June of 2015. The June 2015 image was preferred to the June 2014 image due to excessive cloud cover in the 2014 image. A tasseled cap transformation was applied to the raw Landsat 8 TOA reflectance bands to obtain brightness, greenness, and wetness tasseled cap indices (Baig et al., 2014). These tasseled cap indices are used as covariates in the subsequent analysis.

Further details regarding the dataset and the ensuing analysis are provided in Section 5.

3. Modeling Strategy

As stated before, our goal is to model and generate uncertainty equipped predictions of forest variables, making use of information contained in LiDAR signals. Consider a LiDAR signal, z(·), observed at a finite collection of locations $T_{z} = {s_{1}, \dots, s_{n_{z}}}$ , and a set of forest outcomes, y(·), is observed at locations in the set $T_{y} = {r_{1}, \dots, r_{n_{y}}} \subset T_{z}$ . Furthermore, let $T_{\emptyset} = {t_{1}, \dots, t_{n_{\emptyset}}}$ denote a set of locations where neither LiDAR signals nor forest outcomes are available but where prediction is of interest. Thus, the set of locations where both LiDAR and forest outcomes are to be mapped to corresponds to $T = (T_{z} \cup T_{\emptyset})$ , with $T \subset D \subset ℝ^{2}$ , where $D$ is the spatial domain of interest. Note that although we mention above that z(·) and y(·) are “observed” at locations in $T_{z}$ and $T_{y}$ , respectively, we allow for missing values that are to be imputed in these sets. We make this distinction because locations where imputation is performed are part of the model fitting, whereas for locations in $T_{\emptyset}$ predictions are drawn ex post facto from the posterior predictive distribution; more detail is provided in Section 3.4.

The LiDAR signals are high-dimensional vectors of measurements in $ℝ^{h_{z}}$ , whereas the forest outcomes are relatively small-dimensional vectors (i.e., h_y << h_z), assumed to have support on $ℝ^{h_{z}}$ . Forest outcomes and LiDAR signals are strongly dependent on each other; LiDAR signals vary with the composition of a forest, and, as a plethora of examples in the literature have demonstrated (Ene et al., 2018; Finley et al., 2014b; Nelson et al., 2017) variability in forest outcome variables can be partially explained by LiDAR characteristics.

3.1. Linking LiDAR and forest inventory data

We seek to connect forest outcomes and LiDAR signals as a two-step process. First, we formulate a generative model to extract the spatial signature from the LiDAR data at locations in $T_{z}$ , which can also be used to interpolate LiDAR signals in $T_{\emptyset}$ . Along with other spatially referenced predictors, the LiDAR spatial signatures for locations in $T_{y}$ are used as predictors to build the model for forest outcomes. Moreover, a component that captures spatial variation exclusive to the forest outcomes can also be specified if required. For $s \in D$ this two stage model is given by

Stage 1 : z (s) = X_{z} {(s)}^{'} β_{z} + w^{⋆} (s) + ε_{z} (s),

(3.1)

Stage 2 : y (s) = X_{y} {(s)}^{'} β_{y} + ϒ w^{⋆} (s) + v^{⋆} (s) + ε_{y} (s)

(3.2)

Note the influence of z(s) over y(s) in (3.2) is solely exerted through its spatial component w*(s). There are several arguments in favor of this approach, as opposed to plugging in z(s) or μ_z(s) = X_z(s)′β_z + w*(s) directly as covariates into (3.2). Among them, and most importantly for our setting, z(s), μ_z(s) and w*(s) are all high-dimensional objects, using w*(s) facilitates reducing the dimensionality of the problem by casting it under the factor model structure, as shown in Section 3.2. Additionally, the elements within z(s) are strongly correlated and hence multicollinearity issues would arise if it was included directly in (3.2).

In (3.1) and (3.2) the terms X_z(s)′β_z and X_y(s)′β_y capture large scale variation. For κ ∈ {z,y}, X_κ(s)′ represents a fixed h_κ × p_κ block-diagonal matrix of spatially referenced predictors, where $p_{κ} = \sum_{j = 1}^{h_{κ}} p_{κ, j}$ , having as its jth diagonal block the length p_κ,j vector $x_{j}^{κ} {(s)}^{'}$ . The length p_κ vector β_κ corresponds to the regression coefficients associated to X_κ(s)′. The vectors w*(s) and v*(s) are h_z and h_y dimensional zero-centered stochastic processes over $D$ , respectively. The process w*(s) captures the spatial variation of z(s), while v*(s) synthesizes additional spatial variation in the forest outcomes. The h_y × h_z matrix ϒ connects the spatial information extracted from the LiDAR model into the forest outcomes model. The vectors $ε_{z} (s) ~ N_{h_{z}} (0, Ψ_{z})$ and $ε_{y} (s) ~ N_{h_{y}} (0, Ψ_{y})$ represent uncorrelated random errors (i.e., Ψ_z and Ψ_y are diagonal) at finer scales.

Implementing the modeling strategy above directly is challenging due to the high-dimensionality of the LiDAR signals (h_z ~ 50) and the massive number of spatially dependent observations (n ~ 10⁵), impossible to attempt with common computing resources. In the following section, we formulate a viable alternative to model (3.1) and (3.2).

3.2. The Spatial Factor NNGP Model

To make models (3.1) and (3.2) tractable with limited computing power, we combine a dimension reduction approach and a sparsity inducing technique. In particular, we introduce the spatial factor NNGP model (SF-NNGP), which brings together the spatial factor model (SFM) structure (Schmidt and Gelfand, 2003; Finley et al., 2008; Zhang, 2007; Ren and Banerjee, 2013) with Nearest Neighbor Gaussian processes (NNGPs) (Datta et al., 2016b,c,a).

While the SFM structure enables the analysis of high-dimensional response vectors by using linear combinations of a relatively small number of independent stochastic processes, NNGPs make possible fitting spatial process models when the number of spatial observations is forbiddingly large. NNGPs approximate the parent (dense) GP by using the natural representation of sparsity provided by graphical models (Lauritzen, 1996; Murphy, 2012), this by assuming conditional independence—where conditioning is on the nearest neighbors—with locations outside of the neighbor set. The result is a proper (but sparse) GP that accurately approximates the original dense GP. In contrast to other sparsity inducing approaches, NNGPs allow for interpolation at unobserved locations, and can be used to make full inference on model parameters, including the latent processes. Combining the SFM structure with NNGPs provides a methodology capable of coping simultaneously with high-dimensional response vectors and a large number of spatially dependent observations.

Under the traditional SFM structure, the spatial dependence is introduced by defining the spatial process as $w^{⋆} (s) = Λ w (s) ~ GP (0, H (\cdot | ϕ))$ , where Λ is a factor loadings matrix (commonly tall and skinny) and w(s) is a small-dimensional vector of independent spatial GPs, providing the non-separable multivariate cross-covariance function given by

H (h | ϕ) = cov (Λ w (s), Λ w (s + h)) = \sum_{k = 1}^{q_{w}} C_{k} (h | ϕ_{k}) λ_{k} λ_{k}^{'},

(3.3)

for locations $s, s + h \in D$ . Here, $C_{k} (h | ϕ_{k})$ ’s are univariate parametric correlation functions, and λ_k is the kth column of Λ. This cross-covariance matrix is induced by q-variate (q ≤ l) spatial factors w(s) with independent components $w_{k} (s) ~ G P (0, C_{k} (\cdot | ϕ_{k}))$ .

As such, models (3.1) and (3.2) can be reformulated as SF-NNGPs by characterizing the spatial processes w*(s) and v*(s) as

w^{⋆} (s) = Λ_{z} w (s) and v^{⋆} (s) = Γ v (s),

(3.4)

where the matrices $Λ_{z} = {((λ_{h k}^{(z)}))}_{h_{z} \times q_{w}}$ , and $Γ = {((γ_{l r}))}_{h_{y} \times q_{v}}$ correspond to the factor loadings matrices, and the new spatial factors for $s \in D$ are given by

w (s) ~ \prod_{k = 1}^{q_{w}} NNGP (0, \tilde{C} (\cdot | ϕ_{k}^{w})), and v (s) ~ \prod_{r = 1}^{q_{v}} NNGP (0, \tilde{C} (\cdot | ϕ_{r}^{v})) .

The notation $NNGP (0, \tilde{C} (\cdot | ϕ_{k}^{w}))$ and $NNGP (0, \tilde{C} (\cdot | ϕ_{r}^{v}))$ denotes the Nearest Neighbor Gaussian Processes derived from the parent processes $GP (0, C (\cdot | ϕ_{k}^{w}))$ and $GP (0, C (\cdot | ϕ_{r}^{v}))$ , respectively. Here, $C (\cdot | ϕ)$ represents the spatial correlation function with spatial decay parameter ϕ. The factor model representation in (3.4) leads to a massive reduction in the dimensionality of the problem since the spatial factors w(s) = (w_k (s): 1 ≤ k ≤ q_w) and v(s) = (v_r (s): 1 ≤ r ≤ q_v), have dimensions q_w << h_z and q_v ≤ h_y.

Bringing these elements together, and letting $Λ_{y} = ϒ Λ_{z} = {((λ_{l k}^{(y)}))}_{h_{y} \times q_{w}}$ , a computationally viable version of (3.1) and (3.2) is

Stage 1 : z (s) = X_{z} {(s)}^{'} β_{z} + Λ_{z} w (s) + ε_{z} (s)

(3.5)

Stage 2 : y (s) = X_{y} {(s)}^{'} β_{y} + Λ_{y} w (s) + Γ v (s) + ε_{y} (s),

(3.6)

In general, additional constraints are required for factor models to be identifiable (Anderson, 2003). Identifiability for spatial factor models can be achieved either by making the upper triangle of the loadings matrix equal to 0 and its diagonal elements all equal to 1 (Geweke and Zhou, 1996; Lopes and West, 2004; Aguilar and West, 2010), or as in Ren and Banerjee (2013), by fixing the sign of one element in each column of the factor loadings matrix, while enforcing an ordering constraint among the spatial decay parameters of the univariate correlation functions. We choose to ensure rotation and scale identifiability by using the former approach.

With the SFM structure in place, introducing the NNGP reduces the expensive ( $~ n_{z}^{3} q_{w}$ and $~ n_{y}^{3} q_{v}$ ) calculation required to invert the dense covariance matrices from the parent GPs, by n_zq_w and n_yq_v parallel operations, each of order m³. Here, m is the number of neighbors considered for the NNGP with m << n_y ≤ n_z. In simulations, Datta et al. (2016b) found that in most cases 10 ≤ m ≤ 20 provides an excellent approximation to the parent process; thus, the number of operations required is nearly linear in n.

For completeness, additional details regarding SFMs and NNGPs, as well as the sampling algorithm, are included in the online supplement. For a more thorough treatment of SFM’s we refer the reader to Ren and Banerjee (2013); Genton and Kleiber (2015), and for NNGPs to Datta et al. (2016c).

3.3. Prior Specification and Hierarchical Formulation

Importantly, models (3.5) and (3.6) are fitted separately so the w(s)’s exclusively capture the spatial signal present in the LiDAR signals. However, using plug-in estimates for w(s) (e.g., the posterior means) in (3.6) disregards the uncertainty present in the LiDAR spatial signal. Thus, to propagate this uncertainty through the forest outcome predictions, at each iteration of the from the Markov Chain Monte Carlo (MCMC) algorithm for y(s), we draw a sample for w(s) ( $s \in T_{y}$ ) MCMC samples obtained when fitting model (3.5).

As mentioned in the previous section, the stochastic processes that capture the spatial structure are assumed to follow NNGPs. Given that the NNGP is a proper Gaussian Process, at a finite collection of locations the NNGPs considered induce zero-centered multivariate normal priors with covariance matrices given by ${\tilde{C}}^{(w)}$ and ${\tilde{C}}^{(v)}$ , respectively. Additionally, we use suitably noninformative priors for all other parameters, which make for a direct sampling strategy.

In particular, we assume that β is either flat or conjugate normal. The matrices Γ and Λ_z are constrained as described above, with elements below the diagonal assumed to be standard normal. All elements in Λ_y are also assumed to follow a standard normal distribution. The diagonal entries in Ψ_z and Ψ_y are assigned half-t priors. Lastly, we assume uniform priors for the elements of the spatial decay vectors ϕ_w = (ϕ_w,k: 1 ≤ k ≤ q_w) and ϕ_v = (ϕ_v,r: 1 ≤ r ≤ q_v), in the interval (−log 0.05/ζ_max, −log 0.01/ζ_min), where ζ_min and ζ_min are the minimum and maximum distances across all locations. Given that ϕ_z and ϕ_y are not conjugate with their corresponding likelihood, these are sampled with random walk Metropolis steps.

The joint posterior densities for the first and second stages of the algorithm are proportional to

Stage 1 : π (ϕ_{w}) N_{n_{z} q_{w}} (w_{T_{z}} | 0, {\tilde{C}}^{(w)}) (\prod_{k = 1}^{q_{w}} \prod_{j > k}^{h_{z}} N (λ_{j k}^{(z)} | 0, 1)) \times π (β_{z}) (\prod_{j = 1}^{h_{z}} I G (ψ_{j}^{z} | ν / 2, ν / a_{z, j}) I G (a_{z, j} | 1 / 2, 1 / A^{2})) \times (\prod_{s_{i} \in T_{z}} N_{h_{z}} (z (s_{i}) | X_{z} {(s_{i})}^{'} β_{z} + Λ_{z} w (s_{i}), Ψ_{z})),

(3.7)

Stage 2 : π (ϕ_{v}) N_{n_{y} q_{v}} (v_{T_{y}} | 0, {\tilde{C}}^{(v)}) (\prod_{k = 1}^{q_{w}} \prod_{j = 1}^{h_{y}} N (λ_{j k}^{(y)} | 0, 1)) (\prod_{r = 1}^{q_{v}} \prod_{j > r}^{h_{y}} N (γ_{j r} | 0, 1)) \times π (β_{y}) (\prod_{j = 1}^{h_{y}} I G (ψ_{j}^{y} | ν / 2, ν / a_{y, j}) I G (a_{y, j} | 1 / 2, 1 / A^{2})) \times (\prod_{s_{i} \in T_{y}} N_{h_{y}} (y (s_{i}) | X_{y} {(s_{i})}^{'} β_{y} + Λ_{y} w (s_{i}) + Γ v (s_{i}), Ψ_{y})),

(3.8)

where, the vectors $w_{T_{z}} = {(w {(s_{i})}^{'} : s_{i} \in T_{z})}^{'}$ , and $v_{T} = {(v {(s_{i})}^{'} : s_{i} \in T_{y})}^{'}$ , such that

N_{n_{z} q_{w}} (w_{T} | 0, {\tilde{C}}^{(w)}) = \prod_{s_{i} \in T_{z}} N_{q_{w}} (w_{i} (s_{i}) | B_{i}^{(w)} w_{N (i)}, F_{i}^{(w)}), and N_{n_{y} q_{v}} (v_{T} | 0, {\tilde{C}}^{(v)}) = \prod_{s_{i} \in T_{y}} N_{q_{v}} (v (s_{i}) | B_{i}^{(v)} v_{N (i)}, F_{i}^{(v)}) .

(3.9)

The expressions on the right hand side of (3.9) result from the construction of the NNGP (see online supplement). For an m-neighbor NNGP, denote by m_i = min {m,i − 1} the number of neighbors for location s_i. The index set N(i) for location $s_{i} \in T_{z}$ contains its m_i nearest neighbors; thus w_N(i) corresponds to the vector ${(w {(s_{j})}^{'} : s_{j} \in N (i) \subset T_{z})}^{'}$ . The neighbor sets are defined analogously for the v(s_i)’s. Letting u ∈ {w,v}, $B_{i}^{(u)}$ denotes the q_u × m_iq_u block matrix, with q_u × q_u diagonal blocks containing the kriging weights for the q_u spatial factors for each neighbor. Also, $F_{i}^{(u)}$ corresponds to the q_u × q_u diagonal matrix with the variances for the q_u spatial factors conditioned on the neighbor set N(i) (see Section A.2 in the supplement for details regarding $B_{i}^{(u)}$ and $F_{i}^{(u)}$ ). Lastly, the parameters ${a_{y, j}}_{j = 1}^{h_{y}}$ and ${a_{z, k}}_{k = 1}^{h_{z}}$ complete the hierarchical representation of the half-t prior distribution assumed for $ψ_{j}^{y}$ and $ψ_{k}^{z}$ , respectively, and the hyperparameter A is simply chosen to be some large value (say, 100).

Due to prior conjugacy, the full conditional densities for all parameters, except for those of ϕ_w and ϕ_v, can be sampled using simple Gibbs steps. Further details on the sampling algorithm are deferred to the online supplement.

3.4. Imputation and Prediction

As mentioned before, LiDAR signals are collected over the large spatial region $T_{z}$ , whereas forest outcome observations are confined to the smaller subset of locations in $T_{y}$ . Additionally, there are relevant out-of-sample locations where neither LiDAR nor forest outcomes are observed, $T_{\emptyset}$ . And finally, there are some locations within the corresponding reference sets $T_{z}$ and $T_{y}$ that have some or all missing outcomes. It is thus essential for this modeling effort to provide the means to accurately impute the missing values in $T_{z}$ or $T_{y}$ , and generate LiDAR predictions in $T_{\emptyset}$ and forest outcome predictions within $T_{\emptyset} \cup (T_{z} \ T_{y})$ . Given the NNGP formulation, both imputation and out-of-sample prediction are remarkably inexpensive.

Imputation is straightforward. Let $s_{•} \in T_{z}$ be a location where z(s_•) is missing. Then z(s_•) is drawn as part of the sampling algorithm from $N_{h_{z}} (X_{z} {(s_{•})}^{'} β_{z} + Λ_{z} w (s_{•}), Ψ_{z})$ , where w(s_•) is sampled from the full conditional posterior density in Equation (B.13) of the online supplement. For a missing value y(s_•), where $s_{•} \in T_{y}$ , the procedure is analogous using the full conditional posterior for v(s_•) and the likelihood for y(s_•).

The procedure to predict a new LiDAR observation z(s_∘), $s_{o} \in T_{\emptyset}$ , begins by sampling the spatial factor w(s_∘) from $N_{q_{w}} (B_{\circ}^{(w)} w_{N (s_{\circ})}, F_{\circ}^{(w)})$ , with $B_{\circ}^{(w)}$ and $F_{\circ}^{(w)}$ defined as before. Note that the nearest neighbor set N(s_∘) is assumed to be in $T_{z}$ . Then, we draw $Z (S_{\circ}) | Z_{T_{z}}$ from $N_{h_{z}} (X_{z} {(S_{\circ})}^{'} β_{z} + Λ_{z} w (s_{\circ}), Ψ_{z})$ . This is done by conditioning on the posterior samples of {β_z, Λ_z, Ψ_z, ϕ_w} obtained from the fitting algorithm.

To predict the forest outcomes y(s_∘) at $s_{°} \in T_{\emptyset} \cup (T_{z} \ T_{y})$ , first we generate samples of $v (s_{\circ}) ~ N_{q_{v}} (B_{°}^{(v)} v_{N (s_{o})}, F_{°}^{(v)})$ . Given that y(s_∘) depends on w(s_∘), we combine the posterior draws of {β_y, Λ_y, Γ, Ψ_y, ϕ_v} with those of w(s_∘), obtained when predicting z(s_∘), and draw predicted values for $y (s_{\circ}) | y_{T_{y}}$ from $N_{h_{y}} (X_{y} {(s_{\circ})}^{'} β_{y} + Λ_{y} w (s_{\circ}) + Γ v (s_{\circ}), Ψ_{y})$ .

4. Simulation: Recovering Low-dimensional Structure

In the following simulation exercise we focus exclusively on the high-dimensional component (i.e., the first stage) of the model described above. The simulation below was devised to illustrate the ability of our approach to recover true low-dimensional structure when data is generated from a low-dimensional SFM with dense spatial factors.

We generate a synthetic dataset for h_z = 50 outcomes in n_z = 10, 000 locations from the spatial factor model $z (s) = X_{z} {(s)}^{'} {\tilde{β}}_{z} + {\tilde{Λ}}_{z} \tilde{w} (s) + {\tilde{ε}}_{z} (s)$ . Here, X_z (s)′ is a 50 × 150 block-diagonal matrix of predictors, and ${\tilde{β}}_{z}$ is the vector of regression coefficients, both defined as before. We consider the same three predictors for all outcomes. The spatial factors $\tilde{w} (s) ~ \prod_{k = 1}^{8} GP (0, C (\cdot | {\tilde{ϕ}}_{k}^{z})$ , where $C (\cdot | {\tilde{ϕ}}_{k}^{z})$ is an exponential correlation function with decay parameter ${\tilde{ϕ}}_{k}^{z}$ . Additionally, for identifiability we assume that the 50 × 8 factor loadings matrix ${\tilde{Λ}}_{z}$ has zeros in the upper triangle and ones along the diagonal. Finally, ${\tilde{ε}}_{z} ~ N_{h_{z}} (0, {\tilde{Ψ}}_{z})$ , with ${\tilde{Ψ}}_{z} = diag ({\tilde{ψ}}_{k}^{z} : k = 1, \dots, 8)$ .

We assess the ability of model (3.5) to recover model parameters from the true data generating process, impute missing outcomes, and predict at out-of-sample locations. The SF-NNGP model was fitted for q_w ∈ {3, 5, 8, 10} spatial factors and assuming m =10 neighbors. Out of the 10, 000 locations, we assume all 50 outcomes to be missing in 200 locations chosen at random, and impute them. Additionally we hold out n₀ = 500 locations for out-ofsample prediction and model validation.

The first result worth highlighting is the gains in computational efficiency provided by the SF-NNGP. For this particular simulation exercise—a relatively computationally challenging problem—fitting the largest model considered (i.e., q_w = 10) with 50,000 MCMC iterations, on a Linux server with Intel i7 processor (two 8-core) and 16 GB of memory, the runtime was 4.88 hours. As shown below, the proposed approach is able to recover the true model parameters, accurately impute missing data and generate precise predictions, all of these equipped with suitable uncertainty estimates.

For all values of q_w the SF-NNGP accurately recovered the regression coefficients ${\tilde{β}}_{z}$ for all predictors and responses (Figure 7 in the online supplement). In contrast, the quality of the estimates for the small-scale variance components ${\tilde{ψ}}_{k}^{z}$ ’s was compromised when q_w was lower than the true number of spatial factors. This behavior is expected, for lower values of q_w the $ψ_{k}^{z}$ ’s attempt to compensate for the additional signal that the spatial component with too few spatial factors is unable to capture (Figure 8 in the supplementary material). For q_w = 8 and q_w = 10, the coverage for ${\tilde{ψ}}_{z}$ was 88% and 84%, respectively, with all $ψ_{k}^{z}$ close to ${\tilde{ψ}}_{k}^{z}$ with tight 95% credible sets.

When q_w ≠ 8, the dimensions of the fitted Λ_z, ϕ_w, and w(s) do not match those of their analogs in the true model. Therefore, to assess the quality of fit for the spatial signal for all values of q_w considered, we instead compare the fitted spatial component w*(s) = Λ_zw(s), for $s \in T_{z}$ , to that of the true model, given by ${\tilde{w}}^{⋆} (s) = {\tilde{Λ}}_{z} \tilde{w} (s)$ .

For all locations in $T_{z}$ we calculate $Δ (s) = w^{⋆} (s) - {\tilde{w}}^{⋆} (s)$ (fitted minus true spatial signal) for each MCMC draw of the parameters. For all $s \in T_{z}$ we obtained the median and 95% credible set for Δ(s). To facilitate visualization, in Figure 2 we show the results for only three responses selected at random from the 50 considered. The columns of each panel map quantiles 2.5, 50 and 97.5 for Δ(s) with 3 locations (13, 23 and 48) plotted by row. The fitted spatial signal when q_w ∈ {3, 5} recovers only partially the true signal, with coverages of 26.13% and 42.06%, respectively for q_w = 3 and q_w = 5. When q_w ∈ {8, 10} the recovery of the spatial signal is astonishingly accurate, having over all responses 94.78% coverage with q_w = 8, and 94.18% coverage with q_w = 10.

Figure 2: — Fitted minus true spatial signal, $Δ (s) = w^{⋆} (s) - {\tilde{w}}^{⋆} (s)$ , for locations s₁₃, s₂₃, s₄₈. From left to right the columns in each panel show percentiles 2.5, 50 and 97.5 for Δ(s), respectively.

In addition to the previous results, it is also encouraging to find that when the dimension of the SF-NNGP model matches that of the true model, both the factor loadings ( ${\tilde{Λ}}_{z}$ ) and the spatial decay parameters ( ${\tilde{ϕ}}_{z}$ ) from the true spatial process can be recovered accurately (Figures 3 and 4).

Figure 3: — Fitted vs true factor loadings matrix parameters (95% credible sets and medians) for q_w = 8.

Figure 4: — Fitted vs true spatial decay parameters parameters (95% credible sets and medians) for q_w = 8.

Model performance in terms of the accuracy of both imputation and prediction improves drastically as the number of factors approaches the truth – see Figures 11 and 12 in the online supplement.

Table 1 provides a comparison as q_w varies in the SF-NNGP using different measures of out-of-sample predictive performance. In particular, the continuous rank probability score (CRPS) (Equation (21) in Gneiting and Raftery, 2007) and the root mean squared prediction error (RMSPE) (Yeniay and Goktas, 2002) favor the model with q_w = 8. The coverage of the 95% credible intervals of the predictions was close to the nominal value for all q_w; however, the width of the interval rapidly decreases as q_w approaches the true number of spatial factors.

Table 1:

Out-of-sample prediction comparison across models with different number of spatial factors.

q_w	CRSP	RMSPE	95% Coverage	95% CI Width
3	0.85	1.61	95.82	6.14
5	0.67	1.28	95.43	4.79
8	0.45	0.83	94.78	3.10
10	0.45	0.83	94.84	3.10

Open in a new tab

Both the fitted values for the spatial signals and the out-of-sample predictions with q_w = 8 and q_w = 10 are practically indistinguishable from each other. Furthermore, the model with q_w = 8 accurately recovers all the true factor loadings (Figure 3). Interestingly, with q_w = 10, visual inspection of the estimates for columns 1 through 6 in Λ_z indicate that this model accurately estimates the corresponding true parameter values (see Figure 10 in the online supplement). However, in this same model the estimated parameter values in columns 7 and 8 of Λ_z display departures from their true values; and the 95% credible sets for all the unconstrained elements in the 9th and 10th columns of Λ_z contain zero (see Figure 9 in the online supplement). These results provide guidance regarding the selection of the number of factors q_w to use. As there is no gain in using the model with q_w = 10 over the one with q_w = 8 in terms of predictive accuracy or parameter fit, the results favor the more parsimonious model of the two.

5. Modeling LiDAR Signals and Forest Structure

Our focus in the subsequent analysis is to assess and interpret the utility of SF-NNGP spatial factors to explain variability in the three forest outcomes defined in Section 2, measured on the BCEF. Following the two stage model developed in Section 3.2, we fit (3.5) using q_w ∈ {1, 2, 3, 4, 5, 6, 7, 8} spatial factors and m =10 neighbors to the BCEF LiDAR data comprising n_z=50,197 signals each of length h_z=57. The model mean included only an intercept. Prior specification followed Section 3.3, with the support for elements in ϕ_w adjusted to match the BCEF spatial extent.

The n_y = 197 locations with h_y=3 forest outcomes were used in the second stage model (3.6). To more clearly interpret the spatial factors’ ability to explain variability in forest outcomes, we decided to avoid potential issues with spatial confounding (Hanks et al., 2015) and set v(s) to zero. In practice, however, if our main objective is to maximize predictive performance then this residual spatial random effect should likely be included in the model. In addition to the spatial factors, the second stage model was informed by the three Landsat 8 tasseled cap predictor variables defined in Section 2 which, along with an intercept, were included in X_y(s). Importantly, these predictor variables are available across the entire BCEF, hence, given predicted values of the spatial factors at unobserved locations, we can create complete-coverage forest outcome maps.

Posterior inference for all candidate models was based on three chains of 50,000 post burn-in MCMC samples. Chains converged by 20,000 MCMC iterations. Using the same computer configuration detailed in Section 4, total runtime for the most demanding model, i.e., q_w = 8, was ~36 hours.

The eight candidate models, specified by q_w, were assessed based on their ability to inform forest outcome prediction. This was done by fitting each of the first stage models, then fitting their corresponding second stage models using data from 99 of the197 available locations in $T_{z}$ . The three forest outcomes were then predicted for the remaining 98 out-of-sample locations. Scoring rules and other summaries of the posterior predictive distributions for the 98 out-of-sample locations are presented in Table 2.

Table 2:

Cross-validation prediction summary for forest outcomes given increasing number of spatial factors q_w. Bold values identify lowest CRPS and RMSPE.

	q_w	CRSP	RMSPE	95% Coverage	95% CI Width
	1	26.21	51.37	91.88	161.24
AGB	2	26.36	52.02	92.39	162.14
AGB	3	23.64	46.95	95.94	155.71
	4	23.53	46.93	93.91	155.66
	5	24	47.54	96.45	157.75
	6	24.47	47.8	94.92	172.64
	7	24.75	47.84	95.43	174.44
	8	24.76	48.02	96.45	182.12
	1	1017.7	1980.62	92.39	6010.6
TD	2	1006.02	1957.54	93.4	5944.81
TD	3	1007.72	1954.87	93.4	6068.29
	4	997.32	1955.2	93.4	6040.06
	5	989.31	1930.76	94.92	6182.2
	6	998.3	1944.22	94.42	6223.73
	7	1005.26	1965.81	95.43	6450.5
	8	1004.36	1955.08	96.95	6503.17
	1	5.53	10.29	91.88	36.34
BA	2	5.4	10.01	94.42	36.85
BA	3	5.13	9.54	93.91	35.16
	4	5.17	9.62	93.4	36.21
	5	5.16	9.58	93.4	36.51
	6	5.2	9.59	96.45	38.62
	7	5.24	9.73	95.43	38.34
	8	5.27	9.72	94.42	37.93

Open in a new tab

Increasing the number of spatial factors improves CRPS and RMSPE for each forest outcome Table 2. Exploratory analysis showed gains in predictive performance were negligible beyond q_w = 4 for AGB and q_w = 5 for TD and BA. Given that the q_w=5 model generally yielded the “best” predictions, it was selected for exposition below.

Table 3 provides estimates for the second stage model’s spatial factor regression coefficients, i.e., elements in Λ_y. These results show that several of the spatial factors explain a substantial portion of variability in the forest outcomes. It is, however, difficult to interpret the λ^(y)’s without a sense of what characteristic of z(s) the spatial factors are capturing. When considered with estimates in Table 3, Figure 5 provides some biological interpretation of the spatial factors. Specifically, each panel in Figure 5 represents a spatial factor. The 50 lines in each panel are observed LiDAR signals with color corresponding to the 25 largest (blue lines) and 25 smallest (red lines) estimated spatial factor value.

Table 3:

Elements of Λ_y median and 95% credible intervals for the q_w = 5 model. Bold entries indicate where the 95% credible interval excludes zero.

Parameter	50% (2.5%, 97.5%)
$λ_{AGB, 1}^{(y)}$	−6.65 (−8.89, −4.23)
$λ_{AGB, 2}^{(y)}$	27.20 (−14.11, 65.14)
$λ_{AGB, 3}^{(y)}$	−278.29 (−324.52, −232.28)
$λ_{AGB, 4}^{(y)}$	−46.15 (−162.56, 75.91)
$λ_{AGB, 5}^{(y)}$	−308.81 (−524.42, −90.45)
$λ_{TD, 1}^{(y)}$	−1.77 (−21.35, 17.60)
$λ_{TD, 2}^{(y)}$	−357.49 (−718.82, −7.86)
$λ_{TD, 3}^{(y)}$	269.03 (−137.51, 667.62)
$λ_{TD, 4}^{(y)}$	−1777.21 (−2696.67, −708.08)
$λ_{TD, 5}^{(y)}$	2457.52 (681.18, 4337.97)
$λ_{BA, 1}^{(y)}$	−2.93 (−3.94, −1.75)
$λ_{BA, 2}^{(y)}$	−2.07 (−19.02, 15.79)
$λ_{BA, 3}^{(y)}$	−98.64 (−119.79, −76.24)
$λ_{BA, 4}^{(y)}$	−72.00 (−120.60, −23.00)
$λ_{BA, 5}^{(y)}$	−80.55 (−177.44, 20.51)

Open in a new tab

Figure 5: — Observed LiDAR signals with the 25 largest (*High* in the legend) and 25 smallest (*Low* in the legend) values of w(s)’s elements from the q_w = 5 model.

There are some general biological relationships between forest canopy structure and AGB, TD, and BA. Very low maximum canopy height is indicative of a young regenerating forest (e.g., regrowth after a fire) that would be characterized by low AGB, high TD, low BA. If the majority of trees in a forest have a high canopy height then we expect high AGB, low TD, and high BA (i.e., few large diameter mature trees dominate the area). When the forest is characterized by trees of many different heights (i.e., tree crowns in several vertical strata) then we might expect moderate/high AGB, moderate TD, and moderate/high BA. Some of these expected relationships are observed when comparing Table 3 and Figure 5. For example, the top left panel in Figure 5 differentiates between regenerating forests and all other forest structure, i.e., blue lines show a spike of energy returned at or near ground level versus red lines which show the majority of the energy is returned at or above several meters. Hence negative regression coefficients $λ_{A G B, 1}^{(y)}$ and $λ_{B A, 1}^{(y)}$ in Table 3. The LiDAR signals shown in the top right panel in Figure 5 differentiates between young and old single cohort forests (i.e., all trees were regenerated around the same time and there is little vertical variation in canopy height); hence, negative $λ_{A G B, 3}^{(y)}$ and $λ_{B A, 3}^{(y)}$ in Table 3. The top middle and bottom left panels in Figure 5 generally separate blue signal mature 20+ and ~20 meter canopy height, respectively, from lower stature ~10 meter canopy height forest. Consistent with the biological expectation, the negative $λ_{T D, 2}^{(y)}$ and $λ_{T D, 4}^{(y)}$ suggest forests associated with red LiDAR signals have higher tree density relative to the older taller forests.

As detailed in Section 1, complete coverage maps of the forest outcomes with associated uncertainty estimates are important data products that can be delivered by the proposed two stage model. Following Section 3.4 and using the full data set depicted in Figure 1, we predicted the forest outcomes on a 30× 30 m grid over the BCEF. Figure 6 provides median and 95% credible interval width maps for each outcome. Non-forested areas were omitted (white regions on the maps). Posterior predictive point estimates match well with the distribution of the forest outcomes across the BCEF and are clearly informed by the LiDAR factors which are capturing key forest structure characteristics. Most importantly, the prediction uncertainty maps, displayed in the right column of Figure 6, accurately reflect our lack of information for prediction units that are far from the flight lines were LiDAR data are available, i.e., we achieve more precise posterior predictive distributions along and adjacent to locations where LiDAR data are available. Far from the LiDAR flight lines prediction is only informed by the Landsat 8 tassel cap predictor variables, which in this study explained very little variability in the forest outcomes.

Figure 6: — Model q_w = 5 posterior predictive distribution median and 95% CI width for AGB, TD, and BA forest variables over Bonanza Creek Experimental Forest.

6. Concluding Remarks

We formulated an approach to model high-dimensional spatial data over a large set of locations, and developed an efficient implementation in C++. The SF-NNGP enables the analysis of multivariate spatially referenced datasets that, due to their magnitude, could not have been rigorously explored before. It does so by combining the ability of SFMs to compress the signal from high-dimensional structures into a few dimensions with the computational scalability of NNGPs.

The algorithm was used to exploit the information from the high-dimensional LiDAR signals to jointly model and generate LiDAR based maps of multiple forest variables. Importantly, the proposed two stage model provides a viable approach to producing spatially continuous maps from sparsely sampled LiDAR and forest measurements, and delivers spatially explicit uncertainty quantification that captures the irregular distribution of information across the domain of interest. Such frameworks will become increasingly importantly as sampling LiDAR systems, such as GEDI, come on-line in the near future. These approaches can also be extended to help guide LiDAR and field data acquisition to minimize prediction uncertainty.

Importantly, when fitting a spatial factor model one must choose the number of factors q_w to be used in the model; there are different strategies to address this issue. The approach we adopt here–looking at the out-ofsample evaluation metrics for different choices of q_w and selecting the one where the curves flatten out–is a pragmatic solution and is similar in spirit to cross-validation approaches commonly used to tune hyper-parameters in richly parametrized models. Like any other cross-validation approach, this leads to additional computation, but parallel computing opens the possibility of conducting simultaneous MCMC runs for different values of q_w. As shown, both in the simulation experiment as well as in the BCEF data analysis, this heuristic provides sufficiently good results. Other automated rank selection schemes are available in the literature, such as those proposed in Lopes and West (2004) and in Ren and Banerjee (2013); however, these drastically increase the computational burden of an already computationally costly problem.

A research direction we are keen on exploring is an extension for spatio-temporal data. For this type of data it is necessary to posit a strategy to select the neighbors in the spatio-temporal domain, following the discussion presented in Datta et al. (2016a).

Although our method presents a substantial improvement in terms of scalability over existing approaches, further efforts are required to scale multivariate spatial methods to truly massive datasets. For instance, the ultimate goal for forest variable mapping assisted by sampled LiDAR in interior Alaska is a complete-coverage map of the entire domain (e.g., 46 million ha), which could easily require models capable of assimilating LiDAR signals in more than 10⁸ locations.

Supplementary Material

NIHMS996762-supplement-1.pdf^{(2.9MB, pdf)}

Acknowledgments

The research presented in this study was partially supported by NASA’s Arctic-Boreal Vulnerability Experiment (ABoVE) and Carbon Monitoring System (CMS) programs. Additional support was provided by the United States Forest Service Pacific Northwest Research Station. Finley was supported by National Science Foundation (NSF) DMS-1513481, EF-1137309, EF-1241874, and Finley and Taylor-Rodriguez were supported on EF-1253225. Banerjee was supported by NSF DMS-1513654, NSF IIS-1562303, and NIH/NIEHS 1R01ES027027-01.

Footnotes

Supplementary Materials

The supplementary materials include (1) background information on NNGPs and spatial factor models, (2) the sampling algorithm for the SF-NNGP, and (3) additional simulation results.

Contributor Information

Daniel Taylor-Rodriguez, Email: dantayrod@pdx.edu.

Andrew O. Finley, Email: finleya@msu.edu.

Abhirup Datta, Email: abhidatta@jhu.edu.

Chad Babcock, Email: babcoc76@uw.edu.

Hans-Erik Andersen, Email: handersen@fs.us.

Bruce D. Cook, Email: bruce.cook@nasa.gov.

Douglas C. Morton, Email: douglas.morton@nasa.gov.

Sudipto Banerjee, Email: sudipto@ucla.edu.

References

Aguilar O and West M (2010). Bayesian Dynamic Factor Models and Portfolio Allocation. Journal of Business & Economic Statistics, 18(3):338–357. [Google Scholar]
Andersen H-E, Strunk J, and Temesgen H (2011). Using airborne light detection and ranging as a sampling tool for estimating forest biomass resources in the upper Tanana Valley of interior Alaska. Western Journal of Applied Forestry, 26(4):157–164. [Google Scholar]
Anderson T (2003). An Introduction to Multivariate Statistical Analysis. 3rd edition Wiley Series in Probability and Statistics, Hoboken, NJ. [Google Scholar]
Asner G, Hughes R, Varga T, Knapp D, and Kennedy-Bowdoin T (2009). Environmental and biotic controls over aboveground biomass throughout a tropical rain forest. Ecosystems, 12(2):261–278. [Google Scholar]
Babcock C, Finley AO, Andersen H-E, Pattison R, Cook BD, Morton DC, Alonzo M, Nelson R, Gregoire T, Ene L, Gobakken T, and Næsset E (2017). Geostatistical estimation of forest biomass in interior alaska combining landsat-derived tree cover, sampled airborne lidar and field observations. ArXiv e-prints. https://arxiv.org/pdf/1705.03534.pdf. [Google Scholar]
Babcock C, Finley AO, Bradford JB, Kolka R, Birdsey R, and Ryan MG (2015). Lidar based prediction of forest biomass using hierarchical models with spatially varying coefficients. Remote Sensing of Environment, 169:113–127. [Google Scholar]
Babcock C, Matney J, Finley A, Weiskittel A, and Cook B (2013). Multivariate spatial regression models for predicting individual tree structure variables using lidar data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 6(1, SI):6–14. [Google Scholar]
Baig MHA, Zhang L, Shuai T, and Tong Q (2014). Derivation of a tasselled cap transformation based on landsat 8 at-satellite reflectance. Remote Sensing Letters, 5(5):423–431. [Google Scholar]
Banerjee S (2017). High-dimensional bayesian geostatistics. Bayesian Anal, 12(2):583–614. [DOI] [PMC free article] [PubMed] [Google Scholar]
Banerjee S, Gelfand AE, Finley AO, and Sang H (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):825–848. [DOI] [PMC free article] [PubMed] [Google Scholar]
Bechtold WA and Patterson PL (2005). The Enhanced Forest Inventory and Analysis Program: National Sampling Design and Estimation Procedures. US Department of Agriculture Forest Service, Southern Research Station Asheville, North Carolina. [Google Scholar]
Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, and Whaley RC (2001). An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28:135–151. [Google Scholar]
Bolton DK, Coops NC, and Wulder MA (2013). Measuring forest structure along productivity gradients in the Canadian boreal with small-footprint lidar. Environmental monitoring and assessment, 185(8):6617–6634. [DOI] [PubMed] [Google Scholar]
Bonanza Creek LTER (2016). Bonanza Creek Experimental Forest. http://www.lter.uaf.edu/research/study-sites-bcef. Accessed: 12-16-2017. [Google Scholar]
Chiles J-P and Delfiner P (2009). Geostatistics: modeling spatial uncertainty, volume 497 John Wiley & Sons. [Google Scholar]
Christensen WF and Amemiya Y (2002). Latent variable analysis of multivariate spatial data. Journal of the American Statistical Association, 97(457):302–317. [Google Scholar]
Cook B, Corp L, Nelson R, Middleton E, Morton D, McCorkel J, Masek J, Ranson K, Ly V, and Montesano P (2013). NASA Goddard’s lidar, hyperspectral and thermal (G-LiHT) airborne imager. Remote Sensing, 5(8):4045–4066. [Google Scholar]
Dagum L and Menon R (1998). Openmp: an industry standard api for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46–55. [Google Scholar]
Datta A, Banerjee S, Finley A, Hamm NA, and Schaap M (2016a). Non-Separable Dynamic Nearest-Neighbor Gaussian Process Models for Large Spatio-Temporal Data With an Application to Particulate Matter Analysis. Annals of Applied Statistics Statistics, 44(2):629–659. [DOI] [PMC free article] [PubMed] [Google Scholar]
Datta A, Banerjee S, Finley AO, and Gelfand AE (2016b). Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets. Journal of the American Statistical Association, 111(514):800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]
Datta A, Banerjee S, Finley AO, and Gelfand AE (2016c). On nearest-neighbor Gaussian process models for massive spatial data. Wiley Interdisciplinary Reviews: Computational Statistics, 8(5):162–171. [DOI] [PMC free article] [PubMed] [Google Scholar]
Ene LT, Gobakken T, Andersen H-E, Nsset E, Cook BD, Morton DC, Babcock C, and Nelson R (2018). Large-area hybrid estimation of aboveground biomass in interior alaska using airborne laser scanning data. Remote Sensing of Environment, 204(Supplement C):741–755. [Google Scholar]
Finley AO, Banerjee S, and Cook BD (2014a). Bayesian hierarchical models for spatially misaligned data in R. Methods in Ecology and Evolution, 5(6):514–523. [Google Scholar]
Finley AO, Banerjee S, Ek AR, and McRoberts RE (2008). Bayesian multivariate process modeling for prediction of forest attributes. Journal of Agricultural, Biological, and Environmental Statistics, 13(1):60. [Google Scholar]
Finley AO, Banerjee S, Weiskittel AR, Babcock C, and Cook BD (2014b). Dynamic spatial regression models for space-varying forest stand tables. Environmetrics, 25(8):596–609. [Google Scholar]
Finley AO, Banerjee S, Zhou Y, Cook BD, and Babcock C (2017a). Joint hierarchical models for sparsely sampled high-dimensional LiDAR and forest variables. Remote Sensing of Environment, 190:149–161. [Google Scholar]
Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, and Banerjee S (2017b). Efficient algorithms for bayesian nearest neighbor gaussian processes. ArXiv e-prints. https://arxiv.org/abs/1702.00434. [DOI] [PMC free article] [PubMed] [Google Scholar]
G-LiHT (2016). Goddard’s lidar hyperspectral and thermal (G-LiHT) imager. http://www.gliht.gsfc.nasa.gov. Accessed: 8-11-2017.
GEDI (2014). Global ecosystem dynamics investigation lidar. http://science.nasa.gov/missions/gedi/. Accessed: 8-11-2017.
Genton MG and Kleiber W (2015). Cross-Covariance Functions for Multivariate Geostatistics. Statistical Science, 30(2):147–163. [Google Scholar]
Geweke J and Zhou G (1996). Measuring the pricing error of the arbitrage pricing theory. [Google Scholar]
Gneiting T and Raftery AE (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378. [Google Scholar]
Hanks EM, Schliep EM, Hooten MB, and Hoeting JA (2015). Restricted spatial regression in practice: geostatistical models, confounding, and robustness under model misspecification. Environmetrics, 26(4):243–254. [Google Scholar]
Heaton MJ, Datta A, Finley AO, Furrer R, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, and Zammit-Mangion A (2017). Methods for analyzing large spatial data: A review and comparison. ArXiv e-prints. https://arxiv.org/abs/1710.05013. [DOI] [PMC free article] [PubMed] [Google Scholar]
Hogan JW and Tchernis R (2004). Bayesian factor analysis for spatially correlated data, with application to summarizing area-level material deprivation from census data. Journal of the American Statistical Association, 99(466):314–324. [Google Scholar]
ICESat-2 (2015). Ice, cloud, and land elevation satellite-2. http://icesat.gsfc.nasa.gov/icesat2. Accessed: 8-11-2017. [DOI] [PMC free article] [PubMed]
Jakubowski MK, Guo Q, and Kelly M (2013). Tradeoffs between lidar pulse density and forest measurement accuracy. Remote Sensing of Environment, 130(Supplement C):245–253. [Google Scholar]
Junttila V and Laine M (2017). Bayesian principal component regression model with spatial effects for forest inventory variables under small field sample size. Remote Sensing of Environment, 192(Supplement C):45–57. [Google Scholar]
Lauritzen SL (1996). Graphical Models. Clarendon Press, Oxford, United Kingdom. [Google Scholar]
Lopes HF and West M (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 14:41–67. [Google Scholar]
Murphy K (2012). Machine Learning: A probabilistic perspective. The MIT Press, Cambridge, MA. [Google Scholar]
Næsset E (2011). Estimating above-ground biomass in young forests with airborne laser scanning. International Journal of Remote Sensing, 32(2):473–501. [Google Scholar]
Nelson R, Gobakken T, Næsset E, Gregoire T, Ståhl G, Holm S, and Flewelling J (2012). Lidar sampling – using an airborne profiler to estimate forest biomass in Hedmark County, Norway. Remote Sensing of Environment, 123:563–578. [Google Scholar]
Nelson R, Margolis H, Montesano P, Sun G, Cook B, Corp L, Andersen H-E, deJong B, Pellat FP, Fickel T, Kauffman J, and Prisley S (2017). Lidar-based estimates of aboveground biomass in the continental us and mexico using ground, airborne, and satellite observations. Remote Sensing of Environment, 188(Supplement C):127–140. [Google Scholar]
Ren Q and Banerjee S (2013). Hierarchical Factor Models for Large Spatially Misaligned Data: A Low-Rank Predictive Process Approach. Biometrics, 69(1):19–30. [DOI] [PMC free article] [PubMed] [Google Scholar]
Schmidt AM and Gelfand AE (2003). A bayesian coregionalization approach for multivariate pollutant data. Journal of Geophysical Research: Atmospheres, 108(D24). [Google Scholar]
Stein ML (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8:1–19. [Google Scholar]
Ver Hoef J and Barry R (1998). Modeling crossvariograms for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference, 69:275–294. [Google Scholar]
White JC, Wulder MA, Varhola A, Vastaranta M, Coops Nicholas C, Cook BD, Pitt D, and Woods M (2013). A best practices guide for generating forest inventory attributes from airborne laser scanning data using an area-based approach. The Forestry Chronicle, 89(06):722–723. [Google Scholar]
Woodall CW, Coulston JW, Domke GM, Walters BF, Wear DN, Smith JE, Andersen H-E, Clough BJ, Cohen WB, Griffith DM, et al. (2015). The US forest carbon accounting framework: Stocks and stock change, 1990–2016. [Google Scholar]
Yeniay O and Goktas A (2002). A comparison of partial least squares regression with other prediction methods. Hacettepe Journal of Mathematics and Statistics, 31(99):99–101. [Google Scholar]
Zhang H (2007). Maximum-likelihood estimation for multivariate spatial linear coregionalization models. Environmetrics, 18(2):125–139. [Google Scholar]
Zhang X (2016). An optimized blas library based on gotoblas2. https://github.com/xianyi/OpenBLAS/. Accessed 2015-06-01.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

NIHMS996762-supplement-1.pdf^{(2.9MB, pdf)}

[R1] Aguilar O and West M (2010). Bayesian Dynamic Factor Models and Portfolio Allocation. Journal of Business & Economic Statistics, 18(3):338–357. [Google Scholar]

[R2] Andersen H-E, Strunk J, and Temesgen H (2011). Using airborne light detection and ranging as a sampling tool for estimating forest biomass resources in the upper Tanana Valley of interior Alaska. Western Journal of Applied Forestry, 26(4):157–164. [Google Scholar]

[R3] Anderson T (2003). An Introduction to Multivariate Statistical Analysis. 3rd edition Wiley Series in Probability and Statistics, Hoboken, NJ. [Google Scholar]

[R4] Asner G, Hughes R, Varga T, Knapp D, and Kennedy-Bowdoin T (2009). Environmental and biotic controls over aboveground biomass throughout a tropical rain forest. Ecosystems, 12(2):261–278. [Google Scholar]

[R5] Babcock C, Finley AO, Andersen H-E, Pattison R, Cook BD, Morton DC, Alonzo M, Nelson R, Gregoire T, Ene L, Gobakken T, and Næsset E (2017). Geostatistical estimation of forest biomass in interior alaska combining landsat-derived tree cover, sampled airborne lidar and field observations. ArXiv e-prints. https://arxiv.org/pdf/1705.03534.pdf. [Google Scholar]

[R6] Babcock C, Finley AO, Bradford JB, Kolka R, Birdsey R, and Ryan MG (2015). Lidar based prediction of forest biomass using hierarchical models with spatially varying coefficients. Remote Sensing of Environment, 169:113–127. [Google Scholar]

[R7] Babcock C, Matney J, Finley A, Weiskittel A, and Cook B (2013). Multivariate spatial regression models for predicting individual tree structure variables using lidar data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 6(1, SI):6–14. [Google Scholar]

[R8] Baig MHA, Zhang L, Shuai T, and Tong Q (2014). Derivation of a tasselled cap transformation based on landsat 8 at-satellite reflectance. Remote Sensing Letters, 5(5):423–431. [Google Scholar]

[R9] Banerjee S (2017). High-dimensional bayesian geostatistics. Bayesian Anal, 12(2):583–614. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] Banerjee S, Gelfand AE, Finley AO, and Sang H (2008). Gaussian predictive process models for large spatial data sets. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):825–848. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R11] Bechtold WA and Patterson PL (2005). The Enhanced Forest Inventory and Analysis Program: National Sampling Design and Estimation Procedures. US Department of Agriculture Forest Service, Southern Research Station Asheville, North Carolina. [Google Scholar]

[R12] Blackford LS, Demmel J, Dongarra J, Duff I, Hammarling S, Henry G, Heroux M, Kaufman L, Lumsdaine A, Petitet A, Pozo R, Remington K, and Whaley RC (2001). An updated set of basic linear algebra subprograms (blas). ACM Transactions on Mathematical Software, 28:135–151. [Google Scholar]

[R13] Bolton DK, Coops NC, and Wulder MA (2013). Measuring forest structure along productivity gradients in the Canadian boreal with small-footprint lidar. Environmental monitoring and assessment, 185(8):6617–6634. [DOI] [PubMed] [Google Scholar]

[R14] Bonanza Creek LTER (2016). Bonanza Creek Experimental Forest. http://www.lter.uaf.edu/research/study-sites-bcef. Accessed: 12-16-2017. [Google Scholar]

[R15] Chiles J-P and Delfiner P (2009). Geostatistics: modeling spatial uncertainty, volume 497 John Wiley & Sons. [Google Scholar]

[R16] Christensen WF and Amemiya Y (2002). Latent variable analysis of multivariate spatial data. Journal of the American Statistical Association, 97(457):302–317. [Google Scholar]

[R17] Cook B, Corp L, Nelson R, Middleton E, Morton D, McCorkel J, Masek J, Ranson K, Ly V, and Montesano P (2013). NASA Goddard’s lidar, hyperspectral and thermal (G-LiHT) airborne imager. Remote Sensing, 5(8):4045–4066. [Google Scholar]

[R18] Dagum L and Menon R (1998). Openmp: an industry standard api for shared-memory programming. Computational Science & Engineering, IEEE, 5(1):46–55. [Google Scholar]

[R19] Datta A, Banerjee S, Finley A, Hamm NA, and Schaap M (2016a). Non-Separable Dynamic Nearest-Neighbor Gaussian Process Models for Large Spatio-Temporal Data With an Application to Particulate Matter Analysis. Annals of Applied Statistics Statistics, 44(2):629–659. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] Datta A, Banerjee S, Finley AO, and Gelfand AE (2016b). Hierarchical Nearest-Neighbor Gaussian Process Models for Large Geostatistical Datasets. Journal of the American Statistical Association, 111(514):800–812. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R21] Datta A, Banerjee S, Finley AO, and Gelfand AE (2016c). On nearest-neighbor Gaussian process models for massive spatial data. Wiley Interdisciplinary Reviews: Computational Statistics, 8(5):162–171. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R22] Ene LT, Gobakken T, Andersen H-E, Nsset E, Cook BD, Morton DC, Babcock C, and Nelson R (2018). Large-area hybrid estimation of aboveground biomass in interior alaska using airborne laser scanning data. Remote Sensing of Environment, 204(Supplement C):741–755. [Google Scholar]

[R23] Finley AO, Banerjee S, and Cook BD (2014a). Bayesian hierarchical models for spatially misaligned data in R. Methods in Ecology and Evolution, 5(6):514–523. [Google Scholar]

[R24] Finley AO, Banerjee S, Ek AR, and McRoberts RE (2008). Bayesian multivariate process modeling for prediction of forest attributes. Journal of Agricultural, Biological, and Environmental Statistics, 13(1):60. [Google Scholar]

[R25] Finley AO, Banerjee S, Weiskittel AR, Babcock C, and Cook BD (2014b). Dynamic spatial regression models for space-varying forest stand tables. Environmetrics, 25(8):596–609. [Google Scholar]

[R26] Finley AO, Banerjee S, Zhou Y, Cook BD, and Babcock C (2017a). Joint hierarchical models for sparsely sampled high-dimensional LiDAR and forest variables. Remote Sensing of Environment, 190:149–161. [Google Scholar]

[R27] Finley AO, Datta A, Cook BC, Morton DC, Andersen HE, and Banerjee S (2017b). Efficient algorithms for bayesian nearest neighbor gaussian processes. ArXiv e-prints. https://arxiv.org/abs/1702.00434. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] G-LiHT (2016). Goddard’s lidar hyperspectral and thermal (G-LiHT) imager. http://www.gliht.gsfc.nasa.gov. Accessed: 8-11-2017.

[R29] GEDI (2014). Global ecosystem dynamics investigation lidar. http://science.nasa.gov/missions/gedi/. Accessed: 8-11-2017.

[R30] Genton MG and Kleiber W (2015). Cross-Covariance Functions for Multivariate Geostatistics. Statistical Science, 30(2):147–163. [Google Scholar]

[R31] Geweke J and Zhou G (1996). Measuring the pricing error of the arbitrage pricing theory. [Google Scholar]

[R32] Gneiting T and Raftery AE (2007). Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378. [Google Scholar]

[R33] Hanks EM, Schliep EM, Hooten MB, and Hoeting JA (2015). Restricted spatial regression in practice: geostatistical models, confounding, and robustness under model misspecification. Environmetrics, 26(4):243–254. [Google Scholar]

[R34] Heaton MJ, Datta A, Finley AO, Furrer R, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, Lindgren F, Nychka DW, Sun F, and Zammit-Mangion A (2017). Methods for analyzing large spatial data: A review and comparison. ArXiv e-prints. https://arxiv.org/abs/1710.05013. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R35] Hogan JW and Tchernis R (2004). Bayesian factor analysis for spatially correlated data, with application to summarizing area-level material deprivation from census data. Journal of the American Statistical Association, 99(466):314–324. [Google Scholar]

[R36] ICESat-2 (2015). Ice, cloud, and land elevation satellite-2. http://icesat.gsfc.nasa.gov/icesat2. Accessed: 8-11-2017. [DOI] [PMC free article] [PubMed]

[R37] Jakubowski MK, Guo Q, and Kelly M (2013). Tradeoffs between lidar pulse density and forest measurement accuracy. Remote Sensing of Environment, 130(Supplement C):245–253. [Google Scholar]

[R38] Junttila V and Laine M (2017). Bayesian principal component regression model with spatial effects for forest inventory variables under small field sample size. Remote Sensing of Environment, 192(Supplement C):45–57. [Google Scholar]

[R39] Lauritzen SL (1996). Graphical Models. Clarendon Press, Oxford, United Kingdom. [Google Scholar]

[R40] Lopes HF and West M (2004). Bayesian model assessment in factor analysis. Statistica Sinica, 14:41–67. [Google Scholar]

[R41] Murphy K (2012). Machine Learning: A probabilistic perspective. The MIT Press, Cambridge, MA. [Google Scholar]

[R42] Næsset E (2011). Estimating above-ground biomass in young forests with airborne laser scanning. International Journal of Remote Sensing, 32(2):473–501. [Google Scholar]

[R43] Nelson R, Gobakken T, Næsset E, Gregoire T, Ståhl G, Holm S, and Flewelling J (2012). Lidar sampling – using an airborne profiler to estimate forest biomass in Hedmark County, Norway. Remote Sensing of Environment, 123:563–578. [Google Scholar]

[R44] Nelson R, Margolis H, Montesano P, Sun G, Cook B, Corp L, Andersen H-E, deJong B, Pellat FP, Fickel T, Kauffman J, and Prisley S (2017). Lidar-based estimates of aboveground biomass in the continental us and mexico using ground, airborne, and satellite observations. Remote Sensing of Environment, 188(Supplement C):127–140. [Google Scholar]

[R45] Ren Q and Banerjee S (2013). Hierarchical Factor Models for Large Spatially Misaligned Data: A Low-Rank Predictive Process Approach. Biometrics, 69(1):19–30. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R46] Schmidt AM and Gelfand AE (2003). A bayesian coregionalization approach for multivariate pollutant data. Journal of Geophysical Research: Atmospheres, 108(D24). [Google Scholar]

[R47] Stein ML (2014). Limitations on low rank approximations for covariance matrices of spatial data. Spatial Statistics, 8:1–19. [Google Scholar]

[R48] Ver Hoef J and Barry R (1998). Modeling crossvariograms for cokriging and multivariable spatial prediction. Journal of Statistical Planning and Inference, 69:275–294. [Google Scholar]

[R49] White JC, Wulder MA, Varhola A, Vastaranta M, Coops Nicholas C, Cook BD, Pitt D, and Woods M (2013). A best practices guide for generating forest inventory attributes from airborne laser scanning data using an area-based approach. The Forestry Chronicle, 89(06):722–723. [Google Scholar]

[R50] Woodall CW, Coulston JW, Domke GM, Walters BF, Wear DN, Smith JE, Andersen H-E, Clough BJ, Cohen WB, Griffith DM, et al. (2015). The US forest carbon accounting framework: Stocks and stock change, 1990–2016. [Google Scholar]

[R51] Yeniay O and Goktas A (2002). A comparison of partial least squares regression with other prediction methods. Hacettepe Journal of Mathematics and Statistics, 31(99):99–101. [Google Scholar]

[R52] Zhang H (2007). Maximum-likelihood estimation for multivariate spatial linear coregionalization models. Environmetrics, 18(2):125–139. [Google Scholar]

[R53] Zhang X (2016). An optimized blas library based on gotoblas2. https://github.com/xianyi/OpenBLAS/. Accessed 2015-06-01.

PERMALINK

Spatial Factor Models for High-Dimensional and Large Spatial Data: An Application in Forest Variable Mapping

Daniel Taylor-Rodriguez

Andrew O Finley

Abhirup Datta

Chad Babcock

Hans-Erik Andersen

Bruce D Cook

Douglas C Morton

Sudipto Banerjee

Abstract

1. Introduction

2. Data Description

Figure 1:

3. Modeling Strategy

3.1. Linking LiDAR and forest inventory data

3.2. The Spatial Factor NNGP Model

3.3. Prior Specification and Hierarchical Formulation

3.4. Imputation and Prediction

4. Simulation: Recovering Low-dimensional Structure

Figure 2:

Figure 3:

Figure 4:

Table 1:

5. Modeling LiDAR Signals and Forest Structure

Table 2:

Table 3:

Figure 5:

Figure 6:

6. Concluding Remarks

Supplementary Material

Acknowledgments

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases