Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Mar 1.
Published in final edited form as: Spat Stat. 2024 Mar;59:100808. doi: 10.1016/j.spasta.2023.100808

Modeling lake conductivity in the contiguous United States using spatial indexing for big spatial data

Michael Dumelle a,*, Jay M Ver Hoef b, Amalia Handler a, Ryan A Hill a, Matt Higham c, Anthony R Olsen a
PMCID: PMC11694821  NIHMSID: NIHMS2029939  PMID: 39758934

Abstract

Conductivity is an important indicator of the health of aquatic ecosystems. We model large amounts of lake conductivity data collected as part of the United States Environmental Protection Agency’s National Lakes Assessment using spatial indexing, a flexible and efficient approach to fitting spatial statistical models to big data sets. Spatial indexing is capable of accommodating various spatial covariance structures as well as features like random effects, geometric anisotropy, partition factors, and non-Euclidean topologies. We use spatial indexing to compare lake conductivity models and show that calcium oxide rock content, crop production, human development, precipitation, and temperature are strongly related to lake conductivity. We use this model to predict lake conductivity at hundreds of thousands of lakes distributed throughout the contiguous United States. We find that lake conductivity models fit using spatial indexing are nearly identical to lake conductivity models fit using traditional methods but are nearly 50 times faster (sample size 3,311). Spatial indexing is readily available in the spmodel R package.

Keywords: Spatial correlation, Model selection, Prediction (Kriging), Restricted maximum likelihood estimation, Salinization, United States National Aquatic Resource Surveys

1. Introduction

Conductivity is an integrated measure of total dissolved constituents in water and is heavily influenced by the interaction between climate change and human activities. It is a fundamental metric of the chemical and biological state of an aquatic ecosystem. Recent research has documented the widespread phenomenon of freshwater salinization through rising conductivity measurements as a result of human-mediated pollution through land use and climate change (Kaushal et al., 2021; Dugan et al., 2017), and there is a need to identify which waterbodies are more likely to be affected by salinization (Solomon et al., 2023).

Lake conductivity data are collected regularly as part of the United States Environmental Protection Agency’s (USEPA) National Lakes Assessment (NLA), a national-scale monitoring survey designed to assess the physical, chemical, and biological condition of waterbodies across the contiguous United States on a five-year cycle (Shapiro et al., 2008; Peck et al., 2013; Pollard et al., 2018). As part of the NLA, approximately 1,100 lake conductivity samples have been collected during the summers of 2007, 2012, and 2017, totaling 3,311 samples across the three cycles (i.e., years). Some lakes are sampled for conductivity twice during the same cycle, and some lakes are sampled for conductivity in multiple cycles. The lakes sampled in each cycle are meant to be well-spread (in space) and representative of the broader population of United States lakes. They are selected using the Generalized Random Tessellation Stratified (GRTS) algorithm for spatially balanced sampling (Stevens Jr and Olsen, 2004) via the spsurvey package (Dumelle et al., 2023b) for the R language (R Core Team, 2023). The 2007 NLA sampled lakes with area greater than four hectares, while the 2012 and 2017 NLAs sampled lakes with area greater than one hectare. For consistency across NLA cycles, we restrict our study here to lakes with area greater than four hectares.

Spatial models for lake conductivity using the NLA data can benefit from incorporating relevant explanatory variables, a (potentially anisotropic) spatial dependence structure, lake-specific correlation from repeated measurements at a particular lake, and yearly effects. Fitting such a spatial model to the lake conductivity data can be computationally challenging due to its size (several thousand observations). The “big-data” spatial problem has inspired decades worth of important research (for a review see Heaton et al., 2019). In this study, we review an approach to the “big-data” spatial problem called spatial indexing (SPIN, Ver Hoef et al., 2023) and show how to extend SPIN to accommodate random effects, (geometric) anisotropy, and partition factors while modeling the lake conductivity data. We use SPIN to answer the study’s two main questions:

  1. What are the benefits of modeling using SPIN compared to modeling using traditional methods?

  2. What explanatory variables are useful for explaining patterns of lake conductivity and making predictions at unobserved lakes?

The rest of this study is organized as follows. In Section 2, we provide a review of spatial statistical models, the “big-data” spatial problem, and SPIN. In Section 3, we model the lake conductivity data from the NLA. We use SPIN and 10-fold cross validation to compare a rich set of models having varying fixed and random structures. We then compare the best predictive model fit using SPIN to the same model fit using traditional methods. We end in Section 4 with a discussion of the benefits of SPIN compared to traditional methods and consider the ecological implications of the lake conductivity model.

2. A review of spatial indexing

The general linear model, which includes regression and analysis of variance, is an indispensable tool used to describe ecological phenomena. The general linear model relates a response variable to a set of explanatory (i.e., predictor; covariate) variables while accounting for random errors. Often, these random errors are assumed to follow from a white-noise process. Random errors from a white-noise process are independent and identically distributed with zero-mean and a common variance. Unfortunately, general linear models of these type are inadequate for data that follow from a spatial process and exhibit spatial dependence.

Following Cressie (1993), let a spatial process Y(s) for s in D be defined over point-referenced locations s in a two-dimensional spatial domain D. The spatial process Ysi (i.e., the response variable) can be modeled as

Y()=x()β+η+ϵ, (1)

where x() is a 1×p vector of explanatory variables; β is a p×1 vector of fixed effects that control the impact of x() on Y();η() is a zero-mean, second-order (i.e., weakly) stationary spatial process with covariance function Cη(;θ) that depends on variance σde2 and θ, a vector of other parameters that influence the behavior of the spatial covariance; and ϵ() is a white-noise process with variance σie2. Together, the parameters σde2,θ, and σie2 are called the spatial covariance parameters. The parameter σde2 captures spatially dependent variability and is often called the “partial sill,” while the parameter σie2 captures spatially independent variability and is often called the “nugget.” For any pair of locations si,sj in D, let dij=si-sj be a “separation vector” with “separation distance” dij=dij=si-sj0 and let Cηdij;θ be the covariance between ηsi and ηsj denoted covηsi,ηsj;θ. If η() is assumed to be isotropic (i.e., independent of direction), then dij=dij2, where 2 is the Euclidean norm, and Cηdij;θ=covηsi,ηsj;θ. If η() is assumed to be geometrically anisotropic (i.e., dependent on direction), then dij=SRdij2 where

S=1001/λandR=cosαsinα-sinαcosα,

for 0λ1 and 0απ. Here, λ and α are parameters that scale and rotate, respectively, the original coordinates such that the transformed coordinates yield a covariance that is isotropic (Schabenberger and Gotway, 2017; Dumelle et al., 2023a). One example of a family of covariance functions is the powered exponential family:

Cηdij;θ=σde2exp-dij/ϕγ, (2)

where θ={ϕ,γ},ϕ is the range parameter that controls the distance-decay rate of Cηdij;θ, and γ is a power parameter. When γ=1, (2) is called the exponential covariance function, and when γ=2, (2) is called the Gaussian covariance function. Many other examples of spatial covariance functions are given in Chiles and Delfiner (1999).

The full dependence structure of the spatial process Y() is given by

CYdij;θ=covYsi,Ysj=Cηdij;θ+σie2dij=0, (3)

where dij=0 is an indicator function that equals one when dij=0 and zero otherwise. The function Cηdij can be decomposed into a product of the variance of η(),σde2, and a correlation function, ρ(), in which case (3) can be written instead as

CYdij;θ=covYsi,Ysj=σde2ρηdij;θ+σie2dij=0.

Building from the spatial process formulation, let be y=Ys1,,Ysn be a spatial process observed at the locations s1,,sn in D. Then we define the spatial linear model as

y=Xβ+η+ϵ, (4)

where X=xs1,,xsn,η=ηs1,,ηsn, and ϵ=ϵs1,,ϵsn. The expectation (i.e., average) of y is Xβ and the covariance of y is the matrix V, whose ijth element is given by CYdij;θ. The spatial linear model may also be used to make predictions of the spatial process at unobserved locations (i.e., Kriging). Let s0 in D be an unobserved location at which Ys0 is desired. Building from (1) and Cressie (1993),

Ys0=x0s0β+ηs0+ϵs0.

The vector of covariances between Ys0 and y is

cs0;θ=CYd01;θ,,CYd0n;θ,

where d0i=d0i=s0-si is the separation distance between s0 and si. Let y0=Ys0,x0=xs0, and c0=cs0;θ. The best linear unbiased predictor, or universal Kriging predictor (Cressie, 1993), of y0 is given by

yˆ0=x0β˜+c0V-1y-Xβ˜, (5)

where β˜=XV-1X-1XV-1y. The variance of yˆ0 is called the prediction variance and is given by

varyˆ0=σ2-c0V-1c0+QXV-1X-1Q,

where σ2=CY(0)=σde2+σie2 and Q=x0-c0V-1X. The square root of varyˆ0 is called the prediction standard error. Typically the spatial covariance parameters are assumed unknown and require estimation. When this occurs, we evaluate V at estimates of the spatial covariance parameters, yielding Vˆ (which is also known as the plug-in estimator of V).

One of the main problems with using the spatial linear model is that the n×n covariance matrix V requires inversion when estimating spatial covariance parameters, estimating fixed effects, or making predictions. For example, estimating the spatial covariance parameters using restricted maximum likelihood (REML) requires iteratively minimizing the equation

-2σde2,θ,σie2y=log|V|+rV-1r+logXV-1X+a, (6)

where rθ=y-Xβ˜,logXV-1X is the log (base-e) determinant of XV-1X, and a is a constant that does not depend on θ (Patterson and Thompson, 1971; Harville, 1977; Wolfinger et al., 1994). Additionally, regardless of how θ is estimated, the (unknown) fixed effects are estimated using generalized least squares (GLS):

βˆ=XVˆ-1X-1XVˆ-1y. (7)

Finally, note the inverse required for prediction (5). The computational cost of inverting V grows cubically with the sample size n (i.e., 𝒪n3). This means that if the sample size doubles, inverting the V takes approximately 23=8 times longer. This computational challenge is the root of the “big-data” spatial problem that has motivated active research on efficient computational methods for big spatial data. We will feature our approach to tackling the “big-data” spatial problem using spatial indexing (SPIN, Ver Hoef et al., 2023), which has been implemented in the spmodel R package (Dumelle et al., 2023a).

To estimate covariance parameters using SPIN, the data are first indexed to create a covariance matrix with P partitions based on the indexes {i;i=1,,P},

V=V1,1V1,2V1,PV2,1V2,2V2,PVP,1VP,2VP,P, (8)

and a corresponding indexing and partitioning of the spatial linear model,

y1y2yP=X1X2XPβ+ϵ1ϵ2ϵP. (9)

For the purposes of estimating covariance parameters, we minimize the REML equations (6) based on a covariance matrix,

Vpart=V1,1000V2,2000VP,P, (10)

rather than (8). Using (10) in (6) only requires inverting the matrices Vi,i, which each have dimension ni×ni and ni<<n for all i. Because V is positive definite by construction, each Vi,i is positive definite (all principal submatrices of a positive definite matrix are positive definite). As Vpart is a block diagonal matrix of positive definite matrices, it is positive definite.

Partitioning the covariance matrix (10) for use in (6) is similar to quasi-likelihood (Besag, 1975), composite likelihood (Curriero and Lele, 1999) and divide and conquer (Guha et al., 2012). A key difference lies in (6) with the term logXV-1X. Using composite likelihood, i=1Pσde2,θ,σie2yi results in i=1PlogXiVi,i-1Xi. Using spatial indexing Vpart results in logi=1PXiVi,i-1Xi. A problem for composite likelihood occurs when X contains columns with many zeros, which may occur for certain categorical explanatory variables. If the partitioning of X yields any Xi that have a column with all zeros, then Xi does not have full rank, XiVi,i-1Xi is singular, and logXiVi,i-1Xi is undefined. No such problem occurs with Vpart.

Equation (7) is the generalized least squares estimate for β. Here V-1 only occurs once (optimization for REML requires repeated inverses) but may still take a very long time for big n. For the partitioned model (9) with covariance matrix (10), (7) is,

βˆbd=Txx-1txy, (11)

where Txx=i=1PXiVˆi,i-1Xi and txy=i=1PXiVˆi,i-1yi. For covariance parameter and fixed effect estimation, we set off-diagonal blocks of V=0 in (10), but for the covariance matrix of (11) we use the full covariance matrix (8),

va^rβˆbd=Txx-1+Txx-1WxxTxx-1,

where Wxx=i=1P-1j=i+1PXiVˆi,i-1Vˆi,jVˆj,j-1Xj+XiVˆi,i-1Vˆi,jVˆj,j-1Xj (Ver Hoef et al., 2023).

There are many ways to allocate each observation to a spatial index for the SPIN method (10), and several approaches are given by Ver Hoef et al. (2023). We considered a spatially compact allocation, where indices are clustered in space, determined via k-means clustering on the spatial coordinates with P clusters, where P is the number of partitions. We also considered a random allocation, where observations are randomly allocated to one of P partitions. Ver Hoef et al. (2023) showed that as long as the number of observations within each partition was at least 50, both the compact and random allocations yielded models with appropriate fixed effect inference (i.e., 90% confidence intervals for each fixed effect had approximately 90% coverage). However, spatially compact allocation yielded fixed effect estimates with lower (better) root-mean-squared error than random allocation.

An attractive feature of spatial indexing is that it easily accommodates any valid covariance structure. For example, Ver Hoef et al. (2023) showed how spatial indexing can be applied to non-Euclidean geometries like those imposed by stream networks (which depend on flow-connected and flow-unconnected distances). Here, we will create complex covariance structures that include random effects, geometric anisotropy, and partition factors to model the lake conductivity data, which we discuss in more detail later. Because spatial indexing is computationally efficient, we can use it to explore model selection for explanatory variables and various covariance features.

A version of SPIN also exists for prediction. Ver Hoef et al. (2023) note, however, that it does not perform as well as a “local neighborhood” predictor, in which only a small subset of y nearest to the prediction location are used to inform the prediction. Similar to SPIN for prediction, the “local neighborhood” approach is very computationally efficient. This is because for each prediction, V is subset to dimension nl×nl, where l<<n, which is much faster to invert. Ver Hoef et al. (2023) provide further details regarding the SPIN and “local neighborhood” approaches to prediction.

3. Modeling lake conductivity

Freshwater salinization is commonly measured by electrical conductance, or conductivity, a proxy for dissolved ionic solutes in the water (Thorslund and van Vliet, 2020). Conductivity is a metric responsive to surrounding regional landscape drivers including physical, anthropogenic, and climatic features (Das et al., 2006; Olson and Hawkins, 2012; Read et al., 2015). As a result, these features are excellent candidates for describing patterns in lake conductivity at large scales and predicting which lakes are potentially at risk for excess conductivity.

Anthropogenic activities are creating widespread saltier conditions in freshwater ecosystems as measured by increasing conductivity across the United States and the globe (Cañedo-Argüelles et al., 2016; Kaushal et al., 2018). Lake ecosystems, given their landscape position and water residence time, are integrators of their surrounding environment. As a result, lakes can accumulate external inputs including pollutants (Schindler, 2009; Hintz et al., 2022). Conductivity of freshwater ecosystems has increased due to a combination of salt pollution, accelerated weathering and soil cation exchange, mining and resource extraction, and the presence of easily weathered minerals used in agricultural operations and urban infrastructure (Kaushal et al., 2018). The phenomenon has documented adverse effects on the chemical and biological integrity of freshwater ecosystems (Corsi et al., 2010; Van Meter and Swan, 2014; Hintz et al., 2022). Salinization occurs either through input of excess solutes, high evaporation relative to solute inputs, or a combination of the two (Kaushal et al., 2021). Given rising needs for freshwater extraction to support human communities and higher evaporation rates with climate warming, widespread increases in conductivity are expected to continue (Dugan et al., 2017; Thorslund and van Vliet, 2020). Predicting lake conductivity across the country is a step toward understanding the extent of the salinization phenomena.

3.1. Data sources

We used data from the 2007, 2012, and 2017 NLAs to study conductivity of United States lakes that are at least four hectares in surface area (henceforth called “US Lakes”; Figure 1). In each NLA, approximately 1,000 lakes distributed across the country are visited once during the summer months (June, July, August, or September). Approximately 100 lakes are resampled at least two weeks after the initial sampling date to evaluate variability inherent in repeated measurements. Thus there are a total of approximately 1,100 total samples per year. Sometimes lakes sampled in a previous NLA are also sampled in a subsequent NLA. Of all unique lakes sampled in at least one NLA (2007, 2012, or 2017), approximately 30% were sampled in more than one NLA.

Figure 1:

Figure 1:

Histograms (left) and spatial maps (right) of log (base-e) conductivity (LogCn) in the NLA data during 2007, 2012, and 2017.

To measure conductivity at each lake, a depth-integrated photic zone (maximum depth of 2 meters) water sample was collected for water chemistry (EPA, 2017a). Samples were collected from the deepest point in the lake and generally excluded shoreline conditions. The samples were stored on ice in the field until transferred to the lab where they are stored at 4°C until analysis. Conductivity was measured within 7 days of sample collection (EPA, 2017b).

We selected explanatory variables that represent some of the major sources of solutes to waterbodies and the physical and climatic variables that can modify the concentration of solutes and help explain broad patterns in conductivity across the contiguous US. We opted for variables that encompass the sources and modifiers of conductivity rather than more specific variables (e.g., road density, road salt application rates) for two reasons. First, we required nationally consistent data that could be processed at the watershed scale for all US lakes. Second, a common challenge in population scale analyses of waterbodies is that there is often a lack of targeted explanatory variables, but these variables can be correlated in space. Thus, spatial models are useful for more accurately capturing ecological patterns and making predictions at unobserved waterbodies.

Using the LakeCat data (Hill et al., 2018), we collected data for each explanatory variable summarized for the watershed (all area draining to the lake) of each lake (i.e., explanatory variables are calculated at the lake level). The LakeCat data are a collection of watershed metrics derived for the almost 400,000 lakes of the Medium Resolution National Hydrography Dataset Plus Version 2 (McKay et al., 2012). The LakeCat metrics represent watershed percentage of several land use and land cover types (Homer et al., 2007, 2020), climate (Daly et al., 2008), underlying geochemistry (Olson and Hawkins, 2014), and more. We used these metrics because they are based on nationally consistent geospatial layers which makes them available for not only the model fitting data (i.e., the NLA), but also for prediction to all US lakes in LakeCat. Among the several hundred available metrics in LakeCat, we selected a subset of explanatory variables that previous models of stream electrical conductivity have identified as important (Olson and Hawkins, 2012; Olson and Cormier, 2019). These explanatory variables are summarized in Table 1 and represent physical, anthropogenic, or climate patterns.

Table 1:

Explanatory variables used to model lake conductivity (at the lake level).

Explanatory Variable (Abbreviation) Source
Lake Area (in ha) Physical
Calcium Oxide Physical
Sulfur Physical
Presence of Crop Production (Pres-Crop) Anthropogenic
Proportion of Crop Production (Prop-Crop) Anthropogenic
Presence of Human Development (Pres-HDev) Anthropogenic
Proportion of Human Development (Prop-HDev) Anthropogenic
Year-2012 Climate
Year-2017 Climate
Precipitation (in cm) Climate
Temperature (in degC) Climate

Physical explanatory variables that we considered include lake area (hectares) and the geologic sources of calcium oxide and sulfur. Calcium oxide and sulfur provide the baseline source of solutes to waterbodies (Olson and Hawkins, 2012; Olson and Cormier, 2019; Griffith, 2014) and are found in sedimentary rocks with calcium being widespread in areas with carbonate geology. They are commonly identified as important sources of conductivity under non-anthropogenic conditions (Olson and Cormier, 2019). We summarize calcium and sulfur content as the mean percent of each watershed’s surface (or near surface) geology.

Anthropogenic explanatory variables that we considered included the proportion of lake watershed in crop production and the proportion of lake watershed in human development. These variables were obtained from the National Land Cover Database (NLCD) for 2006, 2011, and 2016 and paired with the corresponding NLA observations separately for 2007, 2012, and 2017, respectively. Agricultural practices associated with crop production such as tillage practices, lime and fertilizer application, and irrigation practices can all increase conductivity (Hintz et al., 2022; Kaushal et al., 2018). Human development encompasses areas where concrete, road salt application, and stormwater and wastewater discharge can all raise solute concentrations in waterbodies (Dugan et al., 2017; Kaushal et al., 2005; Novotny et al., 2008; Beibei et al., 2023; Solomon et al., 2023). Road salt is a major source of salt pollution to streams and states often use this method of road de-icing; however, there is a lack of nationally consistent data on applications rates (Solomon et al., 2023) and the NLA samples lakes during the summer months when road de-icing is less common. In addition, human development includes mining operations that contribute solutes to aquatic systems through the extraction process and irrigation (Palmer et al., 2010). We combined the low, medium, and high intensity development categories in the NLCD to create a single human development variable that would encompass many types of human activities. Together, these measures of crop production and human development capture many landscape modifications and human activities that can result in higher lake water conductivity.

Climate explanatory variables that we considered were the NLA year (2007, 2012, or 2017), precipitation (in cm), and temperature (in degC). The NLA year represents large-scale, general climatic patterns that change throughout the years. Precipitation is an important driver of atmospheric deposition, rock weathering, and transport of solutes to waterbodies (Chae et al., 2004; Stoddard, 1991). In addition, areas with higher precipitation can have a dilution effect on waterbody solutes. Temperature is an important driver of evaporation and transpiration that can concentrate solutes in waterbodies. The 30-year averages for each annual temperature and precipitation are included in the model to capture the typical climate conditions for each lake.

3.2. Spatial Linear Model for NLA data

In Section 3.3, we fit many models to the NLA data and compare them. These models supplement the spatial linear model (4) with a lake-specific random effect, yielding the following form:

y=Xβ+Zu+η+ϵ,

where y is the log (base-e) of lake conductivity, X contains the explanatory variables (in Table 1), Z is a design matrix that indicates which lake each observation came from, and u is a vector of independent, lake-specific random intercepts. Random effects capture sources of variability separate from the explicit spatial (i.e., partial sill) and independent (i.e., nugget) variance components and build additional correlation into the model when observations share the same level of a random effect (Henderson, 1975). We assume that the variance of u is σlake2I, which implies cov(Zu)=σlake2ZZ. The spatial process covariance is modeled using the powered exponential covariance function (2). Specifically, we study the case where γ is fixed at one (yielding the exponential covariance function) or two (yielding the Gaussian covariance function), and make both isotropic and (geometrically) anisotropic assumptions. We also study the case of no spatial covariance (η=0). Euclidean distances were calculated using a NAD83 projection. One additional modification to the covariance matrix we consider is due to a partition factor. We define this as a factor (i.e., categorical) variable with several levels such that observations without the same partition factor level are assumed completely uncorrelated in the covariance matrix. Let Zp be a design matrix for a categorical (partition factor) variable, and let P=ZpZp. When allowing for the partition factor, we create a new covariance matrix that equals VP, where is the Hadamard (direct) product. When we treat year as a partition factor, we assume observations from different years are uncorrelated. When we have no partition factor, we allow observations across years to be correlated.

3.3. Model selection using spatial indexing

We designed an empirical study to compare the fit of 40 lake conductivity models having several different explanatory variable (fixed effect) and covariance structures. The several explanatory variable and covariance structures were determined by the following crossing (Table 2): two explanatory variable structures (no explanatory variables or all explanatory variables); three spatial covariance types (exponential, Gaussian, none); two anisotropy structures (no anisotropy or anisotropy); two partition factor structures (no partition factor or year as a partition factor); and two spatial indexing allocation methods (k-means or random allocation). The crossing described actually yields 48 models, but anisotropy has no meaning when there is no spatial covariance, so 40 models remain. For each of the 40 models, we also included a random effect for lake, which builds into the model additional correlation for repeat observations at the same lake. We fit all models using restricted maximum likelihood (6). We explored adding a random effect for year; this did not improve performance but did make models slower so we did not consider it further. As evidenced by the different types of models described, SPIN accommodates a wide range of possible model types. Moreover, its computational efficiency enables the use of a model comparison tool that relies on repeated model fitting like cross validation.

Table 2:

10-fold cross validation structures using spatial indexing. For the explanatory variable structure, we considered intercept-only models (No) and models with all explanatory variables (Yes). For the spatial covariance (SP-Cov) structure, we considered models with an exponential spatial covariance, a Gaussian spatial covariance, and no spatial covariance (None). For the anisotropy structure, we considered models with and without anisotropy. For the partition factor structure, we considered models that treated year as a partition factor and models that did not treat year as a partition factor. For the allocation structure, we considered models that used k-means or random allocation.

Structure Value 1 Value 2 Value 3
Explanatory Variables No Yes
SP-Cov Exponential Gaussian None
Anisotropy No Yes
Partition Factor No Yes
Allocation k-means random

We compared each of the 40 models using 10-fold cross validation, a measure of predictive performance. To carry out 10-fold cross validation, first the data are randomly split into 10 approximately equally-sized subsets called folds. Then, fold one is held out and the remaining nine folds are aggregated and used to fit a model. This fitted model is then used to make predictions and estimate standard errors at the locations in fold one. Next, fold two is held out and the remaining nine folds are aggregated and used to fit a model. This fitted model is then used to make predictions and estimate standard errors at the locations in fold two. This process is repeated for the remaining eight folds, at which point each observation yi has a corresponding cross validation prediction yˆi and standard error estimate, where i represents the ith of n observations. Then we compute several model performance statistics: mean bias (MB), mean-squared-prediction error (MSPE), R2, and 95% prediction interval coverage. MB measures the average deviation between the true and predicted values and is formally defined as

MB=1ni=1nyi-yˆi.

A well-fitting predictive model should have an average bias close to zero. MSPE measures the average squared deviation between the true and predicted values and is formally defined as

MSPE=1ni=1nyi-yˆi2.

Lower values of MSPE indicate better model fit. R2 (as used here) measures the squared correlation between the true and predicted values and is formally defined as

R2=Cor(y,yˆ)2.

Higher values of R2 indicate better model fit. 95% prediction interval coverage measures the proportion of 95% prediction intervals for the true values that actually contain the true values. It is formally defined as

COVER95=1ni=1nLByˆiyiUByˆi,

where LByˆiyiUByˆi is an indicator function equal to one if yi is contained in the interval spanned by the 95% prediction interval for yi with lower bound LByˆi and upper bound UByˆi. A well-fitting statistical predictive model should have nearly appropriate prediction interval coverage. Our last evaluation metric was the time (in seconds) it took to fit each model to all of the data using SPIN.

All 40 models had mean bias close to zero and 95% prediction interval coverage close to 95% (ranging from 91% to 95%). The models fit using k-means allocation outperformed the equivalent model fit using random allocation, which suggests that incorporating spatial structure into the allocation method can be beneficial. Moreover, the models fit without a partition factor outperformed the equivalent model fit using year as a partition factor, which implies that the models improved when they borrowed covariance across years. Table 3 shows cross validation statistics and model fit times for the 10 models that use k-means allocation and no partition factor. With respect to MSPE and R2 (Table 3), the models with the explanatory variables outperformed the equivalent model with only an intercept, the models with spatial covariance (exponential or Gaussian) outperformed the equivalent model without spatial covariance, the models with an exponential spatial covariance outperformed the equivalent model with a Gaussian spatial covariance, and the models with anisotropy slightly outperformed the equivalent model without anisotropy. With respect to model-fitting time (Table 3), the spatial models were much slower than the non-spatial models, and the spatial models with anisotropy were much slower than equivalent models without anisotropy. The best fitting model with respect to MSPE and R2 was the model with all explanatory variables, the exponential spatial covariance, anisotropy, no partition factor for year, and k-means allocation. This model took 49.88 seconds to fit to all 3,311 observations.

Table 3:

Cross validation performance for spatial indexing models without a partition factor for year and using k-means allocation. Models varied by whether or not explanatory variables (EXPL) were used, the spatial covariance model (MOD), where Exp was for exponential, Gau was for Gaussian, or None, and whether or not anisotropy (ANIS) was used. Performance measures were mean bias (MB), mean-squared-predictor erorr (MSPE), correlation between true and predicted values (R2), 95% prediction interval coverage (COVER95), and fit time in seconds (TIME). MSPE and R2 ranked from best (1) to worst (10).

EXPL MOD ANIS MB MSPE R2 COVER95 TIME
Yes Exp Yes 0.005 0.282 (1) 0.860 (1) 0.940 49.88
Yes Exp No 0.004 0.283 (2) 0.860 (2) 0.940 12.88
Yes Gau Yes 0.005 0.310 (3) 0.846 (3) 0.945 116.21
Yes Gau No 0.000 0.311 (4) 0.846 (4) 0.946 11.68
No Exp No 0.000 0.338 (5) 0.833 (5) 0.936 9.23
No Exp Yes 0.000 0.341 (6) 0.832 (6) 0.936 70.84
No Gau No −0.002 0.380 (7) 0.812 (7) 0.936 12.75
No Gau Yes 0.002 0.394 (8) 0.805 (8) 0.939 63.48
Yes None No 0.009 0.514 (9) 0.745 (9) 0.940 4.10
No None No −0.008 1.008 (10) 0.499 (10) 0.950 3.79

3.4. Comparing spatial indexing to traditional modeling

In this section, we compare the best model described in Section 3.3 fit using spatial indexing to an equivalent model fit using traditional methods (i.e., fit without using spatial indexing and using the full covariance matrix). We call the spatial indexing model SPIN-MOD and the traditional model TRAD-MOD. More specifically, we compare SPIN-MOD and TRAD-MOD using 10-fold cross validation (using equivalent folds), covariance parameter estimation, fixed effect parameter estimation, and prediction. We show that generally, SPIN-MOD and TRAD-MOD perform nearly identically but SPIN-MOD is substantially faster.

3.4.1. 10-fold cross validation

We performed 10-fold cross validation using SPIN-MOD and TRAD-MOD and compared mean bias (MB), mean-squared-prediction error (MSPE), R2, 95% prediction interval coverage (COVER95), and computational times (Table 4). Both SPIN-MOD and TRAD-MOD were unbiased and had nearly appropriate prediction interval coverage. TRAD-MOD had 0.02% lower (better) MSPE and 0.01% higher (better) R2 than SPIN-MOD. 10-fold cross validation took 16.35 minutes for SPIN-MOD and 303.82 minutes (approximately 5 hours) for TRAD-MOD. Fitting a single model to all the data took 0.83 minutes for SPIN-MOD and 40.16 minutes for TRAD-MOD. Given the cubic nature of matrix inversion, the gap in computational times between a model fit using spatial indexing and a model fit using traditional methods will continue to grow with the sample size.

Table 4:

Cross validation performance for the spatial indexing model (SPIN-MOD) and the traditional model (TRAD-MOD). Performance measures were mean bias (MB), mean-squared-prediction error (MSPE), correlation between true and predicted values (R2), and 95% prediction interval coverage (COVER95). Time1 is the time to perform 10-fold cross validation (in minutes) and Time2 is the time fit a single model to all the data (in minutes).

Model MB MSPE R2 COVER95 Time1 Time2
SPIN-MOD 0.0049 0.2818 (2) 0.8603 (2) 0.9400 16.35 0.83
TRAD-MOD 0.0047 0.2816 (1) 0.8604 (1) 0.9393 303.82 40.16

3.4.2. Covariance parameter estimation

Covariance parameter estimates were very similar for both SPIN-MOD and TRAD-MOD (Table 5). For both models, roughly 20% of random variability is explained by repeated observations at the same lake (σlake2), roughly 75% of random variability is explained by the spatially dependent random error σde2, and the remaining few percentage points of random variability is explained by spatially independent random error σie2. For both models, the anisotropy rotation parameter represents a nearly 90° rotation counter-clockwise from the origin and the scale parameter indicates that the correlation is approximately three times shorter in the east-west direction compared to the north-south direction. For both models, the effective range (the distance at which two observations are nearly uncorrelated) was about 1,000 km (the effective range of the exponential covariance is three times the range), which implies that in the east-west direction, spatial covariance is approximately zero when two lakes are roughly 333 km apart, but in the north-south direction, spatial covariance is approximately zero when two lakes are roughly 1,000 km apart. The direction and magnitude of aniosotropy is reasonable given the north-south orientation of many mountainous regions in the contiguous United States that have similar underlying geologies and climate and exert strong effects on the surrounding ecosystem areas. We show this anisotropic behavior as a function of distance and as a contoured level curve of points with equal correlation in Figure 2.

Table 5:

Fixed effect and covariance parameter output for SPIN-MOD and TRAD-MOD. For the fixed effects, estimates are presented with their standard errors in parentheses and * indicating significance: *** indicates p-values less than 0.001, ** indicates p-values greater than 0.001 but less than 0.01, * indicates p-values greater than 0.01 but less than 0.1, and no * markings indicate p-values greater than 0.1. For the covariance parameters, only estimates are presented. Fixed effects and covariance parameters are defined in Table 1 and (2), respectively.

Parameter Type Term SPIN-MOD TRAD-MOD
Explanatory (Fixed) Intercept 5.469 (0.255)*** 4.717 (0.277)***
Lake Area (ha) 3.51E-06 (3.05E-06) 3.51E-06 (2.99E-06)
Calcium Oxide 0.015 (0.003)*** 0.013 (0.003)***
Sulfur 0.050 (0.032) 0.045 (0.032)
Pres-Crop 0.182 (0.038)*** 0.167 (0.037)***
Prop-Crop 0.004 (0.001)*** 0.004 (0.001)***
Pres-HDev 0.167 (0.048)*** 0.161 (0.046)***
Prop-HDev 0.012 (0.002)*** 0.012 (0.002)***
Year-2012 −0.015 (0.011) −0.015 (0.011)
Year-2017 −0.037 (0.012)** −0.037 (0.012)**
Precipitation (cm) −0.016 (0.001)*** −0.013 (0.001)***
Temperature (degC) 0.090 (0.014)*** 0.135 (0.014)***
Covariance (Random) σde2 0.915 (74.6%) 1.122 (78.4%)
σie2 0.032 (2.6%) 0.032 (2.2%)
σlake2 0.280 (22.8%) 0.277 (19.4%)
α 1.51 (radians) 1.54 (radians)
λ 0.325 0.409
ϕ 327.30 (km) 352.84 (km)
Figure 2:

Figure 2:

Left: Approximate correlation of SPIN-MOD as a function of distance (from zero to the range parameter) in east-west (E-W), north-south (N-S), and northeast-southwest (NE-SW) directions. Right: Anisotropic correlation level curve as a function of distance (from zero to the range parameter) for SPIN-MOD. All distances on a single level curve represent have equal correlation. The three level curves represent correlations at different distances.

3.4.3. Fixed effect estimation

Fixed effect estimates and their corresponding standard errors were nearly identical for both SPIN-MOD and TRAD-MOD (Table 5). Both models suggest that 1) there is statistically significant evidence (p-value < 0.01) that crop production, human development, calcium oxide (in surface or near-surface geology), and temperature are associated with increases in average log conductivity, 2) there is statistically significant evidence (p-value < 0.01) that precipitation and NLA cycle year (2017) are associated with decreases in average log conductivity, and 3) there is not statistically significant evidence (p-value > 0.01) that sulfur (in surface or near-surface geology), NLA cycle year (2012), or lake area impact average log conductivity.

3.4.4. Prediction (Kriging)

Separately for SPIN-MOD and TRAD-MOD, we predicted log conductivity and estimated standard errors at 150,839 prediction sites for each year (2007, 2012, and 2017), totaling 452,517 predictions sites. To make these predictions (and standard error estimates), both SPIN-MOD and TRAD-MOD used the full covariance matrix constructed from their respective covariance parameter estimates. In other words, while SPIN was used for covariance parameter estimation with SPIN-MOD, we did not use SPIN (or local neighborhood) prediction for either SPIN-MOD or TRAD-MOD (recall that SPIN for covariance parameter estimation can be used separately from SPIN for prediction). Using the full covariance matrix is feasible here because the observed data are not too big (n=3,311), and for prediction we only need to invert this matrix once (in contrast to covariance parameter estimation, where we need to invert this matrix many times). This highlights a key fact that even when predicting for big data sets (here, npred=452,517), the main computational burden comes from inverting a covariance matrix that corresponds to the observed data (here, n=3,311).

Predictions and standard error estimates of log conductivity were nearly identical for both SPIN-MOD and TRAD-MOD. Figure 3 shows the predictions and standard errors for SPIN-MOD. These predictions and standard errors have similar patterns in each year: predictions are highest in the central parts of the country; predictions are lowest along the coastlines and locations where precipitation is higher; the north-south dependence suggested by the anisotropy parameters seems apparent; standard errors are highest in places where observed data are sparse and along the domain boundaries; and standard errors are lowest in locations nearby many observed data points. Because both SPIN-MOD and TRAD-MOD used the full covariance matrix to make predictions (i.e., SPIN was only used for covariance parameter estimation), they took the same amount of time to compute all 452,517 predictions and standard errors - approximately 155 minutes (using parallel processing).

Figure 3:

Figure 3:

Predictions (left; Preds) and standard errors (right; StdErs) of log (base-e) conductivity at each of the 150,839 lakes in LakeCat during 2007, 2012, and 2012 using SPIN-MOD.

3.5. spmodel function calls

spmodel is an R package that contains many useful tools for spatial statistical modeling of point and areal (i.e., lattice) data. We used the spmodel package to fit all lake conductivity models and make predictions. For model fitting, we used the following function call:

splm(
    formula = formula,
    data = data.frame or sf object,
    spcov_type = character vector,
    anisotropy = logical vector,
    random = formula,
    partition_factor = formula,
    local = logical vector or list
)

The splm() function in spmodel fits spatial linear models and is very similar in structure to the lm function in base-R used to fit non-spatial linear models. The formula argument takes a formula that specifies the relationship between the response variable and the explanatory variables.

The data argument takes a data frame or sf object (Pebesma, 2018) that holds the variables in formula, random, and partition_factor. An sf object is a special data frame that contains spatial geometries. The spcov_type argument takes a character vector that indicates the spatial covariance type desired. The anisotropy argument is a logical vector that indicates whether anisotropy should be modeled or not. The random argument takes a formula that specifies the random effect structure and is similar to the way one specifies random effects in the lme4 (Bates et al., 2015) and nlme (Pinheiro and Bates, 2000) R packages. The partition_factor argument takes a formula that specifies the partition factor. The local argument is a logical vector or a list. If a logical vector, it indicates whether spatial indexing should be implemented with default settings or not. If a list, the user can customize the details of the spatial indexing implementation, changing things like the allocation used (k-means or random), the number of observations assigned to each spatial index, and whether to use parallel processing.

For prediction, we used the following function call:

predict(
    object = splm object,
    newdata = data.frame or sf object,
    interval = character vector
)

The predict() function in spmodel is used to make all 452,517 predictions (150,839 per year). The object argument takes the spatial linear model fit from the splm() function. The newdata argument is a data frame or sf object that contains the explanatory variables and locations for observations requiring prediction. The interval argument is a character vector that indicates the type of interval to return; for our purposes, this was always set to “prediction” to return prediction intervals.

4. Discussion

Conductivity is an important measure of a lake’s chemical and biological state. USEPA’s NLA provides a rich source of lake conductivity data throughout the contiguous United States. Spatial statistical models are useful tools to understand such data, but fitting models to data of this size using traditional methods can be computationally cumbersome. We showed that spatial indexing (SPIN) is a crucial tool for handling such data, yielding very similar models as those obtained with traditional methods in a fraction of the time. Using spatial indexing, we were able to perform model selection on a wide variety of models and evaluate them using predictive metrics. The best-fitting predictive model was motivated by ecologically-relevant explanatory variables, making it also useful for inference. We found that increases in calcium oxide, crop production, human development, and temperature were associated with statistically and practically significant increases in lake conductivity, while increases in precipitation were associated with statistically and practically significant decreases in lake conductivity. These effects are practically significant because their estimated effects are notable, especially when considering their units of measurement. While there was a statistically significant decrease in lake conductivity during 2017, this decrease is not very practically significant, being small in magnitude and unitless (i.e., it represents a single intercept-shift for all sites in 2017). Moreover, this model is practically useful because it does not systematically over-predict or under-predict (i.e., it has mean bias approximately zero), has low prediction error (mean-squared-prediction error approximately 0.282) relative to the spread in the original data (sample standard deviation of log lake conductivity equal to 1.42), and is primarily motivated by ecologically-relevant drivers of lake conductivity, which we discuss in more detail next.

The findings from this modeling effort agree with previous work on the drivers of conductivity in rivers and streams (Olson and Hawkins, 2012) and lakes (Dugan et al., 2017). The baseline source of conductivity to lakes comes from the surface and near surface geology of the watershed (Olson and Hawkins, 2012; Olson and Cormier, 2019; Griffith, 2014), especially in the form of carbonate rocks rich in calcium (Olson and Cormier, 2019). While other models have found sulfur to be an important source of conductivity (Olson and Hawkins, 2012; Olson and Cormier, 2019), these efforts were focused on studying non-anthropogenic sources of conductivity in streams. The influence of sulfur may be small relative to the influence of anthropogenic sources included in the crop production and human development explanatory variables (Olson, 2019). Agricultural crop production is a proxy for fertilizer application, soil lime treatment, irrigation, and tilling practices, all of which contribute positively to conductivity of surface waters. Human development, encompassing urban development, road salt application, industrial activities (e.g., mining), and stormwater or wastewater discharge, was associated with higher surface water conductivity. Together, crop production and human development contribute substantially to the national and global problem of surface water salinization as seen in rising conductivity (Kaushal et al., 2018, 2021, 2023). We found that precipitation was negatively related to conductivity, which is consistent with other studies finding that precipitation across the country has an overall dilution effect on conductivity (Olson and Hawkins, 2012; Olson and Cormier, 2019). Temperature has the opposite effect, where higher temperatures are associated with higher evaporation rates resulting in higher conductivity. After accounting for the explanatory variables, most of the model’s leftover variability was explained by the spatially dependent random errors, which were strongest in the north-south direction and weakest in the east-west direction. This finding is reflective of large-scale geologic and climatic patterns in the United States that tend to have north-south orientations.

Accurately and efficiently characterizing Earth’s various spatial processes is crucial to advancing the scientific understanding of our climate and environment and providing support for informed decision making. Also of great relevance is quantifying the human impact on these processes and making predictions at unobserved locations. SPIN is a crucial methodological development that is well-positioned to satisfy these goals, helping us answer important questions about our climate and environment now and in the future. Importantly, SPIN is readily available for use in the spmodel R package.

Supplementary Material

Supplement1
1

Acknowledgments

We would like to thank Karen Blocksom for acquiring the NLA lake conductivity data, Jana Compton, Erin Howard, Paul Mayer, and Patti Meeks for helpful feedback on the manuscript’s initial draft, and two anonymous reviews for their thoughtful comments that greatly improved the manuscript.

The views expressed in this manuscript are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency or the National Oceanic and Atmospheric Administration. Any mention of trade names, products, or services does not imply an endorsement by the U.S. government, the U.S. Environmental Protection Agency, or the National Oceanic and Atmospheric Administration. The U.S. Environmental Protection Agency and the National Oceanic and Atmospheric Administration do not endorse any commercial products, services or enterprises.

Data availability, code availability, and computing environment

All data and code associated with this work has been made publicly available in a GitHub repository at https://github.com/USEPA/lake-conductivity-spin.manuscript. The spmodel R package is available for download directly from CRAN. More information is available at https://cran.r-project.org/package=spmodel. All figures were made using ggplot2 (Wickham, 2016). All computations were performed on a computer having two Intel(R) Xeon(R) CPU E5-2690 v3 2.60GHz processors, 256 GB of RAM, and 48 threads (i.e., cores available for parallel processing).

References

  1. Bates D, Mächler M, Bolker B, Walker S, 2015. Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67, 1–48. doi: 10.18637/jss.v067.i01. [DOI] [Google Scholar]
  2. Beibei E, Zhang S, Driscoll CT, Wen T, 2023. Human and natural impacts on the us freshwater salinization and alkalinization: A machine learning approach. Science of The Total Environment 889, 164138. [DOI] [PubMed] [Google Scholar]
  3. Besag J, 1975. Statistical analysis of non-lattice data. Journal of the Royal Statistical Society: Series D (The Statistician) 24, 179–195. [Google Scholar]
  4. Cañedo-Argüelles M, Hawkins CP, Kefford BJ, Schäfer RB, Dyack BJ, Brucet S, Buchwalter D, Dunlop J, Frör O, Lazorchak J, et al. , 2016. Saving freshwater from salts. Science 351, 914–916. [DOI] [PubMed] [Google Scholar]
  5. Chae GT, Yun ST, Kim KH, Lee PK, Choi BY, 2004. Atmospheric versus lithogenic contribution to the composition of first- and second-order stream waters in Seoul and its vicinity. Environment International 30, 73–85. doi: 10.1016/S0160-4120(03)00150-8. [DOI] [PubMed] [Google Scholar]
  6. Chiles JP, Delfiner P, 1999. Geostatistics: Modeling Spatial Uncertainty. John Wiley & Sons, New York. [Google Scholar]
  7. Corsi SR, Graczyk DJ, Geis SW, Booth NL, Richards KD, 2010. A fresh look at road salt: Aquatic toxicity and water-quality impacts on local, regional, and national scales. Environmental Science & Technology 44, 7376–7382. doi: 10.1021/es101333u. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Cressie N, 1993. Statistics for Spatial Data (Revised ed.). Wiley, Hoboken, NJ. [Google Scholar]
  9. Curriero FC, Lele S, 1999. A composite likelihood approach to semivariogram estimation. Journal of Agricultural, Biological, and Environmental Statistics 4, 9–28. [Google Scholar]
  10. Daly C, Halbleib M, Smith JI, Gibson WP, Doggett MK, Taylor GH, Curtis J, Pasteris PP, 2008. Physiographically sensitive mapping of climatological temperature and precipitation across the conterminous United States. International Journal of Climatology: a Journal of the Royal Meteorological Society 28, 2031–2064. [Google Scholar]
  11. Das R, Samal NR, Roy PK, Mitra D, 2006. Role of electrical conductivity as an indicator of pollution in shallow lakes. Asian Journal of Water, Environment and Pollution 3, 143–146. [Google Scholar]
  12. Dugan HA, Bartlett SL, Burke SM, Doubek JP, Krivak-Tetley FE, Skaff NK, Summers JC, Farrell KJ, McCullough IM, Morales-Williams AM, Roberts DC, Ouyang Z, Scordo F, Hanson PC, Weathers KC, 2017. Salting our freshwater lakes. Proceedings of the National Academy of Sciences 114, 4453–4458. doi:doi: 10.1073/pnas.1620211114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Dumelle M, Higham M, Ver Hoef JM, 2023a. spmodel: Spatial statistical modeling and prediction in R. PLOS ONE 18, e0282524. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Dumelle M, Kincaid T, Olsen AR, Weber M, 2023b. spsurvey: Spatial sampling design and analysis in R. Journal of Statistical Software 105, 1–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. EPA, 2017a. National Lakes Assessment 2017. Field Operations Manual. U.S. Environmental Protection Agency. Washington, DC, USA. URL: https://www.epa.gov/sites/default/files/2021-01/documents/nla_2017_fom_version_1.1_2017_04_06.pdf. [Google Scholar]
  16. EPA, 2017b. National Lakes Assessment 2017. Laboratory Operations Manual. U.S. Environmental Protection Agency. Washington, DC, USA. URL: https://www.epa.gov/sites/default/files/2020-03/documents/lom_nla_2017_version_1.1.pdf. [Google Scholar]
  17. Griffith MB, 2014. Natural variation and current reference for specific conductivity and major ions in wadeable streams of the conterminous USA. Freshwater Science 33, 1–17. doi: 10.1086/674704. [DOI] [Google Scholar]
  18. Guha S, Hafen R, Rounds J, Xia J, Li J, Xi B, Cleveland WS, 2012. Large complex data: divide and recombine (D&R) with RHIPE. Stat 1, 53–67. [Google Scholar]
  19. Harville DA, 1977. Maximum likelihood approaches to variance component estimation and to related problems. Journal of the American Statistical Association 72, 320–338. [Google Scholar]
  20. Heaton MJ, Datta A, Finley AO, Furrer R, Guinness J, Guhaniyogi R, Gerber F, Gramacy RB, Hammerling D, Katzfuss M, et al. , 2019. A case study competition among methods for analyzing large spatial data. Journal of Agricultural, Biological and Environmental Statistics 24, 398–425. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Henderson CR, 1975. Best linear unbiased estimation and prediction under a selection model. Biometrics, 423–447. [PubMed] [Google Scholar]
  22. Hill RA, Weber MH, Debbout RM, Leibowitz SG, Olsen AR, 2018. The lake-catchment (LakeCat) dataset: characterizing landscape features for lake basins within the conterminous USA. Freshwater Science 37, 208–221. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Hintz WD, Arnott SE, Symons CC, Greco DA, McClymont A, Brentrup JA, Cañedo-Argüelles M, Derry AM, Downing AL, Gray DK, Melles SJ, Relyea RA, Rusak JA, Searle CL, Astorg L, Baker HK, Beisner BE, Cottingham KL, Ersoy Z, Espinosa C, Franceschini J, Giorgio AT, Göbeler N, Hassal E, Hébert MP, Huynh M, Hylander S, Jonasen KL, Kirkwood AE, Langenheder S, Langvall O, Laudon H, Lind L, Lundgren M, Proia L, Schuler MS, Shurin JB, Steiner CF, Striebel M, Thibodeau S, Urrutia-Cordero P, Vendrell-Puigmitja L, Weyhenmeyer GA, 2022. Current water quality guidelines across North America and Europe do not protect lakes from salinization. Proceedings of the National Academy of Sciences 119, e2115033119. doi:doi: 10.1073/pnas.2115033119. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Homer C, Dewitz J, Fry J, Coan M, Hossain N, Larson C, Herold N, McKerrow A, VanDriel JN, Wickham J, et al. , 2007. Completion of the 2001 National Land Cover Database for the conterminous United States. Photogrammetric Engineering and Remote Sensing 73, 337. [Google Scholar]
  25. Homer C, Dewitz J, Jin S, Xian G, Costello C, Danielson P, Gass L, Funk M, Wickham J, Stehman S, et al. , 2020. Conterminous United States land cover change patterns 2001–2016 from the 2016 National Land Cover Database. ISPRS Journal of Photogrammetry and Remote Sensing 162, 184–199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Kaushal SS, Groffman PM, Likens GE, Belt KT, Stack WP, Kelly VR, Band LE, Fisher GT, 2005. Increased salinization of fresh water in the northeastern United States. Proceedings of the National Academy of Sciences 102, 13517–13520. doi:doi: 10.1073/pnas.0506414102. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Kaushal SS, Likens GE, Mayer PM, Shatkay RR, Shelton SA, Grant SB, Utz RM, Yaculak AM, Maas CM, Reimer JE, Bhide SV, Malin JT, Rippy MA, 2023. The anthropogenic salt cycle. Nature Reviews Earth & Environment 4, 770–784. doi: 10.1038/s43017-023-00485-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  28. Kaushal SS, Likens GE, Pace ML, Reimer JE, Maas CM, Galella JG, Utz RM, Duan S, Kryger JR, Yaculak AM, Boger WL, Bailey NW, Haq S, Wood KL, Wessel BM, Park CE, Collison DC, Aisin B.Y.a.I., Gedeon TM, Chaudhary SK, Widmer J, Blackwood CR, Bolster CM, Devilbiss ML, Garrison DL, Halevi S, Kese GQ, Quach EK, Rogelio CMP, Tan ML, Wald HJS, Woglo SA, 2021. Freshwater salinization syndrome: from emerging global problem to managing risks. Biogeochemistry 154, 255–292. doi: 10.1007/s10533-021-00784-w. [DOI] [Google Scholar]
  29. Kaushal SS, Likens GE, Pace ML, Utz RM, Haq S, Gorman J, Grese M, 2018. Freshwater salinization syndrome on a continental scale. Proceedings of the National Academy of Sciences 115, E574–E583. doi:doi: 10.1073/pnas.1711234115. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. McKay L, Bondelid T, Dewald T, Johnston J, Moore R, Reah A, 2012. NHDPlus Version 2: User Guide. URL: http://www.horizon-systems.com/NHDPlus/NHDPlusV2_home.php.
  31. Novotny EV, Murphy D, Stefan HG, 2008. Increase of urban lake salinity by road deicing salt. Science of The Total Environment 406, 131–144. doi: 10.1016/j.scitotenv.2008.07.037. [DOI] [PubMed] [Google Scholar]
  32. Olson JR, 2019. Predicting combined effects of land use and climate change on river and stream salinity. Philosophical Transactions of the Royal Society B: Biological Sciences 374, 20180005. doi:doi: 10.1098/rstb.2018.0005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  33. Olson JR, Cormier SM, 2019. Modeling spatial and temporal variation in natural background specific conductivity. Environmental Science & Technology 53, 4316–4325. doi: 10.1021/acs.est.8b06777. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Olson JR, Hawkins CP, 2012. Predicting natural base-flow stream water chemistry in the western United States. Water Resources Research 48. doi: 10.1029/2011WR011088. [DOI] [Google Scholar]
  35. Olson JR, Hawkins CP, 2014. Geochemical characteristics of the conterminous United States. U.S. Geological Survey data release doi: 10.5066/F7X0653P. [DOI] [Google Scholar]
  36. Palmer MA, Bernhardt ES, Schlesinger WH, Eshleman KN, Foufoula-Georgiou E, Hendryx MS, Lemly AD, Likens GE, Loucks OL, Power ME, White PS, Wilcock PR, 2010. Mountaintop mining consequences. Science 327, 148–149. doi:doi: 10.1126/science.1180543. [DOI] [PubMed] [Google Scholar]
  37. Patterson D, Thompson R, 1971. Recovery of inter-block information when block sizes are unequal. Biometrika 58, 545–554. [Google Scholar]
  38. Pebesma E, 2018. Simple Features for R: Standardized Support for Spatial Vector Data. The R Journal 10, 439–446. doi: 10.32614/RJ-2018-009. [DOI] [Google Scholar]
  39. Peck DV, Olsen AR, Weber MH, Paulsen SG, Peterson C, Holdsworth SM, 2013. Survey design and extent estimates for the National Lakes Assessment. Freshwater Science 32, 1231–1245. [Google Scholar]
  40. Pinheiro JC, Bates DM, 2000. Mixed-Effects Models in S and S-PLUS. Springer, New York. doi: 10.1007/b98882. [DOI] [Google Scholar]
  41. Pollard AI, Hampton SE, Leech DM, 2018. The promise and potential of continental-scale limnology using the US Environmental Protection Agency’s National Lakes Assessment. Limnology and Oceanography Bulletin 27, 36–41. [Google Scholar]
  42. R Core Team, 2023. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  43. Read EK, Patil VP, Oliver SK, Hetherington AL, Brentrup JA, Zwart JA, Winters KM, Corman JR, Nodine ER, Woolway RI, Dugan HA, Jaimes A, Santoso AB, Hong GS, Winslow LA, Hanson PC, Weathers KC, 2015. The importance of lake-specific characteristics for water quality across the continental United States. Ecological Applications 25, 943–955. doi: 10.1890/14-0935.1. [DOI] [PubMed] [Google Scholar]
  44. Schabenberger O, Gotway CA, 2017. Statistical methods for spatial data analysis. CRC press, New York. [Google Scholar]
  45. Schindler DW, 2009. Lakes as sentinels and integrators for the effects of climate change on watersheds, airsheds, and landscapes. Limnology and Oceanography 54, 2349–2358. doi: 10.4319/lo.2009.54.6_part_2.2349. [DOI] [Google Scholar]
  46. Shapiro MH, Holdsworth SM, Paulsen SG, 2008. The need to assess the condition of aquatic resources in the US. Freshwater Science 27, 801–811. [Google Scholar]
  47. Solomon CT, Dugan HA, Hintz WD, Jones SE, 2023. Upper limits for road salt pollution in lakes. Limnology and Oceanography Letters n/a. doi: 10.1002/lol2.10339. [DOI] [Google Scholar]
  48. Stevens DL Jr, Olsen AR, 2004. Spatially balanced sampling of natural resources. Journal of the American statistical Association 99, 262–278. [Google Scholar]
  49. Stoddard JL, 1991. Trends in Catskill stream water quality: Evidence from historical data. Water Resources Research 27, 2855–2864. doi: 10.1029/91WR02009. [DOI] [Google Scholar]
  50. Thorslund J, van Vliet MTH, 2020. A global dataset of surface water and groundwater salinity measurements from 1980–2019. Scientific Data 7, 231. doi: 10.1038/s41597-020-0562-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  51. Van Meter RJ, Swan CM, 2014. Road salts as environmental constraints in urban pond food webs. PLOS ONE 9, e90168. doi: 10.1371/journal.pone.0090168. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Ver Hoef JM, Dumelle M, Higham M, Peterson EE, Isaak DJ, 2023. Indexing and partitioning the spatial linear model for large data sets. PLOS ONE 18, e0291906. [DOI] [PMC free article] [PubMed] [Google Scholar]
  53. Wickham H, 2016. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag; New York. [Google Scholar]
  54. Wolfinger R, Tobias R, Sall J, 1994. Computing Gaussian likelihoods and their derivatives for general linear mixed models. SIAM Journal on Scientific Computing 15, 1294–1310. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement1
1

Data Availability Statement

All data and code associated with this work has been made publicly available in a GitHub repository at https://github.com/USEPA/lake-conductivity-spin.manuscript. The spmodel R package is available for download directly from CRAN. More information is available at https://cran.r-project.org/package=spmodel. All figures were made using ggplot2 (Wickham, 2016). All computations were performed on a computer having two Intel(R) Xeon(R) CPU E5-2690 v3 2.60GHz processors, 256 GB of RAM, and 48 threads (i.e., cores available for parallel processing).

RESOURCES