Skip to main content
EPA Author Manuscripts logoLink to EPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Dec 1.
Published in final edited form as: Environ Model Softw. 2023 Dec;170:105853. doi: 10.1016/j.envsoft.2023.105853

Nutrient Explorer: An analytical framework to visualize and investigate drivers of surface water quality

Michael J Pennino 1,*, Meridith M Fry 1, Robert D Sabo 1, James N Carleton 1
PMCID: PMC11485643  NIHMSID: NIHMS1996450  PMID: 39430121

Abstract

Excess nutrients (nitrogen and phosphorus) in lakes can lead to eutrophication, hypoxia, and algal blooms that may harm aquatic life and people. Some U.S. states have established numeric water quality criteria for nutrients to protect surface waters. However, monitoring to determine if criteria are being met is limited by resources and time. Using R code and the publicly available lake data, we introduce a downloadable interactive user interface for modeling relationships between watershed land use, climate, and other variables and surface water nutrient concentrations. Random Forest modeling identified watershed agricultural and forest land coverage, fertilizer inputs, and lake depth as the most important predictors of total phosphorus. The analytical framework implemented in this application can be applied to different locations and other surface water types to be leveraged by decision makers to identify the most influential drivers of excess nutrient concentrations and to prioritize watersheds for restoration.

Keywords: Phosphorus, Nitrogen, Nutrients, Lake Water Quality, Random Forest Modeling, Multilinear Regression Modeling, R Shiny

1. Introduction

Surface water quality is often associated with watershed land uses and their intensity. Human activities, such as the application of fertilizer and manure to row crops, can result in widespread, detrimental impacts on lake and stream health. Excess phosphorus (P) and nitrogen (N) concentrations in surface waters can impair their suitability for use as drinking water supplies, diminish recreational uses, and degrade their value as habitats for aquatic organisms and communities, particularly by intensifying eutrophication and hypoxia (Diaz et al. 2017, Dodds et al. 2009, Smith 2016).

Numerous studies worldwide have used empirical approaches to determine the causes of water quality impairment and to investigate connections between surface water concentrations of P and N, and watershed land uses, hydrologic, geomorphic, and other factors (Bremigan et al. 2008, Carney 2009, Cross and Jacobson 2013, Fergus et al. 2011, Nielsen et al. 2012, Zhang et al. 2012). Some methods for explaining spatial variation in surface water nutrient concentrations using landscape features include simple summary statistics, Analysis of Variance (ANOVA), Pearson’s or Spearman Rank correlation, Principal Components Analysis (PCA), or ordinary and geographically weighted regression to show how nutrients are often positively related to the proportion of crop land and negatively related to the proportion of forested land (Galbraith and Burns 2007, Jones et al. 2004, Łaszewski et al. 2022, Park and Lee 2020). One historically common modeling technique in ecological studies has been multilinear regression (MLR). Using MLR, Nielsen et al. (2012) established that TN, TP, and Chl-a increased significantly with percent agricultural land, and decreased significantly with percent forested land, in the watersheds of 414 Danish lakes, while Amirbahman et al. (2022) established that the watershed agricultural land fraction was among the strongest predictors (along with lake depth), of summer lake P concentrations in 126 Maine (USA) lakes, and Wagner and Schliep (2018) found associations between catchment agricultural land use and nutrient concentrations, and for lotic waters, Tufford et al. (1998) found stream TN and TP to be positively related to the portion of agriculture and urban land uses, and negatively related to forest and wetland coverage in coastal South Carolina.

There have also been a growing number of recent studies that use machine learning approaches like random forest (RF) modeling to predict surface water quality (Basu et al. 2023, Hill et al. 2017, Hollister et al. 2016, Pennino et al. 2020). For example, Martinsen and Sand-Jensen (2022) used RF to model water quality as a function of catchment, land use, and geomorphologic features, also finding lake nutrient and Chl-a concentrations to be positively related to catchment agriculture in Denmark. And the study by Sabo et al. (2023) used RF to explicitly link the magnitude of contemporary and legacy sources of terrestrial phosphorus to observed lake and stream TP concentrations across the United States, rather than relying on land use gradient relationships to explain spatial and temporal variability in TP. Yet, despite RF models being increasingly used in ecological studies, there are still very few studies that have applied this analytical approach to predict lake nutrient concentrations.

To help address water quality impairment, such as from nutrient pollution, states and authorized tribes in the U.S., under the Clean Water Act (CWA), as part of water quality standards (WQS), have the authority to establish designated uses (i.e., goals) for waterbodies (e.g., for recreation and aquatic life support, and for protection and propagation of fish, shellfish, and wildlife) (U.S. EPA 2017). WQS may include criteria for meeting those designated uses, which may be in the form of either constituent concentrations or narrative statements. States can use numeric thresholds to interpret narrative criteria as part of CWA Section 303(d) assessment or permitting for point source emissions. Under Section 304(a) of the CWA, the U.S. Environmental Protection Agency (USEPA) publishes guidance for the development of water quality criteria that reflect the latest scientific knowledge (e.g., U.S. EPA 2023a). The U.S. EPA also provides recommendations (i.e., 304a criteria) that consist of concentrations, levels, and/or qualitative measures of pollutants that, if not exceeded, are expected to generally ensure adequate water quality for protection of designated uses. The U.S. EPA recommends that states and tribes consider the 304a criteria when developing their own water quality criteria but has no legal authority to require adoption of federally recommended criteria. As of this writing, only one state (Hawaii) has adopted complete sets of P and N criteria for all waterbody types within its borders (although certain U.S. territories have also done so). Four states have adopted P and/or N criteria for two or more waterbody types, three have adopted P and/or N criteria for one waterbody type, 16 have adopted P and/or N criteria for some waters, while 26 (plus the District of Columbia) have not adopted P and/or N criteria for any of their waters (U.S. EPA 2023a). In states that have adopted total phosphorus (TP) criteria for some or all of their lakes and reservoirs (excluding Great Lakes and other site-specific waters), criterion concentrations range from 12 to 160 μg/L across different regional, hydrologic, or morphologic classes or categories of lakes. Rather than single samples or instantaneous concentrations, these criteria are often defined in terms of “chronic” concentrations, with warm-season means (e.g., June - September) or annual geometric means not to be exceeded.

Furthermore, under the Safe Drinking Water Act (SDWA) in the U.S., the U.S. EPA also sets national health-based standards for drinking water and monitors states, local authorities, and water suppliers who enforce those standards. The National Primary Drinking Water Regulations (NPDWR) are legally enforceable primary standards that apply to public water systems for protecting public health, by limiting concentrations of contaminants in drinking water (U.S. EPA 2023b). These include the Maximum Contaminant Level Goals (MCLGs) and Maximum Contaminant Levels (MCLs) for nitrate of 10 mg/L as N and for nitrite of 1 mg/L as N, for protecting infants below the age of six months from blue-baby syndrome.

The objective of this paper is to present an analytical framework implemented as a downloadable application with a graphical user interface (GUI), called the “Nutrient Explorer.” This framework is built to improve environmental management decisions and benefit watershed planning activities by elucidating temporal and spatial patterns in surface water datasets, identifying potentially impaired waters, and highlighting the main the drivers of spatial variation in nutrient concentrations in surface waters. The application was developed in R (R Development Core Team 2022), using RStudio (RStudio Team 2022) with the Shiny package (Chang et al. 2023) and provides an easy way for users to visualize trends, carry out simple summary statistics, and explore and identify correlations between predictor variables (e.g., watershed nutrient inputs, land use, landscape and water body attributes) and nutrient concentrations in surface water. Also, based on literature reviewed above, we chose to employ two modeling methods (RF and MLR) to generate predictive estimates of nutrient concentrations at both sampled and unsampled locations, similar to previous work (Alnahit et al. 2022, Brooks et al. 2016, Fu et al. 2019, Murphy et al. 2019, Nahkala et al. 2022, Pennino et al. 2020). These models can identify the main drivers of water quality impairment and the waterbodies that may be most likely to experience high nutrient concentrations and potentially exceed water quality criteria.

To illustrate the approach and capabilities of this analysis, we used TP concentration sample data collected by various agencies and organizations from lakes in 17 northeastern and northern midwestern U.S. states across multiple decades, from the LAGOS-NE (Lake Multi- Scaled Geospatial and Temporal Database of the Northeast) database (Soranno et al. 2017), along with watershed land use and land cover, hydrologic and surplus anthropogenic nutrient input data summarized at the Hydrologic Unit Code (HUC)-8 level (Sabo et al. 2019, Sabo et al. 2021b). Although the results presented focus on TP from the LAGOS-NE dataset, the analytical approach can be adapted to explore connections between landscape and water body attributes and other water quality parameters and to generate predictive models for other datasets in other regions for lotic or lentic systems (the User Guide in the Nutrient Explorer GUI provides details on how to format and use new datasets).

2. Materials and methods

2.1. Dataset

To begin exploring relationships between land use, landscape and water body attributes, and surface water quality, we gathered relevant water quality and associated land use and landscape data. We used water monitoring data from LAGOS-NE, which includes information on more than 51,000 lakes of greater than 4-hectare (ha) surface area, located in 17 northern midwestern and northeastern U.S. states. The LAGOS-NE database offers data gathered by more than 80 separate agencies within federal, state, and tribal governments, along with data collected by university researchers and citizen monitoring groups (Soranno et al. 2017, Soranno et al. 2015). Although LAGOS-NE does not report the depths from which its water quality samples were collected, samples included in LAGOS-NE are intended to specifically represent the epilimnetic/surficial waters of the sampled lakes (Soranno et al. 2017), which is therefore the focus of this study. LAGOS-NE lake observations were paired with landscape nutrient surplus data and other spatial data so that lake data were matched with their corresponding watershed data, using methodologies developed in previous efforts (Lin et al. 2021, Sabo et al. 2019, Sabo et al. 2021b).

We used R software (https://www.r-project.org/), a free software environment for statistical computing and graphics that can run on a range of platforms (i.e., Windows, UNIX, MacOS), to process and format the data. This analysis focuses on TP concentrations (μg/L), which were natural log (ln) transformed prior to all analyses, to meet requirements for normally distributed residuals. Subsequent uses of “log” will refer to ln.

2.2. Shiny

We developed an analytical framework in R with the Shiny package (https://shiny.rstudio.com/), to provide a downloadable interactive GUI for visualizing and analyzing water quality datasets. Using R Studio and the Shiny GUI, we explored correlations and interdependence among variables, generated RF and MLR models, and created a variety of plots to visualize and predict water quality. We call this application “Nutrient Explorer,” and it can be downloaded from EPA’s science inventory website: https://cfpub.epa.gov/si/si_public_record_report.cfm?dirEntryId=358039.

This general approach allows for the investigation of spatial and temporal patterns in concentrations of surface water nutrient concentrations (e.g., TP and TN), and the ability to identify correlations of these constituents with various landscape and water body characteristics. The interface can be used by researchers and natural resource managers in the United States at federal, state, and local levels, but it can also be applied more broadly to locations globally, to explore available data and relationships, to model and predict concentrations, and to assist with decisions regarding watershed management.

The Nutrient Explorer Shiny GUI consists of five main sections:

  • The first section is for loading either the default lake dataset or uploading a new dataset provided by the user (the User Guide in the Nutrient Explorer GUI gives instructions on how to format this dataset). This section displays a map of the sampling locations (latitude and longitude information are needed) and can allow the user to view different endpoints (e.g., TP, TN). The user can also produce a table with summary statistics on the raw data.

  • The second section allows the user to “Explore the Data” for the original uploaded dataset. This section includes various plots: histograms of endpoint data, summary statistics, time series plots, box plots for the two-digit hydrologic unit (HUC2) scale, correlation plots, and maps.

  • The third section allows the user to “Create a Subset” of the original, uploaded dataset. The data can be filtered by year, month, watershed area, watershed perimeter, elevation, lake area, lake depth, constituent concentration, data source, program type (e.g., federal agency, state agency, university), or HUC2 location. For purposes of illustration we chose to focus on the period from 2000 to 2013 to match the range of years available for the inventory on nutrient inputs (Sabo et al. 2019, Sabo et al. 2021b), and left all other options as their default (which provides the entire range of values for elevation, lake area, depth, etc.).

  • The fourth section allows the user to explore the created subset of the original dataset using various plots, such as time series, box plot summaries by HUC2, correlation plots for predictor and endpoint variables, and maps of the observed mean endpoint concentrations.

  • The fifth section allows the user to carry out either MLR or RF modeling after setting modeling parameters, to predict where endpoint concentrations are expected to be greatest and determine which predictor variables best explain spatial patterns. These two modeling options were chosen based on reviewing the literature for a common approach used for this type of analysis (e.g., MLR) and for an emerging approach being used more and more in ecological studies (e.g., RF) (see below). This section also allows the user to apply a default dataset for making predictions or for the user to upload a new dataset for predictions (instructions for doing this are in the application’s GUI User Guide). Also, this section provides options to download a table of the predicted N and/or P concentrations for every region and to download a color image of the prediction map.

2.3. Predictive Modeling

We included a predictive modeling capability in the Nutrient Explorer to provide users with an ability to evaluate the potential effects of anthropogenic stressors and environmental factors on surface water nutrient concentrations. This allows users to generate estimates of nutrient concentrations (TP in this case) for water bodies that have little or no sampling data. One of the modeling approaches we included, RF, is a decision tree-based approach that combines multiple trees (a ‘forest’) for predicting either discrete classes or continuous variables. Predictions are made by averaging over the trees, with each tree a subset of the entire dataset (Fox et al. 2017). We included RF modeling as part of the Nutrient Explorer because of its well-documented application in ecological, water quality, and spatial prediction studies (Cutler et al. 2007, Pennino et al. 2020, Wheeler et al. 2015), and because it readily handles large numbers of multicollinear variables, is insensitive to overfitting, and is known to work well with non-linear datasets (Hill et al. 2017, Nolan et al. 2014, Read et al. 2015). RF models are also useful for variable selection because they have the ability to generate variable importance rankings, which ranks variables based on the % increase in mean squared error (MSE), wherein variables that cause the greatest increase in % MSE when removed from the model are ranked as most important in explaining variation in the response variable (Cutler et al. 2007, Pennino et al. 2020).

We also included MLR functionality in the Shiny GUI. This portion of the GUI allows the user to select the most important variable or variables identified by the RF model section of the GUI (based on the % increase in MSE as described above). For our illustrative example, we chose the top 20 predictive variables identified by the RF model (see Table S1). Then, based on Pearson’s correlation analysis, when two variables had a correlation coefficient over 0.9, we used expert opinion to remove one of the two variables deemed less important in the model. We then identified the most parsimonious combination of linear predictors to explain the spatiotemporal variation in TP concentrations by calculating the Bayesian information criterion (BIC) statistic for each model and selecting the model with smallest delta BIC (Aho et al. 2014, Akaike 1998), similar to Sabo et al. (2023). Multicollinearity among predictors was also assessed using variance inflation factors (VIF) (Mansfield and Helms 1982, Thompson et al. 2017) and we removed variables with a VIF > 10, a common threshold (Marcoulides and Raykov 2019, Salmeron et al. 2018, Tay 2017), one at a time until no variables had a VIF > 10, similar to (Sabo et al. 2023). In addition to common regression output, standardized regression coefficients were generated to evaluate relative effects of predictors on TP concentrations.

For illustrative purposes, the Shiny GUI highlights where the model predicts concentrations to be above a user-defined threshold, such as an average concentration reflecting a state’s water quality criterion. For example, five states within the LAGOS-NE region have EPA-approved numeric TP criteria for lakes and reservoirs: Minnesota (12–90 μg/L), Wisconsin (15–40 μg/L), New Jersey (50 μg/L), Rhode Island (25 μg/L), and Vermont (12–18 μg/L). Visually, we find that the within-state concentration ranges for these states reflect the different criteria for lakes of different description and/or geographic location within a given ecoregion. In cases where criterion concentration ranges exist within a state, we compared the maximum value (i.e., TP concentration) to the model-predicted concentrations to understand where states may have criteria exceedance concerns. For states lacking lake TP criteria, we compared the modeled predicted concentrations to an average of the maximum criteria values from the five states (i.e., 45 μg/L).

3. Results

3.1. Data Exploration

The data exploration section of the Shiny interface shows how TP concentrations vary spatially by HUC2 or HUC8 regions (Figure 1a). HUC2 zone 7 has the most data records for TP at about 25,000, followed by HUC2 zone 4 at around 20,000 TP data records (Figure 1b). The largest portion of TP monitoring data in LAGOS-NE come from state agency programs, followed by citizen monitoring (Figure 1b). Box plots of TP by HUC2 zone shows that HUC2 zones 5, 8 and 10 have the highest median log TP (~50 μg/L), while HUC2 Regions 1, 2, and 4 have the lowest median log TP of (~12 μg/L) (Figure 1c). A time series of box plots for log TP levels each year between 2000 and 2013 shows that median log TP for the full dataset was steady at ~2.8 (16 μg/L) until 2012 and 2013, when it increased to ~3.2 (25 μg/L) (Figure 1d). The time series of mean log TP concentrations for different program types also allows users to select different program types to display. TP was generally higher for data from the federal agency dataset and lower for the non-profit agency dataset (Figure S1).

Figure 1.

Figure 1.

Total Phosphorus (TP) and how it varies spatially by HUC8 regions, as LogTP = ln(TP, μg/L), (b) the frequency of samples by program type in each HUC2 region, c) box plots by HUC2 zones showing ln TP concentration, and d) box plots showing how ln TP concentration vary by year.

Using the data exploration section, users can assess correlations between response variables (i.e., in this case TP) and various predictor variables (e.g., lake depth, precipitation, % forest land). The correlation matrix produced in this section can provide scatterplots for comparison, and shows which variables have significant correlations (based on Pearson’s correlation coefficients). Figure 2 includes five of the variables used in the modeling demonstration below. For example, we find a negative correlation (below −0.5) between % farmland (NLCD_pct_farm) and % forest land (NLCD_pct_forest) and a positive correlation (above 0.5) between P deposition (P_Deposition.1) and P from farm fertilizer (P_f_fertilizer.1) (Figure 2, Table S2). Also, all predictor variables can be summarized by HUC2 region, showing which regions may have higher or lower values for different variables (Figure S2).

Figure 2.

Figure 2.

Correlation analysis for some of the common variables used in this analysis.

3.2. Random Forest Modeling

The predictor variables in this dataset were used to run the RF model (see Table S2 for a list of all variables and details). The R-squared, based on the hold-out (testing) dataset, was 0.78, prediction error is 0.20, the root mean squared error is 0.44, and mean bias is 0.005 (Table 1). The scatter plot of the observed LogTP vs. Predicted LogTP values for the RF model shows a good one-to-one relationship (Figure S3a). The 10 most important variables in the RF model were inter-lake watersheds (IWS) canopy (iws_canopy2001_mean), % farmland (NLCD_pct_farmland), % forest land (NLCD_pct_forest), maximum lake depth (MaxDepth), mass of P from farm fertilizer (P_f_fertilizer.1, a subset of agricultural inputs), P from agricultural inputs (P_Ag_Inputs.1), mean IWS slope (IWS_slope_mean), P from anthropogenic inputs (P_anthro_Inputs.1), P from crop removal (P_Crop_removal.1), and net anthropogenic P input (NAPI), which is anthropogenic N inputs minus N losses (Figure 3a). Based on the partial dependence plots, IWS canopy (iws_canopy2001_mean), IWS slope (iws_slope_mean), maximum lake depth (MaxDepth), and % forest land (NLCD_pct_forest) had a negative correlation with log TP, while % farmland (NLCD_pct_farmland) and the other P variables had a positive relationship with log TP (Figure 3b).

Table 1.

Results from the random forest and multilinear regression models.

Model Uncertainty Metrics Random Forest Regression Model
r.squared (reported by Ranger function) 0.78
prediction.error (reported by Ranger function) 0.19
R squared (testing set) 0.77
R squared (training set) 0.90
Root Mean Squared Error 0.45
Mean Bias 0.0092
Standard deviation of the error 0.45
Model Uncertainty Metrics Multilinear Regression Model
Multiple R-squared 0.36
Adjusted R-squared 0.36
Residual standard error 0.73
Root Mean Squared Error 0.73
Mean Bias 0.0000
Standard deviation of the error 0.73

Root Mean Squared Error = square root of the mean of the model predictions minus observations, squared.

Mean Bias = summation of predictions minus observations divided by number of observations.

Standard deviation (SD) of the error = SD of the model predictions minus observed values.

Figure 3.

Figure 3.

Results from RF model: a) top 10 most important variables in the model, ranked according to % increase in mean squared error (MSE) when removed from the model, with variables at the top ranked as more important in the model, b) map of predicted TP values for each HUC8 in the northeast US, and c) partial dependence plots showing the relationship between each predictor variable and log(TP, μg/L), where the y-axis is the predicted logTP value and the x-axis is the variable value.

Partial dependence plots showed that there was a threshold for some variables where a small change in the variable results in a large change in log TP. For example, for maximum lake depth (MaxDepth) below 25 meters, there was a dramatic increase in log TP with decreasing depth. For P fertilizer (P_f_fertilizer) above 5 kg P/ha/yr or net anthropogenic P inputs (NAPI.1) are above ~ 50 kg P/yr/ha, there was a greater increase in log TP lake concentration. However, NLCD % farm (NLCD_pct_farm) or % forest (NLCD_pct_forest) had a more linear relationship with log TP (Figure 1b).

The RF model predicted log TP at the HUC8 scale for the upper Midwest and Northeast regions of the U.S. The highest predicted log TP values (corresponding to ~150 μg/L) were found in Minnesota, Iowa, Wisconsin, and Illinois and the lowest were in northeastern states (Pennsylvania through Maine) (Figure 3c).

3.3. Multilinear Regression Modeling

The top 20 variables from the RF modeling (see Table S1) were selected for the MLR and further refined by removing one of each pair of correlated variables. The adjusted R-squared of the final model was 0.36 (Table 1). The scatter plot of the observed LogTP vs. Predicted LogTP values for the RF model shows the semi-one-to-one relationship (Figure S3b). The following variables were removed due to having correlations greater than 0.9 with other predictors that we kept in the model: P from crop removal (P_Crop_removal.1), mean tree canopy cover (iws_canopy2001_mean), P from agriculture inputs (P_Ag_inputs.1), and P from anthropogenic inputs (P_Anthro_Inputs.1). The landscape P surplus variable (P_Ag_Surplus.1), net anthropogenic P inputs (NAPI), was also removed in the next step due to a high VIF.

Next, we selected the MLR model with the lowest delta BIC value (BIC = 0). No other models had a delta BIC < 2. The model with the lowest delta BIC selected was zero and the next highest was seven. The variables selected for the final MLR model with a positive relationship with log TP (ln(TP, μg/L)) were: Elevation, IWS area (iws_ha), land surface temperature (LST_YrMean), % developed land (NLCD_pct_developed), % farmland (NLCD_pct_farm), % wetland (NLCD_pct_wetlands), P from farm fertilizer (P_f_fertilizer), P from human waste (P_human_waste_kg.1), Pfrom livestock waste (P_livestock_Waste.1) (Figure 4a). The variables selected for the final MLR model with a negative relationship with log TP were: road density (iws_roaddensity_density_mperha), mean slope (iws_slope_mean), max lake depth (MaxDepth), P from deposition (P_Deposition.1), and Recovered P (Recovered_P.1) (Figure 4a). Based on the standardized coefficient, the most influential variables in the model were: % farmland (NLCD_pct_farm), maximum lake depth (MaxDepth), watershed area (IWS_ha), temperature (LST_YrMean), P from human waste (P_human_waste_kg.1), P from deposition (P_Deposition.1), recovered P (Recovered_P.1) (Figure 4a). The spatial distribution of log TP predicted by the MLR model was similar to that from the RF model, though the lowest predicted values were 2.5 log(μg/L) for the MLR model compared to 1 log(μg/L) for the RF model (Figure 4b), yet based on Table 1, the RF model does a better job matching the observed data. Based on the mean concentration threshold of 45 μg/L, the states where the lake TP concentrations were predicted to exceed this amount were primarily in the upper Midwest: southwestern Minnesota, Iowa, Illinois, Indiana, and northern Missouri (Figure S4).

Figure 4.

Figure 4.

Results from MLR model: a) most important variables in the model and b) map of predicted log TP values for each HUC8 in the northeast.

4. Discussion

4.1. Overview

We have demonstrated an analytical approach for exploring complex environmental datasets and running predictive models to estimate water nutrient levels in locations where there are no observations. The R code and Shiny GUI described can be used for the nutrients TP or TN and any number of explanatory variables. Different datasets will produce different correlations and modeling results. However, given the flexibility and transparency of the Shiny GUI, users can decide which variables to explore and subsequently use in predictive modeling.

In this study of TP across the midwestern and northeastern U.S., we found the highest observed lake TP concentrations in Iowa and Illinois, and the lowest in Maine. This is corroborated by the HUC2 zones 5 and 7 having the highest observed TP. Information on the data source (subset by Program Type) can be useful to watershed managers interested in seeing what might be driving higher concentrations. The correlation analysis developed can help users discover the direction of influence (i.e., whether positive or negative) and which predictor variables are most highly correlated with the endpoint or constituent of interest.

4.2. Data Exploration

The Data Exploration section of the Shiny GUI allows for quickly identifying where sampling data are most prevalent, by whom they were collected, and what spatial and temporal patterns in concentration may exist. This information can be used to subset the data, based on user needs and preference, for further exploratory analyses and predictive modeling. Knowing spatial and statistical distributions of observed concentrations as well as relationships between predictor and response variables can help a user better interpret the findings of predictive models.

4.3. Predictive Modeling

The predictive modeling section provides various means of interpreting the results and illuminating relationships among the variables in a dataset. For example, the RF modeling provides a ranking of the most important variables and partial dependence plots can be used to understand the direction of the relationship between dependent and independent variables. The standardized coefficients plot for the MLR results also shows the direction of the relationship between the endpoint and predictor variables. Note that the predictive models are limited to interpolating concentrations within the spatial and temporal bounds of the dataset being explored.

From both the RF and MLR modeling presented in this paper, we found that the main drivers of TP concentration and spatial patterns thereof were percent farmland, P inputs from fertilizer, and maximum lake depth. Greater P inputs and smaller lake depths were associated with higher TP concentrations in lakes, not unexpectedly. Interestingly, the top 10 most influential variables from the RF and MLR models were not the same. Lake canopy cover, P inputs from agriculture, and P anthropogenic were removed from the MLR model due to high correlations with other variables, and % forest was not included in the final MLR model. Other variables in common were watershed slope, % forest, and P from fertilizer. The MLR modeling results showed that P from human waste was in the top 10 list, as well as recovered P and P from deposition. The directions of the relationships for the covariates were the same in both the RF and MLR models.

The prediction maps from RF and MLR are similar, but the accuracy of the modeling results is very different. The RF model produced a much higher R-squared value compared to the MLR model, which means the RF model was more predictive of the observed TP lake concentrations. Both models capture broad regional patterns, but RF can capture local variability better by incorporating the unique information afforded by all predictors used in the analysis.

4.4. Management Relevance

In this study, we extrapolated observed lake TP concentrations across space and time to predict mean concentrations in lakes for where sampling data are lacking, but which are located within the general spatial and temporal bounds of the dataset. Based on the prediction map, the RF model predicts lake mean TP concentrations in Minnesota to be above the 90 μg/L numeric TP criteria in more than half of its HUC8s, concentrations to be above the 40 μg/L TP criteria in Wisconsin for some of its HUC8s, and concentrations to be above the 50 μg/L TP criteria in New Jersey in about one third of its HUC8s, primarily in the southern half of that state. The model, however, predicts TP concentrations in Rhode Island and Vermont to be below their 25 μg/L and 18 μg/L TP criteria for all their HUC8s, respectively. Assuming a hypothetical 45 μg/L threshold value for the other states in the LAGOS-NE region, we find approximately 50% of the HUC8s are predicted to have TP concentrations above this threshold, while other states, particularly in the northeast, have fewer or no states predicted to be above this threshold. The states with the most lakes predicted to be above a threshold are generally found in the areas where agriculture is most widely and intensively practiced. Irrespective of model power, it is important to note that the model predictions discussed here are not equivalent to observations and therefore cannot be used as sole determinants of waterbody impairment. Note, that since the GUI presented in this paper is not just for P in lakes, a similar analysis can be done with N for lakes or for nutrient concentrations in other surface waters, like rivers and streams.

In addition to direct nutrient predictions, model results also can be used to assess indirect effects, such as where chlorophyll-a levels in lakes or other surface waters may be high due to elevated TP and TN. Systematic literature reviews and other studies have found positive relationships between TP and chlorophyl-a concentrations in lotic systems for states within the LAGOS-NE region (Bennett et al. 2021, Rowland et al. 2019). This means that there may be an elevated risk for high chlorophyll-a and harmful algal blooms where TP or TN is predicted to be the greatest (Chaffin et al. 2021).

Better management of nutrients in lakes or other surface waters can also be informed by the information provided by the RF partial dependance plots, which shows how the most influential variables are related to the nutrient concentrations in the water. For the example, as shown in this paper, managers could prioritize lakes or other surface waters vulnerable to elevated lake P concentrations, which is predicted for lakes shallower than a certain threshold (e.g., 25 m) or when net anthropogenic P inputs exceeds a certain threshold (e.g., 50 kg P/yr/ha). Additionally, surplus P added to the landscape from agricultural fertilizer may be a large factor in driving lake P levels (Kast et al. 2021, Sabo et al. 2021b). However, even after reductions in P from various point and non-point sources, legacy P surplus from historical P releases may still contribute to elevated lake TP levels (Sabo et al. 2021b, Stoddard et al. 2016). Yet, potential ways to reduce P loads from agriculture still exist, such as better land management, use of best management practices (BMPs), like riparian buffers, and improving farmer knowledge and efficient use of fertilizer (Baulch et al. 2019, Wilson et al. 2019).

5. Conclusions

In this analysis of TP in lakes across 17 northeastern and northern midwestern U.S. states, we demonstrate how water quality and landscape variables can be used in conjunction to predict water quality in data poor locations. The R code and Shiny GUI developed have a variety of useful features including the: 1) visualization of temporal and spatial patterns of surface water nutrient concentrations, 2) analysis of datasets to identify correlations between independent and dependent variables; and 3) modeling and prediction of nutrient concentrations that may be exceeding levels of concern in surface waters. Additionally, the GUI provides an easy way for users to download prediction maps and tables of prediction results.

The framework presented for analyzing and predicting nutrient concentrations offers a statistically sound approach for assessing water quality and helping managers to prioritize watersheds and waterbodies for mitigative activities using publicly available environmental and nutrient inventory datasets (Sabo et al. 2021a), even with limited data availability or coverage.

Supplementary Material

s1
s2

Highlights.

  • A downloadable interactive graphical user interface quantifies the relationships between surface water nutrient concentrations and landscape variables.

  • To demonstrate this application, Total phosphorus (TP) concentrations measurements from lakes in 17 northern midwestern and northeastern U.S. states (LAGOS-NE) were combined with watershed-scale landscape metrics to visualize relationships across space and time.

  • Lake TP is correlated with certain anthropogenic, landscape, and climatic metrics.

  • Random forest and multilinear regression models were developed to determine which factors best explain spatial variability in TP and where lake TP is highest across the Upper Midwest and Northeast U.S.

6. Acknowledgements

We thank the Environmental Modeling and Visualization Laboratory (EMVL) members Yadong Xu and Ray Burton and EPA OMS’s Heidi Paulsen for developing the R Shiny UI framework for this paper. We also thank Ben Washington who helped develop the dataset used in this analysis and some of the initial R code for the visualizations and modeling. We thank Sylvia Lee and Micah Bennett who provided helpful comments and review of the manuscript prior to submission. Disclaimer: The views expressed in this article are those of the authors and do not necessarily represent the views or the policies of the U.S. Environmental Protection Agency. Any mention of trade names, manufacturers or products does not imply an endorsement by the United States Government or the U.S. Environmental Protection Agency.

Software and Data Availability

  • Software Name: Nutrient Explorer.

  • Developer: U.S. Environmental Protection Agency.

  • Email: pennino.michael@epa.gov

  • First year available: 2023.

  • Hardware requirements: PC/Mac.

  • Software requirements: R statistical environment and language; RStudio

  • Program language: R.

  • Program size: 334 mb (download size).

  • RAM: can use up to an additional 5.0 GB of RAM.

  • Availability: https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=CPHEA&dirEntryId=358039

  • Cost: Free; Open Source.

7 References:

  1. Aho K, Derryberry D and Peterson T (2014) Model selection for ecologists: the worldviews of AIC and BIC. Ecology 95(3), 631–636. [DOI] [PubMed] [Google Scholar]
  2. Akaike H (1998) Information theory and an extension of the maximum likelihood principle. Selected papers of hirotugu akaike, 199–213. [Google Scholar]
  3. Alnahit AO, Mishra AK and Khan AA (2022) Stream water quality prediction using boosted regression tree and random forest models. Stochastic Environmental Research and Risk Assessment 36(9), 2661–2680. [Google Scholar]
  4. Amirbahman A, Fitzgibbon KN, Norton SA, Bacon LC and Birkel SD (2022) Controls on the epilimnetic phosphorus concentration in small temperate lakes. Environmental Science: Processes & Impacts 24(1), 89–101. [DOI] [PubMed] [Google Scholar]
  5. Basu NB, Dony J, Van Meter K, Johnston SJ and Layton AT (2023) A random forest in the Great Lakes: Stream nutrient concentrations across the transboundary Great Lakes Basin. Earth’s Future 11(4), e2021EF002571. [Google Scholar]
  6. Baulch HM, Elliott JA, Cordeiro MR, Flaten DN, Lobb DA and Wilson HF (2019) Soil and water management: Opportunities to mitigate nutrient losses to surface waters in the northern Great Plains. Environmental Reviews 27(4), 447–477. [Google Scholar]
  7. Bennett MG, Lee SS, Schofield KA, Ridley CE, Washington BJ and Gibbs DA (2021) Response of chlorophyll a to total nitrogen and total phosphorus concentrations in lotic ecosystems: a systematic review. Environmental Evidence 10(1), 1–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Bremigan MT, Soranno PA, Gonzalez MJ, Bunnell DB, Arend KK, Renwick WH, Stein RA and Vanni MJ (2008) Hydrogeomorphic features mediate the effects of land use/cover on reservoir productivity and food webs. Limnology and Oceanography 53(4), 1420–1433. [Google Scholar]
  9. Brooks W, Corsi S, Fienen M and Carvin R (2016) Predicting recreational water quality advisories: a comparison of statistical methods. Environmental Modelling & Software 76, 81–94. [Google Scholar]
  10. Carney E (2009) Relative influence of lake age and watershed land use on trophic state and water quality of artificial lakes in Kansas. Lake and Reservoir Management 25(2), 199–207. [Google Scholar]
  11. Chaffin JD, Bratton JF, Verhamme EM, Bair HB, Beecher AA, Binding CE, Birbeck JA, Bridgeman TB, Chang X and Crossman J (2021) The Lake Erie HABs Grab: A binational collaboration to characterize the western basin cyanobacterial harmful algal blooms at an unprecedented high-resolution spatial scale. Harmful Algae 108, 102080. [DOI] [PMC free article] [PubMed] [Google Scholar]
  12. Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, Allen J, McPherson J, Dipert A and Borges B (2023) shiny: Web Application Framework for R. R package version 1.7.2. https://cran.r-project.org/web/packages/shiny/. [Google Scholar]
  13. Cross TK and Jacobson PC (2013) Landscape factors influencing lake phosphorus concentrations across Minnesota. Lake and Reservoir Management 29(1), 1–12. [Google Scholar]
  14. Cutler DR, Edwards TC, Beard KH, Cutler A, Hess KT, Gibson J and Lawler JJ (2007) Random forests for classification in ecology. Ecology 88(11), 2783–2792. [DOI] [PubMed] [Google Scholar]
  15. Diaz R, Selman M and Chique C (2017) Global eutrophic and hypoxic coastal systems. Eutrophication and hypoxia: nutrient pollution in coastal waters.
  16. Dodds WK, Bouska WW, Eitzmann JL, Pilger TJ, Pitts KL, Riley AJ, Schloesser JT and Thornbrugh DJ (2009) Eutrophication of US Freshwaters: Analysis of Potential Economic Damages. Environmental Science & Technology 43(1), 12–19. [DOI] [PubMed] [Google Scholar]
  17. Fergus EC, Soranno PA, Cheruvelil KS and Bremigan MT (2011) Multiscale landscape and wetland drivers of lake total phosphorus and water color. Limnology and Oceanography 56(6), 2127–2146. [Google Scholar]
  18. Fox EW, Hill RA, Leibowitz SG, Olsen AR, Thornbrugh DJ and Weber MH (2017) Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology. Environmental Monitoring and Assessment 189(7), 316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Fu B, Merritt WS, Croke BF, Weber TR and Jakeman AJ (2019) A review of catchment-scale water quality and erosion models and a synthesis of future prospects. Environmental Modelling & Software 114, 75–97. [Google Scholar]
  20. Galbraith LM and Burns CW (2007) Linking land-use, water body type and water quality in southern New Zealand. Landscape Ecology 22, 231–241. [Google Scholar]
  21. Hill RA, Fox EW, Leibowitz SG, Olsen AR, Thornbrugh DJ and Weber MH (2017) Predictive mapping of the biotic condition of conterminous US rivers and streams. Ecological Applications 27(8), 2397–2415. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Hollister JW, Milstead WB and Kreakie BJ (2016) Modeling lake trophic state: a random forest approach. Ecosphere 7(3), e01321. [Google Scholar]
  23. Jones J, Knowlton M, Obrecht D and Cook E (2004) Importance of landscape variables and morphology on nutrients in Missouri reservoirs. Canadian Journal of Fisheries and Aquatic Sciences 61(8), 1503–1512. [Google Scholar]
  24. Kast JB, Apostel AM, Kalcic MM, Muenich RL, Dagnew A, Long CM, Evenson G and Martin JF (2021) Source contribution to phosphorus loads from the Maumee River watershed to Lake Erie. Journal of Environmental Management 279, 111803. [DOI] [PubMed] [Google Scholar]
  25. Łaszewski M, Fedorczyk M and Stępniewski K (2022) The Impact of Land Cover on Selected Water Quality Parameters in Polish Lowland Streams during the Non-Vegetative Period. Water 14(20), 3295. [Google Scholar]
  26. Lin J, Compton JE, Hill RA, Herlihy AT, Sabo RD, Brooks JR, Weber M, Pickard B, Paulsen SG and Stoddard JL (2021) Context is everything: interacting inputs and landscape characteristics control stream nitrogen. Environmental Science & Technology 55(12), 7890–7899. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Mansfield ER and Helms BP (1982) Detecting multicollinearity. The American Statistician 36(3a), 158–160. [Google Scholar]
  28. Marcoulides KM and Raykov T (2019) Evaluation of variance inflation factors in regression models using latent variable modeling methods. Educational and psychological measurement 79(5), 874–882. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Martinsen KT and Sand-Jensen K (2022) Predicting water quality from geospatial lake, catchment, and buffer zone characteristics in temperate lowland lakes. Science of the Total Environment 851, 158090. [DOI] [PubMed] [Google Scholar]
  30. Murphy RR, Perry E, Harcum J and Keisman J (2019) A generalized additive model approach to evaluating water quality: Chesapeake Bay case study. Environmental Modelling & Software 118, 1–13. [Google Scholar]
  31. Nahkala BA, Kaleita AL and Soupir ML (2022) Empirical tool development for prairie pothole management using AnnAGNPS and random forest. Environmental Modelling & Software 147, 105241. [Google Scholar]
  32. Nielsen A, Trolle D, Søndergaard M, Lauridsen TL, Bjerring R, Olesen JE and Jeppesen E (2012) Watershed land use effects on lake water quality in Denmark. Ecological Applications 22(4), 1187–1200. [DOI] [PubMed] [Google Scholar]
  33. Nolan BT, Gronberg JM, Faunt CC, Eberts SM and Belitz K (2014) Modeling nitrate at domestic and public-supply well depths in the Central Valley, California. Environmental Science & Technology 48(10), 5643–5651. [DOI] [PubMed] [Google Scholar]
  34. Park S-R and Lee S-W (2020) Spatially varying and scale-dependent relationships of land use types with stream water quality. International Journal of Environmental Research and Public Health 17(5), 1673. [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Pennino MJ, Leibowitz SG, Compton JE, Hill RA and Sabo RD (2020) Patterns and predictions of drinking water nitrate violations across the conterminous United States. Science of the Total Environment 722, 137661. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. R Development Core Team (2022) R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. [Google Scholar]
  37. Read EK, Patil VP, Oliver SK, Hetherington AL, Brentrup JA, Zwart JA, Winters KM, Corman JR, Nodine ER and Woolway RI (2015) The importance of lake‐specific characteristics for water quality across the continental United States. Ecological Applications 25(4), 943–955. [DOI] [PubMed] [Google Scholar]
  38. Rowland FE, Stow CA, Johengen TH, Burtner AM, Palladino D, Gossiaux DC, Davis TW, Johnson LT and Ruberg S (2019) Recent patterns in Lake Erie phosphorus and chlorophyll a concentrations in response to changing loads. Environmental Science & Technology 54(2), 835–841. [DOI] [PubMed] [Google Scholar]
  39. RStudio Team (2022) RStudio: Integrated Development Environment for R. RStudio, PBC, Boston, MA: URL http://www.rstudio.com/. [Google Scholar]
  40. Sabo RD, Clark CM, Bash J, Sobota D, Cooter E, Dobrowolski JP, Houlton BZ, Rea A, Schwede D and Morford SL (2019) Decadal shift in nitrogen inputs and fluxes across the contiguous United States: 2002–2012. Journal of Geophysical Research: Biogeosciences 124(10), 3104–3124. [Google Scholar]
  41. Sabo RD, Clark CM and Compton JE (2021a) Considerations when using nutrient inventories to prioritize water quality improvement efforts across the US. Environmental Research Communications 3(4), 045005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Sabo RD, Clark CM, Gibbs DA, Metson GS, Todd MJ, LeDuc SD, Greiner D, Fry MM, Polinsky R and Yang Q (2021b) Phosphorus inventory for the conterminous United States (2002–2012). Journal of Geophysical Research: Biogeosciences 126(4), e2020JG005684. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Sabo RD, Pickard B, Lin J, Washington B, Clark CM, Compton JE, Pennino M, Bierwagen B, LeDuc S and Carleton JN (2023) Comparing drivers of spatial variability in US lake and stream phosphorus concentrations. Journal of Geophysical Research: Biogeosciences, e2022JG007227. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Salmeron R, García C and García J (2018) Variance inflation factor and condition number in multiple linear regression. Journal of statistical computation and simulation 88(12), 2365–2384. [Google Scholar]
  45. Smith VH (2016) Effects of eutrophication on maximum algal biomass in lake and river ecosystems. Inland Waters 6(2), 147–154. [Google Scholar]
  46. Soranno PA, Bacon LC, Beauchene M, Bednar KE, Bissell EG, Boudreau CK, Boyer MG, Bremigan MT, Carpenter SR and Carr JW (2017) LAGOS-NE: a multi-scaled geospatial and temporal database of lake ecological context and water quality for thousands of US lakes. GigaScience 6(12), gix101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Soranno PA, Cheruvelil KS, Wagner T, Webster KE and Bremigan MT (2015) Effects of land use on lake nutrients: the importance of scale, hydrologic connectivity, and region. Plos One 10(8), e0135454. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Stoddard JL, Van Sickle J, Herlihy AT, Brahney J, Paulsen S, Peck DV, Mitchell R and Pollard AI (2016) Continental-scale increase in lake and stream phosphorus: Are oligotrophic systems disappearing in the United States? Environmental Science & Technology 50(7), 3409–3415. [DOI] [PubMed] [Google Scholar]
  49. Tay R (2017) Correlation, variance inflation and multicollinearity in regression model. Journal of the Eastern Asia Society for Transportation Studies 12, 2006–2015. [Google Scholar]
  50. Thompson CG, Kim RS, Aloe AM and Becker BJ (2017) Extracting the variance inflation factor and other multicollinearity diagnostics from typical regression results. Basic and Applied Social Psychology 39(2), 81–90. [Google Scholar]
  51. Tufford DL, McKellar HN Jr and Hussey JR (1998) In‐stream nonpoint source nutrient prediction with land‐use proximity and seasonality, Wiley Online Library. [Google Scholar]
  52. EPA US (2017) Water Quality Standards Handbook: Chapter 3: Water Quality Criteria. EPA-823-B-17–001. EPA Office of Water, Office of Science and Technology, Washington, DC. Accessed February 2023. https://www.epa.gov/sites/production/files/2014-10/documents/handbook-chapter3.pdf. [Google Scholar]
  53. U.S. EPA (2023a) State Progress Toward Adopting Numeric Nutrient Water Quality Criteria for Nitrogen and Phosphorus. Accessed March 2023. https://www.epa.gov/nutrient-policy-data/state-progress-toward-adopting-numeric-nutrient-water-quality-criteria.
  54. U.S. EPA (2023b) U.S. Environmental Protection Agency. National Primary Drinking Water Regulations. Accessed November 2022. https://www.epa.gov/ground-water-and-drinking-water/national-primary-drinking-water-regulations. [Google Scholar]
  55. Wagner T and Schliep EM (2018) Combining nutrient, productivity, and landscape‐based regressions improves predictions of lake nutrients and provides insight into nutrient coupling at macroscales. Limnology and Oceanography 63(6), 2372–2383. [Google Scholar]
  56. Wheeler DC, Nolan BT, Flory AR, DellaValle CT and Ward MH (2015) Modeling groundwater nitrate concentrations in private wells in Iowa. Science of the Total Environment 536, 481–488. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Wilson RS, Beetstra MA, Reutter JM, Hesse G, Fussell KMD, Johnson LT, King KW, LaBarge GA, Martin JF and Winslow C (2019) Commentary: Achieving phosphorus reduction targets for Lake Erie. Journal of Great Lakes Research 45(1), 4–11. [Google Scholar]
  58. Zhang T, Soranno PA, Cheruvelil KS, Kramer DB, Bremigan MT and Ligmann-Zielinska A (2012) Evaluating the effects of upstream lakes and wetlands on lake phosphorus concentrations using a spatially-explicit model. Landscape Ecology 27(7), 1015–1030. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

s1
s2

Data Availability Statement

  • Software Name: Nutrient Explorer.

  • Developer: U.S. Environmental Protection Agency.

  • Email: pennino.michael@epa.gov

  • First year available: 2023.

  • Hardware requirements: PC/Mac.

  • Software requirements: R statistical environment and language; RStudio

  • Program language: R.

  • Program size: 334 mb (download size).

  • RAM: can use up to an additional 5.0 GB of RAM.

  • Availability: https://cfpub.epa.gov/si/si_public_record_report.cfm?Lab=CPHEA&dirEntryId=358039

  • Cost: Free; Open Source.

RESOURCES