SSNdesign—An R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks

Alan R Pearse; James M McGree; Nicholas A Som; Catherine Leigh; Paul Maxwell; Jay M Ver Hoef; Erin E Peterson

doi:10.1371/journal.pone.0238422

. 2020 Sep 22;15(9):e0238422. doi: 10.1371/journal.pone.0238422

SSNdesign—An R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks

Alan R Pearse ^1,^2,^*, James M McGree ^2,³, Nicholas A Som ^4,⁵, Catherine Leigh ^1,^2,^3,^¤, Paul Maxwell ⁶, Jay M Ver Hoef ⁷, Erin E Peterson ^1,^2,³

Editor: Andreas C Bryhn⁸

PMCID: PMC7508409 PMID: 32960894

Abstract

Streams and rivers are biodiverse and provide valuable ecosystem services. Maintaining these ecosystems is an important task, so organisations often monitor the status and trends in stream condition and biodiversity using field sampling and, more recently, autonomous in-situ sensors. However, data collection is often costly, so effective and efficient survey designs are crucial to maximise information while minimising costs. Geostatistics and optimal and adaptive design theory can be used to optimise the placement of sampling sites in freshwater studies and aquatic monitoring programs. Geostatistical modelling and experimental design on stream networks pose statistical challenges due to the branching structure of the network, flow connectivity and directionality, and differences in flow volume. Geostatistical models for stream network data and their unique features already exist. Some basic theory for experimental design in stream environments has also previously been described. However, open source software that makes these design methods available for aquatic scientists does not yet exist. To address this need, we present SSNdesign, an R package for solving optimal and adaptive design problems on stream networks that integrates with existing open-source software. We demonstrate the mathematical foundations of our approach, and illustrate the functionality of SSNdesign using two case studies involving real data from Queensland, Australia. In both case studies we demonstrate that the optimal or adaptive designs outperform random and spatially balanced survey designs implemented in existing open-source software packages. The SSNdesign package has the potential to boost the efficiency of freshwater monitoring efforts and provide much-needed information for freshwater conservation and management.

Introduction

Streams and rivers are highly biodiverse ecosystems supporting both aquatic and terrestrial species [1, 2] and provide important ecosystem services including clean water, food, and energy [3]. The ecological and economic importance of waterways has driven government and non-government organisations worldwide to invest large amounts of time and money into their monitoring, assessment and rehabilitation [4]. However, monitoring data remain relatively sparse [5] because the cost of sampling makes it impossible to gather data everywhere, on every stream, at all times. Thus, it is crucial to select sampling locations that yield as much information as possible about water quality and aquatic ecosystem health, especially when the stream system is large and resources for sampling are limited.

Geostatistical models are commonly used to analyse environmental data collected at different locations and to make predictions, with estimates of uncertainty, at unobserved (i.e. unsampled) sites [6]. These models are a generalisation of the classic linear regression model, which contains a deterministic mean describing the relationship between the response (i.e. dependent variable) and the covariates (i.e. independent variables). In a geostatistical model, the assumption of independence of the errors is relaxed to allow spatial autocorrelation, which is modelled as a function of the distance separating any two locations [7]. This provides a way to extract additional information from the data by modelling local deviations from the mean using the spatial autocorrelation, or covariance, between sites. However, spatial autocorrelation may exist in streams data that is not well described using Euclidean distance, given the branching network structure, stream flow connectivity, direction and volume [8]. In addition, many traditional covariance functions are invalid if an in-stream (i.e. hydrologic) distance measure is substituted for Euclidean distance [4, 9]. The use of covariance functions based on Euclidean distance may produce physically implausible results; for example, implying that two adjacent streams that do not flow into each other and that have separate watersheds are strongly related. This led to the development of covariance functions that are specifically designed to describe the unique spatial relationships found in streams data [4, 10]. Geostatistical models fit to streams data describe a number of in-stream relationships in a way that is scientifically consistent with the hydrological features of natural streams and, as such, are increasingly being used for broad-scale monitoring and modelling of stream networks; see, for example, Isaak et al. [11] and Marsha et al. [12], both model temperature in streams, with Marsha et al. [12] further considering questions of site placement and sample size based on their data.

The theoretical properties of geostatistical models can also be exploited in optimal and adaptive experimental designs [13–16], which are used to select sampling locations that maximize information gain and minimize costs. However, the exact locations included in an optimal design will depend on the objectives of the monitoring program. Common objectives include estimating the parameters of the underlying geostatistical model (e.g. fixed effects estimates and/or covariance parameters), making accurate predictions at unsampled locations, or both. Utility functions are mathematical representations of the objectives used to measure the suitability of a design for a specific purpose. Depending on the objective of the sampling, the best design might be one that includes spatially balanced sites distributed across the study area or it could be a design that includes clusters of sites in close proximity to one another [17]. A variety of utility functions are available [15, 16] and are described more specifically in Section 2.4. An adaptive design (i.e. sequential design) is constructed by making a series of optimal decisions about where to put sampling sites as new information becomes available over time [14]. For example, the spatial location of monitoring sites may change through time, with some sites removed due to changes in access, or additional sites added as new funding becomes available. In these situations, the information gained from the data collected up to that point can be used to inform where the optimal sampling locations will be at the next time step. Hence, adaptive designs may provide additional benefits for long-term environmental monitoring programs because one-off optimal designs ignore the evolving nature of environmental processes and do not allow for adjustments as monitoring needs change [17].

Bayesian and pseudo-Bayesian methods can enhance optimal and adaptive designs. Utility functions often depend on the parameters of the geostatistical model that one intends to fit over the design; however, the utility function can only be evaluated when these parameter values are fixed [13]. If the values change, for example, through random variations in field conditions, then the design may no longer be optimal. Bayesian and pseudo-Bayesian optimal design addresses this issue by using simulation to construct more robust designs and to incorporate prior information about the distribution of the model parameters when constructing the design [18]. A drawback is that these methods are computationally intensive [18, 19]. In SSNdesign, we use the pseudo-Bayesian approach. This is different than a fully Bayesian approach because we are committed to performing frequentist inference on the data we collect from an experiment. The pseudo-Bayesian approach also does not take a Bayesian view of uncertainty, particularly with respect to model uncertainty, and as such we do not always have access to Bayesian utilities. Nevertheless, the pseudo-Bayesian approach allows us to incorporate prior information in the design process, which is not possible for purely frequentist designs, and can be more computationally efficient than the fully Bayesian approach.

Although numerous software packages have been developed to implement geostatistical models on streams and to solve experimental design problems, none have done both. The SSN package [20] for R statistical Software [21] is currently the only software available for implementing geostatistical models on stream networks [10]. However, various software packages exist to solve experimental design problems. For example, acebayes provides an implementation of the approximate coordinate exchange algorithm for finding optimal Bayesian designs given a user-specified utility function [22]. For spatial design problems, spsurvey [23] implements a variety of sampling designs including the Generalised Random Tessellation Sampling (GRTS) design for spatially balanced samples [24]. The package geospt [25] focuses on drawing optimal and adaptive spatial samples in the conventional 2-D geostatistical domain, with Euclidean distance used to describe spatial relationships between locations. However, it does not allow for stream-specific distance measures and covariance functions or it does not calculate design criteria consistent with a Bayesian or pseudo-Bayesian approach. This is important for constructing designs that are robust to changes in parameter values that the utility function depends on. Som et al. [15] and Falk et al. [16] made use of the geostatistical modelling functions in the SSN package for solving design problems on stream networks, but neither addressed adaptive design problems and both used customised code that was not made publicly available.

To our knowledge, SSNdesign is the first software package that allows users to optimise experimental designs on stream networks using geostatistical models and covariance functions that account for unique stream characteristics within a robust design framework. It combines the statistical modelling functionality found in the SSNpackage [20] with principles of pseudo-Bayesian design [13] into a generalised toolkit for solving optimal and adaptive design problems on stream networks. In Section 2, we discuss the mathematical principles underpinning these tools and outline the structure of the core functions in SSNdesign package along with a summary of the package’s speed and performance. This section is extended by S1 Appendix, which gives a deeper treatment of the required mathematics. In Section 3 we present two case studies using real data from Queensland, Australia. S2 Appendix is the package vignette, and provides the reader with detailed code required to reproduce the example. We conclude with a brief discussion of the package and future developments in Section 4. A glossary of terms is also provided in S3 Appendix for those unfamiliar with the language of experimental design.

The SSNdesign package

Software and data availability

The SSNdesign package is publicly available at https://github.com/apear9/ SSNdesign, along with the data used in this paper. This R package requires R version 3.5.0 or later, and depends on the packages SSN [20], doParallel [26], doRNG [27], spsurvey [23], and shp2graph [28]. These are downloaded from CRAN during the installation of the package.

Workflow

There are three different workflows in SSNdesign corresponding to different design types, as well as for importing and manipulating stream network data (Figs 1 and 2). The general process is to import a stream dataset (see Section 2.2) and if necessary, create potential sampling locations and simulate data at those locations. The next step is to use the main workhorse function optimiseSSNDesign to find optimal or adaptive designs, or use drawStreamNetworkSamples to find probability-based designs. In the case of adaptive design problems, there are always multiple ‘timesteps’. The first timestep is important because adaptive designs cannot be constructed without a pre-existing design. Therefore, it is not possible to go straight from data processing to an adaptive design; as such, the first design decision must be based on either an optimal or probability-based design before implementing the adaptive design process in subsequent timesteps (Fig 2).

Fig 1 — Grey boxes with solid outlines indicate a call to a function (N.B. `importSSN` and `SimulateOnSSN` belong to `SSN`, not `SSNdesign`). Clear boxes with dashed outlines indicate a file, folder or R object that is created as a result of a function call. The one dashed line represents an action that will not always be necessary.

Fig 2 — The yellow box is discussed in more detail in Fig 1. Grey boxes with solid outlines represent a call to a function (N.B. `glmssn` belongs to `SSN`, not `SSNdesign`). Clear boxes with dashed outlines indicate a file, folder or R object that is created as a result of a function call. The dash-dot-dot lines represent optional steps or connections between function calls that do not always occur.

Data format, ingestion and manipulation

The SSNdesign package builds on the functionality in the package SSN [20], most notably the S4 SpatialStreamNetwork class for stream network datasets and the function and class glmssn which fits and stores fitted spatial stream network models. SpatialStreamNetwork objects are an extention of the sp class, but are unique because they contain a set of stream lines (i.e. edges) and a set of observed sites. Prediction sites can also be included, but are optional. These spatial data are imported from.ssn folders, which are created using the Spatial Tools for the Analysis of River Sytems (STARS) ArcGIS custom toolset [29]. The importSSN found in the SSN package is used to ingest data contained in the.ssn folders, but this function will not work if there are no observed sites in the.ssn folder. Therefore, the SSNdesign package provides the function importStreams for creating a SpatialStreamNetwork object with no observed or predicted sites. Additional functions such as generateSites are provided to add potential sampling sites to these empty networks. Four distinct workflows for importing and preprocessing stream data are shown in Fig 1, but the end result is a SpatialStreamNetwork object that contains streams and observed sites where data were collected or simulated that can be used for optimal or adaptive designs.

The data processing (Fig 1) and design (Fig 2) workflows produce objects of class ssndesign, which are lists containing 1) information about the way the design optimisation function (optimiseSSNdesign) was used, including the SpatialStreamNetwork objects before and after an optimal design is found, 2) data about the optimal or adaptive design, and 3) diagnostic information about the optimisation procedure. This class also has a plot method, plot.ssndesign, which plots the trace for the optimisation algorithm. The method plot.SpatialStreamNetwork from the package SSN can be used to visualise the locations of the selected sites. Further details are provided in the package vignette (S2 Appendix). The vignette is intended as a practical guide that demonstrates the functionality and workflow of the package and as a practical reference for managers.

Expected utility estimation and maximisation

There are many ways to configure a fixed number of monitoring sites. Each potential configuration represents a ‘design’ d, and the set of all possible configurations is denoted as D. The goal of optimal design theory as applied to stream networks is to find which configuration of sites is most suitable to achieve a purpose (e.g. precise parameter estimation in a statistical model). We refer to the optimal configuration of sites as d*. Inside SSNdesign, the quality of a design and its suitability for a stated goal is measured using a function called the expected utility U(d) [13]. Larger values of U(d) indicate better designs and the calculation of U(d) is linked to the utility function, U(d, θ, y). The utility function may depend on elements of the geostatistical model fitted over the design, including its parameters θ and either observed or predicted data y [13]. However, in many cases of pseudo-Bayesian utility functions, the utility function does not depend on y and can be written U(d, θ) [16]. This function mathematically encodes the criterion used to compare designs. Examples of utility functions can be found in Section 2.1.2. The utility function, however, cannot be used directly to assess the quality of designs. This is because U(d, θ, y) depends on specific values of θ and y and the relative rankings of designs may change, sometimes dramatically, if there are small variations in these two quantities. Therefore, the parameters θ and the data y must be integrated out such that the values used to rank designs depend only on the designs themselves. We achieve this using Monte-Carlo integration [13]. Further details are provided in S1 Appendix.

The set of possible designs D is usually large and, due to time and computational constraints, we cannot find d* by evaluating U(d) for every d ∈ D. SSNdesign deals with this problem in two ways. Firstly, we do not treat the design problem as a continuous one. That is, we do not allow sites to shift to any place along the stream edges during the search for the best design. The user must first create a set of N candidate points, and a design containing n points is chosen from among them. This ensures that D has a finite size. Secondly, we reduce the computational load of finding d* [19] by applying a coordinate exchange algorithm called the Greedy Exchange Algorithm (S1 Appendix, Algorithm 1). This algorithm rapidly converges on highly efficient designs, although this efficient design may not be the best design [30]. Note that the Greedy Exchange Algorithm has previously been used for optimal designs on stream networks [16].

Utility functions for optimal and adaptive experimental designs

The utility functions implemented in the SSNdesign package are suitable for solving either static optimal design problems or evolving, adaptive design problems. For optimal design, there are six utility functions for common monitoring objectives including parameter estimation and prediction (Table 1; S1 Appendix). For adaptive design, there are three utility functions for similar monitoring objectives that are appropriate for adaptive decision-making. These are intended to be used with the function optimiseSSNDesign. We also provide two utility functions for finding space-filling designs (Table 1), using the optimisation function constructSpaceFillingDesigns. These designs contain roughly equally spaced and unclustered sets of monitoring sites along the stream network [31, 32].

Table 1. Utility functions implemented in `SSNdesign`.

Empirical utility functions are utility functions where the covariance parameters are estimated from data simulated using the prior draws. θ = a vector of covariance parameters from a geostatistical model; and y = data that is either directly observed from a process or simulated from it. OP = optimal design; AD = adaptive design. n/a = no covariance parameters involved. I(θ) = the expected Fisher Information Matrix; ${\hat{β}}_{gls} = the estimates of the fixed effects$ ; $V a r (\hat{β_{gls}}) = covariance matrix for the fixed effects$ . s_z = a prediction site; S = the set of all prediction sites. $\hat{y} (s_{z}) = the predicted value at a prediction site$ . $V a r (\hat{y} (s_{z})) = the kriging variance$ . O_t(θ) = a summary statistic from the existing design. D(x_i, x_j) = the distance between two points x_i and x_j. The distance can be measured as Euclidean distance or hydrological distance along the stream network [10]. D = a sorted vector of non-zero distances in a distance matrix; J = the number of times each distance occurs in one triangle of the matrix. The subscript w = 1, 2, …, W counts the W unique non-zero entries in the distance matrix. p = a weighting power, with p ≥ 1. In the Empirical column, × means No, ✓ means Yes.

Name	Purpose	Application	Empirical	Definition of the expected utility	Reference
CP-optimality	Covariance parameters	OP	×	log det[I(θ)⁻¹]	[16]
D-optimality	Fixed effects parameters	OP	×	$log det [V a r {({\hat{β}}_{gls})}^{- 1}]$	[15, 16]
ED-optimality	Fixed effects parameters	OP	✓	$log det [\hat{V a r} {({\hat{β}}_{gls})}^{- 1}]$	[15, 16]
CPD-optimality	Fixed effects and covariance parameters, a mixture of CP- and D-optimality	OP	×	$log det [V a r {({\hat{β}}_{gls})}^{- 1}] + log det [I {(θ)}^{- 1}]$
K-optimality	Predictions	OP, AD	×	${(\sum_{s_{z} \in S} V a r (\hat{y} (s_{z})))}^{- 1}$	[15, 16]
EK-optimality	Predictions	OP, AD	✓	${(\sum_{s_{z} \in S} \hat{V a r} (\hat{y} (s_{z})))}^{- 1}$	[15, 16]
Sequential CP-optimality	Covariance parameters	AD	×	log det[(I(θ)+ O_t(θ))⁻¹]	S1 Appendix
Sequential D-optimality	Fixed effects parameters	AD	×	$log det [{(V a r ({\hat{β}}_{gls}) + O_{t} (θ))}^{- 1}]$	S1 Appendix
Sequential ED-optimality	Fixed effects parameters	AD	✓	$log det [{(\hat{V a r} ({\hat{β}}_{gls}) + O_{t} (θ))}^{- 1}]$	S1 Appendix
Maximin	Space-filling, with an emphasis on increasing the minimum distance between pairs of points	OP	n/a	min_i≠j D(x_i, x_j)	[32]
Morris-Mitchell	Space-filling, with an emphasis on increasing separations larger than the minimum	OP	n/a	$- {(\sum_{w = 1}^{W} {(J_{w} D_{w})}^{p})}^{1 / p}$	[31]

Open in a new tab

To solve adaptive design problems, we use a myopic design approach; that is, when making an adaptive design decision at a given point in time, we try to find the best decision for the next time step only [33]. An alternative method is to use backward induction, which involves enumerating all possible future decisions for a set number of time steps and then choosing the best series of decisions from among them [14]. However, this approach is often computationally prohibitive. Here we use an algorithm that assumes an initial design d₀ that we seek to improve by adding sites or by rearranging the existing ones. We know a priori that, instead of solving this problem once, we will have to make a further T decisions in the future about the arrangement of the design points (Algorithm 2, S1 Appendix). At each step t = 1, 2, …, T in this iterative process, we fit a model to the existing design at d_t−1 and summarise the fit of the model using the summary statistic O_t(θ). The summary statistic can be arbitrarily defined, though it should be low-dimensional and preferably fast to compute [14]. In evaluating the designs at each timestep t, we incorporate the summary statistic O_t−1(θ) in the calculation of the expected utility given the previous design decisions U(d|d_0:t−1, y_0:t−1). Data simulated from the likelihood at time t, p(y|θ^t, d), can involve real data collection. However, for most design studies, this step will simply involve generating predictions, with associated errors, from an assumed model. As with optimal design problems, the best design $d_{t}^{*}$ at step t is the one which maximises U(d|d_0:t−1, y_0:t−1). We also update the priors on the parameters for each new design. The process is repeated until the best design has been found for each of T time steps. For the objectives of covariance parameter estimation and fixed effects parameter estimation, we have defined utility functions specifically. Note that the main difference between the adaptive and static utility functions is the use of the summary statistic O_t(θ) from the model fitted to the existing sites (Table 1; Supplementary Information A). These utility functions can be used with optimiseSSNDesign.

Users may also define their own utility functions since the optimiseSSNDesign function has the flexibility to accept utility functions as an argument. The utility function must be defined in this format: utility.function(ssn, glmssn, design, prior.parameters, n.draws, extra.arguments). The exact requirements in terms of input type and additional data accessible within the function optimiseSSNDesign are described in the function documentation. It is not necessary to use all the arguments inside the function. Ultimately, the only requirement for a working user-defined utility function is that the function returns a single number, representing the expected utility.

Other standard designs

The SSNdesign package focuses on optimal and adaptive design problems, but we also include a number of standard designs such as simple random sampling and GRTS [24]. We have also included heuristic sampling schemes designed specifically for stream networks, such as designs with sites allocated to headwater streams (i.e. streams at the top of the network), to outlets (i.e. most downstream location on the network), or in clusters around stream confluences (i.e. junctions where stream segments converge; Som et al., [15]). These are all options for the function drawStreamNetworkSamples (Table 2).

Table 2. Standard designs from Som et al. [15].

Note that ‘name in package’ is the string argument that must be passed to the drawStreamNetworkSamples function to use the sampling scheme.

ID	Name in package	Description
SRS	SRS	Simple random sampling. An unstratified random sample of sites.
G1	GRTS	GRTS. Spatially balanced design [24].
G2	GRTSmouth	GRTS with one site always assigned to the stream outlet.
G4	GRTSclus	GRTS with clusters of sites around confluences.
H1	Headwater.Clusts.and.Singles	Headwater samples. One site allocated to the outlet and others preferentially allocated to tributaries.
C3	Trib.Sets.Head.Singles.sample	Clustered triplets. As many points as possible are allocated to triplets clustered where each segment meets at a confluence. All remaining points are assigned to tributaries.
C4	Trib.Sets.Head.Singles.Mouth.sample	C3 with a single point allocated to the outlet segment.

Open in a new tab

Computational performance

The optimisation of Bayesian and pseudo-Bayesian experimental designs via simulation is notoriously slow [19]. Other experimental design packages explicitly warn users to expect this (e.g. acebayes [22]). In our case, we have parallelised the functions (compatible with any OS) to increase computational efficiency for large problems. However, in many common situations users can expect run-times of hours, to days, and even weeks. The expected computation time in hours is given by K × L × T × n × (N − n)/(3600 × C), where K is the number of random starts that are used to seed the algorithm, L is the number of times the algorithm must iterate before converging, T is the time in seconds that is required to calculate U(d) for a single design, n is the number of desired sampling locations, N is the number of potential sampling locations, and C is the number of CPUs allocated to the task. The parameter K is specified by the user in optimiseSSNDesign, but L is more difficult to constrain. The number of times the algorithm must iterate until convergence is stochastic but, in our experience, L = 2 is most common; though we have observed L ∈ {3, 4, 5}. Unsurprisingly, the number of potential sampling locations and the number of desired sampling locations strongly influence computing time, and these make the largest contribution when n ≈ N/2. Large problems N ≥ 300 generally take at least a day, if not more, to compute. Our second case study below with N = 900 required approximately 4 days to complete using 20-32 CPUs and n ≪ N. Using complicated or intensive utility functions will also add significantly to computation time. The empirical prediction or estimation utilities are particularly prone to computational inefficiences because they rely on iteratively fitting and predicting from geostatistical stream network models. This is compounded by the fact that larger n is often associated with larger T values, due to the increased size of the matrices (e.g. covariance matrix, design matrix) that must be created and/or inverted.

Case studies

We present two case studies where we illustrate how SSNdesign can be used to address common monitoring challenges. Field data collection can be expensive, so existing monitoring programs often need to reduce sampling effort due to resource constraints. In the first case study, we demonstrate how to ensure that the remaining survey sites are optimally placed to minimize information loss when sampling effort is reduced. In the second case study, we show how adaptive sampling can be used to design a monitoring program in a previously unsampled area, by gradually adding additional survey sites year-by-year. The R code used to create these examples is provided in Supplementary Information B so that readers can re-create them and also apply these methods to their own data.

In a model-based design problem, a ‘true model’ must be specified. Here, a true model refers to the statistical model which most adequately characterises the underlying spatial process given what we know about the system. If a historical dataset from the study area exists, a standard model-selection process can be used to determine which model has the most support in the data using a range of approaches (e.g. information criteria or cross validation). If historical data are not available, simulated data can also be used to implement the model-based design. In this case, the general approach is to:

Specify the form of the statistical model;
Identify which potential covariates should be related to the response (e.g. temperature affects the solubility of dissolved oxygen in water) based on prior knowledge of the system, or similar systems; and
Set priors that specify the likely relationship between covariates and the response based on previously collected data, expert opinion and/or a literature review.

The same process must be undertaken to specify the spatial covariance structure and covariance parameters for the model. The SimulateOnSSN function from SSN [20] can then be used to simulate data that are subsequently used to implement the model-based design. Needless to say, the quality of the design will depend strongly on the quality of the prior information.

Case study 1: Lake Eacham

Water temperature samples were collected at 88 sites along a stream network near Lake Eacham in north Queensland, Australia (Fig 3) [34]. The dataset includes a shapefile of streams, the 88 observed sites with rainfall data and GIS-derived urban and grazing land use percentages in the catchment, and 237 prediction sites with the same covariates. Most survey sites were clustered at stream confluences, with multiple sites located in the upstream and downstream segments. In a similar experiment to that of Falk et al. [16], we used optimal experimental design to reduce the number of survey sites by half, with the least amount of information loss possible. More specifically, we wanted to retain the ability to 1) estimate the effect that land use and rainfall are having on water temperature and 2) accurately predict temperature at unsampled locations. In the former case, we optimised the design using CPD-optimality, which is maximised when uncertainties in both the fixed-effects and covariance parameters are minimised (Table 1). In the latter case, we used K-optimality, which is maximised when the total prediction uncertainty across all sites is minimised. Note that this situation where two designs are built separately using different utility functions is not ideal. There is no way to reconcile the two resulting designs into a single one. Ideally, we would be able to use the EK-optimality function. This utility function aims to maximise prediction accuracy but also involves a parameter estimation step, and therefore serves as a dual purpose utility function that yields designs that are efficient for parameter estimation and prediction. However, it was not practical to use the EK-optimality function because it is extremely computationally expensive, even for this dataset with only 88 potential sites.

We fit a spatial stream-network model to the temperature data, using riparian land use (i.e. percent grazing and urban area) and the total rainfall recorded on the sampling date (mm) as covariates. The covariance structure contained exponential tail-up and tail-down components [10]. Log-normal priors were set on the covariance parameters using the natural logarithm of the estimated covariance parameters as the means and the log-scale standard errors of {0.35, 0.56, 0.63, 0.69, 0.68}, which were also estimated from the existing data using restricted maximum likelihood (REML).

We set about finding CPD- and K-optimal designs with 44 sites by removing one site at a time from the original 88 sites. An alternative approach would be to optimise for a 44 site design with no intermediate steps. However, we chose to remove one site at a time because it helps reveal differences in the decision process between the CPD- and K-optimality utility functions that would otherwise be difficult to identify. Note that this is not an adaptive design because the model is not refit and the priors on the covariance parameters are not updated at each step. The expected CPD- and K-optimal utilities were calculated at each step using 500 Monte Carlo draws.

As expected, the results revealed differences between the intermediate and final designs discovered under CPD- and K-optimality. Both utility functions preserved clusters in groups of at least three sites around confluences (one on each of the upstream segments and one on the downstream segment). However, CPD-optimality appeared to remove single sites that were not part of clusters in preference to sites within clusters. By comparison, K-optimality appeared to reduce clusters around confluences down to three sites much more quickly, while preserving sites that were located away from confluences. This reflects the previously observed tendency of designs constructed for parameter estimation to favour spatial clusters and the tendency of designs that optimise prediction to contain sites spread out in space, with some clusters to characterise spatial dependence at short distances [15, 16]. These results suggest that clusters located around confluences provide valuable information about the covariance structure that is needed to generate precise parameter estimates and accurate predictions.

We tracked the information loss from the design process over time and compared the performance of our final 44-site optimal designs against 20 random and GRTS designs of the same size (Fig 4). We compared them to multiple GRTS and random designs because these designs have many potential configurations. Therefore, we needed to characterise the range of their performances under the chosen expected utilities. Information in this context is measured as relative design efficiency; a ratio of the expected utility of a given design and the expected utility of the ‘best’ design, which in this case contained all 88 sites. There was a linear reduction of information available for parameter estimation as sites were eliminated from the design (Fig 4a). While the optimal design containing 44 sites provided only 20% of the information gained from 88 sites, it provided significantly more information than the random and GRTS designs. These results suggest that reducing sampling effort by 50% would signficantly impact parameter estimation. However, the same was not true for prediction. We observed only minor reductions in the efficiency of the optimal 44-site design compared to the full 88-site design, while also demonstrating considerable gains in efficiency over random and GRTS designs (Fig 4b). As such, a 50% reduction in sampling effort would have little impact on the predictive ability of the models.

Fig 4 — The panels (a) and (b) have axes for efficiency fixed between 0 and 1. Panel (c) is zoomed on the y-axis range 0-0.3 and panel (d) on the y-axis range 0.7-1.0. Efficiency represents a ratio of the expected utilities of each design and the full 88 site design. The black line indicates the efficiency of the optimal design with a certain number of sampling sites. The 20 red dots and 20 blue dots in each panel represent the efficiencies of 44-site GRTS and random designs, respectively. These serve as a baseline measure of comparison for the optimal design, which should have higher relative efficiency. We only compare the efficiencies of the GRTS, random and optimal designs when there are 44 sampling sites in the design because the 44 site monitoring program is the final result.

The findings from this case study fit inside the framework established by Falk et al. [16] and Som et al. [15], and broadly agree with their findings. However, for us, SSNdesign streamlined the process of discovering these results. The same code sufficed for both the CPD- and K-optimality experiments, with only a few lines’ difference to account for the change in utility function (S2 Appendix). If required, we could easily have changed the covariance function underpinning the spatial relationships in the Lake Eacham network or the priors on the covariance parameters. SSNdesign will enable aquatic scientists and statisticians to construct designs for their own monitoring programs or make decisions about them with ease. Bespoke code will no longer be required, expanding access to the sophisticated methodologies of optimal and adaptive design.

Case study 2: Pine River

In the second case study, we demonstrate how additional sites can be selected to complement the information provided by a set of legacy sites using simulated data. The objective of the adaptive design process is to generate a design that can be used to accurately predict dissolved oxygen over the stream network at unsampled locations.

We did not actually have data at legacy sites. Therefore, for this example, we started by simulating four years of maximum daily dissolved oxygen (DO, mg/L) data at 900 locations throughout the river network. We then selected 200 of these sites using a GRTS design, which were treated as legacy sites. The first two years of data from the legacy sites were treated as historical data and used to estimate the fixed effects and covariance parameters using a spatial statistical stream-network model (S2 Appendix). Five random starts were used to find an adaptive design which maximises the K-optimality utility function (Table 1). We estimated U(d) using M = 500 Monte-Carlo draws from our independent log-normal priors on the covariance parameters. The results were used to select an additional 50 sites in year 3; after which the model was refit to the full dataset and an additional 50 sites were selected in year 4. Thus, the final dataset included 300 sites, giving rise to 950 observed DO measurements collected across four years (Fig 5). The result is shown in Fig 6.

Fig 5 — Green rectangles indicate new sites (i.e. sampling locations) that have been added to the monitoring program, and blue rectangles indicate that sites have been retained from previous years.

Fig 6 — The sites on the network represent the evolution of the adaptive design through time.

We validated the adaptive design by computing its relative efficiency compared to 20 GRTS and 20 random designs of the same size. The GRTS designs were sequentially constructed using the ‘master sample’ approach [35]. We compared relative efficiency by computing the sum of the kriging variances for the same 900 prediction sites that were used when optimising the design. Note that the sum of the kriging variances is simply the inverse of the expected utility. For the purpose of validation, this expected utility was computed using 1000 prior draws to ensure our approximations to the expected utility were accurate. We did not include the 200 fixed GRTS sites in the validation procedure because we wanted to assess whether there were any additional benefits gleaned from the adaptive design. Thus, the efficiencies for each design (i.e. adaptive, GRTS, and random) were based on the 150 measurements collected at 100 locations in years 3 and 4.

The results showed that the adaptive design was more efficient than both the GRTS and random designs, which had 74.5-84% relative efficiency. In terms of variance units, using the adaptive design reduces the total variance across 900 prediction sites by approximately 545-1004 variance units compared to GRTS and random designs (Fig 7). This demonstrates that adaptive design represents a far better investment in terms of predictive ability than designs formed without optimisation.

Fig 7 — The dashed red line represents the sum of the kriging variances for the adaptive design. The performance of the adaptive design is plotted this way as opposed to as a boxplot because there is only one adaptive design.

Conclusions

The SSNdesign package brings together a large body of work on optimal spatial sampling, pseudo-Bayesian experimental design, and the complex spatial data processing and spatial statistical modelling of stream network data. We demonstrate how the package can be used in two contexts which should prove useful for scientists and managers working in stream ecosystems; particularly where monitoring programs lack the resources to comprehensively sample the network, but must nevertheless estimate parameters for complicated spatial processes and accurately predict in-stream conditions across broad areas.

Compared with other packages for spatial sampling, such as geospt [25] and spsurvey [23], SSNdesign represents a significant advance in functionality for stream network data. The SSNdesign package integrates directly with the data structures, modelling and model diagnostic functions from a well-established R package for streams, SSN [20]. As a result, the hydrological distance measures and unique covariance functions for streams data can also be used in the design process. This cannot be accomplished using other packages for spatial design, which are restricted to using Euclidean distance in the conventional 2-dimensional domain of geostatistics. SSNdesign has been written specifically to deal with problems of model-based design (i.e. obtaining the best information from a model given that a spatial statistical model for in-stream processes can be specified). This distinguishes SSNdesign from packages such as spsurvey [23], which selects designs based on factors such as the inclusion probability for a site given characteristics of the underlying stream segments. However, note that we also include tailored functions to simplify the process of generating designs for stream networks as per Som et al. [15], which include designs that can be formed using spsurvey.

In addition to the existing functionality in SSNdesign, there are opportunities for further development. In particular, the authors of the SSN package [20] are working on extending its functionality to include computationally efficient models for big data (Personal Comm., J. Ver Hoef) and we expect there will be major performance boosts for the empirical utility functions where spatial stream network models must be fit iteratively. Computationally efficient spatial stream network models also open up possibilities to include new functionality within the SSNdesign package. For example, users must currently specify a single spatial stream network model as the ‘true’ model underpinning the design problem. This raises several important questions about the possibility of true models, and the uncertainty about which of several plausible, competing models ought to be chosen as the true model. Increased computational efficiency would allow us to implement a model-averaging approach [36] so that users will be able to specify several plausible models as reasonable approximations for the true model. Expected utilities for designs will then be averages of the expected utilities of the design under each plausible model. This approach would make optimal and adaptive designs more robust because the averaging mitigates the possibility that designs are being chosen to obtain the best information about the wrong model. Our hope is that managers of freshwater monitoring programs can more efficiently allocate scarce resources using optimal and adaptive designs for stream networks, which we have made accessible through the SSNdesign package.

Supporting information

S1 Appendix. Background on optimal and adaptive pseudo-Bayesian design.

(PDF)

Click here for additional data file.^{(294.8KB, pdf)}

S2 Appendix. SSNdesign—An R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks.

(PDF)

Click here for additional data file.^{(715.8KB, pdf)}

S3 Appendix. Glossary of terms.

(PDF)

Click here for additional data file.^{(152.4KB, pdf)}

Acknowledgments

Computational resources and services used in this work were provided by the eResearch Office, Queensland University of Technology, Brisbane, Australia.

The findings and conclusions in this article are those of the authors and do not necessarily represent the views of the US Fish and Wildlife Service nor the National Marine Fisheries Service, NOAA. Any use of trade, product, or firm names is for descriptive purposes only and does not imply endorsement by the US Government.

Data Availability

The data referred to in the text are available with the SSNdesign package on GitHub at https://github.com/apear9/SSNdesign.

Funding Statement

This study received funding and support from Healthy Land and Water (https://hlw.org.au/) and was motivated by their monitoring needs and desire to transition their freshwater monitoring program to the use of in-situ sensors at broad spatial scales. E.E.P and J.M.M received the award from Healthy Land and Water. P.M. works for Healthy Land and Water and P.M. assisted in identifying the two motivating examples / case studies for the study and provided the data used in the second case study. P.M. also contributed to the preparation of the manuscript. In addition, JMM was supported by an Australian Research Council Discovery Project (DP200101263).

References

1. David D, H. AA, O. GM, Zen-Ichiro K, J. KD, Christian L, et al. Freshwater biodiversity: Importance, threats, status and conservation challenges. Biological Reviews. 2006;81(2):163–82. 10.1017/S1464793105006950 [DOI] [PubMed] [Google Scholar]
2. Poff NL, Olden JD, Strayer DL. Climate change and freshwater fauna extinction risk In: Hannah L, editor. Saving a million species: Extinction risk from climate change. Washington, DC: Island Press/Center for Resource Economics; 2012. pp. 309–36. [Google Scholar]
3. Vorosmarty CJ, McIntyre PB, Gessner MO, Dudgeon D, Prusevich A, Green P, et al. Global threats to human water security and river biodiversity. Nature. 2010;467:555–61. 10.1038/nature09440 [DOI] [PubMed] [Google Scholar]
4. Ver Hoef JM, Peterson EE, Theobald D. Spatial statistical models that use flow and stream distance. Environmental and Ecological Statistics. 2006;13:449–64. 10.1007/s10651-006-0022-8 [DOI] [Google Scholar]
5.United Nations Water. The right to privacy in the digital age. Geneva, Switzerland; 2016.
6. Cressie N. Statistics for spatial data. New York: Wiley; 1993. [Google Scholar]
7. Ver Hoef JM, Cressie N, Fisher RN, Case TJ. Uncertainty and spatial linear models for ecological data In: Spatial uncertainty in ecology. New York, NY: Springer; 2001. pp. 214–37. [Google Scholar]
8. Peterson EE, Ver Hoef JM, Isaak DJ, Falke JA, Fortin MJ, Jordan CE, et al. Modelling dendritic ecological networks in space: An integrated network perspective. Ecology Letters. 2013;16 (5). 10.1111/ele.12084 [DOI] [PubMed] [Google Scholar]
9. Ver Hoef JM. Kriging models for linear networks and non-euclidean distances: Cautions and solutions. Methods in Ecology and Evolution. 2018;9:1600–13. 10.1111/2041-210X.12979 [DOI] [Google Scholar]
10. Ver Hoef JM, Peterson EE. A moving average approach for spatial statistical models of stream networks. Journal of the American Statistical Association. 2010;105(489). 10.1198/jasa.2009.ap08248 [DOI] [Google Scholar]
11. Isaak DJ, Wenger SJ, Peterson EE, Ver Hoef JM, Nagel DE, Luce CH, et al. The norwest summer stream temperature model and scenarios for the western u.S.: A crowd-sourced database and new geospatial tools foster a user community and predict broad climate warming of rivers and streams. Water Resources Research [Internet]. 2017;53(11):9181–205. Available from: https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2017WR020969 [Google Scholar]
12. Marsha A, Steel EA, Fullerton AH, Sowder C. Monitoring riverine thermal regimes on stream networks: Insights into spatial sampling design from the snoqualmie river. Ecological Indicators. 2018;84:11–26. 10.1016/j.ecolind.2017.08.028 [DOI] [Google Scholar]
13. Mueller P. Simulation-based optimal design. Bayesian Statistics. 1999;6:459–74. 10.1007/s00477-015-1067-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
14. Mueller P, Berry DA, Grieve AP, Smith M, Krams M. Simulation based sequential bayesian design. Journal of Statistical Planning and Inference. 2007;137:3140–50. 10.1016/j.jspi.2006.05.021 [DOI] [Google Scholar]
15. Som NA, Monestiez P, Ver Hoef JM, Zimmerman DL, Peterson EE. Spatial sampling on streams: Principles for inference on aquatic networks. Environmetrics. 2014;25(5):306–23. 10.1002/env.2284 [DOI] [Google Scholar]
16. Falk MG, McGree JM, Pettitt AN. Sampling designs on stream networks using the pseudo-bayesian approach. Environmental and Ecological Statistics. 2014;21:751–73. 10.1007/s10651-014-0279-2 [DOI] [Google Scholar]
17. Kang SY, McGree JM, Drovandi CC, Caley MJ, Mengersen KL. Bayesian adaptive design: Improving the effectiveness of monitoring of the great barrier reef. Ecological Applications. 2016;26(8):2637–48. 10.1002/eap.1409 [DOI] [PubMed] [Google Scholar]
18. Atkinson AC, Donev AN, Tobias RD. Optimum experimental designs, with sas. Oxford University Press; 2007. [Google Scholar]
19. Royle JA. Exchange algorithms for constructing large spatial designs. Journal of Statistical Planning and Inference. 2002;100(2):121–34. 10.1016/S0378-3758(01)00127-6 [DOI] [Google Scholar]
20. Ver Hoef JM, Peterson EE, Clifford D, Shah R. SSN: An r package for spatial statistical modelling on stream networks. Journal of Statistical Software. 2014;56(3). [Google Scholar]
21. R Core Team. R: A language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2018. Available from: https://www.R-project.org/ [Google Scholar]
22.Overstall AM, Woods DC, Adamou M. Acebayes: Optimal bayesian experimental design using the ace algorithm [Internet]. 2017. Available from: https://CRAN.R-project.org/package=acebayes
23. Kincaid TM, Olsen AR. Spsurvey: Spatial survey design and analysis. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]
24. Stevens DL, Olsen AR. Spatially balanced sampling of natural resources. Journal of the American Statistical Association. 2004;99(465):262–78. 10.1198/016214504000000250 [DOI] [Google Scholar]
25.Melo C, Santacruz A, Melo O. Geospt: An r package for spatial statistics. [Internet]. 2012. Available from: geospt.r-forge.r-project.org/
26.Corporation M, Weston S. DoParallel: Foreach parallel adaptor for the’parallel’ package [Internet]. 2019. Available from: https://CRAN.R-project.org/package=doParallel
27.Gaujoux R. DoRNG: Generic reproducible parallel backend for’foreach’ loops [Internet]. 2018. Available from: https://CRAN.R-project.org/package=doRNG
28. Lu B, Sun H, Xu M, Harris P, Charlton M. Shp2graph: Tools to convert a spatial network into an igraph graph in r. ISPRS International Journal of Geo-Information [Internet]. 2018;7(8):293 Available from: 10.3390/ijgi7080293 [DOI] [Google Scholar]
29. Peterson EE, Ver Hoef JM. STARS: An arcgis toolset used to calculate the spatial data needed to fit spatial statistical models to stream network data. Journal of Statistical Software. 2014;56(2):1–17. 10.18637/jss.v056.i02 [DOI] [Google Scholar]
30. Evangelou E, Zhu Z. Optimal predictive design augmentation for spatial generalised linear mixed models. Journal of Statistical Planning and Inference. 2012;142(12):3242–53. 10.1016/j.jspi.2012.05.008 [DOI] [Google Scholar]
31. Morris MD, Mitchell TJ. Exploratory designs for computational experiments. Journal of Statistical Planning and Inference. 1995;43(3):381–402. 10.1016/0378-3758(94)00035-T [DOI] [Google Scholar]
32. Pronzato L, Muller WG. Design of computer experiments: Space filling and beyond. Statistics and Computing. 2012;22 10.1007/s11222-011-9242-3 [DOI] [Google Scholar]
33. McGree JM, Drovandi CC, Thompson MH, Eccleston JA, Duffull SB, Mengersen K, et al. Adaptive bayesian compound designs for dose finding studies. Journal of Statistical Planning and Inference. 2012;142(6):1480–92. 10.1016/j.jspi.2011.12.029 [DOI] [Google Scholar]
34.Peterson EE. STARS: Spatial tools for the analysis of river systems—a tutorial [Internet]. Commonwealth Scientific Industrial Research Organisation (CSIRO); 2011. Available from: http://www.fs.fed.us/rm/boise/AWAE/projects/SSN_STARS/software_data.html#doc
35. Larsen DP, Olsen AR, Stevens DL. Using a master sample to integrate stream monitoring programs. Journal of the Agricultural, Biological and Environmental Statistics. 2008;13(3):243–54. 10.1198/108571108X336593 [DOI] [Google Scholar]
36. Clyde MA. Bayesian model averaging and model search strategies. Bayesian Statistics. 1999;6:157–85. [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0238422.r001

Decision Letter 0

Andreas C Bryhn

9 Jan 2020

PONE-D-19-26876

SSNdesign – an R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks

PLOS ONE

Dear Dr Pearse,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by February 20, 2020. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Andreas C. Bryhn

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Additional Editor Comments (if provided):

Dear authors,

Thank you for a well-written manuscript, Both reviewers recommended a minor revision and I agree with them. Again, I would like to apologise for the long review process, which was due to the difficulty to find suitable reviewers who agreed to participate. Please make sure to address all of the reviewers' comments in your rebuttal letter.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper presents a new R statistical computing package called "SSNdesign" for obtaining statistically optimal (according to a range of different criteria) sampling designs on stream networks. As such, it provides a useful tool for scientists who manage freshwater ecosystems. The package takes advantage of recently (within the last 10-15 years) developed methodology for statistically modeling spatially correlated data taken at points on a stream network.

The paper describes the software package at an appropriate level of detail, and the two case studies are nice and (in my opinion) need no improvement. I do not have expertise on the computational aspects (storage requirements, speed, reliability) of the package so I will limit my comments to those things on which I do have expertise, namely the relevant statistical models and utility functions, as presented in Supplement 1. First, however, here are a few minor comments referring to specific line numbers in the paper itself.

Line 80: Change "pseudo-Baeysian" to "pseudo-Bayesian"

Line 100: Change "Tesselation" to "Tessellation"

Line 200: Change "before" to "then" (so that the sentence reads more smoothly)

Line 279: Delete "based"

Line 298: Delete the period after "distribution"

Line 346: Change "dissovled" to "dissolved"

Now here are some comments on Supplement 1, with references to its line numbers.

Line 32: Remove the comma between "second-order" and "stationary"

Lines 32-33: It is a bit of an oversimplification to say that the moving average models produce covariance functions that depend on "the separation distance between two locations" because tail-down covariance functions generally are functions of two distances (distance of each location to their common confluence) rather than merely the separation distance between locations. So this sentence needs a minor correction.

Line 34: It is somewhat imprecise to refer to a point at which a function "starts." What you mean is that the moving average function is nonzero (positive, in fact) only at and upstream from a point. So you should change this sentence by changing "starts at some location and is nonzero only upstream from that location" to "is nonzero only upstream from a location." Make a similar edit in line 37 when describing the tail-down covariance function.

Line 44: Change "set" to "vector"

Line 67: When two sites are flow-unconnected, a and b, as defined, ALWAYS sum to h (the stream distance between the two sites). Thus, just change this sentence by deleting the phrase "when a+b=h"

Lines 71-78 and expression (5): For consistency with expression (4), use bold type for Y, X, etc.

Line 72: Here it should be noted that X is assumed to have full column rank; otherwise, the matrix in parentheses in line 158 is not invertible.

Line 94: Change "join" to "joint"

Line 106: Change "funciton" to "function"

Lines 120-133: As this is written it is unclear (1) what is the "first" site in d_o, and (2) whether the design with the largest utility (among the N-n designs) or an arbitrary design among the N-n designs is the design that replaces d^*. Please clarify.

Line 164: Because you gave an expression for the information matrix associated with the covariance parameters in expression (9), it would be good to give an expression for the information matrix associated with the fixed effect parameters here as well.

Reviewer #2: Comments for the author have been attached

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

Attachment

Submitted filename: Review_PONE-D-19-26876.pdf

Click here for additional data file.^{(143.7KB, pdf)}

PLoS One. 2020 Sep 22;15(9):e0238422. doi: 10.1371/journal.pone.0238422.r002

Author response to Decision Letter 0

28 Jul 2020

Response to reviewers

Reviewer 1

The reviewer’s comments were that

This paper presents a new R statistical computing package called "SSNdesign" for obtaining statistically optimal (according to a range of different criteria) sampling designs on stream networks. As such, it provides a useful tool for scientists who manage freshwater ecosystems. The package takes advantage of recently (within the last 10-15 years) developed methodology for statistically modeling spatially correlated data taken at points on a stream network.

The reviewer then provides a list of line-by-line corrections for small matters (spelling, grammar, etc.).

Firstly, in response, we thank Reviewer 1 for their constructive and positive feedback on our manuscript. Secondly, we list our responses/changes below against the line-by-line commentary of the reviewer. The reviewer’s comments are shown in bold font.

Line 80: Change "pseudo-Baeysian" to "pseudo-Bayesian"

Done.

Line 100: Change “Tesselation” to “Tessellation”

Done.

Line 200: Change “before” to “then” (so that the sentence reads more smoothly)

Done.

Line 279: Delete “based”

Done.

Line 298: Delete the period after “distribution”

Done.

Line 346: Change “ issolved” to “dissolved”

Done.

Now here are some comments on Supplement 1, with references to its line numbers.

Line 32: Remove the comma between "second-order" and "stationary"

Done.

We agree with the reviewer. This now reads

…they can be described by a mean function that depends on the location within the network, and a second-order stationary covariance function. Traditional covariance functions parameterise the dependence between observations in terms of the Euclidean distance separating two locations, but this is less straightforward in the context of stream networks. Stream network covariance functions and the distance metrics they use may depend on flow connectivity. Details on these covariance functions are provided below.

Line 34: It is somewhat imprecise to refer to a point at which a function “starts.” What you mean is that the moving average function is nonzero (positive, in fact) only at and upstream from a point. So you should change this sentence by changing “starts at some location and is nonzero only upstream from that location” to “is nonzero only upstream from a location.” Make a similar edit in line 37 when describing the tail-down covariance function.

Done.

Line 44: Change "set" to "vector"

Done.

Line 67: When two sites are flow-unconnected, a and b, as defined, ALWAYS sum to h (the stream distance between the two sites). Thus, just change this sentence by deleting the phrase "when a+b=h"

Done.

Lines 71-78 and expression (5): For consistency with expression (4), use bold type for Y, X, etc.

We have decided to drop the bold type for Y, X in Expression (4). This makes this expression consistent with the rest of the appendix and the main manuscript, where bold font is not used for Y, X.

Line 72: Here it should be noted that X is assumed to have full column rank; otherwise, the matrix in parentheses in line 158 is not invertible.

Done. The sentence now reads “… X is a design matrix with full column rank for the fixed effects”.

Line 94: Change "join" to "joint"

Done.

Line 106: Change "funciton" to "function"

Done.

In response to (1), the initial randomly-selected design can be thought of as a set of sites d_0={s_1,s_2,s_3,…,s_n}. The “first site” in this design simply refers to s_1 in the set. We agree this could be unclear to a reader. The updated text is given beneath the next paragraph.

As for (2), the design with the largest utility among the N-n designs replaces d^* if and only if its utility is greater than the utility of d^*. That is, if d^b is the “best” design among the N-n designs evaluated at that iteration of the coordinate exchange algorithm, d^* is replaced by d^b if and only if U(d^b )>U(d^* ). Sometimes U(d^b) does not satisfy this condition because there is no design among the N-n recently evaluated designs that improves on the current best design. We agree this could be unclear to readers so we have updated the text to be:

In this work, we use a greedy exchange algorithm (Algorithm 2) to locate optimal designs (9,13). The greedy exchange algorithm works by optimising the choice of each of n sites one-by-one. Initially, a random design with n sites is proposed and becomes d_0 = {s_1 ,s_2 ,...,s_n } (the initial design) and d^* (the design which currently has the highest value of U(d)). From this point, we begin the coordinate exchange. Note that there are N - n candidate points not currently in d^*. The first site in d_0 (s_1) is then swapped out for each of the N-n candidate sites. The expected utilities of the resulting designs are recorded. If any designs have an expected utility larger than U(d^* ), the design with the highest expected utility replaces d^*. Then we update our pool of candidate sites, and we begin to exchange the next site. Otherwise, the design reverts to d^* and the next site in the design is exchanged for each candidate site.

We agree with the reviewer here. Thank you for catching this oversight. The expression for I(d,β_gls) is now given in the expanded inline at Line 167.

Reviewer 2

Case study 1

In this paper, case study one (Lake Eacham) is based on an existing dataset (i.e. Falk 2014 uses this dataset as well) and the underlying methodology used for the study is not new (optimal experimental design was used in Falk 2014 and Som 2014). I think it's important to state this clearly and cite the above-mentioned prior work.

We agree that we should have more prominently cited and acknowledged these previous works. However, we note the Lake Eacham dataset was originally published on the U.S. Forest Service website ( https://www.fs.fed.us/rm/boise/AWAE/projects/SSN_STARS/software_data.html) as an example data set for the software package STARS (Peterson, 2011):

Peterson EE (2011). STARS: Spatial Tools for the Analysis of River Systems – A Tutorial. Technical Report EP111313, Commonwealth Scientific Industrial Research Organisation (CSIRO). URL: http://www.fs.fed.us/rm/boise/AWAE/projects/SSN_STARS/software_data.html#doc.

We have adjusted the text of the first two paragraphs of Case Study 1 to reflect the origins of the data set and to emphasise the previous work done by Som et al. (2014) and Falk et al. (2014).

Furthermore, the case study is currently presented like a vignette (an application of the SSNdesign package to real data), but doesn't make clear how it is relevant in demonstrating the contribution of the SSNdesign package. Given that both the dataset and the methodology in this case study were already presented in prior work, the majority of content of the case study is better suited for an R package vignette or as supplementary material. Instead, I think the paper would be better served by giving (i) a more concise overview of the case study using SSNdesign, with emphasis on how the software improves the ability to do such analysis, and (ii) a discussion of how the SSNdesign package better addresses the experimental design problem compared with alternatively available software like spsurvey or geospt.

Though the reviewer is correct that neither the Lake Eacham dataset nor optimal design problems on stream networks are novel, we must point out that the following elements of the case study differentiate it from previously published examples with these data:

The CPD-optimality utility function was not used in either Falk et al. (2014) or Som et al. (2014). At best those two papers used the C-optimality and D-optimality utility functions separately.

Our method of dropping sites one-by-one from the design was not used in Falk et al. (2014). Consequently, the insights we gain from tracking information loss over the successive designs are novel.

To address the point that the case study reads like a vignette, we would like to first point out that, in some ways, it is meant to do so. This paper documents our R package SSNdesign and showcases its capabilities. We chose the case studies to be of interest to a wide audience of readers and both address common (applied) design challenges that aquatic managers face; noting that novel methods are also presented in Case Study 2. The code used in Som et al. (2014) and Falk et al. (2014) were not made publicly available and, as a result, optimal and adaptive design methods on stream networks have remained inaccessible to the vast majority of potential users over the last six years. Thus, our main purpose is simply to demonstrate how the SSNdesign package simplifies previously complex tasks that required bespoke code to be developed.

At the same time, we agree with the reviewer that we would benefit shifting the focus of our discussion of Case Study 1. We have edited the discussion of the first case study to highlight the utility of SSNdesign in producing these results, and we have also highlighted the connections between this case study and Falk et al. (2014).

At Lines 289-290, we have entered:

In a similar experiment to that of Falk et al. (21), we used optimal experimental design to reduce the number of survey sites by half, …

At Lines 344-352, we have added the following paragraph of discussion:

The findings from this case study fit inside the framework established by Falk et al. (21) and Som et al. (20), and broadly agree with their findings. However, for us, SSNdesign streamlined the process of discovering these results. The same code sufficed for both the CPD- and K-optimality experiments, with only a few lines’ difference to account for the change in utility function (S2 Appendix). If required, we could easily have changed the covariance function underpinning the spatial relationships in the Lake Eacham network or the priors on the covariance parameters. SSNdesign will enable aquatic scientists and statisticians to construct designs for their own monitoring programs or make decisions about them with ease. Bespoke code will no longer be required, expanding access to the sophisticated methodologies of optimal and adaptive design.

Case study 2

Case study two (Pine River) is a new dataset (synthetically simulated) and the underlying methodology used for the study is new (the authors state that adaptive experimental design problems have not been used under the SSNM framework). This is a case study that supports the motivations of this paper.

Yes, thank you.

In my experience, this R package will be an invaluable tool to researchers and managers in the freshwater monitoring field. I recommend this manuscript be published pending minor revisions. I look forward to reading a revised version of the manuscript.

Thank you for your insightful and positive feedback!

Abstract Line 17 “and so effective and efficient survey designs…”: delete “and” (double conjunction).

Done.

Abstract Lines 21 - 23 “Thus, unique challenges of geostatistics and…”: This sentence doesn’t feel supported by the sentences which precede it. This sentence suggests that the unique challenges of geostatistics on stream networks motivated the development of SSNdesign. But, it would be more accurate to say that these challenges motivate the development of the methodology implemented in SSNdesign. A better motivation for an R package like SSNdesign would be the lack of available software that provides access to, or application of, these methods. For example, a sentence summarizing the lack experimental design R packages which account for the unique SSN structure would be stronger motivation (i.e. see 5th paragraph, lines 94 - 109 of the introduction).

We agree with the reviewer’s assessment. We have therefore modified this part of the abstract to read:

Geostatistical models for stream network data and their unique features already exist. Some basic theory for experimental design in stream environments has also previously been described. However, open source software that makes these design methods available for aquatic scientists does not yet exist. To address this need, we present SSNdesign, ...

Lines 58 - 61: Here the authors have demonstrated the increased use of SSNMs in the literature. The end of this paragraph would be a good place to cite other papers which empirically looked at experimental design utilizing SSNMs. I think it would strengthen the paper’s motivation and further highlight the research needs of an R package which tackles these questions. For example Marsha et al, 2018, Monitoring riverine thermal regimes on stream networks: Insights into spatial sampling design from the Snoqualmie River, WA, Ecological Indicators 84: 11-26. Some of the takeaways from these papers align with those learned from the case studies you have included.

We agree with the reviewer on this point. Being able to point to specific studies focussing on experimental design on stream networks would indeed strengthen our case studies. Thank you for pointing us to this useful reference. We have incorporated and cited this paper on Lines 61 - 62:

see, for example, Isaak et al. (11) and Marsha et al. (12), both of whom model thermal regimes in streams. Marsha et al. (12) further consider questions of site placement and sample size based on their data.

Line 67 “Utility functions are mathematical…”: This sentence feels out of place here. Suggest moving it to line 70 before “A variety of utility functions are available…”

We agree that this sentence reads a bit awkwardly, but we also thought carefully about where to introduce this concept. We believe it is important to keep it in place because it defines what utility functions are in general before the following sentence, which is about differing design objectives affecting the spatial distribution of samples.

Line 81 “which measure the suitability of an experimental design for some purpose”: The authors already defined a utility function in the previous paragraph. I suggest deleting this and continue with “often depend on…”

Done.

Line 87 “In this paper and in SSNdesign…”: Upon first reading I don’t understand the separation between “this paper” and “SSNdesign.” By this paper do you mean application to the case studies? Perhaps make this clearer.

On further reflection, we agree with the reviewer that this is confusing. In fact, separating the two is unnecessary. Therefore, we have cut out the reference to the paper and now the sentence reads “In SSNdesign, we use the pseudo-Bayesian approach.”

Figure 1:

It is not clear where the reader should be starting with this flowchart. My experience with using SSNMs in R indicate to me how to interpret this flowchart, but I can see how a reader who is new to working with SSNMs in R could be confused. A flow chart which is left justified (or column aligned) and which then flows in parallel would be clearer (like Figure 2).

Thank you for this feedback. We agree the intended reading of the flowchart is not clear. We also agree with your suggestions on how to improve it. Please see the response to your comment below on specifically how we went about this.

When interpreting this flowchart it is not clear if the two “.ssn folder” objects are the same object fundamentally. If they are, then the reader wonders why one is required to be run through the importSSN function before SimulateOnSSN, while the other can go straight to SimulateOnSSN. My interpretation is that the bottom path “createSSN → .ssn folder → SimulateOnSSN” represents a continued R session (the scenario where the user starts with nothing); while the other path represents a new R session where the user starts with a “.ssn folder” object. If this interpretation is correct, I would suggest splitting this figure into two sections (one right and one left; or one top and one bottom) illustrating the two scenarios.

Again, thank you for this feedback. To clarify,

The 2x “.ssn folder” do not refer to the same object, just to a data structure that the two workflows have in common. (i.e. both use .ssn folders to store the stream network data.)

The “.ssn folder” appearing in the top workflow refers to a pre-existing dataset where the user has, at a minimum, a set of stream edges in a shapefile. From here, if the user has no observed data in a sites shapefile, then they can simulate observed and prediction locations using generateSites. Once the observed and prediction locations have been generated, the user may then simulate observed response values on those sites using SimulateOnSSN. Alternatively, if the user has a .ssn folder that already contains an observed sites and prediction locations shapefile(s), then they can go straight to simulating response values if necessary.

The “createSSN -> “.ssn folder” -> SimulateOnSSN” refers to a situation where the user has no data for a stream network. That is, they must artificially generate all parts of the stream dataset including the stream edges and sites. The createSSN function is responsible for creating a new .ssn folder that contains the completely simulated stream network. These data are then passed on to the SimulateOnSSN function to provide simulated response values for the simulated survey sites.

We have updated Figure 1, incorporating your suggestions for improving readability. Specifically, we have disentangled each of the three cases from one another. There are now three rectangles stacked vertically, each one representing a different scenario (i.e. whether you have stream edges and survey data, just stream edges, or no GIS or survey data at all). We believe the new Figure 1 is a vast improvement on the original.

The Expected utility estimation and maximisation section: In general these two paragraphs are concise and very well explained.

Thank you!

The Utility functions for optimal and adaptive experimental designs section:

Reading about the adaptive design technique in SSNdesign, I see that maximizing U(d | d0:t-1, y0:t-1) depends upon d0 (an initial model design). This naturally makes the reader wonder the sensitivity of the initial model chosen. Did the authors look at how the choice of d0 affects future iterations, or even the final outcome, in their adaptive design? Or are there any other studies that looked at this that can be cited?

This is an insightful question. The choice of d0 will affect design choices in future iterations. This is because initially, prior information about the model parameters is formed based on d0. If this is vague, then it is likely that the iteratively chosen designs will become more informative as we become more certain about the parameter values. In contrast, if the prior information is initially quite informative, then the iteratively chosen designs should start out being relatively informative. The effect that the choice of d0 has on subsequent designs was not formally investigated via simulation as typically (at least in the stream modelling scenarios we have encountered) d0 will be quite uninformative, particularly for covariance parameters. Indeed, this really motivates the need to adopt adaptive design methods as, if the parameters are well estimated based on d0, then static (non-adaptive) design approaches would probably be suitable.

I see where the authors explicitly discuss solving adaptive design problems (paragraph 2), but where are optimal design problems discussed in this section?

The main purpose of this section is to provide concise definitions of all the utility functions that we have included in SSNdesign. The main focus of this section is Table 1 where we write out the equations for the utility functions; describe their arguments; and their intended use-cases (e.g. for selecting covariance/fixed effects parameters, or for minimising the average kriging standard errors across prediction sites). However, detailed discussions of both the optimal and adaptive design procedures are actually provided elsewhere (see the section entitled “Expected utility estimation and maximisation”, as well as the S1 Appendix). We explicitly discuss the myopic design approach in paragraph 2 of this section (while also pointing readers to the relevant sections and algorithms in the S1 Appendix for more detailed information) because it is critical information for understanding the utility functions we have included in the package for adaptive design. Utility functions for myopic designs look different to those used in backward induction. Therefore, we believe this brief discussion of the myopic design approach is necessary here.

Line 195 “Space-filling designs are designs which…”: delete “are designs which, ideally” so it reads “Space-filling designs contain roughly…”

Done.

Line 201 “Here we used an algorithm…”: change “used” to “use”

Done.

Line 219 “Users may also define…”: change to “Users may also define their own utility functions since the optimiseSSNDesign function has the flexibility to…” It doesn’t make sense to use a function to define another function.

Done.

Line 258 “Field data collection can be expensive and so existing…”: delete “and” (double conjunction).

Done.

Lines 271 - 275: Just a formatting note to capitalize “Identify” and “Set” as the beginning of each list point.

Done.

Line 274 “Set priors that specify what the…”: I suggest changing this sentence as follows, “Set priors that specify the likely relationship between covariates and the response based on expert opinion and/or a literature review.” This feels more concise and clear.

Done, but we have added “previously collected data” so that the sentence reads:

Set priors that specify the likely relationship between covariates and the response based on previously collected data, expert opinion and/or a literature review.

Line 281: This is a previously published dataset that should be cited.

Yes, we have now cited it correctly as

Figure 4: This is a good figure which effectively illustrates the author’s conclusions from case study 1.

Thank you!

I suggest the two plots use the same y-axis scale (0.0 – 1.0). It could be misleading to the reader otherwise.

Yes, we have done this for the updated version of Figure 4. However, we have expanded the number of panels to four. The reasons for this are partially explored below and further explained in our response to the reviewer’s next point.

The updated plot now looks like this:

The following outlines how this figure works:

Panels (a) and (d) are the panels (a) and (b) respectively from the original Figure 4.

Panels (a) and (b) in this version have their y-axes scaled to 0.0-1.0. The fact that both plots have the same axis scales means that it is possible to make direct comparisons of the two plots side-by-side.

Panels (c) and (d) are zoomed into the regions of the y-axes that allow the variability in the GRTS and random designs to be more clearly displayed. The total range of the y-axis is the same for both panels (0.3) but panel (c) shows the region of the plot with the y-axis fixed between 0.0-0.3 whereas panel (d) has its y-axis fixed between 0.7-1.0.

The feature in this figure that stands out to me the most is the lack of variability in efficiency under the GRTS designs using the CPD-optimal utility function (a) and the larger variability under the GRTS designs using the K-optimal utility function (b). Do the authors have any ideas about why that would be?

The problem that the reviewer has described here is an artifact of the inconsistent y-axis scales in the original Figure 4. The efficiencies of GRTS designs exhibit similar levels of variability under the CPD- and K-optimality utility functions. As it is now clear from panel (c) in the updated Figure 4, the GRTS designs under CPD-optimality exhibit efficiencies approximately in the range of 0.03-0.09 (total range = 0.06). Similarly, under K-optimality, the GRTS designs have efficiencies between 0.90-0.96 (total range = 0.06). We believe our updated Figure 4 no longer risks misleading readers in this way.

Line 382 “Compared to other packages…”: change to “Compared with other packages…” Typical convention is that “compared to” is used for discussing similarities between fundamentally different objects while “compared with” is for discussing differences between two fundamentally similar objects.

Done.

References

Falk MG, McGree JM, Pettitt AN. Sampling designs on stream networks using the pseudo-bayesian approach. Environmental and Ecological Statistics. 2014;21:751–73.

Som NA, Monestiez P, Ver Hoef JM, Zimmerman DL, Peterson EE. Spatial sampling on streams: Principles for inference on aquatic networks. Environmetrics. 2014;25(5):306–23.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(85.3KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0238422.r003

Decision Letter 1

Andreas C Bryhn

18 Aug 2020

SSNdesign – an R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks

PONE-D-19-26876R1

Dear Dr. Pearse,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Andreas C. Bryhn

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

PLoS One. doi: 10.1371/journal.pone.0238422.r004

Acceptance letter

Andreas C Bryhn

24 Aug 2020

PONE-D-19-26876R1

SSNdesign – an R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks

Dear Dr. Pearse:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Andreas C. Bryhn

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 Appendix. Background on optimal and adaptive pseudo-Bayesian design.

(PDF)

Click here for additional data file.^{(294.8KB, pdf)}

S2 Appendix. SSNdesign—An R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks.

(PDF)

Click here for additional data file.^{(715.8KB, pdf)}

S3 Appendix. Glossary of terms.

(PDF)

Click here for additional data file.^{(152.4KB, pdf)}

Attachment

Submitted filename: Review_PONE-D-19-26876.pdf

Click here for additional data file.^{(143.7KB, pdf)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(85.3KB, docx)}

Data Availability Statement

The data referred to in the text are available with the SSNdesign package on GitHub at https://github.com/apear9/SSNdesign.

[pone.0238422.ref001] 1. David D, H. AA, O. GM, Zen-Ichiro K, J. KD, Christian L, et al. Freshwater biodiversity: Importance, threats, status and conservation challenges. Biological Reviews. 2006;81(2):163–82. 10.1017/S1464793105006950 [DOI] [PubMed] [Google Scholar]

[pone.0238422.ref002] 2. Poff NL, Olden JD, Strayer DL. Climate change and freshwater fauna extinction risk In: Hannah L, editor. Saving a million species: Extinction risk from climate change. Washington, DC: Island Press/Center for Resource Economics; 2012. pp. 309–36. [Google Scholar]

[pone.0238422.ref003] 3. Vorosmarty CJ, McIntyre PB, Gessner MO, Dudgeon D, Prusevich A, Green P, et al. Global threats to human water security and river biodiversity. Nature. 2010;467:555–61. 10.1038/nature09440 [DOI] [PubMed] [Google Scholar]

[pone.0238422.ref004] 4. Ver Hoef JM, Peterson EE, Theobald D. Spatial statistical models that use flow and stream distance. Environmental and Ecological Statistics. 2006;13:449–64. 10.1007/s10651-006-0022-8 [DOI] [Google Scholar]

[pone.0238422.ref005] 5.United Nations Water. The right to privacy in the digital age. Geneva, Switzerland; 2016.

[pone.0238422.ref006] 6. Cressie N. Statistics for spatial data. New York: Wiley; 1993. [Google Scholar]

[pone.0238422.ref007] 7. Ver Hoef JM, Cressie N, Fisher RN, Case TJ. Uncertainty and spatial linear models for ecological data In: Spatial uncertainty in ecology. New York, NY: Springer; 2001. pp. 214–37. [Google Scholar]

[pone.0238422.ref008] 8. Peterson EE, Ver Hoef JM, Isaak DJ, Falke JA, Fortin MJ, Jordan CE, et al. Modelling dendritic ecological networks in space: An integrated network perspective. Ecology Letters. 2013;16 (5). 10.1111/ele.12084 [DOI] [PubMed] [Google Scholar]

[pone.0238422.ref009] 9. Ver Hoef JM. Kriging models for linear networks and non-euclidean distances: Cautions and solutions. Methods in Ecology and Evolution. 2018;9:1600–13. 10.1111/2041-210X.12979 [DOI] [Google Scholar]

[pone.0238422.ref010] 10. Ver Hoef JM, Peterson EE. A moving average approach for spatial statistical models of stream networks. Journal of the American Statistical Association. 2010;105(489). 10.1198/jasa.2009.ap08248 [DOI] [Google Scholar]

[pone.0238422.ref011] 11. Isaak DJ, Wenger SJ, Peterson EE, Ver Hoef JM, Nagel DE, Luce CH, et al. The norwest summer stream temperature model and scenarios for the western u.S.: A crowd-sourced database and new geospatial tools foster a user community and predict broad climate warming of rivers and streams. Water Resources Research [Internet]. 2017;53(11):9181–205. Available from: https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1002/2017WR020969 [Google Scholar]

[pone.0238422.ref012] 12. Marsha A, Steel EA, Fullerton AH, Sowder C. Monitoring riverine thermal regimes on stream networks: Insights into spatial sampling design from the snoqualmie river. Ecological Indicators. 2018;84:11–26. 10.1016/j.ecolind.2017.08.028 [DOI] [Google Scholar]

[pone.0238422.ref013] 13. Mueller P. Simulation-based optimal design. Bayesian Statistics. 1999;6:459–74. 10.1007/s00477-015-1067-8 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238422.ref014] 14. Mueller P, Berry DA, Grieve AP, Smith M, Krams M. Simulation based sequential bayesian design. Journal of Statistical Planning and Inference. 2007;137:3140–50. 10.1016/j.jspi.2006.05.021 [DOI] [Google Scholar]

[pone.0238422.ref015] 15. Som NA, Monestiez P, Ver Hoef JM, Zimmerman DL, Peterson EE. Spatial sampling on streams: Principles for inference on aquatic networks. Environmetrics. 2014;25(5):306–23. 10.1002/env.2284 [DOI] [Google Scholar]

[pone.0238422.ref016] 16. Falk MG, McGree JM, Pettitt AN. Sampling designs on stream networks using the pseudo-bayesian approach. Environmental and Ecological Statistics. 2014;21:751–73. 10.1007/s10651-014-0279-2 [DOI] [Google Scholar]

[pone.0238422.ref017] 17. Kang SY, McGree JM, Drovandi CC, Caley MJ, Mengersen KL. Bayesian adaptive design: Improving the effectiveness of monitoring of the great barrier reef. Ecological Applications. 2016;26(8):2637–48. 10.1002/eap.1409 [DOI] [PubMed] [Google Scholar]

[pone.0238422.ref018] 18. Atkinson AC, Donev AN, Tobias RD. Optimum experimental designs, with sas. Oxford University Press; 2007. [Google Scholar]

[pone.0238422.ref019] 19. Royle JA. Exchange algorithms for constructing large spatial designs. Journal of Statistical Planning and Inference. 2002;100(2):121–34. 10.1016/S0378-3758(01)00127-6 [DOI] [Google Scholar]

[pone.0238422.ref020] 20. Ver Hoef JM, Peterson EE, Clifford D, Shah R. SSN: An r package for spatial statistical modelling on stream networks. Journal of Statistical Software. 2014;56(3). [Google Scholar]

[pone.0238422.ref021] 21. R Core Team. R: A language and environment for statistical computing [Internet]. Vienna, Austria: R Foundation for Statistical Computing; 2018. Available from: https://www.R-project.org/ [Google Scholar]

[pone.0238422.ref022] 22.Overstall AM, Woods DC, Adamou M. Acebayes: Optimal bayesian experimental design using the ace algorithm [Internet]. 2017. Available from: https://CRAN.R-project.org/package=acebayes

[pone.0238422.ref023] 23. Kincaid TM, Olsen AR. Spsurvey: Spatial survey design and analysis. 2018. [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0238422.ref024] 24. Stevens DL, Olsen AR. Spatially balanced sampling of natural resources. Journal of the American Statistical Association. 2004;99(465):262–78. 10.1198/016214504000000250 [DOI] [Google Scholar]

[pone.0238422.ref025] 25.Melo C, Santacruz A, Melo O. Geospt: An r package for spatial statistics. [Internet]. 2012. Available from: geospt.r-forge.r-project.org/

[pone.0238422.ref026] 26.Corporation M, Weston S. DoParallel: Foreach parallel adaptor for the’parallel’ package [Internet]. 2019. Available from: https://CRAN.R-project.org/package=doParallel

[pone.0238422.ref027] 27.Gaujoux R. DoRNG: Generic reproducible parallel backend for’foreach’ loops [Internet]. 2018. Available from: https://CRAN.R-project.org/package=doRNG

[pone.0238422.ref028] 28. Lu B, Sun H, Xu M, Harris P, Charlton M. Shp2graph: Tools to convert a spatial network into an igraph graph in r. ISPRS International Journal of Geo-Information [Internet]. 2018;7(8):293 Available from: 10.3390/ijgi7080293 [DOI] [Google Scholar]

[pone.0238422.ref029] 29. Peterson EE, Ver Hoef JM. STARS: An arcgis toolset used to calculate the spatial data needed to fit spatial statistical models to stream network data. Journal of Statistical Software. 2014;56(2):1–17. 10.18637/jss.v056.i02 [DOI] [Google Scholar]

[pone.0238422.ref030] 30. Evangelou E, Zhu Z. Optimal predictive design augmentation for spatial generalised linear mixed models. Journal of Statistical Planning and Inference. 2012;142(12):3242–53. 10.1016/j.jspi.2012.05.008 [DOI] [Google Scholar]

[pone.0238422.ref031] 31. Morris MD, Mitchell TJ. Exploratory designs for computational experiments. Journal of Statistical Planning and Inference. 1995;43(3):381–402. 10.1016/0378-3758(94)00035-T [DOI] [Google Scholar]

[pone.0238422.ref032] 32. Pronzato L, Muller WG. Design of computer experiments: Space filling and beyond. Statistics and Computing. 2012;22 10.1007/s11222-011-9242-3 [DOI] [Google Scholar]

[pone.0238422.ref033] 33. McGree JM, Drovandi CC, Thompson MH, Eccleston JA, Duffull SB, Mengersen K, et al. Adaptive bayesian compound designs for dose finding studies. Journal of Statistical Planning and Inference. 2012;142(6):1480–92. 10.1016/j.jspi.2011.12.029 [DOI] [Google Scholar]

[pone.0238422.ref034] 34.Peterson EE. STARS: Spatial tools for the analysis of river systems—a tutorial [Internet]. Commonwealth Scientific Industrial Research Organisation (CSIRO); 2011. Available from: http://www.fs.fed.us/rm/boise/AWAE/projects/SSN_STARS/software_data.html#doc

[pone.0238422.ref035] 35. Larsen DP, Olsen AR, Stevens DL. Using a master sample to integrate stream monitoring programs. Journal of the Agricultural, Biological and Environmental Statistics. 2008;13(3):243–54. 10.1198/108571108X336593 [DOI] [Google Scholar]

[pone.0238422.ref036] 36. Clyde MA. Bayesian model averaging and model search strategies. Bayesian Statistics. 1999;6:157–85. [Google Scholar]

PERMALINK

SSNdesign—An R package for pseudo-Bayesian optimal and adaptive sampling designs on stream networks

Alan R Pearse

James M McGree

Nicholas A Som

Catherine Leigh

Paul Maxwell

Jay M Ver Hoef

Erin E Peterson

Roles

Abstract

Introduction

The SSNdesign package

Software and data availability

Workflow

Fig 1. A flow chart of the function calls used to import and prepare streams data for use in SSNdesign.

Fig 2. A flow chart of the function calls used to construct optimal (blue), adaptive (red), and probability-based (green) designs for stream networks.

Data format, ingestion and manipulation

Expected utility estimation and maximisation

Utility functions for optimal and adaptive experimental designs

Table 1. Utility functions implemented in SSNdesign.

Other standard designs

Table 2. Standard designs from Som et al. [15].

Computational performance

Case studies

Case study 1: Lake Eacham

Fig 3. The Lake Eacham stream network (blue lines) with all potential sampling sites (red squares).

Fig 4. Information loss in our optimal designs as we remove sites one-by-one using the (a, c) CPD-optimal and (b, d) K-optimal utility functions.

Case study 2: Pine River

Fig 5. A schematic diagram showing the adaptive design process over four years.

Fig 6. The Pine River stream network.

Fig 7. The sum of kriging variances for both Generalised Random Tessellation Sampling (GRTS) and random designs computed using 1000 draws from the priors set on the covariance parameters.

Conclusions

Supporting information

Acknowledgments

Data Availability

Funding Statement

References

Decision Letter 0

Andreas C Bryhn

Roles

Author response to Decision Letter 0

Decision Letter 1

Andreas C Bryhn

Roles

Acceptance letter

Andreas C Bryhn

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases

Fig 1. A flow chart of the function calls used to import and prepare streams data for use in `SSNdesign`.

Table 1. Utility functions implemented in `SSNdesign`.