spsurvey: Spatial Sampling Design and Analysis in R

Michael Dumelle; Tom Kincaid; Anthony R Olsen; Marc Weber

doi:10.18637/jss.v105.i03

. Author manuscript; available in PMC: 2024 Jan 18.

Published in final edited form as: J Stat Softw. 2023 Jan 18;105(3):1–29. doi: 10.18637/jss.v105.i03

spsurvey: Spatial Sampling Design and Analysis in R

Michael Dumelle ¹, Tom Kincaid ², Anthony R Olsen ³, Marc Weber ⁴

PMCID: PMC9926341 NIHMSID: NIHMS1870927 PMID: 36798141

Abstract

spsurvey is an R package for design-based statistical inference, with a focus on spatial data. spsurvey provides the generalized random-tessellation stratified (GRTS) algorithm to select spatially balanced samples via the grts() function. The grts() function flexibly accommodates several sampling design features, including stratification, varying inclusion probabilities, legacy (or historical) sites, minimum distances between sites, and two options for replacement sites. spsurvey also provides a suite of data analysis options, including categorical variable analysis (cat_analysis()), continuous variable analysis cont_analysis()), relative risk analysis (relrisk_analysis()), attributable risk analysis (attrisk_analysis()), difference in risk analysis (diffrisk_analysis()), change analysis (change_analysis()), and trend analysis (trend_analysis()). In this manuscript, we first provide background for the GRTS algorithm and the analysis approaches and then show how to implement them in spsurvey. We find that the spatially balanced GRTS algorithm yields more precise parameter estimates than simple random sampling, which ignores spatial information.

Keywords: design-based inference, generalized random-tessellation stratified algorithm, Horvitz-Thompson, inclusion probability, spatial balance, variance estimation

1. Introduction

Survey designs are often used to study an environmental resource in a population. These populations are comprised of individual population units, which are often referred to as sites. Each site contains information about the environmental resource, and a complete characterization of the resource can be obtained by studying every site. Unfortunately, studying every site is rarely feasible. Therefore, a sample of sites is collected, and the sample is used to make generalizations about the larger population. Typically sites are selected without replacement, and we make this assumption henceforth. The process by which sites are selected in the sample is known as the sampling design.

In the design-based approach to statistical inference, a sample should be representative of the population, but the term representative is often vague and has multiple interpretations (Kruskal and Mosteller 1979a,b,c). We claim a representative sample should have at least the following two properties. First, the sites must be selected as part of the sample via a random mechanism. The design-based approach to statistical inference relies on a random selection of sites; the random site selection forms the foundation for deriving properties of parameter estimates (Särndal, Swensson, and Wretman 2003; Lohr 2009). Second, the probability each site is selected as part of the sample is greater than zero. This probability of selection is known as an inclusion probability.

There are three types of commonly studied environmental resources: point resources, linear resources, and areal resources. A point resource has a finite number of population units (i.e., a finite population) and represents a collection of point geometries. An example of a point resource is all lakes (viewed as a whole) in the United States, using the centroid of the lake as the site location. A linear resource has an infinite number of population units (i.e., an infinite population) and represents a collection of linestring geometries. An example of a linear resource is all streams in the United States. An areal resource has an infinite number of population units and represents a collection of polygon geometries. An example of an areal resource is the San Francisco Bay Estuary.

These point, linear, and areal resources tend to be spread over geographic space. If a sample is well-spread over geographic space, we call it a spatially balanced sample (we provide a more technical definition of spatial balance in Section 2.2). Spatially balanced samples are desirable because they tend to yield more precise parameter estimates than samples that are not spatially balanced (Stevens and Olsen 2004; Barabesi and Franceschi 2011; Grafström and Lundström 2013; Robertson, Brown, McDonald, and Jaksons 2013; Wang et al. 2013; Benedetti, Piersimoni, and Postiglione 2017).

The spsurvey package (Dumelle, Kincaid, Olsen, and Weber 2023) selects spatially balanced samples using the generalized random-tessellation stratified (GRTS) algorithm (Stevens and Olsen 2004) and is available from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/package=spsurvey. Shortly after the GRTS algorithm emerged, several other spatially balanced sampling algorithms followed. Walvoort, Brus, and De Gruijter (2010) used compact geographical strata to perform stratified sampling; this approach is available in the spcosa R package (Walvoort, Brus, and De Gruijter 2022). Grafström, Lundström, and Schelin (2012) used a local pivot method for finite populations and Grafström and Matei (2018) generalized this approach to infinite populations; these approaches are available in the BalancedSampling R package (Grafström and Lisic 2019). Grafström (2012) used a spatially correlated Poisson approach, also available in BalancedSampling. Benedetti and Piersimoni (2017) used a within-sample distance approach available in the Spbsampling R package (Pantalone, Benedetti, and Piersimoni 2022). Robertson et al. (2013) developed balanced acceptance sampling, and subsequently, Robertson, McDonald, Price, and Brown (2018) used Halton iterative partitioning; these approaches are available in the SDraw R package (McDonald and McDonald 2020). Foster et al. (2020) developed spatially balanced transect sampling; this approach is available in the MBHdesign R package (Foster 2021).

The GRTS algorithm in spsurvey implements many features absent from the aforementioned software packages. The GRTS algorithm in spsurvey can be applied to all three resource types: point, linear, and areal. It accommodates several sampling design features like stratification, unequal selection probabilities, legacy (or historical) sites, minimum distances between sites, and two options for replacement sites (reverse hierarchical ordering and nearest neighbor). The GRTS algorithm is discussed in more detail in Section 2. Section 2 also showcases how spsurvey can be used to summarize and visualize sampling frames and samples as well as measure spatial balance.

Another benefit of spsurvey compared to the aforementioned software packages is that spsurvey can also be used to analyze data and estimate parameters of a population. spsurvey has a suite of analysis functions that enable categorical variable analysis, continuous variable analysis, attributable risk analysis, relative risk analysis, difference in risk analysis, change analysis, and trend analysis. In addition, variances can be estimated using the local neighborhood variance estimator (Stevens and Olsen 2003), which increases precision by using the spatial locations of each observation in variance estimation. The analysis functions in spsurvey are discussed in more detail in Section 3.

The rest of this paper is organized as follows. In Section 2, we review spatially balanced sampling in spsurvey. In Section 3, we the describe the analysis approaches available in spsurvey. In Section 4, we compare performance of the GRTS algorithm and local neighborhood variance estimator to simple random sampling using data from the 2012 National Lakes Assessment (U.S. Environmental Protection Agency 2017). Finally, in Section 5, we end with a discussion and explore potential future developments for spsurvey.

To install and load spsurvey, run

R> install.packages(“spsurvey”) 
R> library(“spsurvey”)

2. Spatially balanced sampling

In Section 1 we introduced the notion of a random sample. Random samples are selected from a collection of sites. This collection of sites is known as the sampling frame. Ideally, the set of sites in the sampling frame is the same as the set of sites in the population. Unfortunately this is not always true, as a sampling frame may contain some sites that are not in the population (overcoverage), may be missing sites from the population (undercoverage), or both. Selecting an appropriate sampling frame is crucial if you want to generalize results from the sample to the population. To understand whether a sampling frame is appropriate for a population, summaries and visualizations of the sampling frame are helpful. Next we demonstrate using spsurvey to summarize and visualize sampling frames. We then give theoretical background for the generalized random-tessellation stratified (GRTS) algorithm and show how to use it in spsurvey to select spatially balanced samples and to summarize, visualize, write, and print these samples. We end the section by showing how to explicitly measure spatial balance using spsurvey and to use GRTS for a variety of resource types.

2.1. Summarizing and visualizing sampling frames

Sampling frames for point, linear, or areal resources summarized and visualized in spsurvey using the summary() and plot() functions, respectively. The summary() and plot() functions have similar syntax and require at least two arguments: the sampling frame and a formula. The sampling frame must be an ‘sf’ object (Pebesma 2018) or a data frame. The formula specifies the variables in the sampling frame to summarize or visualize and can be one-sided or two-sided. Additional arguments to summary() and plot() are discussed in more detail later.

To demonstrate the use of summary() and plot(), we use the the NE_Lakes data in spsurvey. The NE_Lakes data is an ‘sf’ object of 195 lakes in the Northeastern United States. The NE_Lakes data represent a point resource, as there are a finite number of lakes to sample. Later we study linear and areal data in spsurvey. To load NE_Lakes into your global environment, run

R> data(“NE_Lakes”, package = “spsurvey”)

There are five variables in NE_Lakes: AREA, a continuous variable representing lake area (in hectares); AREA_CAT, a categorical variable representing lake area levels small (1 to 10 hectares) and large (greater than 10 hectares); ELEV, a continuous variable representing lake elevation (in meters); and ELEV_CAT, a categorical variable representing lake elevation levels low (0 to 100 meters) and high (greater than 100 meters). We can view the geometry information and first few rows of NE_Lakes by running

R> NE_Lakes

Simple feature collection with 195 features and 4 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: 1834001 ymin: 2225021 xmax: 2127632 ymax: 2449985
Projected CRS: NAD83 / Conus Albers 
First 10 features:

	`AREA`	`AREA_CAT`	`ELEV`	`ELEV_CAT`	`geometry`
`1`	`10.648825`	`large`	`264.69`	`high`	`POINT (1930929 2417191)`
`2`	`2.504606`	`small`	`557.63`	`high`	`POINT (1849399 2375085)`
`3`	`3.979199`	`small`	`28.79`	`low`	`POINT (2017323 2393723)`
`4`	`1.645657`	`small`	`212.60`	`high`	`POINT (1874135 2313865)`
`5`	`7.489052`	`small`	`239.67`	`high`	`POINT (1922712 2392868)`
`6`	`86.533725`	`large`	`195.37`	`high`	`POINT (1977163 2350744)`
`7`	`1.926996`	`small`	`158.96`	`high`	`POINT (1852292 2257784)`
`8`	`6.514217`	`small`	`29.26`	`low`	`POINT (1874421 2247388)`
`9`	`3.100221`	`small`	`204.62`	`high`	`POINT (1933352 2368181)`
`10`	`1.868094`	`small`	`78.77`	`low`	`POINT (1892582 2364213)`

`total`	`ELEV_CAT`	`ELEV_CAT:AREA_CAT`
`total:195`	`low :112`	`low:small :82`
	`high: 83`	`high:small:53`
		`low:large :30`
		`high:large:30`

`ELEV by total:`
	`Min.`	`1st Qu.`	`Median`	`Mean`	`3rd Qu.`	`Max.`
`total`	`0`	`21.925`	`69.09`	`127.3862`	`203.255`	`561.41`

`ELEV by AREA_CAT:`
	`Min.`	`1st Qu.`	`Median`	`Mean`	`3rd Qu.`	`Max.`
`small`	`0.00`	`19.64`	`59.660`	`117.4473`	`176.1700`	`561.41`
`large`	`0.01`	`26.75`	`102.415`	`149.7487`	`241.2025`	`537.84`

`AREA by total:`
	`Min.`	`1st Qu.`	`Median`	`Mean`	`3rd Qu.`	`Max.`
`total`	`1.043181`	`2.491625`	`3.833015`	`13.26145`	`7.540559`	`137.8127`

`AREA by siteuse:`
	`Min.`	`1st Qu.`	`Median`	`Mean`	`3rd Qu.`	`Max.`
`Base`	`1.043181`	`2.539218`	`4.273565`	`14.52684`	`11.178641`	`137.81268`
`Over`	`1.767196`	`2.456281`	`2.804252`	`6.93449`	`5.619522`	`38.26573`

`siteuse by total:`
	`Legacy`	`Base`	`Over`	`Near`
`total`	`5`	`15`	`7`	`27`

`siteuse by stratum:`
	`Legacy`	`Base`	`Over`	`Near`
`high`	`0`	`10`	`5`	`15`
`low`	`5`	`5`	`2`	`12`

	`Category`	`nResp`	`Estimate.P`	`LCB95Pct.P`	`UCB95Pct.P`
`1`	`Fair`	`24`	`23.69392`	`11.55386`	`35.83399`
`2`	`Good`	`38`	`51.35111`	`36.78824`	`65.91398`
`3`	`Poor`	`34`	`24.95496`	`13.35359`	`36.55634`
`4`	`Total`	`96`	`100.00000`	`100.00000`	`100.00000`

	`Category`	`nResp`	`Estimate.U`	`LCB95Pct.U`	`UCB95Pct.U`
`1`	`Fair`	`24`	`2530.428`	`1171.077`	`3889.780`
`2`	`Good`	`38`	`5484.120`	`3086.357`	`7881.883`
`3`	`Poor`	`34`	`2665.103`	`1375.258`	`3954.949`
`4`	`Total`	`96`	`10679.652`	`7903.812`	`13455.491`

	`Subpopulation`	`Category`	`nResp`	`Estimate.U`	`LCB95Pct.U`	`UCB95Pct.U`
`5`	`Oregon`	`Fair`	`8`	`1298.8470`	`266.5980`	`2331.096`
`6`	`Oregon`	`Good`	`26`	`2854.3752`	`1533.3077`	`4175.443`
`7`	`Oregon`	`Poor`	`13`	`630.3551`	`241.3029`	`1019.407`
`8`	`Oregon`	`Total`	`47`	`4783.5773`	`3398.7997`	`6168.355`

	`stratum`	`metric`	`value`
`1`	`None`	`pielou`	`0.0301533`

	`stratum`	`metric`	`value`
`1`	`None`	`pielou`	`0.04589258`

	`Indicator`	`nResp`	`Estimate`	`LCB95Pct`	`UCB95Pct`
`1`	`BMMI`	`96`	`56.50929`	`53.01609`	`60.00249`

	`Subpopulation`	`Indicator`	`nResp`	`Estimate`	`LCB95Pct`	`UCB95Pct`
`1`	`California`	`BMMI`	`19`	`50.48964`	`42.55357`	`58.42572`
`2`	`Oregon`	`BMMI`	`47`	`61.29675`	`56.23802`	`66.35548`
`3`	`Washington`	`BMMI`	`30`	`54.23036`	`48.06838`	`60.39234`

`total`	`AP`	`BMMI`
`total:1030`	`N :694`	`Min. : 0.00`
	`Y :334`	`1st Qu.:33.00`
	`NA’s: 2`	`Median :43.90`
		`Mean :43.22`
		`3rd Qu.:54.60`
		`Max. :86.10`
		`NA’s :116`

Algorithm	SPB	Bias	RMSE	Coverage	MOE
GRTS	0.0214	−0.0003	0.0206	0.9525	0.0406
SRS	0.0339	−0.0008	0.0258	0.9455	0.0505

Algorithm	SPB	Bias	RMSE	Coverage	MOE
GRTS	0.0213	0.0063	0.7655	0.9520	1.5303
SRS	0.0336	0.0134	0.8421	0.9440	1.6668

PERMALINK

spsurvey: Spatial Sampling Design and Analysis in R

Michael Dumelle

Tom Kincaid

Anthony R Olsen

Marc Weber

Abstract

1. Introduction

2. Spatially balanced sampling

2.1. Summarizing and visualizing sampling frames

Figure 1:

2.2. The generalized random-tessellation stratified algorithm

Figure 2:

Legacy sites

A minimum distance between sites

Replacement sites

2.3. Summarizing, visualizing, and binding design sites

Figure 3:

2.4. Printing design sites

2.5. Measuring spatial balance

2.6. Linear and areal sampling frames

Figure 4:

3. Analysis

3.1. Categorical variable analysis

3.2. Continuous variable analysis

3.3. Additional analysis approaches

4. Application

Figure 6:

Table 1:

Figure 7:

Table 2:

Figure 8:

5. Discussion

Figure 5:

Acknowledgments

Contributor Information

Data and code availability

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases