Skip to main content
PLOS One logoLink to PLOS One
. 2020 Apr 30;15(4):e0231863. doi: 10.1371/journal.pone.0231863

Using data derived from cellular phone locations to estimate visitation to natural areas: An application to water recreation in New England, USA

Nathaniel H Merrill 1,*, Sarina F Atkinson 2, Kate K Mulvaney 1, Marisa J Mazzotta 1, Justin Bousquin 3
Editor: Song Gao4
PMCID: PMC7192446  PMID: 32352978

Abstract

We introduce and validate the use of commercially available human mobility datasets based on cell phone locations to estimate visitation to natural areas. By combining this data with on-the-ground observations of visitation to water recreation areas in New England, we fit a model to estimate daily visitation for four months to more than 500 sites. The results show the potential for this new big data source of human mobility to overcome limitations in traditional methods of estimating visitation and to provide consistent information at policy-relevant scales. However, the data providers’ opaque and rapidly developing methods for processing locational information required a calibration and validation against data collected by traditional means to confidently reproduce the desired estimates of visitation. We found that with this calibration, the high-resolution information in both space and time provided by cell phone location-derived data creates opportunities for developing next-generation models of human interactions with the natural environment.

Introduction

People visit natural areas to view the scenery and wildlife or to engage in any number of recreational activities they enjoy. These areas are important to society, as shown through the sheer number of people who visit parks, beaches, walking trails, and other natural spaces and their significant contribution to the economy [1]. However, it is difficult to quantify the use of natural areas across the many, diverse locations and the timing of those visits. Observational counts are time consuming and expensive to conduct and traditional survey approaches have their own sampling complications for estimating visitation. Therefore, it is often unknown how many and what types of people visit natural areas—critical information for managers or researchers to apply in natural resource damage assessments, park and urban planning, economic valuation studies, tourism studies, as well as to inform many other management decisions [23].

This paper demonstrates the use of commercially available anonymized and aggregated data on cellular device locations to estimate visitation to natural areas. For brevity, we refer to these cellular device location-based datasets generally as “cell data” in this paper. We investigated how cell data performs in providing the types of visitation information needed in policy applications, information on the temporal and spatial distribution of visits to natural areas. Specifically, we compared the cell data with on-the-ground daily visitation counts that we collected, along with other federal, state and town records for water recreation areas. We found that the cell data contained useful information for estimating visitor use, but it required a correction (calibration) to match the scale of the observations and to be confident with its wider application. We then built and applied a statistical model to estimate daily visitation with cell data to more than 500 water recreation areas in the northeastern United States for the primary four months of recreational use (June-September) of 2017.

Our application demonstrates the potential for this emerging source of big data to provide comprehensive visitation information across many places and time windows, a feat that would be impractical with traditional methods. The high-resolution information in space and time provided by cell data expands opportunities for developing next-generation models of human interactions with the natural environment. In addition, cell data provides the ability to know not just how many people are visiting specific locations, but also where visitors are coming from within aggregated geographies. This information allows for basic calculations of distance traveled to a location and a deeper understanding of the community composition of visitors. While these new data sources may help overcome many logistical barriers to obtaining behavioral information at scale, our work highlights the ongoing need for traditional methods of collection for calibration and validation for these new data sources to be useful in common applications.

Background

Existing visitation information for natural areas is limited and currently comes from many different, often inconsistent, sources. There are visitation estimates derived from entrance fees, parking fees, lifeguard counts, car or people electronic counters, aerial surveys, remote sensing, or tailored observational sampling plans [414]. Each data source comes with its own nuances in terms of sampling issues and geographical and temporal coverage. Most ongoing visitation collection efforts only capture paying customers during the hours of fee collections, thereby providing only a subset of daily use. There are also detailed collections that coincide with a specific project or event, such as an oil spill, but the knowledge is often not generalizable past a particular region or time-period. Often, the need to obtain visitation information arises after an event has occurred, making the before and after comparison difficult [3]. These event-based data collections are also resource intensive [2].

Efforts to overcome these barriers using other types of digital records include estimates derived from photo-sharing or other types of social media posts [1521]. These techniques have been useful in estimating visitation to large parks, attractions, and natural areas around the world over long periods of time by providing monthly, seasonal, or yearly visitation estimates. However, these data sources represent only the small fraction of the public that opt to use those specific social media outlets, and they lack adequate temporal and spatial resolution. These factors limit the ability of social-media based methods to inform broader policies or localized environmental management. Use of cellular devices is much more common and the resolution of the information provided is much finer in both time and space.

Cell data come from the digital traces of people’s cell phone use and location. In the past, this was based on the location of the tower a device was communicating with or a triangulation from the device to various cellphone towers. This data provided information based on many people, but with locational and temporal accuracy issues related to connectivity to the cellphone tower network and peoples’ active use of the phone [22]. With the addition of GPS instrumentation and the growing ubiquity of smartphone-style cellular telephones, these digital traces have become more accurate, more frequent and represent an increasing proportion of the population [2328]. The device-level, raw data are collected by cellular telephone providers, GPS enabled devices and increasingly by smartphone applications. Several third-party providers on the commercial market combine this information and sell processed data in a variety of formats [28].

Cell data in various forms have been used most widely in the transportation and urban planning fields to understand use of public infrastructure and commuting [2326] and to assist in land use classifications [27]. This data has also been used to understand economic trends [29], restaurant choice [30] and in epidemiology and population research in developing countries [3134]. Despite its promise, there are limited environmental applications of the data to date, with notable applications concerning natural disasters [32] and air quality [3536].

There are a few recent applications of cell data to understand behavior in and around natural areas [3739]. Kubo et al. [37] used cell data to calculate the economic value of coastal tourism across Japan, but provided no ground truth to the visitation information. For an island park in Korea [38], Kim et al. applied cell data to analyze tradeoffs between visitation and biodiversity and showed decent correlations between the cell data and monthly estimates of visitation to several specific locations on the island. A study of parks in California, USA, is the closest to the work presented in this paper [39]. The study used a similarly processed cell data product from a third-party vendor to estimate daily park visitation. They calibrated the cell data with just one set of data, vehicle counts on a major nearby road, finding a unit-value correction factor. They then validated their estimates against a single park’s gate traffic and parking information. They found good agreement with their corrected cell data model and daily vehicle counts. From there, they used park-specific vehicle-to-people ratios to extrapolate to the number of visitors to the other twenty-one parks of interest. Our study differs by incorporating multiple visitation records representing counts of people to a wider set of locations: eighteen different water recreation areas. We find similar potential for this data source to provide useful, policy-relevant visitor use information at daily and site-level scales for water recreation areas.

Data description

We identified and purchased a dataset of visitation derived from cell phone locations for a set of geographies and time to understand the extent, temporal distribution, and value of water recreation for Cape Cod, Barnstable County, Massachusetts and New England, USA in general. The set of water recreation areas comprises a comprehensive list of all the public beaches and public access points to water (fresh and saltwater beaches, public access points, parks, ways to water, and boat ramps) for Barnstable County, Massachusetts, compiled from federal, state, county, and town GIS information (n = 464), and an additional set of beaches across greater New England (n = 113). The data we used consists of estimates of visitation to these water recreation areas over the summer months of 2017 (June through September).

We purchased data products processed by a third-party provider, Airsage, Inc. This provider creates population-level estimates of human mobility derived from a panel of over 120 million devices using location information from smartphone applications (see S1 File). The data provider processes this device-specific locational information. Before we receive it, the data is anonymized and aggregated to contain no personally identifiable information. We do not obtain any device-level information, nor raw device GPS locations, but instead, we obtain aggregated summaries of visitation by recreation site and estimates of the visitors’ origin census block-group geographies. The data provider translates their sample to population-level estimates using weights based on the share of the population their sample represents by census-tract geographies. The cellular device sample we purchased data from includes about 30% of the U.S. population but varies by tract and month.

To obtain the cell data for the sample geographies of interest, we spatially buffered (added area) around the water-access sites which were designated as line or point features in the original spatial databases. In consultation with the data provider and after attempting a range of spatial buffers, a 100-meter buffer was chosen to balance specificity in capturing water recreation visits (i.e., not capturing ancillary points of interest in geographies, like restaurants or stores, for example) with the accuracy of the locational information. We sent the defined water recreation areas to the data provider as a set of geographic extents, or polygons (see Fig 1 for examples of area definitions), and they returned the aggregated and anonymized processed data in tabular form. We include a sample of the dataset below (Table 1) and include the entirety of this dataset available with the code package associated with this work at https://github.com/USEPA/Recreation_Benefits.git.

Fig 1. Example definition of water recreation areas.

Fig 1

Dowses Beach, Barnstable, Massachusetts, USA and nearby water access areas. Point and line features representing water recreation access were buffered by 100 meters to capture use at the sites. These geographic areas correspond to the sample of visitation estimates from the cell data.

Table 1. Sample of dataset derived from cell phone locations.

POI DATE HH00 HH01 HH02 HH03 HH04 HH05 HH20 HH21 HH22 HH23 DEVICE_TOTAL
1 20170601 251 96 47 0 171 488 668 848 812 222 21895
1 20170602 245 202 133 148 112 646 594 604 1157 468 21526
1 20170603 299 148 196 243 340 135 922 1283 1100 759 19449
1 20170604 658 332 117 372 414 395 1071 186 404 286 19415
2 20170601 0 0 0 0 0 0 86 0 0 0 1103
2 20170602 114 0 0 0 0 50 138 312 0 48 2746
2 20170603 0 0 0 0 0 0 238 226 122 66 2247
2 20170604 66 66 66 66 0 0 156 33 0 54 2046

POI indexes the water access locations and each column represents estimated visitation in hourly windows, with HH00 being 12AM-1AM. DEVICE_TOTAL refers to the estimate of unique devices seen in any of the 24 hours at that water access location.

The locational accuracy of the device locations underpinning the data range depending on the source device and the smartphone application. The accuracy of reported locations from applications varies with ranges of 1–10 meters (GPS), 20–200 meters (Wi-Fi), and 100–2000 meters (cell tower-based) based on the method(s) each application uses to locate each device. We were not able to obtain an average locational accuracy for devices seen in our geographies in our specific dataset since the smartphone applications do not report the exact location methods to the data provider and we do not receive device-specific locational information. Given the potential range in location accuracy, visits attributed to a water recreation area could have actually been to a nearby attraction, or vice-versa. We chose a relatively small buffer around the recreation areas to be conservative in defining the area attributed to use of the site and to minimize any mis-located visits. Given this and other limitations, we relied on the calibration and validation to on-the-ground visitation counts to assess the usefulness and accuracy of the cell data for our application and the choice of spatial definitions and buffers around sites.

In total, the cell dataset includes visitation estimates for 51,511 days across 577 sites. A complete set would be 70,394 days (577 sites x 122 days), but some of the days for some sites are missing due to low visitation and detection limits. A visit was defined as a device location history implying a stay estimated to be longer than five minutes at a geography that was not the home or work location of the device based on the behavior of that device over the month. Home was defined as the census block group where the device is most often seen over the month between 9PM-6AM (see S1 File).

In addition to estimates of visits, the cell data for each area also includes the home location of those visitors, either at the census block-group level or categorized as international if the origin was outside of the United States. For example, the data may show 100 people visited beach x during a time period, with 30 of those people come from census block y, 20 from census block z, and so on. The monthly visitor origin-destination data contains 642,915 data points representing monthly trip totals (577 sites x origin census block groups, which vary by site and month). This total is not inclusive of the data representing zero trips to destinations originating from block groups implied by the full origin-destination matrix. Because of the geographic and temporal scope, collecting this same information with traditional methods would be prohibitively expensive, time consuming, and inconsistent.

Methods

To investigate the cell data’s ability to reproduce daily visitation counts, we compared the cell data to a series of observational counts collected from three different, commonly recorded sources for beach and park visitations. We then calibrated a model to translate the cell data into consistent estimates of visitation across many sites in the region of interest and across many days. We purposely designed the calibration to cover a wide range of access location sizes and visitation totals to test the transferability of the cell data and models built from it. No visitation dataset alone was perfect for use in calibration across a wide range of locations due to differences in counting methods, which is a common limitation to visitation records to natural spaces, generally. Capturing visitation to natural areas is very challenging, and existing approaches all have their own limitations for capturing daily visitation [40]. The observational data that overlaps with the cell data consists of: 1) onsite counts of small access points to an estuary, 2) a town’s visitation estimates for their managed beaches, and 3) entrance fees collected by a town to a major beach.

Visitation observation methods also vary because the context for taking them differ. Although the objective may be the same, observing visitation at a major beach requires different methods than those used for a set of small access points around an estuary [41]. By comparing and calibrating our data to a combined set representing the variety of recreational visitation count methods that exist in the real world, we show the ability of cell data to both replicate the types of data that are traditionally used and to bridge the various observational visitation records common around natural resources. Each source of observational data is briefly described below.

  • Small: We quantified use of the Three Bays estuary system on Cape Cod, Massachusetts through observational sampling for eleven public access points within the estuary. The counts were taken as a combination of periodic people and car counts and sunrise-to-sunset counts of visitors and cars. The public access points include beaches, docks, boat ramps, and landings. Observational data from this study include visitation estimates for 11 public access points for seven days from June to August, 2017 [41]. (N = 72)

  • Medium: The diverse set of beaches for Barnstable, Massachusetts, on Cape Cod includes saltwater and freshwater beaches accessible to either the public or to town residents only. The dataset provided by the Town of Barnstable’s Recreation Division includes daily visitation estimates from lifeguard counts for seven of their beaches from Memorial Day (May 29) to Labor Day (September 4), 2017. (N = 234)

  • Large: Narragansett Town Beach in Rhode Island is a popular destination for tourists and residents. Access to this beach is divided into resident only and public entrances. The public entrance requires an entrance fee for each person providing an accurate dataset of daily visitation to the beach. However, these entrance fees are only collected for the public part of the beach. By missing those visitors with resident permits, the data provided by the town of Narragansett underestimates visitation to the whole beach. Therefore, based on a .85/1 ratio between public and resident use obtained through parking lot counts conducted in the public and resident parking lots, the data were adjusted to represent daily use for the whole beach (see S1 File for details). Daily visitation numbers are provided from Memorial Day to Labor Day, 2017. (N = 86)

Fig 2 shows how the cell data-derived counts and observational counts compare. The cell data-derived estimates correspond well, but overestimate the observed visitor counts by about four times. This overestimation is likely due to several confounding assumptions. These assumptions start with choices in how the data provider processes the raw cellular device level information to associate records with geographies in certain time windows and to extrapolate the sample of cellular devices to population level estimates through their estimates of market penetration (see S1 File).

Fig 2. Observational counts compared against cell data.

Fig 2

Observations of visitation (9AM-4PM) plotted against uncalibrated visitation estimated from the cell data for the same hours.

The cell data product did not provide counts for the block of time that corresponded to our visitation counts (9AM-4PM), but rather by individual hours. Therefore, we had to translate these hourly counts to our time window by making assumptions on the length of stay of visitors, since the same device would be counted multiple times if it were to stay at the site for multiple hours. Following the data provider’s advice to match the cell data-derived information to visitation observations, we used an assumption of a three-hour average stay to match the time-window of our observations. We could have picked data on one hour in the window to be representative of the whole three hours, the end hour for instance, but this would discard information in the other hours. Instead, we calculated a moving average (three-hour window) of visitation for each hourly visitation estimate from the cell data for each site. We then summed the moving average of only the central hour of three-hour blocks from 8AM-4PM (9AM, 12PM, 3PM) (see S1 File for more details). This reduces the cell data counts due to multiple sightings, since we summed only one of the three hours in each window, but maintains the information in the hourly distribution of use through the day.

An assumption of a shorter length of stay would have increased the cell data counts and vice-versa. For example, if we assumed a two-hour average length of stay, we would have used two-hour instead of three-hour windows in the daily sums, increasing the daily total. While three hours may be a long average length of stay for recreational visits to all the water access sites, the data reflecting this assumption were inputs to the calibration models. We sought to correct any bias and inaccuracies introduced by this assumption by using the calibration models fit to on-the-ground counts below. Similarly, there is a difference in the relationship across the three sizes of access points that can also be seen in Fig 2. The differences by group are likely due to differences in the observational counting methods and possibly how well cell data performs based on the size of the area. We control for both possible effects in the statistical models used to calibrate the data.

Models and prediction

Our objective was to develop a model that predicts visitation to a range of water recreation areas using the cell data and other explanatory variables that are easily compiled across many places. These covariates include weather (temperature and precipitation), the month, day of the week, and size of the water access. The model controls for the different counting methods in the observational data. We estimated a varied set of candidate regression models including several functional forms where we defined linear and log-linear relationships between the visitor counts and the cell data and other regressors in R [42]. Since we did not have any preconceived notion of the functional forms of the relationships between the covariates and the dependent variable, we also estimated a more general random forest model. A random forest model is a type of non-parametric model commonly used in the data science and machine learning fields. They have been shown to reproduce many functional forms and have superior predictive performance over standard multivariate regression models for many applications [4344].

The candidate model specifications were as follows:

  • Linear
    Yit=+β1Cit+β2Dt+β3Wt+β4Si+β5Ai+eit (1)
  • Log-Linear
    log(Yit)=+β1Cit+β2Dt+β3Wt+β4Si+β5Ai+eit (2)
  • Random Forest
    Yit=f(Cit,Dt,Wt,Si,Ai) (3)

Where,

Yit - Observed visits to site, i, on day,t.

Cit - cell data-derived estimate of visitation to site i on day t.

Dt - Matrix of dummy variables for the month, day of the week, weekend, holiday.

Wt - Matrix of weather variables (precipitation, temperature, windspeed) for day t from Barnstable Municipal Airport weather station.

Si - Dummy variables for the source of observed visitation data (Narragansett Beach, Barnstable Town, Three Bays).

Ai - Area of site i.

∝ - intercept.

eit - error term.

We compared the candidate models based on their predictive performance. To avoid selecting an overfit model and therefore being overconfident in its out-of-sample model performance, we conducted a cross validation by splitting the data into training (in-sample) and test (out-of-sample) sets. We fit the candidate models to the training sets of data, predicted the test sets, and calculated fit statistics on the out-of-sample observations. We did this for 10 random splits of the data using a k-fold cross validation and present the average model performances across the 10 test sets in Table 1 [4546]. Using this cross-validation technique is a statistical check and a data science best practice. Given the predictive purposes of these classes of models, we suggest that, in the future, a similar cross-validation process should be performed when using proxy-types of data for predicting visitation. Additional regression and random forest outputs, goodness-of-fit metrics, and details can be found in the S1 File as well as in the code package (https://github.com/USEPA/Recreation_Benefits.git).

Ethics statement

This work was reviewed and deemed exempt by the Institutional Review Board (Study #17–3334) from the University of North Carolina at Chapel Hill.

Results

Using the cell data product from the data provider resulted in about a four-times overestimation of the type of recreational visitation we were looking to estimate when compared against observations. Despite the scale difference, we found the information contained in the cell data to be valuable to predict visitation across a diverse set of sites after calibration. From our models, there are a few ways to show that it is the information in the cell data that is providing most of the explanatory power as compared to the covariates (weather, area, source of the observational counts). Table 2 shows the regression results using just cell data (columns 1–2), then with additional covariates (columns 3–4). Just using cell data produces a decent model in linear form. The additional value of the covariates can be seen in the improved stats between column 1 and 3 and 2 and 4. In the random forest model, the cell data was by far the most useful in modeling visitation as seen by metrics of variable importance (see S1 File)

Table 2. Regression results.

Visits Log(Visits) Visits Log(Visits)
(1) (2) (3) (4)
Cell data 0.245*** 0.0003*** 0.296*** 0.0002***
(0.005) (0.0000) (0.007) (0.0000)
Area (m2) 0.00003 0.000006***
(0.0002) (0.0000006)
Narragansett -646.796*** 0.182
(72.136) (0.225)
Town of Barnstable -60.612 -0.409***
(42.119) (0.131)
Temperature (°F) 10.398*** 0.066***
(2.567) (0.008)
Precipitation (inches) -26.180 -0.447***
(47.250) (0.147)
Constant 33.128* 4.320*** -334.114 0.539
(19.496) (0.064) (206.779) (0.644)
Observations 392 392 392 392
R2 0.86 NA .89 NA
ME .27 -430.96 .43 -74.35
RMSE 318.08 3161.23 272.47 1030.21
MAE 186.69 791.33 174.68 345.93

Dummy variables are included for month and day of the week in columns 3 and 4. Columns 2 and 4 are in log-linear form. See code and S1 File for additional details and candidate models. Goodness of fit statistics are from out-of-sample sets from a 10-fold cross validation. Log-linear model predictions were converted to people terms for goodness of fit statistics. ME = mean error, RMSE = root mean squared error, MAE = mean absolute error.

* p<0.10.

** p<0.05p.

***p<0.01.

Of the candidate models, we chose the random forest as the preferred model, given its predictive performance indicated by the lowest RMSE, MAE, and low bias (ME) during cross validation (see Table 3). To create the final comprehensive visitation dataset, we used the preferred random forest model to predict daily visitation to all 577 water-access areas in our sample for the four summer months (June-September) of 2017. Fig 3 plots predicted visitation using the random forest model against observational counts, showing a tight overall in-sample fit. Our calibrated model produced daily visitation estimates with an out-of-sample mean absolute error of 155 people (Table 3) based on the cross-validation.

Table 3. Performance statistics for candidate models.

ME RMSE MAE R-Squared
Linear model 0.43 272.47 174.68 .89
Log-linear model -74.35 1030.21 345.93 NA
Random forest -3.78 262.48 154.84 .91

ME = mean error, RMSE = root mean squared error, MAE = mean absolute error. See S1 File for complete set of model output and performance statistics.

Fig 3. Cell data modeled visitation compared against observations.

Fig 3

Predicted daily (9AM-4PM) visits from the cell data model compared to observed visitation at three sizes of water recreation areas in New England, USA.

This model produced comprehensive visitation estimates across the region, but also at each individual site across days. The result of this work, a calibrated dataset of visitation including the code and data for producing the results in this paper can be found at https://github.com/USEPA/Recreation_Benefits.git. As examples of the usefulness of the broad geographical scope of the model and resulting database, Fig 4 shows daily visitation estimates to all public water-access points on Cape Cod for the summer months of 2017. Along with these types of landscape-scale results, the data and model provide focused site-specific information. Fig 5 shows daily visitation to a single beach across the season, information available for all sites and days across the four summer months.

Fig 4. Visitation for Cape Cod, MA, USA for the summer of 2017.

Fig 4

Total predicted visits (9AM-4PM) to water recreation areas for the summer of 2017 (June, July, August, September) for Cape Cod (Barnstable County, MA, USA), using the cell data model.

Fig 5. Narragansett Beach, RI, USA, daily visitation.

Fig 5

Predicted daily (9AM-4PM) visits to Narragansett Town Beach (Narragansett, RI, USA) for the summer of 2017 using the cell data model. These daily predictions are compared to observed visits based on on-the-ground counts.

Since we combined three different sources of visitation data to fit the model, we also ran candidate models on each visitation observation data source separately and the relationships between the cell data and each visitation dataset remained similar (see S1 File). The in-sample fit of those models varied, with the smaller access point dataset, Three Bays, the least well fit (R2 = .36), to better fits with the larger access point of Narragansett Beach (R2 = .96). The number of observations vary with the sources, as do specifics of how those observations were collected, but we suspect the cell data may be better at predicting visitation to larger areas, with more daily visitation. There are more cellular devices in a sample of a day at the more popular places to estimate visitation from, likely reducing noise in the estimate.

We used the most accurate and unbiased of the candidate statistical models for prediction. However, there are several sources of potential inaccuracies and biases in estimating visitation in this way that are not incorporated in the metrics of model goodness-of-fit. The observational visitation counts contain their own uncertainties and potential biases based on their sampling design and counting methods. By calibrating and validating to those counts, we may be carrying over those issues to our estimates of visitation. In addition, the cellular data contains uncertainties resulting from the geospatial accuracy of the device locations, our geographic definition of the sites, and the methods of expansion from the device sample to population-level estimates. More applications using cell datasets are needed to understand these limitations combined with additional and more consistent collections of visitation observations for calibration.

The models we fit may also be susceptible to spatial autocorrelation issues resulting from the cell dataset if there are variations in how the data represents visitation geographically. Spatial autocorrelation in models potentially inflates goodness-of-fit estimates, can bias parameters, and reduces predictive performance. We have controls in the models for each group of sites, which are geographically clustered to alleviate some of the potential issue. Similarly, collinearity in the covariates could potentially cause poor predictions, attributing predictive information to the wrong covariate, for example. We checked for this issue in a few ways by building up the model covariates sequentially starting with cell data alone and adding covariates. This led to little change in the relationships between observations and cell data counts (see S1 File for more model details and variations). We also consistently found good out-of-sample goodness-of-fit metrics in a cross validation, giving us more confidence that spatial autocorrelation and collinearity were likely not an issue in the models’ predictive performance.

Discussion

We contend that the use of cell data provides a valuable method to quantify visitation across large numbers of areas and over time. When calibrated, the cell data provided an accurate and consistent way to estimate visitation to natural areas. It allowed us to produce a previously unavailable dataset of water visitation at a policy-relevant resolution, spatial extent, and consistency. The information provides visitation details for specific locations at a regional or sub-regional scale, the scale at which most decisions and policies regarding public natural areas are made. Understanding the scale and timing of visitor use of an area allows managers to determine and provide appropriate facilities and safety precautions and researchers to predict impacts of environmental change, measure effects of natural disasters, and conduct research on social and economic value of public access to natural areas. To date, no other methods for quantifying use have the same capacity for providing location-specific data across broad geographic areas at such a high temporal resolution.

The data can inform landscape-level analyses of behavior to understand, at a broader scale, the impacts of changes in environmental quality across time as well as across access geography. The origin data can be used to better understand who visits an access point based on where they come from. This provides informative profiles regarding the characteristics of people and communities that are, or could be, affected by environmental degradation or improvements or other policy and management decisions. Taking advantage of variations in behavior and environmental quality across time, cell data’s fine temporal resolution opens avenues for longitudinal studies. For instance, it provides the ability to quantify the number of visitors affected and economic impacts incurred from beach closures, algal blooms, oil spills, or other events that typically are not captured consistently with current methods or lack baseline data for measuring impacts.

Given our results, we suggest caution in using visitation estimates out-of-the-box from data providers without calibration. There is useful information in the datasets, but we found that calibration was necessary to confidently use the data for our purposes. In its delivered form, the data overestimated use of the type we were interested in quantifying, recreational visits to water-access areas. We hypothesized the need for this correction based on a few practical factors discussed below.

We were only able to define recreation visitation by limiting the area of the requested geographies (the GIS information for the sites we requested from the data providers) where recreation would likely be the primary purpose for visiting the area and during time-windows of interest. Inherently, the on-the-ground visitation observations were more restrictive in capturing visits for recreation and not capturing ancillary or non-recreational visits to the geographies, like walking by the site on a nearby road. It is also reasonable to assume some observational counts may be conservative and under-report recreation visitation due to sampling constraints.

Additionally, there is a cascade of statistical modeling assumptions that are made by the third-party providers to take raw device locational information to the anonymized and aggregated form delivered to the customers (see S1 File for the publicly available description of Airsage’s process). The exact details of each private provider’s data processing workflows are their intellectual property and protected as such. This opaqueness motivates the use of methods to judge performance critically and the construction of additional models based on common cell data products for popular environmental applications, such as the one described in this paper. Because the methods used by the data providers are constantly in development, estimating performance on common observational datasets and with common methods would provide more clarity and confidence (statistical and otherwise) for cell data’s use in policy and management applications. In some cases, lack of cellular connectivity in some natural areas may also limit its usefulness, although GPS-based and application-derived locational methods have overcome some of this limitation by passing along information when the device is reconnected to a network. It is in a user’s best interest to calibrate the product for their application, when possible, or consider the uncalibrated information as a relative metric. We demonstrated a simple method to do a calibration in this paper and provide the data and code for others to work from and improve as more users apply these types of data products.

While cell data-derived information is an exciting development for researchers and managers, counterintuitively, we found attempting to use it for a practical application only further motivated the need to take more accurate, consistent and unbiased observations of visitation using traditional methods. Modeling methods are hindered by the lack of availability of training datasets and would be greatly improved by larger and more uniformly collected observations. This is especially true regarding machine-learning algorithms [47]. For example, with small and practical tweaks in the way visitation records are collected at water-access areas, such as collecting periodic counts of cars and people at specific times, visitation records could become more harmonized and useful [12,41]. From there, visitation proxies like cell data or social media-based models can provide a platform for spatial and temporal extrapolation across broad geographies, as we demonstrate here. The need for such models is not confined to water-access visitation, as it is relevant to many other similar policy contexts, for example, at national parks or urban green spaces. The differences in how well our models fit depending on the visitation data source, with larger more popular locations fit better than small, should be cautionary to approaches applying the cell data-derived visitation estimates to settings where there is no similar observational data for comparison.

In this paper, we explored only a few dimensions of the information in the cell data-derived dataset, focusing on estimating visitation. The origin information provided about the visitors to the water-access sites—monthly visitation by site and census block groups in our case—provides the type of data needed for a myriad of travel-demand models for recreation [48]. Previous social media-based work [18] pointed out this potential opportunity, but travel-demand models typically require more spatially and temporally resolved information than what can be found currently in data derived from social media. Fig 6 displays origin-destination data for one site, Narragansett Beach, Rhode Island, for August 2017. Our dataset includes this information for all sites and months. More work is needed to calibrate and validate the origin-destination information provided by cell data products for environmental applications and more generally [49]. Validating the origin data and fitting travel-demand models is beyond the scope of this paper and is left for future work. We include the origin-destination data in the data package associated with this paper for any other users that wish to tackle this natural next step. As with the visitation estimates, the possibility of using new sources of data for origin-destination travel-demand models encourages implementation of more commonly worded and formatted general population and visitor intercept surveys to provide the necessary corroborating and calibrating datasets.

Fig 6. Visitor origins for Narragansett Beach, RI, USA.

Fig 6

Count of visitors by census block group origins for visitors to Narragansett Town Beach, Narragansett RI, USA, (black star on the map) in August 2017. This monthly origin information exists for each of the 577 access points in our sample.

The visitation dataset resulting from the work in this paper is useful for federal, state and town managers and agencies in the region for any number of applications requiring visitor use information. For example, state managers may use this information to determine the allocation of beach monitoring resources, a town shellfish warden may use it to identify the most actively fished sites, or a tourism board could use it to understand the profile of visitors to popular natural attractions. Of broader interest in natural resource management and research, these types of models using cell data and a similar calibration process could be developed for other areas or for other purposes such as understanding the scale of ecosystem services, human-wildlife conflict, water consumption demand, or emergency response, for example.

Supporting information

S1 File

(DOCX)

Acknowledgments

The views expressed in this article are those of the authors and do not necessarily represent the views or policies of the U.S. Environmental Protection Agency. This contribution is identified by tracking number ORD-028692 of the U.S. Environmental Protection Agency, Office of Research and Development, Center for Environmental Measurement and Modeling, Atlantic Coastal Environmental Sciences Division. We would like to thank Anne Neale, Bryan Milstead and Ryan Furey for their thoughtful reviews.

Product disclaimer: Mention of trade names or manufacturers does not imply U.S. Government endorsement of commercial products.

Data Availability

All relevant data are available at https://github.com/USEPA/Recreation_Benefits.git.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Leeworthy VR, Wiley PC. Current participation patterns in marine recreation. National Oceanic and Atmospheric Administration; 2001. Nov. [Google Scholar]
  • 2.Tourangeau R, English E, McConnell KE, Chapman D, Cervantes IF, Horsch E, et al. The Gulf recreation study: Assessing lost recreational trips from the 2010 Gulf Oil Spill. Journal of Survey Statistics and Methodology. 2017. July 6;5(3):281–309. [Google Scholar]
  • 3.Horsch E, Welsh M, Price J, Domanski A, Meade NF, Murray J. Best practices for collecting onsite data to assess recreational use impacts from an oil spill. 2017. Available from: ftp://ftp.library.noaa.gov/noaa_documents.lib/NOS/ORR/TM_NOS_ORR/TM_NOS-ORR_54.pdf [Google Scholar]
  • 4.Deacon RT, Kolstad CD. Valuing beach recreation lost in environmental accidents. Journal of Water Resources Planning and Management. 2000. December;126(6):374–81. [Google Scholar]
  • 5.Da Silva CP. Beach carrying capacity assessment: how important is it? Journal of Coastal Research. 2002. March;36(sp1):190–8. [Google Scholar]
  • 6.Dwight RH, Brinks MV, SharavanaKumar G, Semenza JC. Beach attendance and bathing rates for Southern California beaches. Ocean & Coastal Management. 2007. January 1;50(10):847–58. [Google Scholar]
  • 7.Harada SY, Goto RS, Nathanson AT. Analysis of lifeguard-recorded data at Hanauma Bay, Hawaii. Wilderness & Environmental Medicine. 2011. March 1;22(1):72–6. [DOI] [PubMed] [Google Scholar]
  • 8.King P, McGregor A. Who's counting: An analysis of beach attendance estimates and methodologies in southern California. Ocean & Coastal Management. 2012. March 1;58:17–25. [Google Scholar]
  • 9.Garcia A, Smith JR. Factors influencing human visitation of southern California rocky intertidal ecosystems. Ocean & Coastal Management. 2013. March 1;73:44–53. [Google Scholar]
  • 10.Kreitler J, Papenfus M, Byrd K, Labiosa W. Interacting coastal based ecosystem services: recreation and water quality in Puget Sound, WA. PloS One. 2013. February 22;8(2):e56670 10.1371/journal.pone.0056670 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Morgan D. Development of a method to estimate and predict beach visitation. Tourism in Marine Environments. 2016. December 12;12(1):69–77. [Google Scholar]
  • 12.Leggett CG. Sampling strategies for on-site recreation counts. Journal of Survey Statistics and Methodology. 2017. September 1;5(3):326–49. [Google Scholar]
  • 13.Lyon SF, Merrill NH, Mulvaney KK, Mazzotta MJ. Valuing coastal beaches and closures using benefit transfer: An Application to Barnstable, Massachusetts. Journal of ocean and coastal economics. 2018. May;5(1):1-. 10.15351/2373-8456.1086 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.English DB. Forest Service national visitor use monitoring process: research method documentation. US Forest Service, Southern Research Station; 2002. [Google Scholar]
  • 15.Fisher DM, Wood SA, White EM, Blahna DJ, Lange S, Weinberg A, et al. Recreational use in dispersed public lands measured using social media data and on-site counts. Journal of Environmental Management. 2018. September 15;222:465–74. 10.1016/j.jenvman.2018.05.045 [DOI] [PubMed] [Google Scholar]
  • 16.Van Berkel DB, Tabrizian P, Dorning MA, Smart L, Newcomb D, Mehaffey M, et al. Quantifying the visual-sensory landscape qualities that contribute to cultural ecosystem services using social media and LiDAR. Ecosystem Services. 2018. June 1;31:326–35. 10.1016/j.ecoser.2018.03.022 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Keeler BL, Wood SA, Polasky S, Kling C, Filstrup CT, Downing JA. Recreational demand for clean water: evidence from geotagged photographs by visitors to lakes. Frontiers in Ecology and the Environment. 2015. March;13(2):76–81. [Google Scholar]
  • 18.Wood SA, Guerry AD, Silver JM, Lacayo M. Using social media to quantify nature-based tourism and recreation. Scientific Reports. 2013. October 17;3:2976 10.1038/srep02976 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Hamstead ZA, Fisher D, Ilieva RT, Wood SA, McPhearson T, Kremer P. Geolocated social media as a rapid indicator of park visitation and equitable park access. Computers, Environment and Urban Systems. 2018. November 1;72:38–50. [Google Scholar]
  • 20.Tenkanen H, Di Minin E, Heikinheimo V, Hausmann A, Herbst M, Kajala L, et al. Instagram, Flickr, or Twitter: Assessing the usability of social media data for visitor monitoring in protected areas. Scientific Reports. 2017. December 14;7(1):17615 10.1038/s41598-017-18007-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Sonter LJ, Watson KB, Wood SA, Ricketts TH. Spatial and temporal dynamics and value of nature-based recreation, estimated via social media. PLoS One. 2016. September 9;11(9):e0162372 10.1371/journal.pone.0162372 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Solmaz G, Turgut D. A survey of human mobility models. IEEE Access. 2019. September 3;7:125711–31. [Google Scholar]
  • 23.Toole JL, Colak S, Sturt B, Alexander LP, Evsukoff A, González MC. The path most traveled: Travel demand estimation using big data resources. Transportation Research Part C: Emerging Technologies. 2015. September 1;58:162–77. [Google Scholar]
  • 24.Blondel VD, Decuyper A, Krings G. A survey of results on mobile phone datasets analysis. EPJ Data Science. 2015. December 1;4(1):10. [Google Scholar]
  • 25.Calabrese F, Ferrari L, Blondel VD. Urban sensing using mobile phone network data: a survey of research. Acm Computing Surveys (csur). 2014. November 12;47(2):1–20. [Google Scholar]
  • 26.Çolak S, Alexander LP, Alvim BG, Mehndiratta SR, González MC. Analyzing cell phone location data for urban travel: current methods, limitations, and opportunities. Transportation Research Record. 2015. January;2526(1):126–35. [Google Scholar]
  • 27.Toole JL, Ulm M, González MC, Bauer D. Inferring land use from mobile phone activity. In Proceedings of the ACM SIGKDD International Workshop on Urban Computing 2012 Aug 12 (pp. 1–8). ACM.
  • 28.Toch E, Lerner B, Ben-Zion E, Ben-Gal I. Analyzing large-scale human mobility data: a survey of machine learning methods and applications. Knowledge and Information Systems. 2019. March 5;58(3):501–23. [Google Scholar]
  • 29.Frias-Martinez V, Soguero-Ruiz C, Frias-Martinez E, Josephidou M. Forecasting socioeconomic trends with cell phone records. In Proceedings of the 3rd ACM Symposium on Computing for Development 2013 Jan 11 (pp. 1–10).
  • 30.Athey S, Blei D, Donnelly R, Ruiz F, Schmidt T. Estimating heterogeneous consumer preferences for restaurants and travel time using mobile location data. In AEA Papers and Proceedings 2018 May (Vol. 108, pp. 64–67).
  • 31.Kung KS, Greco K, Sobolevsky S, Ratti C. Exploring universal patterns in human home-work commuting from mobile phone data. PloS One. 2014. June 16;9(6):e96180 10.1371/journal.pone.0096180 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Lu X, Bengtsson L, Holme P. Predictability of population displacement after the 2010 Haiti earthquake. Proceedings of the National Academy of Sciences. 2012. July 17;109(29):11576–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Wesolowski A, Eagle N, Tatem AJ, Smith DL, Noor AM, Snow RW, et al. Quantifying the impact of human mobility on malaria. Science. 2012. October 12;338(6104):267–70. 10.1126/science.1223467 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Kraemer MU, Golding N, Bisanzio D, Bhatt S, Pigott DM, Ray SE, et al. Utilizing general human movement models to predict the spread of emerging infectious diseases in resource poor settings. Scientific Reports. 2019. March 26;9(1):5151 10.1038/s41598-019-41192-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Yu H, Russell A, Mulholland J, Huang Z. Using cell phone location to assess misclassification errors in air pollution exposure estimation. Environmental Pollution. 2018. February 1;233:261–6. 10.1016/j.envpol.2017.10.077 [DOI] [PubMed] [Google Scholar]
  • 36.Li M, Gao S, Lu F, Tong H, Zhang H. Dynamic estimation of individual exposure levels to air pollution using trajectories reconstructed from mobile phone data. International Journal of Environmental Research and Public Health. 2019. January;16(22):4522. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Kubo T, Uryu S, Yamano H, Tsuge T, Yamakita T, Shirayama Y. Mobile phone network data reveal nationwide economic value of coastal tourism under climate change. Tourism Management. 2020. April 1;77:104010. [Google Scholar]
  • 38.Kim YJ, Lee DK, Kim CK. Spatial tradeoff between biodiversity and nature-based tourism: Considering mobile phone-driven visitation pattern. Global Ecology and Conservation. 2020. March 1;21:e00899. [Google Scholar]
  • 39.Monz C, Mitrovich M, D'Antonio A, Sisneros-Kidd A. Using mobile device data to estimate visitation in parks and protected areas: an example from the nature reserve of orange county, California. Journal of Park and Recreation Administration. 2019;37(4). [Google Scholar]
  • 40.Wallmo K. Assessment of techniques for estimating beach attendance. Beach Sampling Report of NOAA. 2003. April;10. [Google Scholar]
  • 41.Mulvaney KK, Atkinson SF, Merrill NH, Twichell JH, Mazzotta MJ. Quantifying recreational use of an estuary: a case study of Three Bays, Cape Cod, USA. Estuaries and Coasts. 2020. January 1;43(1):7–22. 10.1007/s12237-019-00645-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.RCore T.E.A.M. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria: URL http://www.R-project.org. 2016. [Google Scholar]
  • 43.Ziegler A, König IR. Mining data with random forests: current options for real‐world applications. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery. 2014. January;4(1):55–63. [Google Scholar]
  • 44.Wright MN, Ziegler A. ranger: A fast implementation of random forests for high dimensional data in C++ and R. arXiv preprint arXiv:1508.04409. 2015. August 18. [Google Scholar]
  • 45.Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai 1995. August 20 (Vol. 14, No. 2, pp. 1137–1145). [Google Scholar]
  • 46.Rodriguez JD, Perez A, Lozano JA. Sensitivity analysis of k-fold cross validation in prediction error estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2009. December 1;32(3):569–75. [DOI] [PubMed] [Google Scholar]
  • 47.Witten IH, Frank E, Hall MA, Pal CJ. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann; 2016. October 1. [Google Scholar]
  • 48.Phaneuf DJ, Smith VK. Recreation demand models. Handbook of Environmental Economics. 2005. January 1;2:671–761. [Google Scholar]
  • 49.Vanhoof M, Reis F, Ploetz T, Smoreda Z. Assessing the quality of home detection from mobile phone data for official statistics. Journal of Official Statistics. 2018. December 1;34(4):935–60. [Google Scholar]

Decision Letter 0

Song Gao

14 Jan 2020

PONE-D-19-32712

Using Data Derived from Cellular Phone Locations to Estimate Visitation to Natural Areas: An Application to Water Recreation in New England, USA.

PLOS ONE

Dear Merrill,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but didn't fully meet PLOS ONE’s publication criteria yet as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

We would appreciate receiving your revised manuscript by Feb 28 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Professor Song Gao, Ph.D.

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

http://www.journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and http://www.journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that Figures 3 and 5 in your submission contain map/satellite images which may be copyrighted.

All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (a) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (b) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figures 3 and 5 to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

3. Your ethics statement must appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section. Please also ensure that your ethics statement is included in your manuscript, as the ethics section of your online submission will not be published alongside your manuscript.

4. Please include captions for your Supporting Information files at the end of your manuscript, and update any in-text citations to match accordingly. Please see our Supporting Information guidelines for more information: http://journals.plos.org/plosone/s/supporting-information

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: No

Reviewer #2: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This study presents a practical approach to estimate visitation to natural areas. While the overall methodology is sound, I do have some concerns in terms of how input data, which is used to fit the model, is derived:

1) The observation data are collected in very different means, which implies certain degree of inconsistency and systematic bias. It does make sense if the model is fit against a particular source of observational recording. However, I would question the validity if distinct observational sources are combined together. Have you tried to fit models against each individual data set and compare outcomes?

2) According to supplementary materials, to calibrate Narragansett beach observational counts, you calculated ratio (resident/public) in three 3-hour time windows and adopted the average (0.85) as the calibration ratio. The data suggests that the ratio is very different during different time windows, and the sum is actually very close 553 (public) vs. 528 (resident). Also, I am interested to know why three different dates (7/13, 7/24, and 7/26) were picked for data collection.

3) A spatial buffer of 100-meter was created and a reasonable explanation is provided. What role does locational accuracy play here? According to data description, location can be collected by GPS, Wi-Fi, or cell tower, and which location provider of each ping is unknown. For locations derived from cell tower, apparently the 100-meter threshold would lead to significant misclassification (i.e., inside or outside). You may want to justify the threshold selection by taking locational accuracy into account.

Some other questions/comments/suggestions:

1) The data description section can be improved. You mentioned "we obtain aggregated summaries of visitation by recreation site", what information does Airsage need (e.g., the name of the site, or the geographic extent of areas of interest) to generate aggregated summary and what do the "aggregated summaries" look like? I strongly recommend to include a table of sample data in this section.

2) How would applying 3-hour moving average to the raw hourly count address double counting (or multiple counting)? This is not very clear to me. Can you come up with better explanation in text?

Reviewer #2: This manuscript focuses on estimating the number of visitors to natural areas based on cellular data. The authors conducted multi-scale case studies based on cellular data and various field data (e.g., observed visitor data).

There have been abundant studies using cell phone data to study human mobility, but many previous studies were conducted from a perspective of transportation/urban geography. This study has a different angle and investigated the visitation pattern to natural areas based on cell phone data, which can potentially provide useful input to studies on natural resources in physical geography. To this end, it has the potential of becoming something useful in the field. However, several problems should be addressed before the authors move forward.

The structure of the paper can be improved. I would suggest moving the literature review in “data description” to the “background” section and keep the data description more focused. The literature review is generally inadequate. Please consider adding more details regarding the types of cell phone data (e.g., CDR, assisted GPS data, Erlang data) and how these datasets have been used to model human mobility. Another thing can be discussed is the connection and differences between using mobile phone data in physical geography and human/urban geography. The research question should also be explicitly stated in the introduction. Currently, it is unclear if the authors wanted to focus on analyzing the spatial distribution of visitations, the temporal pattern of visitations, or something else (although this information was later provided in the methodology section).

Related papers:

Calabrese, F., Ferrari, L. and Blondel, V. D. (2015). Urban sensing using mobile phone network data: a survey of research. ACM Computing Surveys 47(2), pp. 1–20.

Yuan, Y., & Raubal, M. (2016). Exploring Georeferenced Mobile Phone Datasets – A Survey and Reference Framework. Geography Compass, 10(6), 239-252. doi:10.1111/gec3.12269.

The authors should also discuss how their work is different from the following study:

MONZ, Christopher et al. Using Mobile Device Data to Estimate Visitation in Parks and Protected Areas: An Example from the Nature Reserve of Orange County, California. Journal of Park and Recreation Administration, [S.l.], v. 37, n. 4, oct. 2019. ISSN 2160-6862.

P4 l114, the definition of a “visit” is unclear. What if a user showed up at two nearby locations? For example, if a user sit on a bench for 20 minutes and then used a public restroom 100 meters away. Is this considered two visits? Doesn’t it make more sense to cluster close-by points from the same device?

The mobile phone dataset should also be better explained. It may help to provide a few sample records.

L198 how did you decide on the 100-meter buffer? Please clarify.

L248 Did you consider the collinearity between the explanatory variables in your model? Some of the explanatory variables may not be independent from each other. More importantly, it is highly likely that the observed values are spatially auto-correlated, which can inflate the R2 values and jeopardize the reliability of the models.

Would the results be different if you consider different time periods during the day/week/month, etc.?

Overall, the results are interesting but I feel that the models should be designed more thoroughly and carefully.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Apr 30;15(4):e0231863. doi: 10.1371/journal.pone.0231863.r002

Author response to Decision Letter 0


20 Feb 2020

Reviewer 1:

1) The observation data are collected in very different means, which implies certain degree of inconsistency and systematic bias. It does make sense if the model is fit against a particular source of observational recording. However, I would question the validity if distinct observational sources are combined together. Have you tried to fit models against each individual data set and compare outcomes?

The issue of a lack of visitation data collection consistently and across many spaces and times is precisely the problem that we would hope the use of the cell data-derived estimates would overcome. However, as you point out, we are limited by not having perfect calibration/validation datasets. We point this out a few different ways in the paper [starting on line 71 and line 182, for example] and a paragraph in the discussion [starting on line 404]. Since we want to see how well the cell data works across many types and sizes of locations or visitation, we had to use observational records from many different sizes and types of locations. The observation methods vary because the settings for taking them differ. Even if the final objective of the sampling is the same, daily visitation, observing visitation at a big beach with car and people receipts calls for a different method than at a set of small access points around an estuary, for example. The way the description of the observational data was written implied we chose different count methods on purpose; it was also out of necessity to cover more types and number of places. We edited the description of the visitation datasets to make these points starting on line 182.

We had to choose between using fewer observations from fewer places to fit our model, or more diverse observations from a larger set and more varied geographic settings. The latter is closer to the objective of the application of the cell data, so we chose that path. We sought to control for some of these differences in sources in the overall (all data source models) with a dummy variable for the source of data.

With all that said, when fit separately, the observational data sources result in similar models. We now provide fits of models against each data source separately in the Supporting Information and discuss in the text. The observational count relationship to the cell data is consistent across the visitation data sources. They do vary with the coefficient related to the cell data ranging from .27-.32, but that is to be expected when cutting any dataset into parts and fitting separate models. The fit is best for Narragansett Beach followed by Barnstable’s town records then the Three Bays counts. This also happens to be in order of largest to smallest visitation locations. This leads us to hypothesize that there may be a varied accuracy by the quantity of visitation, possibly due to having a larger sample of devices in the datasets for larger beaches to represent daily visitation on any given day. We make note of these between observational dataset differences in the results and discussion:

Line 349:

“Since we combined three different sources of visitation data to fit the model, we also ran candidate models on each visitation observation data source separately and the relationships between the cell data and each visitation dataset remained similar (see Supporting Information). The in-sample fit of those models varied, with the smaller access point dataset, Three Bays, the least well fit (R2 = .36), to better fits with the larger access point of Narragansett Beach (R2 = .96). The number of observations vary with the sources, as do specifics of how those observations were collected, but we suspect the cell data may be better at predicting visitation to larger areas, with more daily visitation. There are likely more cellular devices in a sample of a day at the more popular places to estimate visitation from, reducing noise in the estimate.”

And line 413:

“The differences in how well our models fit depending on the visitation data source, with larger more popular locations fit better than small, should be cautionary to approaches applying the cell data-derived visitation estimates to settings where there is no similar observational data for comparison.”

2) According to Supplementary Information, to calibrate Narragansett beach observational counts, you calculated ratio (resident/public) in three 3-hour time windows and adopted the average (0.85) as the calibration ratio. The data suggests that the ratio is very different during different time windows, and the sum is actually very close 553 (public) vs. 528 (resident). Also, I am interested to know why three different dates (7/13, 7/24, and 7/26) were picked for data collection.

We chose these time windows because from our work studying car/people methods on the Three Bays counting project (the small visitation dataset in this paper), the hours of 12-4 were most representative of the visitation for the day. The days were chosen out of convenience in the workweek over the summer. While Narragansett is unique in that it charges people (whether they park or not) to get on the beach, we found those records incomplete and chose this car method to capture the resident use. Each counting method comes with its own issues, which is another argument to include more than one observational dataset to compare results and calibrate more general models, such as the one we created.

3) A spatial buffer of 100-meter was created and a reasonable explanation is provided. What role does locational accuracy play here? According to data description, location can be collected by GPS, Wi-Fi, or cell tower, and which location provider of each ping is unknown. For locations derived from cell tower, apparently the 100-meter threshold would lead to significant misclassification (i.e., inside or outside). You may want to justify the threshold selection by taking locational accuracy into account.

This choice was one of balancing minimizing extraneous visitation to areas near, but not associated with, the access points of interest, but having areas big enough to represent the use there. Often, access point databases are in line or point form, so there has to be some buffer added to capture use around the area. We now include an example figure of these buffers (Fig 1).

We tried 100-200-300-meter buffers and chose 100 in consultation with Airsage, where we believed we did not capture nearby attractions not wanted, but big enough for their process to work at the individual device level, which we as a buyer of the data do not see. Errors would go both ways: visit defined outside the area but actually in the area, or defined in the area but actually outside. We decided defining the area more narrowly as opposed to larger would limit the false attribution of visitation, which we figure is a worse direction for an error than being conservative and under-estimating visitation.

The approach seemed to work well against observational counts, which is the only way we could assess this issue without access to individual level locational information. How big the buffer should be for what application is an interesting question that the data providers with access to the individual level data and locational processing methods could tackle. For us, the justification is showing the accuracy of our choices against the observational counts we are trying to calibrate to.

We re-worked this discussion now in the paper, line 136:

“To obtain the cell data for the sample geographies of interest, we spatially buffered (added area) around the water-access sites which were designated as line or point features in the original spatial databases. In consultation with the data provider and after attempting a range of spatial buffers, a 100-meter buffer was chosen to balance specificity in capturing water recreation visits (i.e., not capturing ancillary points of interest in geographies, like restaurants or stores, for example) with the accuracy of the locational information. We sent the defined water recreation areas to the data provider as a set of geographic extents, or polygons (see Fig 1 for examples of area definitions), and they returned the aggregated and anonymized processed data in tabular form. We include a sample of the dataset below (Table 1) and include the entirety of this dataset available with the code package associated with this work at https://github.com/USEPA/Recreation_Benefits.git.

The locational accuracy of the device locations underpinning the data range depending on the source device and the smartphone application. The accuracy of reported locations from applications varies with ranges of 1-10 meters (GPS), 20-200 meters (Wi-Fi), and 100-2000 meters (cell tower-based) based on the method(s) each application uses to locate each device. We were not able to obtain an average locational accuracy for devices seen in our geographies in our specific dataset since the smartphone applications do not report the exact location methods to the data provider and we do not receive device-specific locational information. Given the potential range in location accuracy, visits attributed to a water recreation area could have actually been to a nearby attraction, or vice-versa. We chose a relatively small buffer around the recreation areas to be conservative in defining the area attributed to use of the site and to minimize any mis-located visits. Given this and other limitations, we relied on the calibration and validation to on-the-ground visitation counts to assess the usefulness and accuracy of the cell data for our application and the choice of spatial definitions and buffers around sites.”

Some other questions/comments/suggestions:

1) The data description section can be improved. You mentioned "we obtain aggregated summaries of visitation by recreation site", what information does Airsage need (e.g., the name of the site, or the geographic extent of areas of interest) to generate aggregated summary and what do the "aggregated summaries" look like? I strongly recommend to include a table of sample data in this section.

Airsage, and other providers, need geographic areas (polygons) defined that represent the points of interest, POIs. We created these polygons ourselves and sent them to Airsage. There are other products where a user might define a large extent and get a rasterized (heatmap) style summary, but these were not the ones we pursued.

We rearranged and added to the data description to address these questions and other reviewers’ comments, moving the lit review of these cell data applications to the background section and focusing on the specifics of our data purchase in the data description section. We also brought parts of the methods up to the data description that were explaining parts of the process of what we provide to Airsage and what they sent back (data description) versus what we did with it (methods).

We added a sample of the data in a table to the data description section as suggested and the whole thing is available with the code package. We now include an example of the spatial definitions in Fig 1.

2) How would applying 3-hour moving average to the raw hourly count address double counting (or multiple counting)? This is not very clear to me. Can you come up with better explanation in text?

The key to understand this is that we applied the three-hour moving average, but only summed one hour of the three (the middle hour) to get to the day total. This reduces the sum by roughly 1/3, but maintains the information in all three hours instead of picking the end, middle or start of the 3 hour chunk to represent the 3. We re-worked and added to this description starting on Line 229:

“The cell data product did not provide counts for the block of time that corresponded to our visitation counts (9AM-4PM), but rather by individual hours. Therefore, we had to translate these hourly counts to our time window by making assumptions on the length of stay, since the same device would be counted multiple times if it were to stay at the site for multiple hours. Following the data provider’s advice to match the cell data-derived information to visitation observations, we used an assumption of a three-hour average stay to match the time-window of our observations. We could have picked data on one hour in the window to be representative of the whole three hours, the end hour for instance, but this would discard information in the other hours. Instead, we calculated a moving average (three-hour window) of visitation for each hourly visitation estimate from the cell data for each site. We then summed the moving average of only the central hour of three-hour blocks from 8AM-4PM (9AM, 12PM, 3PM) (see Supporting Information for more details). This reduces the cell data counts due to multiple sightings, since we summed only one of the three hours in each window, but maintains the information in the hourly distribution of use through the day.

An assumption of a shorter length of stay would have increased the cell data counts and vice-versa. For example, if we assumed a two-hour average length of stay, we would have used two-hour instead of three-hour windows in the daily sums, increasing the daily total. While three hours may be a long average length of stay for recreational visits to all the water access sites, the data reflecting this assumption were inputs to the calibration models. We sought to correct any bias and inaccuracies introduced by this assumption by using the calibration models fit to on-the-ground counts below. Similarly, there is a difference in the relationship across the three sizes of access points that can also be seen in Fig 2. The differences by group are likely due to differences in the observational counting methods and possibly how well cell data performs based on the size of the area. We control for both possible effects in the statistical models used to calibrate the data.”

Reviewer 2

The structure of the paper can be improved. I would suggest moving the literature review in “data description” to the “background” section and keep the data description more focused.

We rearranged these sections, moving the lit review of these cell data applications to the background section and focusing on the specifics of our data purchase in the data description section. We also brought parts of the methods up to the data description that were explaining parts of the process of what we provide to Airsage and what they sent back (data description) versus what we did with the data (methods).

The literature review is generally inadequate. Please consider adding more details regarding the types of cell phone data (e.g., CDR, assisted GPS data, Erlang data) and how these datasets have been used to model human mobility. Another thing can be discussed is the connection and differences between using mobile phone data in physical geography and human/urban geography.

We edited this section and added more references that explain the locational data sources to the background section. The technicalities of specific data inputs for the whole range of cellular locational data is beyond the scope of this paper and better suited to be explained in the references we provided. We now provide a concise overview of the various sources and explain where the dataset we purchased came from, GPS location from smartphone applications, in the data description section.

We do not think a discussion about the difference between physical and human/urban geography provides any clarity to the reader in understanding what we did or to understanding the types and goals of past applications of cell data better than the groupings of applications we point out in the literature review. We have added several recent relevant studies to the literature review.

The research question should also be explicitly stated in the introduction. Currently, it is unclear if the authors wanted to focus on analyzing the spatial distribution of visitations, the temporal pattern of visitations, or something else (although this information was later provided in the methodology section).

We edited the statement addressing this point in the introduction:

Line 51:

“We investigated how cell data performs in providing the types of visitation information needed in policy applications, information on the temporal and spatial distribution of visits to natural areas.”

Related papers:

Calabrese, F., Ferrari, L. and Blondel, V. D. (2015). Urban sensing using mobile phone network data: a survey of research. ACM Computing Surveys 47(2), pp. 1–20.

We added this reference to the list of references for use in urban and transportation research in the background section.

Yuan, Y., & Raubal, M. (2016). Exploring Georeferenced Mobile Phone Datasets – A Survey and Reference Framework. Geography Compass, 10(6), 239-252. doi:10.1111/gec3.12269.

This one seemed repetitive to other references we had, so we did not include it.

The authors should also discuss how their work is different from the following study:

Monz, Christopher et al. Using Mobile Device Data to Estimate Visitation in Parks and Protected Areas: An Example from the Nature Reserve of Orange County, California. Journal of Park and Recreation Administration, [S.l.], v. 37, n. 4, oct. 2019. ISSN 2160-6862.

This work was published after we initially submitted our paper, and since then we have been in touch with the authors. Our work does differ in important ways, using records from more than one location and counts representing people visitation instead of cars. We now cover this and other relevant recent work in an extended discussion in the background section on park visitation estimates

Line 102:

“There are a few recent applications of cell data to understand behavior in and around natural areas [37-39]. Kubo et al. [37] used cell data to calculate the economic value of coastal tourism across Japan, but provided no ground truth to the visitation information provided by the cell data. For an island park in Korea [38], Kim et al. applied cell data to analyze tradeoffs between visitation and biodiversity and showed decent correlations between the cell data and monthly estimates of visitation to several specific locations on the island. A study of parks in California, USA, is the closest to the work presented in this paper [39]. The study used a similarly processed cell data product from a third-party vendor to estimate daily park visitation. They calibrated the cell data with just one set of data, vehicle counts on a major nearby road, finding a unit-value correction factor. They then validated their estimates against a single park’s gate traffic and parking information. They found good agreement with their corrected cell data model and daily vehicle counts. From there, they used park-specific vehicle-to-people ratios to extrapolate to the number of visitors to the other twenty-one parks of interest. Our study differs by incorporating multiple visitation records representing counts of people to a wider set of locations: eighteen different water recreation areas. We find similar potential for this data source to provide useful, policy-relevant visitor use information at daily and site-level scales for water recreation areas.”

P4 l114, the definition of a “visit” is unclear. What if a user showed up at two nearby locations? For example, if a user sit on a bench for 20 minutes and then used a public restroom 100 meters away. Is this considered two visits? Doesn’t it make more sense to cluster close-by points from the same device?

Processing the individual-level information is done by the data provider Airsage, which uses its own methods to define a visit based on the individual device location history. They define a visit as a device staying for more than 5 minutes in the area during an hour. They define device totals (only count a visit once if leaves and enters in the hour), as well as activity points (which deal with the same device re-entering as you describe). We used the device totals to compare to visitation as this is closer to what we would want to count, a unique visit. Airsage uses clustering to define the device’s location. See the Supporting Information for the publicly available information on Airsage’s process. We point out in the paper, line 390:

“Additionally, there is a cascade of statistical modeling assumptions that are made by the third-party providers to take raw device locational information to the anonymized and aggregated form delivered to the customers (see Supporting Information for the publicly available description of Airsage’s process). The exact details of each private provider’s data processing workflows are their intellectual property and protected as such. This opaqueness motivates the use of methods to judge performance critically and the construction of additional models for popular environmental applications for use with common data products, such as the one described in this paper…”

The mobile phone dataset should also be better explained. It may help to provide a few sample records.

We rearranged and added to the data description to address these questions and other reviewer’s comments, moving the lit review of these cell data applications to the background section and focusing on the specifics of our data purchase in the data description section. We also brought parts of the methods up to the data description that were explaining parts of the process of what we provide to Airsage and what they sent back (data description) versus what we did with it (methods).

We now provide a sample of the data we used in Table 1.

L198 how did you decide on the 100-meter buffer? Please clarify.

This choice was one of balancing minimizing extraneous visitation to areas near, but not associated with, the access points of interest, but having areas big enough to represent the use there. Often, access point databases are in line or point form, so there has to be some buffer added to capture use around the area. We now include an example figure of these buffers (Fig 1).

We tried 100-200-300-meter buffers and chose 100 in consultation with Airsage, where we believed we did not capture nearby attractions not wanted, but big enough for their process to work at the individual device level, which we as a buyer of the data do not see. Errors would go both ways: visit defined outside the area but actually in the area, or defined in the area but actually outside. We decided defining the area more narrowly as opposed to larger would limit the false attribution of visitation, which we figure is a worse direction for an error than being conservative and under-estimating visitation.

The approach seemed to work well against observational counts, which is the only way we could assess this issue without access to individual level locational information. How big the buffer should be for what application is an interesting question that the data providers with access to the individual level data and locational processing methods could tackle. For us, the justification is showing the accuracy of our choices against the observational counts we are trying to calibrate to.

We re-worked this discussion now in the paper, line 136:

“To obtain the cell data for the sample geographies of interest, we spatially buffered (added area) around the water-access sites which were designated as line or point features in the original spatial databases. In consultation with the data provider and after attempting a range of spatial buffers, a 100-meter buffer was chosen to balance specificity in capturing water recreation visits (i.e., not capturing ancillary points of interest in geographies, like restaurants or stores, for example) with the accuracy of the locational information. We sent the defined water recreation areas to the data provider as a set of geographic extents, or polygons (see Fig 1 for examples of area definitions), and they returned the aggregated and anonymized processed data in tabular form. We include a sample of the dataset below (Table 1) and include the entirety of this dataset available with the code package associated with this work at https://github.com/USEPA/Recreation_Benefits.git.

The locational accuracy of the device locations underpinning the data range depending on the source device and the smartphone application. The accuracy of reported locations from applications varies with ranges of 1-10 meters (GPS), 20-200 meters (Wi-Fi), and 100-2000 meters (cell tower-based) based on the method(s) each application uses to locate each device. We were not able to obtain an average locational accuracy for devices seen in our geographies in our specific dataset since the smartphone applications do not report the exact location methods to the data provider and we do not receive device-specific locational information. Given the potential range in location accuracy, visits attributed to a water recreation area could have actually been to a nearby attraction, or vice-versa. We chose a relatively small buffer around the recreation areas to be conservative in defining the area attributed to use of the site and to minimize any mis-located visits. Given this and other limitations, we relied on the calibration and validation to on-the-ground visitation counts to assess the usefulness and accuracy of the cell data for our application and the choice of spatial definitions and buffers around sites.”

L248 Did you consider the collinearity between the explanatory variables in your model? Some of the explanatory variables may not be independent from each other. More importantly, it is highly likely that the observed values are spatially auto-correlated, which can inflate the R2 values and jeopardize the reliability of the models.

We do not think collinearity is an issue with our regression. We are interested in the relationship of the observational counts to the cell data, which is varying with temperature, weather, day of the week, potentially clouding the marginal effect of each variable separately. Collinearity could be an issue in inflating coefficient variances and false attribution of effect to one or the other collinear regressor and could hinder out-of-sample performance. We can check for collinearity issues on the cell data coefficient, and the models in general, by specifying the regression without other covariates and add groups of them sequentially. We did this in the Supporting Information model performance tables. We also show the variable importance output for the random forest model, showing removing cell data at each node resulted in the largest increase in variance of all the variables. This implies that it is the cell data with the information and not an issue of collinearity with combinations of the other covariates, like day of the week and weather. We reference these checks in the text starting on line 301:

“From our models, there are a few ways to show that it is the information in the cell data that is providing most of the explanatory power as compared to the covariates (weather, area, source of the observational counts). Table 2 shows the regression results using just cell data (columns 1-2), then with additional covariates (columns 3-4). Just using cell data produces a decent model in linear form. The additional value of the covariates can be seen in the improved stats between column 1 and 3 and 2 and 4. From the random forest model, the cell data was by far the most useful in modeling visitation as seen by metrics of variable importance (see Supporting Information).”

Spatial auto-correlation could potentially be an issue, but we have a few controls for it. First, we include a dummy variable for the source of the observational counts. This also acts as a dummy the geographic areas of the beaches as the counts come from geographically clustered areas. We also use a cross validation technique to check for overfitting issues in our model for any number of reasons and present out-of-sample goodness of fit estimates. If there were severe collinearity or spatial/temporal autocorrelation that affected the predictive performance of our model we would have seen poor out-of-sample performance, which we do not see.

Would the results be different if you consider different time periods during the day/week/month, etc.?

We looked at a specific time window of visitation within the days, since this was the same time window for which our observational counts were representative (9AM-4PM). We include all days of the week and summer months (June-September) in our analysis, again based on the observational datasets. How well the cell data performs in other time periods outside of our summer season observational dataset we calibrated to is unknown. However, most water access in New England occurs in the summer months, and these are the most policy-relevant months to study.

The cell data processing from Airsage is consistent year round, so we can assume the relationships would be similar, but we have no data to test that currently. At this point, we are confident in showing how cell data and our models do during the summer season in the region at these types of water access points in our sample.

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Song Gao

24 Mar 2020

PONE-D-19-32712R1

Using Data Derived from Cellular Phone Locations to Estimate Visitation to Natural Areas: An Application to Water Recreation in New England, USA.

PLOS ONE

Dear Merrill,

Thankss for your efforts and submitting your revised manuscript to PLOS ONE.  The revision has improved significantly. Before acceptance for publication, both expert reviewers requested a short discussion about the data quality, model uncertainty, and limitation among other minor issues. Therefore, we invite you to submit another revised version of the manuscript that addresses the minor points raised during the review process.

We would appreciate receiving your revised manuscript by May 08 2020 11:59PM. When you are ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter.

To enhance the reproducibility of your results, we recommend that if applicable you deposit your laboratory protocols in protocols.io, where a protocol can be assigned its own identifier (DOI) such that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). This letter should be uploaded as separate file and labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. This file should be uploaded as separate file and labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. This file should be uploaded as separate file and labeled 'Manuscript'.

Please note while forming your response, if your article is accepted, you may have the opportunity to make the peer review history publicly available. The record will include editor decision letters (with reviews) and your responses to reviewer comments. If eligible, we will contact you to opt in or out.

We look forward to receiving your revised manuscript.

Kind regards,

Song Gao, Ph.D.

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: I really appreciate the efforts to address my comments for the original submission. The quality of this manuscript is significantly improved. I particularly like how you acknowledge and discuss the limitation of the data/method and how people should use caution when using cell data to address real-world problems.

Some trivial suggestions:

1) Can you name different types of data in term of data collection method, instead of "small", "medium", and "large"?

2) The ethics statement breaks the transition between methods and results. It should be moved to the end of the manuscript.

Reviewer #2: The authors addressed most of my comments in the revision, and the manuscript has been greatly improved. I only have one minor comment- I suggest adding a thorough discussion on the data quality issues (e.g., precision, accuracy, biases) that occurred in the study, as well as the uncertainty caused by modeling fitting. You can also clarify the collinearity and spatial autocorrelation issues in the main text and point the readers to the supplementary materials for more details.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files to be viewed.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email us at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2020 Apr 30;15(4):e0231863. doi: 10.1371/journal.pone.0231863.r004

Author response to Decision Letter 1


30 Mar 2020

Response to Reviewers

Reviewer 1:

Reviewer #1: I really appreciate the efforts to address my comments for the original submission. The quality of this manuscript is significantly improved. I particularly like how you acknowledge and discuss the limitation of the data/method and how people should use caution when using cell data to address real-world problems.

Some trivial suggestions:

1) Can you name different types of data in term of data collection method, instead of "small", "medium", and "large"?

We considered this, but the current way was the most clear and concise way to refer to the different groups of coastal sites. The scale of the site and visitation we believe to also be an important distinguishing feature, especially in future use of the data. The size also corresponds to the counting method for practical observational reasons which we discuss in depth now starting on line 193.

2) The ethics statement breaks the transition between methods and results. It should be moved to the end of the manuscript.

It is an odd place for it, but when submitting our revisions, we were instructed by the editor to put it there:

“Your ethics statement must appear in the Methods section of your manuscript. If your ethics statement is written in any section besides the Methods, please move it to the Methods section and delete it from any other section.”

We will ask if we can put it at the end of the manuscript when we submit this version, but it might be the journal’s format to have it in the methods.

Reviewer 2

The authors addressed most of my comments in the revision, and the manuscript has been greatly improved. I only have one minor comment- I suggest adding a thorough discussion on the data quality issues (e.g., precision, accuracy, biases) that occurred in the study, as well as the uncertainty caused by modeling fitting. You can also clarify the collinearity and spatial autocorrelation issues in the main text and point the readers to the supplementary materials for more details.

In addition to the discussion of the accuracy issues in the cell data, we added a longer discussion in the results section about the compounding sources of bias and inaccuracies starting on line 359:

“We used the most accurate and unbiased of the candidate statistical models for prediction. However, there are several sources of potential inaccuracies and biases in estimating visitation in this way that are not incorporated in the metrics of model goodness-of-fit. The observational visitation counts contain their own uncertainties and potential biases based on their sampling design and counting methods. By calibrating and validating to those counts, we may be carrying over those issues to our estimates of visitation. In addition, the cellular data contains uncertainties resulting from the geospatial accuracy of the device locations, our geographic definition of the sites, and the methods of expansion from the device sample to population-level estimates. More applications using cell datasets are needed to understand these limitations combined with additional and more consistent collections of visitation observations for calibration.

The models we fit may also be susceptible to spatial autocorrelation issues resulting from the cell dataset if there are variations in how the data represents visitation geographically. Spatial autocorrelation in models potentially inflates goodness-of-fit estimates, can bias parameters, and reduces predictive performance. We have controls in the models for each group of sites, which are geographically clustered to alleviate some of the potential issue. Similarly, collinearity in the covariates could potentially cause poor predictions, attributing predictive information to the wrong covariate, for example. We checked for this issue in a few ways by building up the model covariates sequentially starting with cell data alone and adding covariates. This led to little change in the relationships between observations and cell data counts (see Supporting Information for more model details and variations). We also consistently found good out-of-sample goodness-of-fit metrics in a cross validation, giving us more confidence that spatial autocorrelation and collinearity were likely not an issue in the models’ predictive performance.”

We also have a discussion point, pointing out the need for more and more consistent, unbiased collection of observational visitation data in traditional means. The lack of which holds back the new methods, like using cell datasets. Starting on line 424:

“While cell data-derived information is an exciting development for researchers and managers, counterintuitively, we found attempting to use it for a practical application only further motivated the need to take more accurate, consistent and unbiased observations of visitation using traditional methods. Modeling methods are hindered by the lack of availability of training datasets and would be greatly improved by larger and more uniformly collected observations. This is especially true regarding machine-learning algorithms [47]. For example, with small and practical tweaks in the way visitation records are collected at water-access areas, such as collecting periodic counts of cars and people at specific times, visitation records could become more harmonized and useful [12,41]. From there, visitation proxies like cell data or social media-based models can provide a platform for spatial and temporal extrapolation across broad geographies, as we demonstrate here. The need for such models is not confined to water-access visitation, as it is relevant to many other similar policy contexts, for example, at national parks or urban green spaces. The differences in how well our models fit depending on the visitation data source, with larger more popular locations fit better than small, should be cautionary to approaches applying the cell data-derived visitation estimates to settings where there is no similar observational data for comparison.”

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 2

Song Gao

3 Apr 2020

Using Data Derived from Cellular Phone Locations to Estimate Visitation to Natural Areas: An Application to Water Recreation in New England, USA.

PONE-D-19-32712R2

Dear Dr. Merrill,

We are pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it complies with all outstanding technical requirements.

Within one week, you will receive an e-mail containing information on the amendments required prior to publication. When all required modifications have been addressed, you will receive a formal acceptance letter and your manuscript will proceed to our production department and be scheduled for publication.

Shortly after the formal acceptance letter is sent, an invoice for payment will follow. To ensure an efficient production and billing process, please log into Editorial Manager at https://www.editorialmanager.com/pone/, click the "Update My Information" link at the top of the page, and update your user information. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, you must inform our press team as soon as possible and no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

With kind regards,

Song Gao, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Song Gao

7 Apr 2020

PONE-D-19-32712R2

Using Data Derived from Cellular Phone Locations to Estimate Visitation to Natural Areas: An Application to Water Recreation in New England, USA.

Dear Dr. Merrill:

I am pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please notify them about your upcoming paper at this point, to enable them to help maximize its impact. If they will be preparing press materials for this manuscript, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

For any other questions or concerns, please email plosone@plos.org.

Thank you for submitting your work to PLOS ONE.

With kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Song Gao

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 File

    (DOCX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are available at https://github.com/USEPA/Recreation_Benefits.git.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES