Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2022 Jun 3;1(6):e0000039. doi: 10.1371/journal.pdig.0000039

Spatial aggregation choice in the era of digital and administrative surveillance data

Elizabeth C Lee 1, Ali Arab 2, Vittoria Colizza 3, Shweta Bansal 4,*
Editor: Yuan Lai5
PMCID: PMC9931313  PMID: 36812505

Abstract

Traditional disease surveillance is increasingly being complemented by data from non-traditional sources like medical claims, electronic health records, and participatory syndromic data platforms. As non-traditional data are often collected at the individual-level and are convenience samples from a population, choices must be made on the aggregation of these data for epidemiological inference. Our study seeks to understand the influence of spatial aggregation choice on our understanding of disease spread with a case study of influenza-like illness in the United States. Using U.S. medical claims data from 2002 to 2009, we examined the epidemic source location, onset and peak season timing, and epidemic duration of influenza seasons for data aggregated to the county and state scales. We also compared spatial autocorrelation and tested the relative magnitude of spatial aggregation differences between onset and peak measures of disease burden. We found discrepancies in the inferred epidemic source locations and estimated influenza season onsets and peaks when comparing county and state-level data. Spatial autocorrelation was detected across more expansive geographic ranges during the peak season as compared to the early flu season, and there were greater spatial aggregation differences in early season measures as well. Epidemiological inferences are more sensitive to spatial scale early on during U.S. influenza seasons, when there is greater heterogeneity in timing, intensity, and geographic spread of the epidemics. Users of non-traditional disease surveillance should carefully consider how to extract accurate disease signals from finer-scaled data for early use in disease outbreaks.

Author summary

Administrative health records, social media streams like Twitter, and participatory surveillance systems like Influenzanet are increasingly available for infectious disease surveillance, but are often geographically aggregated to preserve data privacy and confidentiality. We explored how an arbitrary choice in the spatial aggregation of non-traditional disease data sources may influence estimates of disease burden and epidemiological understanding of an outbreak. Using influenza-like illness as measured through a medical claims database as our case study, we find that there is substantial variation in influenza season timing and magnitude across spatial scales due to which spatial aggregation could lead to misleading estimates of epidemiological quantities. In particular, we find that epidemiological inferences are more sensitive to spatial scale early on during U.S. influenza seasons, when there is greater heterogeneity in timing, intensity, and geographic spread of the epidemics. Non-traditional disease surveillance may have distinct advantages in reporting speed and volume, but care is required when aggregating this data for spatial epidemiological analysis.

Introduction

Effective disease surveillance systems seek to capture accurate, representative, and timely disease data in the face of complex logistical challenges and limited human resources [1]. As these data are typically collected at centralized locations like sentinel healthcare facilities and summarized according to political administrative boundaries, there are natural spatial units that may be incorporated into the surveillance system design and reporting. Aggregating surveillance data to administrative boundaries is useful because these units are used in the allocation and distribution of resources and the development of public health guidelines.

While the hope is that spatial and temporal heterogeneity in reported surveillance corresponds to the true underlying disease burden, biases in measurement may contribute to inaccurate estimates. One potential source of bias when working with aggregated surveillance data, often overlooked, stems from choices in the design and aggregation of the reporting data stream itself. In disease ecology, it is well-documented that ecological processes are sensitive to spatial scale, that differences in scale may explain seemingly-conflicting data, and that disease distributions are the result of hierarchical processes that occur on different scales [e.g. [26]]. Parallel concerns arise in spatial statistics, where the ecological and atomistic fallacies warn against the extension of statistical conclusions from populations to individuals and vice versa [e.g. [79]]. In epidemiology, statistical methods that account for the hierarchical nature of spatial data have been developed to improve disease mapping for small area aggregated health data [e.g. [10, 11]].

Non-traditional disease data such as digital data streams, syndromic disease reporting, and medical claims were not necessarily generated for the purpose of disease surveillance, but they have the potential to provide information relevant to disease tracking in a timely and cost-efficient way across large geographic scales [1217]. Traditional surveillance systems are designed to meet pre-determined objectives such as routine surveillance or outbreak detection, for a fixed set of syndromes or diseases in a specific population. Non-traditional data are typically more voluminous and collected at the individual level, but they often capture a convenience sample limited by user biases. For example, medical claims data captures only individuals with health insurance, while Twitter users with a specific geolocation tag may be younger than the general population in that location. Moreover, the collection of non-traditional disease data is not often designed with attention to logistical reporting constraints. Consequently, epidemiologists and policy makers increasingly have new choices in how to aggregate these records spatially and temporally. Noise and random variability may mask epidemiologically-relevant disease signals in data at finer spatial and temporal scales, and we have limited understanding about how these choices might affect subsequent inference [1820].

Using U.S. medical claims data for influenza-like illness as a case study, we consider the issue of ‘spatial aggregation choice’ among potentially novel sources of surveillance data. First we characterize influenza season dynamics from 2002–2003 through 2008–2009 across different spatial aggregation scales. We examine defining influenza season features such as the epidemic source location, onset and peak season timing, and epidemic duration with data aggregated to the county and state levels. Finally, we compare spatial autocorrelation for burden between the early and peak influenza seasons, and test the relative magnitude of spatial aggregation differences for seasonal measures related to timing and intensity. This work highlights the scenarios under which spatial aggregation choice are important, particularly when considering the use of alternative surveillance data streams.

Methods

Medical claims data

Weekly visits for influenza-like illness (ILI) and any diagnosis from October 2002 to April 2009 were obtained from a records-level database of U.S. medical claims managed by IMS Health and processed to the county scale. ILI was defined with International Classification of Diseases, Ninth Revision (ICD-9) codes for: direct mention of influenza, fever combined with respiratory symptoms or febrile viral illness, or prescription of oseltamivir, while any diagnosis visits represent all possible medical diagnoses including ILI (also see [21]). We also obtained metadata from IMS Health on the percentage of reporting physicians and the estimated effective physician coverage by visit volume [21]. Over the years in our study period, our medical claims database represented an average of 24% of visits for any diagnosis from 37% of all health care providers across 95% of U.S. counties during influenza season months [21].

We also aggregated visits for ILI and any diagnosis to the U.S. state- and region-levels, where region boundaries were defined by the groupings of states by the U.S. Department of Health and Human Services.

We performed the same data processing procedure for each county-, state- and region-level time series of ILI per any diagnosis visits (ILI ratio) that has been described elsewhere [21]. In brief, ILI intensity is calculated as a detrended ILI ratio during the flu period from November through March. The flu period is defined as the maximum consecutive period when the ILI ratio exceeds an epidemic threshold (minimum of at least two weeks).

Defining disease burden and spatial aggregation difference

The study considered five measures of influenza disease burden—two measures of timing (onset and peak flu season timing), two measures of intensity (onset and peak intensity), and epidemic duration—at county, state, and region scales. In the below definitions, the intensity of influenza activity in a given location and time refers to the time series of the detrended ILI ratio (See details about the intensity calculation at [21]).

We defined onset timing as the number of weeks from week number 40 (first week of October) until the first week in the epidemic period. We defined peak timing as the number of weeks from week 40 until the week with the maximum epidemic intensity during the epidemic period. The epidemic duration was the number of weeks where the ILI intensity exceeded the epidemic threshold.

Proxies for prevalence during the onset flu season and peak flu season were calculated like relative risks; the onset intensity and peak intensity for a given county, state, or region was defined as its risk relative to a single, national ‘expected’ onset and peak prevalence, respectively. This ‘expected prevalence,’ calculated for each influenza season, was the county population-weighted mean of the associated intensity measure. The onset flu season was identified as a 2–3 week flu season period with the greatest exponential growth rate, while the peak flu season was identified as the week with the maximum ILI intensity.

We defined spatial aggregation difference as the difference between a given influenza disease burden measure at an aggregated spatial scale (i.e., state or region) and the county spatial scale (e.g., μstateμcounty, where μ is a burden measure). As burden measures are normalized, they may be compared across spatial scales and the scale of the spatial aggregation difference is the same as that of each individual burden measure. A positive spatial aggregation difference indicates that state- or region-level data over-represented disease burden magnitude (onset and peak intensity) or had later epidemic timing (onset or peak timing) relative to county measures. Among timing measures, spatial aggregation error of 20 means that state surveillance data presented epidemic onset or epidemic peak 20 weeks after county surveillance data. Among intensity measures, a spatial aggregation error of -1 means that state surveillance data reported e−1 ≈ 0.37 times the risk of county surveillance data.

Inferring probable source location

Using seasonal time series of intensity, we identified the top 10% of locations (at the county or state scale) with the earliest epidemic onset for each season as potential source locations and calculated the Euclidean distances between the centroids of potential source locations and all other locations. We then used the Pearson correlation coefficient (Ho: no difference from zero) between distance to potential source location and onset week to identify probable county or state source locations for a given influenza season (higher correlation coefficient means higher probability of being source location).

Examining spatial dependence in influenza disease burden

We plotted spatial correlograms to examine the global spatial autocorrelation of the four county-level summary measures of disease burden in the statistical programming language R with the ncf package [22]. A two-sided permutation test was performed with 500 permutations to identify correlations that deviated significantly from zero (Ho: no difference from zero).

Comparing spatial aggregation differences across measures and scales

We tested whether spatial aggregation difference was greater among early season or peak season measures of disease burden, and whether state- or region-level aggregations generated greater differences across all measures of disease burden. To compare onset and peak season measures, we paired the spatial aggregation differences for county-season observations across all influenza seasons within our study period for 1) onset timing and peak timing and 2) onset intensity and peak intensity, and tests were performed for both state- and region-level values. To compare differences among state- or region-level aggregations, we paired state-county and region-county differences by county observation for each of the four disease burden measures.

We compared spatial aggregation difference with Bayesian intercept models (effectively, a Bayesian paired t-test) that accounted for county spatial dependence (See SM Methods). The models were implemented with approximate Bayesian inference in R using Integrated Nested Laplace Approximations (INLA) with the INLA package (www.r-inla.org) [23, 24].

Positive estimates mean that 1) spatial aggregation differences for peak timing are greater than those for onset timing, 2) spatial aggregation differences for peak intensity are greater than those for early intensity, or 3) spatial aggregation differences for region and county are greater than those for state and county, and vice versa for negative values. If the 95% credible intervals for β0 fail to overlap with zero, we interpret that there is a statistically significant difference between the measures contributing to δi. We used relatively non-informative normal priors for β0 and log-gamma priors for the precision term τϕ.

Results

We explore the scales of influenza surveillance using county-level U.S. medical claims data representing 2.5 billion visits from upwards of 120,000 health care providers each year for influenza seasons from 2002–2003 through 2008–2009. There was evident heterogeneity in the intensity and timing of ILI activity between counties and their aggregated state and HHS region scales (Fig 1).

Fig 1. ILI intensity by influenza season from 2002–2003 through 2008–2009 across 10 HHS regions.

Fig 1

ILI intensity is displayed for all available counties and states in a given HHS region in different colors (grey for counties, black for states, and red for region). Some regions (such as Region 1) have fewer counties than others so heterogeneity at the county level may be less apparent.

Probable epidemic source locations rarely overlap between county- and state-level data streams

We inferred the most probable epidemic counties and states independently for each influenza season. Across all seasons, we found disagreement in the top two most probable source states and the top 50 most probable source counties (Fig 2). Probable source counties partially overlapped with probable source states only in a few influenza seasons and in a small set of locations: four counties representing 41% of the population of Rhode Island overlapped in the 2004–2005 season; nine counties in California (33% of state population) and seven counties in Nevada (21% of state population) overlapped in the 2005–2006 season; eight counties in Alabama (7% of state population) and 28 counties in Georgia (6% of state population) overlapped in the 2006–2007 season.

Fig 2. Most probable influenza season U.S. source locations at state and county scales across all influenza seasons.

Fig 2

We present the two states (pink) and 50 counties (red) that are the most probable source locations for each influenza season from 2002–2003 through 2008–2009. Probable source counties were partially contained within probable source states only in the 2004–2005, 2005–2006, and 2006–2007 influenza seasons. When inferring probable source locations, disagreement between county- and state-level analyses was common. The map base layer is from the US Census Bureau.

A majority of county data streams achieve onset and peak timing milestones before state data streams

To elucidate the discrepancy between county and state epidemic source locations, we compared the influenza season onset and peak week between county and state scales. While ILI spread was sometimes very rapid, with influenza season onset striking almost all counties within a given state at once, these patterns were not consistent across seasons or states (Fig O in S1 Text).

State-level flu season onset and peak timing tended to occur after the majority of counties in the state had already achieved those milestones. Across the 2002–2003 through 2008–2009 influenza seasons, a mean of 62% and 70% of state populations had already experienced the onset and peak of the influenza season by the times when the aggregated state-level data achieved its influenza season onset and peak, respectively (Fig 3). County population size did not appear to have an association with onset or peak timing (Fig A-N in S1 Text).

Fig 3. Comparison of county and state influenza season onset and peak timing.

Fig 3

We present the cumulative percentage of county populations that have experienced (A) influenza season onset and (B) the influenza season peak by the time that these milestones have been achieved by the aggregated state-level data. For each state abbreviation (rows), the point represents the mean across influenza seasons from 2002–2003 through 2008–2009 while the horizontal line indicates the range of one standard deviation on either side of the mean. The red vertical lines indicate the mean of the mean values across states.

Through visual examination of correlograms, we found that spatial autocorrelation remained present at greater distances for peak measures than early season measures of disease burden, suggesting that seasonal dynamics become more spatially synchronized as the influenza season progresses (Fig 4). Autocorrelation declined to zero at 1177 km and 1359 km for onset and peak timing and at 809 km and 1140 km for onset and peak intensity, respectively.

Fig 4. Spatial correlograms for timing and intensity measures across all influenza seasons.

Fig 4

We present spatial autocorrelation among counties within specified distance classes for timing measures (left) and intensity measures (right). Early season measures (onset timing and early season intensity) are represented in blue and peak season measures (peak timing and intensity) are in orange. Points are displayed only if the p-value for a two-sided permutation test to evaluate correlation is less than 0.01. Colored vertical lines indicate the mean distance where county measures are no more similar than that expected by chance in a given region.

While county-level epidemics had greater heterogeneity in epidemic duration, often with longer right-skewed tails, epidemic durations were similar across spatial scales (Fig P in S1 Text). There was greater variability in epidemic duration between influenza seasons than between different spatial scales. Only in the HHS region centered in New York did the distributions in epidemic duration appear to be shifted. However, this region may be particularly subject to discrepancies related to spatial scales because it represents the smallest geographic area in the study region.

County-level maps of burden and spatial aggregation difference for an example influenza season for onset timing, peak timing, onset intensity, and peak are displayed in the supplement (Fig Q in S1 Text).

Spatial aggregation differences are more prevalent at epidemic onset than at peak flu season

We compared spatial aggregation differences between onset and peak timing and between onset and peak intensity using a Bayesian procedure that may be viewed as a paired t-test for spatially correlated data. The estimates indicate that spatial aggregation differences between state and county measures were greater for onset timing than peak timing and for onset intensity than peak intensity (Table 1). This means that there was greater heterogeneity in the timing and intensity of early season measures than in the peak season measures. Region-county differences were also greater for onset intensity than peak intensity (Table A in S1 Text).

Table 1. Comparison of state-county spatial aggregation differences between onset and peak season measures.

Negative values mean that spatial aggregation estimates for peak measures were smaller than spatial aggregation differences for onset measures. Bolded values denote mean estimates that we interpret to have statistical significance; that is, the 95% credible intervals did not overlap with zero.

State-County Comparison Estimate (95%CI)
Peak-Onset Timing -0.23 (-0.29, -0.16)
Peak-Early Intensity -0.31 (-0.33, -0.30)

Region-county differences were larger than state-county ones for timing measures, while state-county differences were larger than region-county ones for measures of disease intensity (Table B in S1 Text).

Discussion

Administrative health records, social media streams like Twitter, and participatory surveillance systems like Influenzanet, Flu Near You, and Facebook COVID-19 Symptom Survey are increasingly available for disease surveillance, but use of these data for epidemiological analysis is subject to ‘spatial aggregation choice’ [15, 17]. In this study, we examined how an arbitrary choice in the spatial aggregation of non-traditional disease data sources may influence estimates of disease burden and epidemiological understanding of an outbreak. First, we describe the dynamics and burden of influenza-like illness across the United States from 2002–2003 through 2008–2009 with medical claims data across the county, state, and HHS region spatial scales. We observed substantial heterogeneity in influenza season timing and magnitude across spatial scales and found that analyses performed with county-level and state-level data could provide contradictory results regarding inference on the most probable epidemic source location. State-level timing measures provided delayed information about the onset and peak season timings, and timing-related measures had greater spatial heterogeneity in disease burden and spatial aggregation error than did intensity-related measures.

We initially hypothesized that influenza epidemics aggregated to larger spatial scales would have longer epidemic duration than county-level data because a state-level or region-level epidemic should represent the set of all lower-level epidemics, which are staggered in time. During our study period, however, state-level onset and peak season timings occurred only after 60–70% of the state’s population had experienced those milestones (as reported by county-level data), and epidemic duration was similar across spatial scales. Our analysis suggests that spatial aggregation makes influenza season outbreaks less sensitive to onset and peak timing detection. The identification of source locations was also highly scale-dependent; probable source counties were only occasionally located within the probable source state, and when there was overlap, those counties represented only a small proportion of the state population. We also highlight that seasons with more geographically synchronized flu epidemics, which can occur in antigenically novel or severe seasons [25, 26], are not any more likely to have overlap in source locations across spatial scales. We thus hypothesize that source locations are independent of peak dynamics.

These results suggest that spatially-aggregated data are less reliable in representing early season dynamics. This may be because the timing and intensity of ILI activity appears to be more heterogeneous in the early season than the peak season [27], and heterogeneity is associated with greater spatial aggregation differences (Fig T-W in S1 Text). Counties were spatially autocorrelated at greater distances for both peak timing and peak intensity as compared to onset timing and early season intensity, respectively (Fig 4), and spatial aggregation differences were smaller for peak measures than early season measures (Table 1 and Table A in S1 Text). Two factors may contribute to these differences between early and peak season: 1) there are less reliable disease signals during the early flu season, and 2) observations in epidemic onset are asynchronized, but they become more spatially synchronized as the season progresses. Together, these results bolster the hypothesis that seasonal influenza is seeded to many locations and spread primarily through local transmission [28], while prior work suggests that school-holiday-associated contact reductions may play a role in synchronizing influenza outbreaks [29].

Our study suggests that spatial aggregation choice is most critical in early influenza season surveillance (i.e., identifying source locations and early season inference), particularly for assessing season onset (Fig R-S in S1 Text). Delayed detection of season onset and inaccurate estimation of early season intensity may lessen agility of policymakers and healthcare facilities to anticipate staffing and hospital supply needs as they prepare for the peak influenza season activity. Nevertheless, further work should be done to verify the generalizability of our results to different disease syndromes and data sources. Our conclusions about when and how spatial aggregation choice is most important may be conflated with other data reporting processes, such as the expanding geographic coverage of our medical claims data over time, the distribution of reporting healthcare facilities, variability in reporting quality (driven by differences in healthcare and surveillance resources), clustered use of certain ICD-9 diagnosis codes (driven by hospital practices or knowledge-sharing between physicians, for example), and healthcare access, as well as the stochastic variation in influenza season dynamics itself. In addition, changes of mobility over time (e.g. due to family or touristic travels during the winter holidays) associated with changes of contact patterns (e.g. due to closure of schools during holidays) are known to contribute to influenza diffusion [29, 30] and may therefore affect the spatial aggregation.

As big data becomes more prevalent and fine-scale targeting and measurement becomes the norm in infectious disease surveillance, spatial aggregation and zoning biases, discrepancies between statistical inference when the boundaries of contiguous spatial units are re-arranged [31, 32], may become regular concerns for epidemiologists. This case study has a direct link to the modifiable areal unit problem (MAUP), a phenomenon which describes how spatial aggregation of data can yield different statistical results and highlights the need for sensitivity analyses examining spatial scale [32]. While we sought to describe differences in influenza season features across spatial scales, other recent work pursues the identification of epidemiology-driven geographic regions as a potential solution to this problem [25]. At this juncture, where traditional, administrative, and digital data may be used in disease surveillance, it is critical to develop general methodologies that can extract useful disease signals from fine-scaled data early on in an outbreak [16].

Supporting information

S1 Text. Detailed descriptions about data processing, methodological choices, sensitivity analyses, and supporting evidence.

The Supporting Information includes Table A and B and Figures A to W.

(PDF)

Acknowledgments

This work was made possible by a data agreement between IQVIA and the RAPIDD Program of the Science & Technology Directorate, Department of Homeland Security and the Fogarty International Center, National Institutes of Health.

Data Availability

The medical claims database is not publicly available; they were obtained from IMS Health, now IQVIA, which may be contacted at https://www.iqvia.com/. All other model input data and map base layers are made publicly available by the US Census Bureau. Model output data and code are available at https://github.com/bansallab/spatialaggregation.

Funding Statement

Research reported in this publication was supported by the Jayne Koskinas Ted Giovanis Foundation for Health and Policy (http://jktgfoundation.org/); and the National Institute of General Medical Sciences of the National Institutes of Health under award number R01GM123007. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health, and the funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

References

  • 1. German RR, Lee LM, Horan JM, Milstein RL, Pertowski CA, Waller MN, et al. Updated guidelines for evaluating public health surveillance systems: recommendations from the Guidelines Working Group. MMWR Recommendations and Reports. 2001;50(RR13):1–35. [PubMed] [Google Scholar]
  • 2. Wiens JA. Spatial scaling in ecology. Functional Ecology. 1989;3(4):385–397. doi: 10.2307/2389612 [DOI] [Google Scholar]
  • 3.Levin SA. The problem of pattern and scale in ecology; 1992.
  • 4. McGill BJ. Matters of Scale. Science. 2010;328(5978):575–576. doi: 10.1126/science.1188528 [DOI] [PubMed] [Google Scholar]
  • 5. Pepin KM, Kay S, Golas B, Shriner S, Gilbert A, Miller R, et al. Inferring infection hazard in wildlife populations by linking data across individual and population scales. Ecology Letters. 2017. doi: 10.1111/ele.12732 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Cohen JM, Civitello DJ, Brace AJ, Feichtinger EM, Ortega CN, Richardson JC, et al. Spatial scale modulates the strength of ecological processes driving disease distributions. Proceedings of the National Academy of Sciences. 2016;113(24):E3359–E3364. doi: 10.1073/pnas.1521657113 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Robinson W. Ecological Correlations and the Behavior of Individuals. American Sociological Review. 1950;15(3):351–357. doi: 10.2307/2087176 [DOI] [Google Scholar]
  • 8.Lawson AB. Statistical Methods in Spatial Epidemiology, Second Edition. Wiley Series in Probability and Statistics. West Sussex, England: John Wiley & Sons, Ltd.; 2006. Available from: http://doi.wiley.com/10.1002/9780470035771
  • 9. Darby S, Deo H, Doll R, Whitley E. A Parallel Analysis of Individual and Ecological Data on Residential Radon and Lung Cancer in South-West England. Journal of the Royal Statistical Society Series A. 2001;164(1):193–203. doi: 10.1111/1467-985X.00196 [DOI] [Google Scholar]
  • 10. Aregay M, Lawson AB, Faes C, Kirby R. Bayesian multiscale modeling for aggregated disease mapping data. Statistical Methods in Medical Research. 2015;(0962280215607546). doi: 10.1177/0962280215607546 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Deeth LE, Deardon R. Spatial Data Aggregation for Spatio-Temporal Individual-Level Models of Infectious Disease Transmission. Spatial and Spatio-temporal Epidemiology. 2016;17:95–104. doi: 10.1016/j.sste.2016.04.013 [DOI] [PubMed] [Google Scholar]
  • 12. Brownstein JS, Freifeld CC, Madoff LC. Digital Disease Detection—Harnessing the Web for Public Health Surveillance. New England Journal of Medicine. 2009;360(21):2153–2157. doi: 10.1056/NEJMp0900702 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Salathé M, Bengtsson L, Bodnar TJ, Brewer DD, Brownstein JS, Buckee C, et al. Digital epidemiology. PLoS Computational Biology. 2012;8(7):1–5. doi: 10.1371/journal.pcbi.1002616 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Hay SI, George DB, Moyes CL, Brownstein JS. Big Data Opportunities for Global Infectious Disease Surveillance. PLoS Medicine. 2013;10(4):2–5. doi: 10.1371/journal.pmed.1001413 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Bansal S, Chowell G, Simonsen L, Vespignani A, Viboud C. Big Data for Infectious Disease Surveillance and Modeling. Journal of Infectious Diseases. 2016;214(suppl 4):S375–S379. doi: 10.1093/infdis/jiw400 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Althouse B, Scarpino S, Meyers L, Ayers J, Bargsten M, Baumbach J, et al. Enhancing Disease Surveillance with Novel Data Streams: Challenges and Opportunities. EPJ Data Science. 2015;4(17):1–8. doi: 10.1140/epjds/s13688-015-0054-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Lee EC, Asher JM, Goldlust S, Kraemer JD, Lawson AB, Bansal S. Mind the Scales: Harnessing Spatial Big Data for Infectious Disease Surveillance and Inference. Journal of Infectious Diseases. 2016;214(suppl 4):S409–S413. doi: 10.1093/infdis/jiw344 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Tildesley MJ, House TA, Bruhn MC, Curry RJ, O’Neil M, Allpress JLE, et al. Impact of spatial clustering on disease transmission and optimal control. Proceedings of the National Academy of Sciences of the United States of America. 2010;107(3):1041–1046. doi: 10.1073/pnas.0909047107 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Tildesley MJ, Ryan SJ. Disease Prevention versus Data Privacy: Using Landcover Maps to Inform Spatial Epidemic Models. PLoS Computational Biology. 2012;8(11):8–9. doi: 10.1371/journal.pcbi.1002723 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Lo Iacono G, Robin Ca, Newton JR, Gubbins S, Wood JLN. Where are the horses? With the sheep or cows? Uncertain host location, vector-feeding preferences and the risk of African horse sickness transmission in Great Britain. Journal of the Royal Society, Interface / the Royal Society. 2013;10(83):20130194. doi: 10.1098/rsif.2013.0194 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Lee EC, Arab A, Goldlust SM, Viboud C, Grenfell BT, Bansal S. Deploying digital health data to optimize influenza surveillance at national and local scales. PLOS Computational Biology. 2018;14(3):e1006020. doi: 10.1371/journal.pcbi.1006020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Bjornstad ON, Cai J. Package ncf: spatial covariance functions; 2019. Available from: https://cran.r-project.org/package=ncf.
  • 23. Rue H, Martino S, Chopin N. Approximate Bayesian Inference for Latent Gaussian Models Using Integrated Nested Laplace Approximations. Journal of the Royal Statistical Society, Series B. 2009;71(2):319–392. doi: 10.1111/j.1467-9868.2008.00700.x [DOI] [Google Scholar]
  • 24. Martins TG, Simpson D, Lindgren F, Rue H. Bayesian computing with INLA: New features. Computational Statistics and Data Analysis. 2013;67:68–83. doi: 10.1016/j.csda.2013.04.014 [DOI] [Google Scholar]
  • 25. Rosensteel GE, Lee EC, Colizza V, Bansal S. Characterizing an epidemiological geography of the United States: influenza as a case study. American Journal of Epidemiology. in press;. [Google Scholar]
  • 26. Lee EC, Viboud C, Simonsen L, Khan F, Bansal S. Detecting signals of seasonal influenza severity through age dynamics. BMC infectious diseases. 2015;15(1):1–11. doi: 10.1186/s12879-015-1318-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Coletti P, Poletto C, Turbelin C, Blanchon T, Colizza V. Shifting patterns of seasonal influenza epidemics. Scientific Reports. 2018;8(1):1–12. doi: 10.1038/s41598-018-30949-x [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Viboud C, Bjørnstad ON, Smith DL, Simonsen L, Miller MA, Grenfell BT. Synchrony, Waves, and Spatial Hierarchies in the Spread of Influenza. Science. 2006;312(April):447–451. doi: 10.1126/science.1125237 [DOI] [PubMed] [Google Scholar]
  • 29. Ewing A, Lee EC, Viboud C, Bansal S. Contact, travel, and transmission: The impact of winter holidays on influenza dynamics in the United States. The Journal of Infectious Diseases. 2016. doi: 10.1093/infdis/jiw642 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. Luca G, Kerckhove K, Coletti P, Poletto C, Bossuyt N, Hens N, et al. The impact of regular school closure on seasonal influenza epidemics: a data-driven spatial transmission model for Belgium. BMC Infect Dis. 2018;18:29. doi: 10.1186/s12879-017-2934-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Waller LA, Gotway CA. Applied spatial statistics for public health data. John Wiley & Sons, Inc.; 2004. [Google Scholar]
  • 32. Gotway CA, Young LJ. Combining Incompatible Spatial Data. Journal of the American Statistical Association. 2002;97(458):632–648. doi: 10.1198/016214502760047140 [DOI] [Google Scholar]
  • 33. Besag J, York J, Mollié A. Bayesian image restoration, with two applications in spatial statistics. Annals of the Institute of Statistical Mathematics. 1991;43(1):1–20. doi: 10.1007/BF00116466 [DOI] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000039.r001

Decision Letter 0

Yuan Lai, Laura Sbaffi

28 Jan 2022

PDIG-D-21-00097

Spatial aggregation choice in the era of digital and administrative surveillance data

PLOS Digital Health

Dear Dr. Lee,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Mar 29 2022 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Yuan Lai, Ph.D.

Academic Editor

PLOS Digital Health

Journal Requirements:

1. Please update your Competing Interests statement. If you have no competing interests to declare, please state: “The authors have declared that no competing interests exist.”

2. We ask that a manuscript source file is provided at Revision. Please upload your manuscript file as a .doc, .docx, .rtf or .tex. If you are providing a .tex file, please upload it under the item type ‘LaTeX Source File’ and leave your .pdf version as the item type ‘Manuscript’.

3. Please provide separate figure files in .tif or .eps format only and remove any figures embedded in your manuscript file. Please ensure that all files are under our size limit of 20MB.

For more information about how to convert your figure files please see our guidelines: https://journals.plos.org/digitalhealth/s/figures

4. Please provide us with a direct link to the base layer of the map used in Figure 2, Figure S17, Figure S18, Figure S19, and ensure this location is also included in the figure legend.

Please note that, because all PLOS articles are published under a CC BY license (creativecommons.org/licenses/by/4.0/), we cannot publish proprietary maps such as Google Maps, Mapquest or other copyrighted maps. If your map was obtained from a copyrighted source please amend the figure so that the base map used is from an openly available source.

Please note that only the following CC BY licences are compatible with PLOS licence: CC BY 4.0, CC BY 2.0 and CC BY 3.0, meanwhile such licences as CC BY-ND 3.0 and others are not compatible due to additional restrictions. If you are unsure whether you can use a map or not, please do reach out and we will be able to help you.

The following websites are good examples of where you can source open access or public domain maps:

* U.S. Geological Survey (USGS) - All maps are in the public domain. (http://www.usgs.gov)

* PlaniGlobe - All maps are published under a Creative Commons license so please cite “PlaniGlobe, http://www.planiglobe.com, CC BY 2.0” in the image credit after the caption. (http://www.planiglobe.com/?lang=enl)

* Natural Earth - All maps are public domain. (http://www.naturalearthdata.com/about/terms-of-use/)

5. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article's retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

Dear authors, we have received all reviewers' decisions, and please refer to their comments below for the revision. In particular, both Reviewer #1 and #3 suggest better explanations on some definitions and terminology. Please address these feedbacks accordingly. Many thanks for your submission and apologize for the rather long reviewing process.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: I don't know

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This paper describes how spatial aggregation (region, state, and county level) can influence measures of disease onset and peak while using non-traditional surveillance data sources. Using ILI and medical claims data as a use case, this work shows how differences in early season measures are observed across different spatial aggregations. In my opinion, these findings contribute to disease surveillance and epidemiological work, especially regarding non-traditional data sources. To the extent of my knowledge, the methods are appropriate for assessing this research question. The study findings, presentation of the work, and rigor of the methods all contribute to my recommendation to publish with minor revisions.

The overall paper provides a clear narrative about how spatial aggregation choice may impact epidemiological inference of non-traditional data sources. While there are no major concerns on my part regarding the study, there are some places in the paper that could benefit from refined language clarity, and enrichment. These areas are detailed below.

In the introduction, the wording around “political administrative boundaries” in the first paragraph needs addressing. Perhaps there is a word missing. The introduction also mentions that “methods that account for the hierarchical nature of spatial data have been developed to improve disease mapping and the study of disease dynamics” ahead of a paragraph that details that this does not necessarily apply to non-traditional disease data. I recommend clarifying the statement to show it is more applicable to traditional epi data, or that this work needs to be done for non-tradition. I also recommend reviewing the text to remove the same descriptive word more than once in a statement, such as often in ““Non-traditional data are more voluminous and often collected at the individual level, but they often capture a convenience sample limited by user biases.”

For the methods, I think adding what definition of ILI was used (IMS health definition, CSTE, etc.) would be helpful detail. Elaborating on the definitions of weekly ILI visits and diagnoses would also be helpful. Are these diagnoses of ILI, flu, etc. None of the enrichments need to be lengthy or exhaustive, just more informative. I am interested in how expected prevalence was calculated with population weights yet did not vary across counties. Some enrichment around this would be interesting, perhaps in the discussion. Additionally, where the time series of season intensity described in “Inferring probable source location” different than the time series sued to defined intensity? For the Euclidean distances, it is the only measurement in its section that is only county, not state or county. How, if any, where distances state to county handled? Lastly, clearly stating the four county-level summary measures of disease burden after the foist use of this language would be a helpful detail to readers, providing more clarity as they navigate the paper.

In the results, I think you should specify the following is across all observed seasons: “We found disagreement in the top two most probable source states and the top 50 most probable source counties (Figure 2)”. In the sub section “A majority of county data streams achieve onset and peak timing milestones before state data streams”, please address the repeating word (achieved) in the sentence beginning with “State-level flu season onset and peak timing tended”. Additionally, I suggest a slight re-phrase so this does not read exactly like methods section: “We defined spatial aggregation difference as the difference between a give influenza disease burden measure at aggregated spatial scales”. For Figure 2, the county data for HHS region 1 is more difficult to see than the other HHS regions. Highlighting the difference in scale for intensity across regions in the caption would be informative. While the figure’s focus is intraregional comparison across spatial scale, it might be good to note the differences. For Table 1, is the text starting with “The two negative estimates indicate” part of the title? Should that be moved to the caption? Or explained in the text describing results in the figure?

For the discussion, I would emphasize that the goal of this study is more focused on influence of spatial aggregation, especially as novel data streams become available for influenza. I would include the goals around spatial aggregation in this sentence as it was the focus, rather than the current phrasing. Some additional considerations to enrich this section include additional biases specific to medical claims data that could relate to the results seen in spatial aggregation (beyond geographic coverage). This could include access to care, facility aggregation (particularly in rural areas), etc. Similarly, it may be interesting to briefly address how mobility may play into differential results of spatial aggregation in early and peak seasons. Lastly, detailing why onset and peak location and timing are important to capture arcuately would be a good addition.

The findings of this study, especially about the plurality of state populations experiencing ILI onset ahead of state-level data indicating such, are very interesting. I think Figure 2 did a nice job of displaying the onset location differences across county and state level data. ILI is a good use case for this work and it is helpful to see how non-traditional and digital data may be influenced differently by spatial choice than traditional epi data, particularly as more of these data for respiratory illnesses emerge in light of COVID-19.

Reviewer #2: Because this manuscript is about the spatial autocorrelation, I think it should be also showed the cold-spot, not only the hot-spot areas. It can be really helpful for the readers and also the policy makers in order to find the comparison between the areas. Also, it can boots the improvement in this research in the future. Perhaps using multivariate approach, in order to find the major or other causes of diseases spread.

I think it will be really important in order to boost the similar research in other countries, especially in my country, Indonesia, in order to boost the surveillance ILI data and spatial utilization in public health sector.

In page 7, I found a technical issue in the paragraph

I think the authors should explain the limitation of the study. In manuscript, it is not explained well for me.

Reviewer #3: The study makes good contribution to the current knowledge. Spatial aggregation differences are critical to statistical inference out of multi-level data. However, the measurement of some key variables in the study looks not easy to follow. For example, " the onset intensity and peak intensity metrics were defined as risks relative to the ‘expected’ onset and peak prevalence...". It would be difficult for readers to learn how the intensity was exactly measured. What are the "risks"? How "expected" was defined? AND the key variable in this study, spatial aggregation difference, was defined as "the difference between a given influenza disease burden measure...." It was just one sentence measuring this. The authors should elaborate more with this measurement. And they should make it easy to read. How burden is related to spatial aggregation difference?

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Dhihram Tenrisau

Reviewer #3: Yes: Ge Zhan

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000039.r003

Decision Letter 1

Yuan Lai, Laura Sbaffi

11 Apr 2022

Spatial aggregation choice in the era of digital and administrative surveillance data

PDIG-D-21-00097R1

Dear Dr. Lee,

We are pleased to inform you that your manuscript 'Spatial aggregation choice in the era of digital and administrative surveillance data' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Yuan Lai, Ph.D.

Academic Editor

PLOS Digital Health

***********************************************************

Dear Dr. Lee,

It is my pleasure to inform that we received all feedbacks from the reviewers regarding you revised manuscript "Spatial aggregation choice in the era of digital and administrative surveillance data".

Please see reviewers' decision and comments below.

Best,

Yuan

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #3: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors addressed all of my comments from the initial review. Overall, I am pleased with the updated version of the manuscript.

Reviewer #2: Thank you for your answers. I hope this research can give the impact for the surveillance research for the future. If you don't mind can you also add the github link or rmarkdown? I think it can be really important to the reader, so they can use and learn your methods in his R application. Thank you

Reviewer #3: Comments have been well adressed in the revision.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: Yes: Ge Zhan

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Text. Detailed descriptions about data processing, methodological choices, sensitivity analyses, and supporting evidence.

    The Supporting Information includes Table A and B and Figures A to W.

    (PDF)

    Attachment

    Submitted filename: Scales MS rebuttal 1.docx

    Data Availability Statement

    The medical claims database is not publicly available; they were obtained from IMS Health, now IQVIA, which may be contacted at https://www.iqvia.com/. All other model input data and map base layers are made publicly available by the US Census Bureau. Model output data and code are available at https://github.com/bansallab/spatialaggregation.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES