Skip to main content
PLOS One logoLink to PLOS One
. 2021 Jun 10;16(6):e0252990. doi: 10.1371/journal.pone.0252990

A comparison of prospective space-time scan statistics and spatiotemporal event sequence based clustering for COVID-19 surveillance

Fuyu Xu 1, Kate Beard 1,*
Editor: Agricola Odoi2
PMCID: PMC8191960  PMID: 34111199

Abstract

The outbreak of the COVID-19 disease was first reported in Wuhan, China, in December 2019. Cases in the United States began appearing in late January. On March 11, the World Health Organization (WHO) declared a pandemic. By mid-March COVID-19 cases were spreading across the US with several hotspots appearing by April. Health officials point to the importance of surveillance of COVID-19 to better inform decision makers at various levels and efficiently manage distribution of human and technical resources to areas of need. The prospective space-time scan statistic has been used to help identify emerging COVID-19 disease clusters, but results from this approach can encounter strategic limitations imposed by constraints of the scanning window. This paper presents a different approach to COVID-19 surveillance based on a spatiotemporal event sequence (STES) similarity. In this STES based approach, adapted for this pandemic context we compute the similarity of evolving daily COVID-19 incidence rates by county and then cluster these sequences to identify counties with similarly trending COVID-19 case loads. We analyze four study periods and compare the sequence similarity-based clusters to prospective space-time scan statistic-based clusters. The sequence similarity-based clusters provide an alternate surveillance perspective by identifying locations that may not be spatially proximate but share a similar disease progression pattern. Results of the two approaches taken together can aid in tracking the progression of the pandemic to aid local or regional public health responses and policy actions taken to control or moderate the disease spread.

Introduction

The first reported case of Coronavirus disease 2019 (COVID-19) appeared in the US in Washington State in January 2020. Cases then began to appear around the country, creating an outbreak more severe than that experienced in the city of Wuhan, China, where the initial outbreak occurred [1], as well as in many European countries [2, 3]. By mid-March 2020 the outbreak had spread to many states and by late April over one million confirmed cases had been reported in the US.

To anticipate and detect outbreaks, the World Health Organization (WHO), many national and local health departments, academic or other non-profit organizations continuously collected information about occurrences of COVID-19. Incidence cases were cumulatively added to different online repositories [46]. Quick detection of emerging geographical clusters or space-time clusters of COVID-19 can aid public health agencies in prioritizing spatial locations for allocation of different kinds of medical resources including testing kits and applying efficient and publicly acceptable interventions. Versions of space-time scan statistics have been widely used to identify significant clusters of various diseases [711] as well as in the current COVID-19 crisis [12, 13]. Space-time scan statistics use circular or elliptical scanning windows of a series of sizes in combination with varying time intervals to systematically scan a study area to detect clusters of disease cases. The Poisson based space-time scan statistic evaluates each scan window for numbers of cases and tests for locations exceeding the number of expected cases under a Poisson distribution.

The prospective Poisson space-time scan statistic has been successfully used for space-time surveillance of different epidemic diseases. As Kulldorff et al. proposed [9, 10], this method focuses on detecting emerging clusters that start at any time during the study period and remain identifiable at the current time (i.e., active or alive), which is the major difference compared to the retrospective space-time scan statistic. Jones et al. used this method to detect twelve “live” or emerging statistically significant (p-value ≤ 0.05) clusters of shigellosis in the city of Chicago [14], the results of which helped local health departments to prioritize the assignment and investigation of shigellosis cases. The prospective Poisson space-time scan statistic has also been utilized to identify emerging clusters in other diseases such as thyroid cancer among men in New Mexico (1973–1992) [9], syndromic surveillance [15], measles [16], and dengue fever [17]. More recently, it has been used to detect “active” clusters of COVID-19 confirmed cases in the United States [12, 18].

While the prospective space-time scan statistic is a good option for detecting emerging space-time clusters of infectious diseases, there remain some limitations. The effectiveness of the circular scan window decreases as the shape of emerging clusters becomes more irregular. Detected clusters may contain locations without confirmed cases or with low relative risk due to the artifact of the scanning process [10, 12, 19], although this limitation can be minimized by reporting the individual relative risk for the included locations in each cluster. For the Poisson model, the results depend on accurate data on the population at risk, which may be hard to obtain. Furthermore, the prospective space-time scan statistic as an exploratory method, should be followed with other surveillance measures and more detailed investigation of transmission dynamics and pathogenic mechanics of COVID-19 to better understand detected emerging clusters [12].

While the prospective space-time scan statistic has demonstrated value for COVID-19 surveillance, the objective of this study was to demonstrate a different but complementary view of COVID-19 outbreak patterns. The space time scan statistic detects hotspots but does not inform about locations that may be spatially disparate yet may be exhibiting highly similar patterns in disease case count evolution. To capture this dynamic, we employed an event sequence similarity metric on the sequences of daily COVID incidence rates by county. This event sequence similarity metric was then used to cluster counties exhibiting similarly evolving COVID -19 case histories. The resulting identification of locations exhibiting similar evolutionary patterns in the disease provides another aid for public health responses and understanding of disease dynamics. In the remainder of this paper, we describe this event sequence similarity metric as applied to COVID-19 daily incidence rates and compare it with results of the prospective Poisson space-time scan statistic. We use four time periods to illustrate progression of COVID-19 outbreaks through the lens of prospective space-time scan statistic generated clusters and event sequence similarity clusters. The two approaches provide different but complementary aids to COVID-19 surveillance. One tells us of emerging spatial hotspots, the other tells us of collections of locations that for some reasons have statistically similar evolving COVID-19 incidence patterns.

Materials and methods

Data acquisition and processing

We accessed COVID-19 raw daily global collection data from the GitHub repository (https://github.com/CSSEGISandData/COVID-19) created and maintained by the Johns Hopkins University Center for Systems Science and Engineering (JHU CCSE) [20]. The specific time series dataset for this research contains FIPS codes, state names, geolocations, and confirmed cumulative cases, starting from January 22, 2020 through selected ending dates. JH CCSE continues to semi-automatically or automatically update their site daily (https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/).

County level population data for the USA were obtained from the national US Census with estimates for 2019. The ESRI ™ shapefiles of US states and counties used for Geographic Information System (GIS) mapping were downloaded from the TIGER geography portal (US Census Bureau) (https://www.census.gov/cgi-bin/geo/shapefiles/index.php).

We focused the analysis on the 48 contiguous states and Washington D. C.. The dataset was cleaned by filtering out the records without “FIPS” codes and names of counties, and with “FIPS” > 8000 (assigned with “Out of AL”, “Out of AK”, …, “Out of WY”). We combined the cleaned COVID-19 dataset with the U.S. census data at the county level through the “FIPS” codes and double checked the correctness of the spatial information (Latitude and Longitude). Because the COVID-19 dataset only contains cumulative case counts, we obtained the daily confirmed cases by subtracting the previous day’s number from the current day’s reported cumulative cases. The daily incidence rate for each county was obtained as daily confirmed cases divided by county population and multiplied by 10,000. We chose the data from the first wave of the COVID-19 pandemic in the US in 2020 for this study. The entire duration of the first wave is further divided into four analysis periods considering the incubation time for the disease mostly ranging from 1–14 days with the average of 5 days [21] and the slow case increment at the beginning time in January and February, 2020. The four analysis periods each start from January 22 and cover roughly 2–4 week separations corresponding to an early period 1) March 13, and spiking periods 2) March 31, 3) April 19 and 4) May 20.

Prospective Poisson space-time scan statistic

We used the prospective Poisson space–time scan statistic as implemented in SaTScan (http://www.satscan.org/) to detect clusters of COVID-19 cases that remained active at the end of each study period. The space–time scan statistic (STSS) is briefly introduced here, and more details can be obtained from [9, 10, 12, 22]. With spatial scan statistics we can identify the locations of clusters of cases. A cluster can be defined as a set of points or regions, at a user defined granularity, with either high or low rates of incidence. For this study, the focus was high rates of COVID-19 incidence. Conceptually the STSS uses a cylinder as the scanning window, where the circular base of the cylinder captures the spatial dimension while the height represents a temporal interval. To identify space-time clusters at the county level, the center of the circular base is co-located with the centroid of each county. As the scan progresses, the radius of the circular base and the height of the cylinder changes from lower bounds to spatial and temporal upper limits. Similar to [12] we set the maximum scanning window base to include up to 10 percent of the total population to avoid the potential of extremely large clusters (ie. covering a quarter of the country) especially as may occur at the beginning stage of the epidemic, and the upper temporal bound to 50% of the entire study period. As each cylinder moves over the study area, it covers a different set of cases for different time intervals, which can be considered as potential emerging space-time cluster candidates. We set the cluster’s duration to a minimum of 2 days and required at least 5 incidents or confirmed cases of COVID-19 as described in [12].

The age structure of a population will influence the incidence of disease, and deaths from COVID-19 are several times higher in older age groups as noted by others [12]. However, we were unable to access age and sex data at this time for cases in this study, so we could not adjust for age and sex. Assuming that COVID-19 incidence follows a Poisson distribution according to the county population, e.g. the assumed population at risk [9], the likelihood ratio test statistic and the relative risk for each scan cylinder was calculated based on the description in [79, 12]. The cylinder with the maximum likelihood ratio identifies the location with the most likely elevated risk for COVID-19. We used Standard Monte Carlo simulations (999) in the SaTScan setting to calculate the statistical significance of detected clusters with a p-value equal or less than 0.05 being considered statistically significant. SaTScan computes the relative risk (RR) for each cluster and individual counties. The RR for a county within a cluster can be calculated as in [18]:

RRcty=c/e(Cc)(Ce)

Where, c is the total number of cases in a county, C is the total number of observed cases in the conterminous US, and e is the expected number of cases in a county calculated as e=pcty*CP (pcty is the population in a county, P is the total population). We used ESRI ArcGIS 10.6 (www.esri.com) GIS software to create cartographic representations for these detected emerging clusters at the county level.

Event sequence similarity-based cluster analysis

Our event sequence similarity approach focuses on the temporal evolution of events occurring at fixed locations. In this study, an event corresponds to the COVID-19 daily incidence rate for a county and a COVID-19 event sequence for a county is the sequence of daily incidence rates covering a specific study period. We compute the similarity of these county level COVID-19 event sequences using a time ordered Jaccard measure [2325]. Briefly, this measure uses all co-occurrence time points between two event sequences es1 and es2, and calculates the similarity between two events at the co-occurrence timestamp based on their level of measurement. The similarity between two counties’ COVID-19 event sequences is calculated as below:

simcounty(es1,es2)=j=1C(1Abs(lev(es1j)lev(es2j)))|es1es2|

where,

simcounty(es1, es2)–Similarity between county level event sequences es1 and es2,

es1j, es2j–the event values for two corresponding co-occurring events in es1 and es2 at timestamp j.

lev(es1j), lev(es2j)–the relative event levels of two corresponding co-occurring events in es1 and es2 at timestamp j, respectively:

lev(es1j)=es1jes1j+es2jandlev(es2j)=es2jes1j+es2j

C –the total number of co-occurring timestamps,

Abs(lev(es1j)–lev(es2j))–absolute value of difference between relative event levels of two corresponding co-occurring events in es1 and es2 at timestamp j,

|es1es2|–Cardinality of the union of two event sequences es1 and es2.

We then used the computed COVID-19 event sequence similarity measures between counties as the metric for hierarchical clustering [26]. All similarity computations and clustering tasks were implemented in R. The hierarchical clustering was performed using the hclust R function with the linkage method of Ward.D2. The optimal number of clusters was evaluated using the elbow method [2729]. This method supports selection of the number of clusters at which the total within-cluster sum of square (WSS) no longer improves. In a plot of number of clusters versus WSS, the optimal cluster number is visually associated with the point at which the WSS value flattens.

Comparison of prospective space time scan and event sequence similarity-based clusters

To support comparison of the two methods we used the counties identified in the prospective Space time scan statistics as having relative risk > 1 as the counties for analysis with the sequence similarity metric. All other counties not included in this set were labeled as OC meaning outside clusters. We include them in Figs 3, 6 and 9 in the graphs of incidences curves for each study period to show their temporal incidence pattern as a baseline.

Fig 3. Sequence similarity-based COVID-19 clusters along with average temporal trends at the county level through 3/13/2020.

Fig 3

This map includes the counties with higher relative risk (RR>1) contained in all the clusters detected by scan statistics in Fig 1. The average temporal trends of cumulative cases for STES clusters 1–8 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

Fig 6. Sequence similarity-based COVID-19 clusters along with average temporal trends at county level during 1/22/2020-3/31/2020.

Fig 6

This map includes the counties with higher relative risk (RR>1) contained in all the clusters detected by scan statistics in Fig 3. The average temporal trends of cumulative cases for STES clusters 1–8 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

Fig 9. Sequence similarity-based COVID-19 emerging clusters along with average temporal trends at county level during 1/22/-4/19/2020.

Fig 9

This map includes the counties with higher relative risk (RR>) contained in all the clusters detected by scan statistics in Fig 5. The average temporal trends of cumulative cases for STES clusters 1–10 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

Results

Space-time clusters and sequence similarity-based clusters at county level: Study period 1 (1/22-3/13/2020)

In this early period, COVID-19 was just appearing in the US with the first case reported in Snohomish County Washington on January 19. For this period, the prospective space-time scan statistic identified 11 statistically significant (p-value < 0.05) clusters shown graphically in Fig 1 and summarized in Table 1. These clusters, aside from one in California and two in New York, are generally quite large and counties within them with RR > 1 are few and generally spatially dispersed. Because of the generally large size of these clusters, identifying the spatial specificity of an outbreak is limited.

Fig 1. COVID-19 space-time scan hotspots in the United States at the county level from 1/22/-3/13/2020.

Fig 1

Table 1. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-3/13/2020 at the county level.

    Duration Radius Observed Expected Relative   Population #County #County
Cluster Start Date End Date (Days) (Km) Cases Cases Risk (RR) p-value at Risk (total) (RR>1)
1 3/10 3/13 4 806.37 389 38 12.28 <0.001 888,297 238 14
2 3/7 3/13 7 0.00 139 15 10 <0.001 189,707 1 1
3 3/10 3/13 4 551.69 66 18 3.83 <0.001 167,447 404 16
4 3/9 3/13 5 364.08 42 10 4.29 <0.001 87,766 262 16
5 3/12 3/13 2 32.48 102 47 2.21 <0.001 1,267,395 9 6
6 3/12 3/13 2 91.08 10 0 29.12 <0.001 7,438 35 3
7 3/5 3/13 9 49.70 93 42 2.25 <0.001 790,544 3 3
8 3/9 3/13 5 178.04 9 0 26.67 <0.001 2,607 94 3
9 3/10 3/13 4 224.18 12 1 14.16 <0.001 15,926 104 3
10 3/10 3/13 4 253.24 12 1 10.51 <0.001 8,832 64 3
11 3/7 3/13 7 264.34 88 47 1.91 <0.001 824,139 36 12

Note: Space-time clusters were identified using the spatial scan statistic with a Poisson model.

Based on the elbow evaluation method, 8 event sequence similarity-based clusters were defined for this period (Fig 2). Fig 3 shows the map representation of these clusters along with their temporal profiles. Members of Cluster 3 that include counties in Washington State, California and New York show the earliest onset and the fastest case accumulation. Members of Cluster 5 show an early onset that initially tracks Cluster 3 but then abruptly flattens and then decreases in early March. Members of this cluster include 3 counties in California and one in Minnesota. Cluster 2 members show a delayed occurrence in cases but an extremely fast case accumulation over a few days. The 8 members of this cluster are generally in isolated rural settings in Colorado, Oklahoma, Wyoming, South Dakota, Wisconsin, Louisiana and Indiana. Members of Cluster 6 showed initiation of cases at approximately the same time as Cluster 2 but levelled off quickly at a lower incidence rate. The cluster containing counties in New York suggests initial points of entry and situations conducive to rapid acceleration of cases such as high density or tight knit communities. A pairwise comparison of cluster numbers for the 1st study period from these two approaches can be found in S1 Table.

Fig 2. Elbow method evaluation and hierarchical clustering results for the 1st period.

Fig 2

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 3.

Space-time clusters and sequence similarity-based clusters at county level: Study period 2 (1/22-3/31/2020)

Results from the prospective space-time scan statistics analysis for the second study period (through March 31) identified twenty-four space-time clusters of COVID-19 as statistically significant (Fig 4 and Table 2). This period shows a growing emergence of spatial clusters across the US, but generally more consolidated clusters as the number of cases grow. The space-time clusters are smaller than in the first period and several detected clusters contain a single county (cluster radius = 0). This period shows a shift toward more clusters appearing in the interior US relative to the coasts.

Fig 4. COVID-19 space-time scan statistic detected hotspots in the United States at county level through 3/31/2020.

Fig 4

Table 2. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-3/31/2020 at the county level.

    Duration Radius Observed Expected Relative   Population #County #County
Cluster Start Date End Date (Days) (Km) Cases Cases Risk (RR) p-value at Risk (total) (RR>1)
1 3/22 3/31 13 89.28 82,928 10,049 14.35 <0.001 6,395,723 22 22
2 3/22 3/31 10 43.08 5,887 1,526 3.95 <0.001 1,074,213 3 3
3 3/20 3/31 12 73.70 3,152 487 6.57 <0.001 292,363 8 8
4 3/27 3/31 5 0.00 3,078 1,012 3.08 <0.001 2,201,911 1 1
5 3/24 3/31 8 73.96 680 68 9.97 <0.001 39,490 20 18
6 3/26 3/31 6 60.42 2,587 1,102 2.37 <0.001 1,370,768 2 2
7 3/24 3/31 8 62.27 2,041 846 2.43 <0.001 1,345,457 4 4
8 3/19 3/31 13 95.88 190 11 17.17 <0.001 5,083 4 3
9 3/30 3/31 2 307.75 1,528 729 2.11 <0.001 1,822,585 262 82
10 3/16 3/31 16 82.42 313 54 5.78 <0.001 28,677 5 5
11 3/20 3/31 12 146.72 214 38 5.6 <0.001 20,460 9 4
12 3/29 3/31 3 325.81 4,574 3,543 1.3 <0.001 6,684,959 257 75
13 3/27 3/31 5 210.38 787 448 1.76 <0.001 647,610 43 10
14 3/30 3/31 2 0.00 1,190 789 1.51 <0.001 3,855,599 1 1
15 3/25 3/31 7 50.46 206 72 2.88 <0.001 57,714 5 2
16 3/23 3/31 9 49.14 84 14 5.86 <0.001 5,999 5 4
17 3/30 3/31 2 240.79 344 179 1.92 <0.001 528,991 11 3
18 3/29 3/31 3 0.00 27 2 11.75 <0.001 1,412 1 1
19 3/14 3/31 18 36.13 105 44 2.4 <0.001 20,986 2 2
20 3/22 3/31 10 42.64 35 8 4.27 <0.001 3,227 4 4
21 3/30 3/31 2 0.00 244 152 1.61 <0.001 991,866 1 1
22 3/24 3/31 8 54.38 22 4 5.76 <0.001 1,899 8 5
23 3/27 3/31 5 139.67 101 50 2.02 <0.001 49,538 2 2
24 3/11 3/31 21 188.69 48 17 2.85 <0.001 6,210 45 16

Note: Space-time clusters were identified using the spatial scan statistic with a Poisson model.

For this second study period the sequence similarity clustering resulted in 8 clusters based on the elbow method evaluation (Fig 5). Fig 6 shows the map of these clusters and their temporal signatures. For this period, only three clusters deviate from the outside cluster (OC) set pattern. Cluster 7 shows the most rapid increase in cases. Members of this cluster include Miami, San Jose, Los Angeles area counties, Chicago, Detroit, New Orleans and New York metropolitan counties. Members of Cluster 8 show a slower and less rapid increase in cases. Some of these members appear in a group across New Jersey and Pennsylvania, around Baltimore, Denver and Seattle. Cluster 4 follows a similar trajectory with some concentrations around New Orleans, Columbus Georgia, and Indianapolis. Members of this cluster also appear in more isolated rural settings in Arizona, Oklahoma and South Dakota. A pairwise comparison of cluster numbers for the 2nd study period from these two approaches can be found in S2 Table.

Fig 5. Elbow method evaluation and hierarchical clustering results for the 2nd period.

Fig 5

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 6.

Space-time clusters and sequence similarity-based clusters at county level: Study period 3 (1/22-4/19/2020)

For the third study period, the prospective space-time cluster statistic detected 47 statistically significant clusters (p≤0.05) as shown in Fig 7. Associated cluster characteristics are shown in Table 3. In this period more clusters are emerging in the southern US, with additional new pockets in Montana and a cluster covering Nebraska and South Dakota. Metropolitan New York remains an active cluster and a more condensed Mid-Atlantic coast cluster has emerged. We see additional consolidation in the size of clusters with 25 appearing as a single county.

Fig 7. COVID-19 space-time scan statistic detected hotspots in the United States at county level through 4/19/2020.

Fig 7

Table 3. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-4/19/2020 at the county level.

    Duration Radius Observed Expected Relative   Population #County #County
Cluster Start Date End Date (Days) (Km) Cases Cases Risk (RR) p-value at Risk (total) (RR>1)
1 3/21 4/19 30 112.67 317,283 50,808 10.07 <0.001 10,183,190 29 29
2 3/25 4/19 26 73.70 13,048 2,223 5.96 <0.001 468,407 8 8
3 3/27 4/19 24 43.08 22,215 7,189 3.15 <0.001 1,680,202 3 3
4 4/16 4/19 4 0.00 1,670 20 83.28 <0.001 19,232 1 1
5 4/4 4/19 16 0.00 15,161 6,360 2.41 <0.001 2,838,481 1 1
6 3/31 4/19 20 77.77 2,949 441 6.72 <0.001 93,100 22 22
7 4/6 4/19 14 298.19 40,502 27,081 1.52 <0.001 9,421,799 226 93
8 4/10 4/19 10 263.00 1,767 341 5.2 <0.001 137,317 85 26
9 3/30 4/19 21 0.00 8,162 4,404 1.86 <0.001 1,173,224 1 1
10 3/26 4/19 25 0.00 435 36 12.25 <0.001 7,586 1 1
11 4/17 4/19 3 0.00 360 29 12.63 <0.001 30,783 1 1
12 4/1 4/19 19 59.89 1,270 464 2.74 <0.001 116,600 6 5
13 4/9 4/19 11 162.39 832 271 3.07 <0.001 112,063 5 5
14 3/20 4/19 31 84.21 760 281 2.71 <0.001 52,008 6 6
15 3/31 4/19 20 218.29 10,400 8,205 1.27 <0.001 1,932,165 152 77
16 4/5 4/19 15 169.63 400 104 3.84 <0.001 22,025 36 20
17 4/9 4/19 11 42.71 309 67 4.58 <0.001 24,501 3 3
18 4/14 4/19 6 36.59 428 142 3.02 <0.001 97,393 6 6
19 4/13 4/19 7 41.53 100 6 16.58 <0.001 2,434 2 1
20 4/9 4/19 11 144.34 2,800 1,943 1.44 <0.001 999,773 20 14
21 4/14 4/19 6 0.00 109 10 10.73 <0.001 5,683 1 1
22 3/20 4/19 31 0.00 299 88 3.41 <0.001 16,762 1 1
23 4/2 4/19 18 0.00 643 349 1.85 <0.001 94,077 1 1
24 4/7 4/19 13 70.67 348 179 1.95 <0.001 41,649 17 14
25 4/15 4/19 5 0.00 123 37 3.35 <0.001 29,216 1 1
26 4/17 4/19 3 192.58 142 51 2.8 <0.001 50,741 11 6
27 4/18 4/19 2 37.48 298 152 1.96 <0.001 386,360 2 2
28 4/3 4/19 17 92.71 301 156 1.93 <0.001 41,584 5 3
29 4/11 4/19 9 0.00 173 77 2.25 <0.001 41,981 1 1
30 4/11 4/19 9 0.00 83 24 3.48 <0.001 14,638 1 1
31 4/15 4/19 5 0.00 41 7 6.16 <0.001 3,595 1 1
32 4/15 4/19 5 72.81 57 13 4.29 <0.001 10,680 8 5
33 4/14 4/19 6 0.00 1,019 763 1.34 <0.001 926,455 1 1
34 4/13 4/19 7 0.00 583 410 1.42 <0.001 336,507 1 1
35 3/28 4/19 23 50.34 32 5 6.04 <0.001 888 2 2
36 4/2 4/19 18 68.61 253 149 1.7 <0.001 28,897 10 9
37 4/12 4/19 8 0.00 59 20 2.96 <0.001 8,797 1 1
38 4/18 4/19 2 0.00 272 174 1.56 <0.001 1,139,191 1 1
39 4/17 4/19 3 0.00 37 10 3.74 <0.001 27,699 1 1
40 3/29 4/19 22 0.00 105 52 2.02 <0.001 9,587 1 1
41 4/18 4/19 2 0.00 20 3 6.4 <0.001 7,819 1 1
42 3/23 4/19 28 44.85 112 58 1.94 <0.001 9,320 5 5
43 4/11 4/19 9 0.00 93 46 2.02 <0.001 17,771 1 1
44 4/18 4/19 2 0.00 14 2 8.17 0.002 3,531 1 1
45 4/14 4/19 6 0.00 22 5 4.71 0.003 2,749 1 1
46 4/18 4/19 2 0.00 53 21 2.49 0.003 31,371 1 1
47 3/24 4/19 27 0.00 102 55 1.85 0.006 10,847 1 1

Note: Space-time clusters were identified using the spatial scan statistic with a Poisson model.

For the third study period, ten sequence similarity-based clusters were selected using the elbow method (Fig 8). Fig 9 shows the map of these clusters and their temporal profiles. Cluster 8 shows a distinct early and more rapid accumulation of cases. Many members of this cluster were members of Cluster 7 in the previous study period. These members include Chicago, Detroit metropolitan area, Miami, Philadelphia, and metropolitan New York counties. Some significant missing members in Cluster 8 from the previous period Cluster 7 are San Jose, Los Angeles and Las Vegas. Cluster 9 shows a group with the next most rapidly developing number of cases. Within this group, some members appear concentrated around metropolitan New York, Philadelphia, Baltimore and Washington DC, and Denver. Cluster 10, as the third most rapidly merging cluster for this period, has members in a halo like pattern around metropolitan New York, Philadelphia and New Orleans. Other members, however, appear in more isolated rural settings in New Mexico, Utah, and Washington State. This group includes the Hopi, Zuni, Navajo and Yakima national reservations. Two other clusters to note in this group are Cluster 7 and Cluster 2 which show later initiation times in terms of case accumulation but appear to be accelerating at the end of the study period. Many of these members show a concentration in southern Indiana and western Kentucky respectively, with another grouping of Cluster 7 members appearing in southwestern Georgia on the border with Alabama. A complete pairwise comparison of cluster numbers for the 3rd study period from these two approaches can be found in S3 Table.

Fig 8. Elbow method evaluation and hierarchical clustering results for the 3rd period.

Fig 8

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 9.

Space-time clusters and sequence similarity-based clusters at county level: Study period 4 (1/22-5/20/2020)

For the fourth study period ending on May 20, 2020 the prospective space-time scan statistic identified 87 statistically significant clusters. Table 4 provides the characteristics of these 87 active space-time clusters at the end of May 20, 2020. From Fig 10 we can observe that in this period clusters continued to emerge in southern states and more clusters emerge in the mountain west. The previous cluster covering Nebraska and South Dakota has expanded into Iowa, North Dakota and Minneapolis. The metropolitan New York cluster has consolidated and the prior period mid-Atlantic cluster has consolidated to an emerging cluster around Philadelphia.

Table 4. Attributes of prospective space-time clusters (hotspots) for COVID-19 from 1/23-5/20/2020 at the county level.

    Duration Radius Observed Expected Relative   Population #County #County
Cluster Start Date End Date (Days) (Km) Cases Cases Risk (RR) p-value at Risk (total) (RR>1)
1 3/23 5/20 59 126.60 516,153 128,515 5.51 <0.001 15,225,284 35 35
2 4/7 5/20 44 55.64 77,744 30,138 2.66 <0.001 5,000,478 5 5
3 4/12 5/20 39 332.91 14,779 3,116 4.78 <0.001 411,108 155 109
4 4/17 5/20 34 103.56 41,285 18,966 2.21 <0.001 3,575,889 42 25
5 4/20 5/20 31 215.21 7,183 749 9.63 <0.001 111,251 47 35
6 3/23 5/20 59 73.70 16,614 5,499 3.04 <0.001 625,641 8 8
7 3/26 5/20 56 43.08 34,409 18,624 1.87 <0.001 2,253,493 3 3
8 4/29 5/20 22 0.00 1,336 16 81.34 <0.001 3,508 1 1
9 4/13 5/20 38 0.00 2,487 206 12.07 <0.001 30,632 1 1
10 4/9 5/20 42 191.99 5,571 1,339 4.17 <0.001 184,726 6 6
11 4/15 5/20 36 0.00 1,952 175 11.15 <0.001 25,544 1 1
12 3/24 5/20 58 77.77 4,684 1,282 3.66 <0.001 134,101 22 22
13 4/13 5/20 38 0.00 955 36 26.75 <0.001 4,378 1 1
14 4/15 5/20 36 114.37 3,799 1,339 2.84 <0.001 187,231 21 21
15 4/23 5/20 28 0.00 598 21 28.96 <0.001 3,038 1 1
16 5/12 5/20 9 0.00 344 3 114.45 <0.001 1,002 1 1
17 4/14 5/20 37 36.59 2,623 962 2.73 <0.001 150,923 6 5
18 4/24 5/20 27 42.39 1,579 458 3.45 <0.001 77,989 7 7
19 4/30 5/20 21 0.00 1,436 451 3.18 <0.001 134,923 1 1
20 5/3 5/20 18 0.00 191 4 44.47 <0.001 772 1 1
21 3/23 5/20 59 47.10 519 87 5.99 <0.001 9,665 2 2
22 4/28 5/20 23 45.28 436 77 5.66 <0.001 13,095 3 3
23 5/10 5/20 11 29.09 221 15 14.38 <0.001 3,235 3 3
24 5/10 5/20 11 0.00 257 24 10.91 <0.001 7,981 1 1
25 4/30 5/20 21 0.00 354 56 6.27 <0.001 11,332 1 1
26 5/6 5/20 15 0.00 994 383 2.6 <0.001 202,613 1 1
27 4/1 5/20 50 136.56 5,564 3,846 1.45 <0.001 449,669 30 22
28 5/7 5/20 14 0.00 566 155 3.65 <0.001 71,572 1 1
29 5/2 5/20 19 31.84 510 133 3.83 <0.001 20,764 4 4
30 4/19 5/20 32 192.58 305 51 6.02 <0.001 40,867 11 2
31 3/30 5/20 52 0.00 14,842 12,107 1.23 <0.001 1,575,369 1 1
32 4/21 5/20 30 0.00 517 144 3.6 <0.001 25,141 1 1
33 5/11 5/20 10 0.00 248 37 6.71 <0.001 24,329 1 1
34 5/12 5/20 9 45.71 262 47 5.53 <0.001 32,224 3 1
35 4/27 5/20 24 0.00 153 16 9.6 <0.001 2,345 1 1
36 4/29 5/20 22 37.68 576 218 2.65 <0.001 48,225 2 2
37 4/2 5/20 49 42.71 704 312 2.25 <0.001 36,636 3 3
38 5/8 5/20 13 0.00 164 24 6.95 <0.001 5,473 1 1
39 5/19 5/20 2 0.00 2,437 1,721 1.42 <0.001 6,453,712 1 1
40 5/15 5/20 6 0.00 60 3 21 <0.001 841 1 1
41 5/6 5/20 15 29.36 112 17 6.41 <0.001 4,070 2 2
42 5/10 5/20 11 45.67 150 32 4.62 <0.001 8,166 2 2
43 4/6 5/20 45 30.61 309 116 2.67 <0.001 13,014 3 3
44 4/18 5/20 33 0.00 519 257 2.02 <0.001 42,288 1 1
45 5/7 5/20 14 0.00 105 20 5.2 <0.001 5,939 1 1
46 4/25 5/20 26 99.90 124 29 4.23 <0.001 4,072 15 6
47 4/20 5/20 31 30.03 288 124 2.33 <0.001 22,341 3 2
48 3/23 5/20 59 77.39 581 342 1.7 <0.001 39,119 4 2
49 5/13 5/20 8 106.86 270 121 2.24 <0.001 83,127 2 2
50 3/29 5/20 53 0.00 291 139 2.1 <0.001 15,029 1 1
51 4/22 5/20 29 26.90 155 55 2.83 <0.001 8,779 2 2
52 4/7 5/20 44 46.15 317 165 1.92 <0.001 18,980 6 6
53 5/2 5/20 19 0.00 103 33 3.16 <0.001 7,699 1 1
54 4/1 5/20 50 53.19 83 22 3.7 <0.001 2,198 3 3
55 4/14 5/20 37 27.39 68 16 4.25 <0.001 1,791 2 2
56 4/23 5/20 28 0.00 156 65 2.4 <0.001 10,718 1 1
57 4/13 5/20 38 21.26 248 128 1.93 <0.001 19,711 2 2
58 4/27 5/20 24 0.00 30 3 10.24 <0.001 323 1 1
59 5/18 5/20 3 0.00 49 9 5.29 <0.001 15,448 1 1
60 4/17 5/20 34 0.00 107 39 2.73 <0.001 8,405 1 1
61 4/18 5/20 33 72.28 534 354 1.51 <0.001 58,978 7 5
62 4/21 5/20 30 140.99 233 125 1.87 <0.001 26,408 6 4
63 4/29 5/20 22 0.00 234 126 1.85 <0.001 30,406 1 1
64 4/22 5/20 29 0.00 115 47 2.43 <0.001 6,538 1 1
65 5/19 5/20 2 0.00 21 2 12.77 <0.001 4,032 1 1
66 5/5 5/20 16 92.43 1,039 796 1.3 <0.001 286,527 2 2
67 4/19 5/20 32 0.00 115 49 2.37 <0.001 10,204 1 1
68 5/8 5/20 13 0.00 192 101 1.9 <0.001 45,852 1 1
69 5/12 5/20 9 0.00 30 4 6.87 <0.001 771 1 1
70 5/17 5/20 4 0.00 123 55 2.23 <0.001 78,471 1 1
71 4/29 5/20 22 0.00 156 79 1.97 <0.001 17,303 1 1
72 3/28 5/20 54 50.34 32 6 5.44 <0.001 656 2 2
73 5/7 5/20 14 80.26 106 46 2.28 <0.001 12,240 5 4
74 4/14 5/20 37 47.62 115 53 2.15 <0.001 7,305 3 3
75 4/9 5/20 42 35.79 123 59 2.09 <0.001 6,343 2 2
76 4/20 5/20 31 0.00 134 68 1.98 <0.001 11,760 1 1
77 4/28 5/20 23 195.74 281 184 1.53 <0.001 48,676 9 4
78 4/16 5/20 35 27.34 243 154 1.57 <0.001 22,877 3 2
79 4/15 5/20 36 0.00 116 59 1.96 <0.001 7,734 1 1
80 4/9 5/20 42 56.31 478 350 1.36 <0.001 49,008 2 1
81 5/18 5/20 3 93.41 130 70 1.86 <0.001 180,113 8 2
82 4/17 5/20 34 0.00 37 11 3.37 <0.001 20,483 1 1
83 4/10 5/20 41 30.49 135 76 1.78 <0.001 9,851 2 2
84 5/14 5/20 7 0.00 125 69 1.82 <0.001 43,779 1 1
85 5/12 5/20 9 27.78 87 44 1.97 0.004 16,827 2 1
86 5/3 5/20 18 0.00 20 4 4.6 0.013 670 1 1
87 5/19 5/20 2 80.43 28 8 3.38 0.019 55,557 12 2

Note: Space-time clusters were identified using the spatial scan statistic with a Poisson model.

Fig 10. Prospective space-time scan statistic detected clusters of COVID-19 incidents during the study period of 1/22/2020-5/20/2020.

Fig 10

In this fourth period, using the sequence similarity-based clustering, we selected 10 clusters based on the elbow method evaluation (Fig 11). Fig 12 presents a map of these clusters and their temporal signatures. In this period, Cluster 8 which includes Miami, Chicago, Detroit, Los Angeles, Philadelphia and New York metropolitan counties is the fastest growing in term of cases. Clusters 7 and 9 start out with similar increases in cases but Cluster 7 members show a levelling off in early May relative to Cluster 9. Cluster 10 shows a delayed start but steady increase starting in early April. Cluster 5 shows a different trajectory in that it shows a much slower start to case accumulation but then exhibits a sharp increase starting in mid-April, increasing more rapidly than Clusters 10 and 7. Cluster 4 initially falls below the outside cluster “OC” group but then shows a sharp jump and more rapid accumulation. More detailed information on pairwise comparison of cluster numbers for the 4th study period from these two approaches can be found in S4 Table.

Fig 11. Elbow method evaluation and hierarchical clustering results for the 4th period.

Fig 11

Notice that the numberings and colors of STES clusters match with those of corresponding clusters on the map and the temporal trend graph in Fig 12.

Fig 12. Sequence similarity-based COVID-19 clusters along with average temporal trends at county level during 1/22/-5/20/2020.

Fig 12

This map includes the counties with higher relative risk (RR>1) contained in all the clusters detected by scan statistics in Fig 10. The average temporal trends of cumulative cases for STES clusters 1–10 on the map appear at the bottom right. Notice that the colors of STES clusters match with correspondingly colored dots on the map and with the colors of the STES cluster curves on the graph. OC includes all counties not included in the clusters.

Discussion

For this study we compared two approaches for COVID-19 surveillance. In combination, the two approaches provide complementary views that can offer a more comprehensive picture of surveillance information to further aid public health analysis and monitoring. The space-time scan statistic identifies emerging clusters as locations where the observed number of cases most exceeds the expected number of cases in space-time based on the underlying population. This approach provokes questions of why the disease is emerging at such a location during a period of time. For disease progression, where the temporal pattern is equally important, similarity in the sequence of daily incidence rates adds valuable information as it points to locations where the disease is progressing in a similar fashion. This view provokes questions of why these sometimes spatially dispersed locations are behaving in a similar way.

An initial working hypothesis for the STES sequence similarity metric in an environmental monitoring context was that locations that are spatially close are more likely to exhibit similar event sequences. While this is born out in some instances in this pandemic context, we found that in all study periods, similar sequence patterns of COVID-19 cases can be quite spatially separated. This result suggests that spatial proximity is not always a driver of sequence similarity. It has been reported that socio-economic or demographic characteristics could explain the different transmission rates or patterns between communities and locations [30]. Because members of these clusters share similar temporal disease progressions, questions arise as to whether they share some similar underlying characteristics such as similar population density, similar populations at risk, similar changes in surveillance programs, or possibly similar intervention strategies at work.

Sequence similarity Cluster 3 in the first study period which covers the first appearance of COVID-19 in the US shows the earliest and fastest accumulating number of cases suggesting initial points of entry. As members of this cluster include Snohomish and King counties in Washington State, several California counties in the San Francisco Bay area, and Bronx, Kings, Queens, Wassau, and New York counties in New York state these do align with the known entry points on the east and west coasts. Seemingly unusual members in this cluster are Johnson County Iowa; Kershaw County, South Carolina; Williamson, Tennessee; and Douglas, Nebraska. An interesting question is why this last subgroup of locations shares a similar profile with the coastal points of entry. Sequence similarity-based Cluster 2 in the first period is another interesting collection which is very spatially dispersed. Most of the members are rural communities that include Sheridan Wyoming, Davison South Dakota, Jackson Oklahoma, Hancock Indiana, Pitkin Colorado, Caddo Louisiana and Pierce Wisconsin. The temporal profile for this group is initially flat until mid-March at which point it shows a very rapid accumulation of cases. Such spatially dispersed cluster members that exhibit similar behaviours are targets for further investigation of potential contextual similarities. Of particular interest from epidemiological and health policy perspectives are spatially dispersed cluster members that exhibit similar flattening or decreasing patterns as these would be interesting to explore to understand if they have similar demographic characteristics or if they shared similar intervention measures.

We note that the sequence similarity clusters suggest some connections which are not conveyed by the scan statistic clusters. For example, in the third study period the scan statistic results indicate several new clusters. An examination of the sequence similarity clusters in this period indicate that several members of Cluster 10 were first nation or tribal reservations. In other words, several of the spatially dispersed reservations across the west showed a similar onset and progression in COVID-19 cases.

Another difference between the two approaches is that the sequence similarity-based clusters starting in the third period begin to show evidence of a spatial diffusion effect. For example, members of Cluster 8 with the earliest and fastest accumulating sequence similarity often appear to be surrounded by or in close spatial association with the next closest lagging group, Cluster 9. A similar pattern appears between Cluster 8 and Cluster 9 members in the fourth study period.

Recent research has pointed to different continents of origin for the introduction of COVID-19 into the US [31, 32]. Genomic epidemiology research supports the belief that isolates from China primarily seeded the original COVID-19 outbreak on the US West Coast and that European isolates seeded the pandemic in New York (and the US East Coast) [33]. Given some connectivity suggested by the sequence similarity based approach there may exist opportunities for productive combination with phylogenetic tracing and transmission pathway studies [34].

We recognize that both approaches can be impacted by limitations in data collection. Several publications have noted reporting lags although these are most problematic with respect to death reports rather than daily reported case counts [3538]. There is clearly the potential for inaccuracies in data collection covering many different jurisdictions. If for example, reports of new cases are delayed by a day or two from a jurisdiction this could potentially change the similarity in the sequences of county daily case counts. However, given the length of the study periods here we expect lags of one to two days to have minor impact.

Supporting information

S1 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-3-13/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

(XLSX)

S2 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-3-31/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

(XLSX)

S3 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-4-19/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

(XLSX)

S4 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-5-20/2020.

This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

(XLSX)

S5 Table. The minimal data set underlying the results described in this manuscript.

(CSV)

Data Availability

All data used in the study are available as S1S5 Tables provided with this submission.

Funding Statement

The authors received no specific funding for this work.

References

  • 1.Huang C, Wang Y, Li X, Ren L, Zhao J, Hu Y, et al. Clinical features of patients infected with 2019 novel coronavirus in Wuhan, China. Lancet. 2020;395(10223):497–506. doi: 10.1016/S0140-6736(20)30183-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Saglietto A, D’Ascenzo F, Zoccai GB, De Ferrari GM. COVID-19 in Europe: the Italian lesson. Lancet. 2020;395(10230):1110–1. doi: 10.1016/S0140-6736(20)30690-5 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Danon L, Brooks-Pollock E, Bailey M, Keeling MJ. A spatial model of CoVID-19 transmission in England and Wales: early spread and peak timing. medRxiv [Preprint]. 2020. medRxiv 20022566 [posted 2020 Feb 14; cited 2020 Sept 10]. Available from: https://www.medrxiv.org/content/10.1101/2020.02.12.v1. [Google Scholar]
  • 4.Alamo T, Reina DG, Mammarella M, Abella A. Open data resources for fighting covid-19. arXiv 200406111 [Preprint]. 2020. [posted Apr 13; last revised May 11; cited Sept 10]. Available from: https://arxiv.org/abs/04.06111. [Google Scholar]
  • 5.Latif S, Usman M, Manzoor S, Iqbal W, Qadir J, Tyson G, et al. Leveraging data science to combat covid-19: A comprehensive review. IEEE Transactions on Artificial Intelligence. 2020;1(1):85–103. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Moorthy V, Restrepo AMH, Preziosi M-P, Swaminathan S. Data sharing for novel coronavirus (COVID-19). Bulletin of the World Health Organization. 2020;98(3):150. doi: 10.2471/BLT.20.251561 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Kulldorff M. A spatial scan statistic. Communications in Statistics-Theory and methods. 1997;26(6):1481–96. [Google Scholar]
  • 8.Kulldorff M. Spatial scan statistics: models, calculations, and applications. Scan statistics and applications: Springer; 1999. p. 303–22. [Google Scholar]
  • 9.Kulldorff M. Prospective time periodic geographical disease surveillance using a scan statistic. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2001;164(1):61–72. [Google Scholar]
  • 10.Kulldorff M, Heffernan R, Hartman J, Assuncao R, Mostashari F. A space-time permutation scan statistic for disease outbreak detection. PLoS Med. 2005;2(3):e59. doi: 10.1371/journal.pmed.0020059 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Khan D, Rossen LM, Hamilton BE, He Y, Wei R, Dienes E. Hot spots, cluster detection and spatial outlier analysis of teen birth rates in the U.S., 2003–2012. Spatial and Spatio-temporal Epidemiology. 2017;21:67–75. doi: 10.1016/j.sste.2017.03.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Desjardins MR, Hohl A, Delmelle EM. Rapid surveillance of COVID-19 in the United States using a prospective space-time scan statistic: Detecting and evaluating emerging clusters. Applied Geography. 2020;118:102202. doi: 10.1016/j.apgeog.2020.102202 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Qi H, Xiao S, Shi R, Ward MP, Chen Y, Tu W, et al. COVID-19 transmission in Mainland China is associated with temperature and humidity: A time-series analysis. Science of the Total Environment. 2020:138778. doi: 10.1016/j.scitotenv.2020.138778 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Jones RC, Liberatore M, Fernandez JR, Gerber SI. Use of a prospective space-time scan statistic to prioritize shigellosis case investigations in an urban jurisdiction. Public Health Reports. 2006;121(2):133–9. doi: 10.1177/003335490612100206 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Yih WK, Deshpande S, Fuller C, Heisey-Grove D, Hsu J, Kruskal BA, et al. Evaluating real-time syndromic surveillance signals from ambulatory care data in four states. Public Health Reports. 2010;125(1):111–20. doi: 10.1177/003335491012500115 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Yin F, Li X, Ma J, Feng Z. The early warning system based on the prospective space-time permutation statistic. Wei Sheng Yan Jiu (in Chinese: Journal of Hygiene Research). 2007;36(4):455–8. [PubMed] [Google Scholar]
  • 17.Duczmal LH, Moreira GJ, Burgarelli D, Takahashi RH, Magalhães FC, Bodevan EC. Voronoi distance based prospective space-time scans for point data sets: a dengue fever cluster analysis in a southeast Brazilian town. International Journal of Health Geographics. 2011;10(1):29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Hohl A, Delmelle E, Desjardins M, Lan Y. Daily surveillance of COVID-19 using the prospective space-time scan statistic in the United States. Spatial and Spatio-temporal Epidemiology. 2020:100354. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li M, Shi X, Li X, Ma W, He J, Liu T. Sensitivity of disease cluster detection to spatial scales: an analysis with the spatial scan statistic method. International Journal of Geographical Information Science. 2019;33(11):2125–52. [Google Scholar]
  • 20.Dong E, Du H, Gardner L. An interactive web-based dashboard to track COVID-19 in real time. Lancet Infectious Diseases. 2020;20(5):533–4. doi: 10.1016/S1473-3099(20)30120-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.He W, Yi GY, Zhu Y. Estimation of the basic reproduction number, average incubation time, asymptomatic infection rate, and case fatality rate for COVID-19: Meta-analysis and sensitivity analysis. Journal of Medical Virology. 2020;92(11):2543–50. doi: 10.1002/jmv.26041 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Kulldorff M, Mostashari F, Duczmal L, Katherine Yih W, Kleinman K, Platt R. Multivariate scan statistics for disease surveillance. Statistics in Medicine. 2007;26(8):1824–33. doi: 10.1002/sim.2818 [DOI] [PubMed] [Google Scholar]
  • 23.Jaccard P. Étude comparative de la distribution florale dans une portion des Alpes et des Jura. Bull Soc Vaudoise Sci Nat. 1901;37:547–79. [Google Scholar]
  • 24.Sun S-B, Zhang Z-H, Dong X-L, Zhang H-R, Li T-J, Zhang L, et al. Integrating Triangle and Jaccard similarities for recommendation. PloS One. 2017;12(8):e0183570. doi: 10.1371/journal.pone.0183570 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Ayub M, Ghazanfar MA, Maqsood M, Saleem A, editors. A Jaccard base similarity measure to improve performance of CF based recommender systems. 2018 International Conference on Information Networking (ICOIN), Chiang Mai, Thailand; 2018, pp. 1–6.
  • 26.Ros F, Guillaume S. A hierarchical clustering algorithm and an improvement of the single linkage criterion to deal with noise. Expert Systems with Applications. 2019;128:96–108. [Google Scholar]
  • 27.Syakur M, Khotimah B, Rochman E, Satoto B. Integration k-means clustering method and elbow method for identification of the best customer profile cluster. IOP Conference Series: Materials Science and Engineering. 2018;336(1):012017. [Google Scholar]
  • 28.Gustriansyah R, Suhandi N, Antony F. Clustering optimization in RFM analysis based on k-means. Indones J Electr Eng Comput Sci. 2020;18(1):470–7. [Google Scholar]
  • 29.Zambelli AE. A data-driven approach to estimating the number of clusters in hierarchical clustering. F1000Research. 2016;5. doi: 10.12688/f1000research.10103.1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Dowd JB, Andriano L, Brazel DM, Rotondi V, Block P, Ding X, et al. Demographic science aids in understanding the spread and fatality rates of COVID-19. Proceedings of the National Academy of Sciences. 2020;117(18):9696–8. doi: 10.1073/pnas.2004911117 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Gonzalez-Reiche AS, Hernandez MM, Sullivan MJ, Ciferri B, Alshammary H, Obla A, et al. Introductions and early spread of SARS-CoV-2 in the New York City area. Science. 2020;369(6501):297–301. doi: 10.1126/science.abc1917 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Worobey M, Pekar J, Larsen BB, Nelson MI, Hill V, Joy JB, et al. The emergence of SARS-CoV-2 in Europe and North America. Science. 2020;370(6516):564–70. doi: 10.1126/science.abc8169 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Deng X, Gu W, Federman S, Du Plessis L, Pybus OG, Faria NR, et al. Genomic surveillance reveals multiple introductions of SARS-CoV-2 into Northern California. Science. 2020;369(6503):582–7. doi: 10.1126/science.abb9263 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Zhang W, Govindavari JP, Davis BD, Chen SS, Kim JT, Song J, et al. Analysis of genomic characteristics and transmission routes of patients with confirmed SARS-CoV-2 in Southern California during the early stage of the US COVID-19 pandemic. JAMA Network Open. 2020;3(10):e2024191–e. doi: 10.1001/jamanetworkopen.2020.24191 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Casella F. Can the COVID-19 epidemic be controlled on the basis of daily test reports? IEEE Control Systems Letters. 2020;5(3):1079–84. [Google Scholar]
  • 36.Aliprantis D, Tauber K. Measuring deaths from COVID-19. Economic Commentary. 2020;18:1–7. [Google Scholar]
  • 37.Angelopoulos AN, Pathak R, Varma R, Jordan MI. On identifying and mitigating bias in the estimation of the COVID-19 case fatality rate. Harvard Data Science Review. 2020;Special Issue 1-COVID-19. [Google Scholar]
  • 38.Kogan NE, Clemente L, Liautaud P, Kaashoek J, Link NB, Nguyen AT, et al. An early warning approach to monitor COVID-19 activity with multiple digital traces in near real time. Science Advances. 2021;7(10):eabd6989. doi: 10.1126/sciadv.abd6989 [DOI] [PMC free article] [PubMed] [Google Scholar]

Decision Letter 0

Agricola Odoi

29 Dec 2020

PONE-D-20-26698

Space-Time Surveillance of COVID-19 Emerging Hotspots using Prospective Scan Statistics Enhanced by Spatiotemporal Event Sequence Based Clustering

PLOS ONE

Dear Dr. Beard,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses all the issues raised by the two reviewers

Please submit your revised manuscript by Jan 29 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Agricola Odoi, BVM, MSc, PhD, FAHA, FACE

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. In your Data Availability statement, you have not specified where the minimal data set underlying the results described in your manuscript can be found. PLOS defines a study's minimal data set as the underlying data used to reach the conclusions drawn in the manuscript and any additional data required to replicate the reported study findings in their entirety. All PLOS journals require that the minimal data set be made fully available. For more information about our data policy, please see http://journals.plos.org/plosone/s/data-availability.

Upon re-submitting your revised manuscript, please upload your study’s minimal underlying data set as either Supporting Information files or to a stable, public repository and include the relevant URLs, DOIs, or accession numbers within your revised cover letter. For a list of acceptable repositories, please see http://journals.plos.org/plosone/s/data-availability#loc-recommended-repositories. Any potentially identifying patient information must be fully anonymized.

Important: If there are ethical or legal restrictions to sharing your data publicly, please explain these restrictions in detail. Please see our guidelines for more information on what we consider unacceptable restrictions to publicly sharing data: http://journals.plos.org/plosone/s/data-availability#loc-unacceptable-data-access-restrictions. Note that it is not acceptable for the authors to be the sole named individuals responsible for ensuring data access.

We will update your Data Availability statement to reflect the information you provide in your cover letter.

3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

4. We note that Figures 1- 8 in your submission contain map images which may be copyrighted. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For these reasons, we cannot publish previously copyrighted maps or satellite images created using proprietary data, such as Google software (Google Maps, Street View, and Earth). For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

4.1.    You may seek permission from the original copyright holder of Figures 1-8 to publish the content specifically under the CC BY 4.0 license. 

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

4.2.    If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

The following resources for replacing copyrighted map figures may be helpful:

USGS National Map Viewer (public domain): http://viewer.nationalmap.gov/viewer/

The Gateway to Astronaut Photography of Earth (public domain): http://eol.jsc.nasa.gov/sseop/clickmap/

Maps at the CIA (public domain): https://www.cia.gov/library/publications/the-world-factbook/index.html and https://www.cia.gov/library/publications/cia-maps-publications/index.html

NASA Earth Observatory (public domain): http://earthobservatory.nasa.gov/

Landsat: http://landsat.visibleearth.nasa.gov/

USGS EROS (Earth Resources Observatory and Science (EROS) Center) (public domain): http://eros.usgs.gov/#

Natural Earth (public domain): http://www.naturalearthdata.com/

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Partly

Reviewer #2: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The authors of the study used prospective space-time scan statistic on number of positive COVID-19 cases in counties to determine location and period of most likely space-time clusters in the continental US. Case counts from counties in most likely clusters were then analyzed through hierarchical agglomerative clustering method to further characterize groups of counties with similar patterns in how frequency of cases (standardized by population size) develop over time. Demographic factors were then used in order to evaluate any differences between clusters. Researches reported that counties detected in the same space-time clusters were classified into different hierarchical clusters. This is interesting finding, and a topic worthy of investigation; although not completely surprising. When methods are combined, this could be done in many different ways and authors used one possible approach - at the very end of the manuscript, they offered some alternatives. This alternative approach was something that I was wondering from the very beginning (no actionable comment here).

In my opinion, by far the largest limitation of the manuscript is its length. Manuscript with 38 pages of text (including tables) is more a technical report than a classical journal article. This is clearly a consequence of thorough analysis, but reading of such contribution is very demanding on the reader, particularly when a lot of technical results consists of reporting counties in most likely clusters. This was done for different time periods, followed by hierarchical clustering. Of course, number of clusters in spatial analysis and in multivariate analysis will be different and this further complicates reading. Is it possible to present an analysis for only one time period (the one authors feel is the most informative), and for the rest of results to be offered in the supplementary material? With a short commentary point on how are other results similar or dissimilar? Also, in multivariate cluster analysis, it is often attempted to explain nature of clusters identified. Although this was attempted to some degree here, this information was diluted among many other details.

Because of manuscript length, it is easy to miss some technical details.

For example.

L104 - are the authors suggesting that RR is available for each location in the cluster? I am not sure that this is in line with how scan statistics work. The RR risk is estimated for the population in the scanning window, or maybe I missed the point here.

L116. Is it possible to determine false positives based on results from scan statistics? Could RR<1 be indication of cluster of cases with lower than expected risk, rather than false positive?

L121. Perhaps check wording. Is it all counties with high RR or all counties from significant clusters with high RR?

L159. Is "severe" the best adjective to use. severe typically refers to clinical expression, whereas cases are just reflection of incidence, some are likely asymptomatic.

L163. Authors should perhaps explain whether they used Poisson model or space-time permutation model. They are different methods.

L163. In this section, it would also be useful to explain how is prospective statistic different from retrospective which is commonly applied in retrospective research studies.

In the methods section - authors did not mention division into different time periods, and what was the rationale for that. reader learns about that in the results section.

L239. "spread" may not be the best word to use here. Clusters do not spread.

Throughout the manuscript, some wording should be improved. e.g. L384 writes about 4 statistically significant counties. is it counties or clusters that are significant?

In addition, the authors try to compare demographics across different clusters, but are not trying to do any statistical testing which could be helpful as a decision point. Is there a reason for that? Are the authors concerned that many comparisons could lead to some false findings?

Reviewer #2: Manuscript #: PONE-D-20-26698

Title: Space-time surveillance of COVID-19 emerging hotspots using prospective scan statistics enhanced by spatiotemporal event sequence based clustering.

Authors: Fuyu Xu & Kate Beard

General comments:

Xu and Beard have used an innovative approach to classifying the temporal patterns of “epidemic curves” at the county level that could make a very important contribution to the epidemiological and disease surveillance literature. Unfortunately, the current manuscript suffers from too much repetition, taking on too many objectives, and perhaps taking a less than ideal approach for integrating their event sequence based clustering with scan statistics. Specifically, the following should be addressed:

1. While there are a number of good reasons to conduct prospective scans, in the context of this manuscript repeating the analysis several times for different periods during the pandemic in the contiguous US states makes the manuscript exceedingly repetitive. I would recommend the authors pick a single period for the analysis and use it consistently throughout the manuscript to exemplify their approach.

2. In terms of the use of the scan statistic, there are a number of major issues that should be addressed by the authors:

i. The decision to limit the maximum size of a cluster to 10% of the population appears to be arbitrary. I would recommend using the default of 50% or less (also the maximum that can be used) of the population. This value does not prevent smaller clusters from being detected, but prevents the need for arbitrary values and is the whole point of the flexible scanning window. In reviewing the figures, it is clear that during certain periods a large number of small clusters are likely part of a larger cluster.

ii. The authors need to clearly state what rule was used for reporting space-time clusters in terms of spatial overlap and preferably spatio-temporal overlap.

iii. The authors are unclear whether they are using a Poisson model or the space-time permutation model for their space-time scans. On lines 166-167, they refer to the permutation model and later to the Poisson distribution on line 185. The space-time permutation model is not based on a Poisson distribution and only requires case data.

iv. If the authors used a Poisson model in SaTScan, it is unclear why they did not adjust for age and sex; this type of standardization is common in most epidemiological analyses. If they used a space-time permutation model, they should recognize that these models identify space-time clusters while inherently adjusting for purely spatial and temporal clusters; in other words this model would adjust for demographic and socio-economic factors if they were related to spatial location.

3. It is unclear why the authors did not apply their event sequence based clustering to all counties. The value seems to be diminished by only applying the technique to locations within active space-time clusters. In fact, investigating if there were spatial clusters of these event sequence clusters, using a multinomial model, would have been a very interesting and perhaps a more appropriate way to combine the two approaches and comment on whether or not these event sequence clusters were randomly distributed or had particular “hotspots”. It would certainly be interesting to compare space-time “hotspots” for rates of disease with spatial clusters of these event sequence clusters.

4. The event sequence clusters shared in geographically distant regions (e.g., Pacific Northwest and New York) likely reflect when COVID-19 was introduced into the US even though molecular sequencing suggests the introduction in these regions was likely from different continents of origin (i.e., Europe vs. Asia). Some discussion concerning the epidemiological interpretation of clusters based on the greater literature is warranted.

5. The socio-demographic analyses should probably be removed from this manuscript. I would recommend performing these analyses using multivariable multinomial regression models based on the event sequence cluster classification for each county in a separate manuscript. Currently, the descriptive comments concerning socio-demographic factors within clusters identified using different methods are not particularly insightful in the current draft of the manuscript. I suspect the authors have put too much in one manuscript and this section deserves much more detail and a stronger analysis.

Specific comments:

Title:

i. The event sequence based clustering was applied to counties during a specific time period, but it did not account for spatial location. It might be better for the authors to state that they are examining the distribution of counties classified based on event sequence based clustering with respect to space-time clusters of COVID-19. If they agree with comment 3 in my general comments, it might be better to focus on the spatial clustering of event sequence based clusters of COVID-19.

Introduction:

i. Lines 60-62: This statement is not really correct. Where and when these measures were implemented, there was success in “flattening the curve” (even in the US), and strict measures did control the disease in parts of the world where the political, economic, and sociological conditions allowed for their strict implementation. I would remove this sentence since it is not relevant to the authors’ work or the need for surveillance tools for the continuing pandemic.

ii. Lines 97-98. This statement is not accurate. The space-time permutation model that is available with SaTScan does not use or require background population data although as a result, it is subject to population shift bias.

iii. Lines 108-109. There’s nothing that prevents an individual from making these comparisons. Please note that the point of the scan statistic is to detect if there are clusters in space, time, and space-time with significantly higher or lower levels of disease without predefining the geographical or temporal size of these clusters. There is no implication that every sub-region within a cluster shares the same rates any more than one could assume that the rate of disease among cities, towns, or villages within a county were homogeneous.

iv. Lines 110-129. This reads more like a summary of methods. I would strongly suggest the authors clearly state their research objectives (i.e., what are they trying to discover or compare rather than a brief description of the methods).

Methods:

i. Please review suggestions in the general comments especially concerning the scan statistic.

ii. Lines 156-161. The authors should use appropriate epidemiological terms. The authors need to clearly state they are calculating the crude incidence rate. However, I would strongly encourage them to consider calculating the age and sex adjusted incidence rates for their subsequent event sequence based clustering.

iii. Lines 183-184. The authors state that the duration of a cluster was set at 2 days, but in the results they have space-time clusters of 1 day in length. The authors should revise the methods or results for consistency.

iv. Line 188. Please note the number of Monte Carlo replications performed and whether scans were performed as 1-tailed tests needs to be stated.

v. Line 198. Please replace the phrase “cases normalized to the county population” with the term incidence rate throughout the manuscript.

Results:

i. Please make certain tables of clusters have consistent titles and select one short form to differentiate event sequence based clusters from space-time clusters and use it consistently throughout the manuscript. Please avoid repeating the methods in the results and use subheadings to differentiate the different statistical approaches being used. Avoid including discussion in the results section.

ii. The tables and figures provide a great deal of summary detail. Please use the text to describe general locations and major characteristics of clusters. The listing of each county within a cluster or each county with a high rate of disease in the text is not necessary. For tables of space-time clusters, please include a column for the radius, the column for the log likelihood in unnecessary. Some authors would include latitude and longitude in these tables, but the figures are sufficient for spatial information.

iii. The figures concerning the “elbow method” and the dendrograms should be moved from the supplemental material into the main manuscript. If the authors follow the suggestions in the general comments, this figure would only be needed for the one period being examined.

iv. Line 339. Please remove the term “emerging”.

v. Line 450. Please note that it is the cluster that is statistically significant and not the counties. Please remove similar phrases from the text.

vi. Line 473. Figures and tables need to be numbered in the order they appear in the text. Currently, figure 9 is listed before figures 7 and 8.

vii. Line 522-523. Please remove results that are not statistically significant from the text and figures.

Discussion:

i. Please revise the discussion after revising the manuscript.

References:

i. Please make certain the references are consistently formatted and all information is included. For instance, the formatting of journal titles in terms of capital letter is inconsistent and the journal is missing from some references.

Tables and figures

i. Please do not include Excel files in the supplemental material. If you believe this material is useful for the reader, generate tables in PDF format with proper variable names and footnotes for any short forms.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jun 10;16(6):e0252990. doi: 10.1371/journal.pone.0252990.r002

Author response to Decision Letter 0


12 Feb 2021

Response to reviewers

Reviewer #1: The authors of the study used prospective space-time scan statistic on number of positive COVID-19 cases in counties to determine location and period of most likely space-time clusters in the continental US. Case counts from counties in most likely clusters were then analyzed through hierarchical agglomerative clustering method to further characterize groups of counties with similar patterns in how frequency of cases (standardized by population size) develop over time. Demographic factors were then used in order to evaluate any differences between clusters. Researches reported that counties detected in the same space-time clusters were classified into different hierarchical clusters. This is interesting finding, and a topic worthy of investigation; although not completely surprising. When methods are combined, this could be done in many different ways and authors used one possible approach - at the very end of the manuscript, they offered some alternatives. This alternative approach was something that I was wondering from the very beginning (no actionable comment here).

In my opinion, by far the largest limitation of the manuscript is its length. Manuscript with 38 pages of text (including tables) is more a technical report than a classical journal article. This is clearly a consequence of thorough analysis, but reading of such contribution is very demanding on the reader, particularly when a lot of technical results consists of reporting counties in most likely clusters. This was done for different time periods, followed by hierarchical clustering. Of course, number of clusters in spatial analysis and in multivariate analysis will be different and this further complicates reading. Is it possible to present an analysis for only one time period (the one authors feel is the most informative), and for the rest of results to be offered in the supplementary material? With a short commentary point on how are other results similar or dissimilar? Also, in multivariate cluster analysis, it is often attempted to explain nature of clusters identified. Although this was attempted to some degree here, this information was diluted among many other details. We recognize that there was too much redundancy and detailing of clusters in the initial manuscript. We did consider a focus on a single study period, but we feel there are some interesting comparisons to convey over the four periods. To be responsive to the reviewer’s concern of length and redundancy we have substantially simplified the results section to focus on a more directed comparison of the two approaches.

Because of manuscript length, it is easy to miss some technical details.

For example.

L104 - are the authors suggesting that RR is available for each location in the cluster? I am not sure that this is in line with how scan statistics work. The RR risk is estimated for the population in the scanning window, or maybe I missed the point here.

SatScan does compute the RR for individual locations within a cluster as well as the cluster RR. The RR for a county within a cluster is calculated using the following equation as used in Hohl et al 2020

〖RR〗_cty=(c/e)/((C-c)(C-e))where c= total number of cases in a county, C is the total number of observed cases in the conterminous US, e is the expected number of cases in a county calculated as e=p_cty*C/P

L116. Is it possible to determine false positives based on results from scan statistics? Could RR<1 be indication of cluster of cases with lower than expected risk, rather than false positive?

A RR <1 would not be considered a false positive in the sense of the RR for the cluster. Our thinking here is that in the context of a detected cluster, a specific county within the cluster with a county RR <1 might be construed as a false positive. We have however changed the text to avoid any misrepresentation in this regard.

L121. Perhaps check wording. Is it all counties with high RR or all counties from significant clusters with high RR? Individual counties with RR>1 computed as indicated above were used in the STES analysis.

L159. Is "severe" the best adjective to use. severe typically refers to clinical expression, whereas cases are just reflection of incidence, some are likely asymptomatic. We agree that this term is not appropriate here and have removed this text.

L163. Authors should perhaps explain whether they used Poisson model or space-time permutation model. They are different methods. The prospective Poisson space-time models was used. We apologize for the oversight and ambiguity.

L163. In this section, it would also be useful to explain how is prospective statistic different from retrospective which is commonly applied in retrospective research studies. A more specific distinction between them has been added to the text.

In the methods section - authors did not mention division into different time periods, and what was the rationale for that. reader learns about that in the results section. In the revised manuscript we have explicitly noted the study time period in the Methods section with some justification for these period intervals.

L239. "spread" may not be the best word to use here. Clusters do not spread. We agree and have changed the text to - as the drivers for these identified clusters.

Throughout the manuscript, some wording should be improved. e.g. L384 writes about 4 statistically significant counties. is it counties or clusters that are significant? We agree and have removed the incorrect reference to statistically significant counties.

In addition, the authors try to compare demographics across different clusters, but are not trying to do any statistical testing which could be helpful as a decision point. Is there a reason for that?

Are the authors concerned that many comparisons could lead to some false findings? In the revisions to the manuscript, we have removed the demographic analysis section as suggested by reviewer 2.

Reviewer #2: Manuscript #: PONE-D-20-26698

Title: Space-time surveillance of COVID-19 emerging hotspots using prospective scan statistics enhanced by spatiotemporal event sequence based clustering.

Authors: Fuyu Xu & Kate Beard

General comments:

Xu and Beard have used an innovative approach to classifying the temporal patterns of “epidemic curves” at the county level that could make a very important contribution to the epidemiological and disease surveillance literature. Unfortunately, the current manuscript suffers from too much repetition, taking on too many objectives, and perhaps taking a less than ideal approach for integrating their event sequence based clustering with scan statistics. Specifically, the following should be addressed:

We recognize that there were too many objectives, too much redundancy and detailing of clusters in the initial manuscript. We did consider a focus on a single study period, but we feel there are some interesting comparisons to convey over the four periods, so we have retained the four study period comparison. To be responsive to the reviewer’s concern of length and redundancy we have substantially simplified the results section to focus on a more directed comparison of the two approaches.

1. While there are a number of good reasons to conduct prospective scans, in the context of this manuscript repeating the analysis several times for different periods during the pandemic in the contiguous US states makes the manuscript exceedingly repetitive. I would recommend the authors pick a single period for the analysis and use it consistently throughout the manuscript to exemplify their approach. We have revised the manuscript to remove the redundancies of describing each period in detail. We feel that the comparison between the space-time scan and the sequence similarity clusters benefits from a comparison over a sequence of time periods. Thus, we have retained the 4 study periods but limit the results and discuss to key comparisons of what is conveyed in the space-time scan view versus the temporal view conveyed by the sequence similarity.

2. In terms of the use of the scan statistic, there are a number of major issues that should be addressed by the authors:

i. The decision to limit the maximum size of a cluster to 10% of the population appears to be arbitrary. I would recommend using the default of 50% or less (also the maximum that can be used) of the population. This value does not prevent smaller clusters from being detected, but prevents the need for arbitrary values and is the whole point of the flexible scanning window. In reviewing the figures, it is clear that during certain periods a large number of small clusters are likely part of a larger cluster. Determining a specific upper bound of scanning window size has been explained in papers by Kulldorf et al. and SaTScan User Guide 9.6. The optimum maximum size of the scanning window should be determined on case by case. 10% of the population was used in similar research (Holz et al 2020) and it seemed is reasonable to replicate that here. We did experiment with using 50% of population for an upper limit which resulted in some extremely large clusters especially at the early stage of pandemic.

ii. The authors need to clearly state what rule was used for reporting space-time clusters in terms of spatial overlap and preferably spatio-temporal overlap. There is no spatial overlap in clusters in the output results, but temporal overlaps do occur.

iii. The authors are unclear whether they are using a Poisson model or the space-time permutation model for their space-time scans. On lines 166-167, they refer to the permutation model and later to the Poisson distribution on line 185. The space-time permutation model is not based on a Poisson distribution and only requires case data. The prospective Poisson space-time model was used. We apologize for the oversight and ambiguity and have made clarifications in the text.

iv. If the authors used a Poisson model in SaTScan, it is unclear why they did not adjust for age and sex; this type of standardization is common in most epidemiological analyses. If they used a space-time permutation model, they should recognize that these models identify space-time clusters while inherently adjusting for purely spatial and temporal clusters; in other words this model would adjust for demographic and socio-economic factors if they were related to spatial location. We did not have information on age and sex for confirmed cased so were not able to make these adjustments. We also note in the text that while deaths from COVID 19 are several times higher in older age groups, infections can affect all segments of the population.

3. It is unclear why the authors did not apply their event sequence based clustering to all counties. The value seems to be diminished by only applying the technique to locations within active space-time clusters. In fact, investigating if there were spatial clusters of these event sequence clusters, using a multinomial model, would have been a very interesting and perhaps a more appropriate way to combine the two approaches and comment on whether or not these event sequence clusters were randomly distributed or had particular “hotspots”. It would certainly be interesting to compare space-time “hotspots” for rates of disease with spatial clusters of these event sequence clusters. We did apply the sequence similarity-based clustering to all counties, but in this early period of the pandemic many counties had no cases. We have now included these in one “Outsiders” category in the temporal profiles that have been added to sequence similarity cluster maps. As an objective was to compare differences between space-time scan and sequence similarity clusters, we also felt it was useful to focus on the most active case locations. While some of the sequence similarity clusters exhibit some spatial clustering, an aim of the sequence similarity approach was to offer a temporal view on the locations identified by the space-time scan.

4. The event sequence clusters shared in geographically distant regions (e.g., Pacific Northwest and New York) likely reflect when COVID-19 was introduced into the US even though molecular sequencing suggests the introduction in these regions was likely from different continents of origin (i.e., Europe vs. Asia). Some discussion concerning the epidemiological interpretation of clusters based on the greater literature is warranted. We have added references to research on the different continents of origin noting that isolates from China primarily seeded the original COVID-19 outbreak on the West Coast and that European isolates seeded the pandemic in New York (and the US East Coast).

5. The socio-demographic analyses should probably be removed from this manuscript. I would recommend performing these analyses using multivariable multinomial regression models based on the event sequence cluster classification for each county in a separate manuscript. Currently, the descriptive comments concerning socio-demographic factors within clusters identified using different methods are not particularly insightful in the current draft of the manuscript. I suspect the authors have put too much in one manuscript and this section deserves much more detail and a stronger analysis. We agree that this analysis would be better served in another manuscript and have removed it.

Specific comments:

Title:

i. The event sequence based clustering was applied to counties during a specific time period, but it did not account for spatial location. It might be better for the authors to state that they are examining the distribution of counties classified based on event sequence based clustering with respect to space-time clusters of COVID-19. If they agree with comment 3 in my general comments, it might be better to focus on the spatial clustering of event sequence based clusters of COVID-19. We have revised the title to reflect changes to: “A comparison of prospective space-time scan statistics and event sequence similarity based clustering for COVID 19 surveillance.”

Additionally, we have made revisions in the manuscript to clarify that the sequence similarity clustering was applied in each study period, revised the maps of these clusters and added timelines to show the temporal profiles of these clusters. Some of the sequence similarity clusters do exhibit spatial clustering of some members but the intent was to examine membership with respect to sequence similarity rather than spatial clustering.

Introduction:

i. Lines 60-62: This statement is not really correct. Where and when these measures were implemented, there was success in “flattening the curve” (even in the US), and strict measures did control the disease in parts of the world where the political, economic, and sociological conditions allowed for their strict implementation. I would remove this sentence since it is not relevant to the authors’ work or the need for surveillance tools for the continuing pandemic. This text has been removed in the revised manuscript.

ii. Lines 97-98. This statement is not accurate. The space-time permutation model that is available with SaTScan does not use or require background population data although as a result, it is subject to population shift bias. We did not use the permutation space-time scan statistic. The prospective Poisson space-time models was used. We apologize for the ambiguity in line 168 and have corrected this oversight in the revised manuscript.

iii. Lines 108-109. There’s nothing that prevents an individual from making these comparisons. Please note that the point of the scan statistic is to detect if there are clusters in space, time, and space-time with significantly higher or lower levels of disease without predefining the geographical or temporal size of these clusters. There is no implication that every sub-region within a cluster shares the same rates any more than one could assume that the rate of disease among cities, towns, or villages within a county were homogeneous. The text in this section was removed.

iv. Lines 110-129. This reads more like a summary of methods. I would strongly suggest the authors clearly state their research objectives (i.e., what are they trying to discover or compare rather than a brief description of the methods). We have revised this section to specifically address the paper objectives and have moved the methods related text to the Methods section.

Methods:

i. Please review suggestions in the general comments especially concerning the scan statistic.

We have addressed the suggestions in the general comments on the scan statistic.

ii. Lines 156-161. The authors should use appropriate epidemiological terms. The authors need to clearly state they are calculating the crude incidence rate. However, I would strongly encourage them to consider calculating the age and sex adjusted incidence rates for their subsequent event sequence based clustering. We thank the reviewer for this correction and have revised the manuscript to refer to incidence rate. We did not have information on age and sex for confirmed cased so were not able to make these adjustments. We also note in the text that while deaths from COVID 19 are several times higher in older age groups, infections can affect all segments of the population.

iii. Lines 183-184. The authors state that the duration of a cluster was set at 2 days, but in the results they have space-time clusters of 1 day in length. The authors should revise the methods or results for consistency. The setting with 2 days is correct. A date math error occurred when presenting the cluster duration in the table, which we have corrected in the revised manuscript.

iv. Line 188. Please note the number of Monte Carlo replications performed and whether scans were performed as 1-tailed tests needs to be stated. We chose the Standard Monte Carlo for the p-value (<= 0.05) in the SaTScan setting, and 999 simulations were run.

v. Line 198. Please replace the phrase “cases normalized to the county population” with the term incidence rate throughout the manuscript. This has been corrected through-out the revised manuscript.

Results:

i. Please make certain tables of clusters have consistent titles and select one short form to differentiate event sequence based clusters from space-time clusters and use it consistently throughout the manuscript. Please avoid repeating the methods in the results and use subheadings to differentiate the different statistical approaches being used. Avoid including discussion in the results section. We have revised the test to clearly separate results from discussion.

ii. The tables and figures provide a great deal of summary detail. Please use the text to describe general locations and major characteristics of clusters. The listing of each county within a cluster or each county with a high rate of disease in the text is not necessary. For tables of space-time clusters, please include a column for the radius, the column for the log likelihood in unnecessary. Some authors would include latitude and longitude in these tables, but the figures are sufficient for spatial information. We have included a column for the radius and removed the column of log likelihood (LLR).

iii. The figures concerning the “elbow method” and the dendrograms should be moved from the supplemental material into the main manuscript. If the authors follow the suggestions in the general comments, this figure would only be needed for the one period being examined. In the revised manuscript we have moved the figures of the elbow method graphs and cluster dendrograms to the main body.

iv. Line 339. Please remove the term “emerging”. We have removed this term.

v. Line 450. Please note that it is the cluster that is statistically significant and not the counties. Please remove similar phrases from the text. We have corrected this in the revised manuscript.

vi. Line 473. Figures and tables need to be numbered in the order they appear in the text. Currently, figure 9 is listed before figures 7 and 8. We have revised some figures and taken care that they appear in the correct order and in which they are referenced in the text.

vii. Line 522-523. Please remove results that are not statistically significant from the text and figures. We have removed the non-statistically significant clusters in the figures and the corresponding description in the text.

Discussion:

i. Please revise the discussion after revising the manuscript. The discussion has been revised to reflect revisions to the manuscript.

Attachment

Submitted filename: Response to reviewers.docx

Decision Letter 1

Agricola Odoi

18 Mar 2021

PONE-D-20-26698R1

A comparison of prospective space-time scan statistics and event sequence similarity-based clustering for COVID 19 surveillance

PLOS ONE

Dear Dr. Beard,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses all the issues raised by both reviewers.

Please submit your revised manuscript by May 02 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Agricola Odoi, BVM, MSc, PhD, FAHA, FACE

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you to authors for addressing points raised in the previous review. I only have a couple of minor suggestions for authors to consider.

Authors stated that incubation is 14 days, which is approaching upper limit. Ideally, the authors should carefully word whether this was maximum or average and provide reference.

Next, on images displaying case sequence clusters – it may be useful to indicate, possibly in the figure legend, what are outsiders.

Similarly, for cluster dendrograms it may be helpful to indicate which group of observations belong to which cluster. It seems there is a place for this, but cluster designation is not visible on figures.

In addition, the authors used calendar time to look into similarity of case incidence within clusters. For future consideration, it may actually be interesting to consider time in terms of number of days since detection of the first case in a county.

Reviewer #2: Manuscript ID: PONE-D-20-266698R1

Title: A comparison of prospective space-time scan statistics and event sequence similarity-based clustering of COVID-19 surveillance

Authors: Xu, F. & Beard, K.

General comments: The revised draft of the authors manuscript is greatly improved. Although I am not sure I agree with all their decisions (e.g., maximum scanning window), I believe they have documented/defended their decisions well. My remaining suggestions are mainly cosmetic in nature. Below are some general comments/suggestions:

i. Please put subheadings for the space-time cluster and sequence similarity-based cluster paragraphs in each study period section to avoid confusion over what type of “clusters” are being discussed.

ii. Spell “sequence similarity-based cluster” consistently throughout the manuscript. It is written at least three different ways in the text (e.g., “sequence-similarity based cluster”, “sequence similarity based cluster”).

iii. In paragraphs concerning sequence similarity-based clusters, please do not refer to “spatial clusters” of these clusters. It is very confusing to use the term in a non-statistical sense in a manuscript describing two types of statistical clusters. Just indicate that these sequence similarity-based clusters concentrate around particular cities or regions rather than state they form “spatial clusters”.

iv. In the discussion, make certain to state clearly the value of extracting the information concerning the sequence similarity-based clusters from within the space-time clusters.

Manuscript text:

i. Line 45: It should read “share a similar”.

ii. Line 109: It should read “understanding disease dynamics”.

iii. Lines 113 & 386: It should read “complementary” not “complimentary”.

iv. Line 165: It might be better to write “missing” rather than “avoiding”.

v. Lines 171-173: This statement is not accurate. The age structure of a population will influence disease reporting and the real incidence of disease. The authors should just state they did not have access to age and sex data for cases and were unable to adjust for these variables as they reported in their response letter. This limitation should be addressed in the discussion.

vi. Line 219: Would it be better to state “decreased” rather than “improved”?

vii. Line 263: It should read, “statistical”.

viii. Line 333: Please remove the sentence, “In this period, we note little activity on the west coast.” This statement is not correct. The authors did not identify any active space-time clusters during this period, but there was a lot of disease activity on the West Coast.

ix. Line 347: Replace “covering through” with “ending in”.

x. Line 348: Replace “statistics” with “statistic”.

xi. Line 362: Replace “determined” with “selected”.

xii. Lines 371-372: Make certain to explain what “OC” means in the text.

xiii. Lines 389-390: It might be better to state, “the expected number of cases in space-time based on …...”. The following sentence from 390-391 should be removed since it is redundant with the addition of the above phrase.

xiv. Line 392: It should read, “at such a location during a period of time.”

xv. Line 393: Replace “temporal dimension” with “temporal pattern” since one could argue the “temporal dimension” is part of the space-time cluster.

xvi. Line 400: Rephrase as “we found that in all study periods, similar sequence patterns of COVID-19...”

xvii. Line 408: Consider including “similar changes in surveillance programs” to the list of reasons explaining these common temporal patterns.

xviii. Lines 413-414: Do the authors mean “counties in New York State”? Please clarify.

xix. Line 447: It should read, “pathway studies”.

Tables and figures:

i. Tables 1-4. In the titles, please replace “SaTScan space-time clusters” with “prospective space-time clusters”. A footnote can be added to the tables stating, “Space-time clusters were identified using the spatial scan statistic with a Poisson model”.

ii. Fig 1. Remove “covering” from the title.

iii. I believe the journal expects supplementary materials to be labeled as “Table S1” and “Fig. S1” rather than “S1 Table” or “S1 Figure”. Please correct accordingly.

References:

i. Please properly edit the references. Journal titles and manuscript titles are inconsistently formatted. Additional “PubMed” information is sometimes accidentally included at the end of references. If the authors wish to include “doi” information, please include it consistently or not at all.

ii. Reference 36 should be replaced with a more formal reference (e.g., journal article or government report).

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Jun 10;16(6):e0252990. doi: 10.1371/journal.pone.0252990.r004

Author response to Decision Letter 1


17 Apr 2021

Reviewer #1: Thank you to authors for addressing points raised in the previous review. I only have a couple of minor suggestions for authors to consider.

Authors stated that incubation is 14 days, which is approaching upper limit. Ideally, the authors should carefully word whether this was maximum or average and provide reference. Thanks for pointing this out. We cited a reference for this in which the incubation time for COVID-19 is mostly ranging from 1-14 days with the average of 5 days.

Next, on images displaying case sequence clusters – it may be useful to indicate, possibly in the figure legend, what are outsiders. We added text in the methods section on comparison to the group outside the clusters which is the Outsiders or OC group. This is also now noted in the captions of all four related maps.

Similarly, for cluster dendrograms it may be helpful to indicate which group of observations belong to which cluster. It seems there is a place for this, but cluster designation is not visible on figures. We added the cluster numbers to the dendrogram figures to facilitate reference between them and the maps. They are also consistently color coded now between figures.

In addition, the authors used calendar time to look into similarity of case incidence within clusters. For future consideration, it may actually be interesting to consider time in terms of number of days since detection of the first case in a county. In this paper we emphasize the purpose of surveillance, so using calendar time seemed appropriate. We agree that using the number of days since detection of the first case in a county for a time measure is an interesting approach for future research. As a matter of fact, we thought this before and believe that using the time counting from the first incidence detection in a county is better for identifying the similar patterns in terms of the development of covid-19 disease itself.

Reviewer #2: Manuscript ID: PONE-D-20-266698R1

Title: A comparison of prospective space-time scan statistics and event sequence similarity-based clustering of COVID-19 surveillance

Authors: Xu, F. & Beard, K.

General comments: The revised draft of the authors manuscript is greatly improved. Although I am not sure I agree with all their decisions (e.g., maximum scanning window), I believe they have documented/defended their decisions well. My remaining suggestions are mainly cosmetic in nature. Below are some general comments/suggestions:

i. Please put subheadings for the space-time cluster and sequence similarity-based cluster paragraphs in each study period section to avoid confusion over what type of “clusters” are being discussed. We added subheading for space-time scan cluster and sequence similarity-based cluster paragraphs for each of the study periods.

ii. Spell “sequence similarity-based cluster” consistently throughout the manuscript. It is written at least three different ways in the text (e.g., “sequence-similarity based cluster”, “sequence similarity based cluster”). Thank you for catching this. We made the consistent use of “sequence similarity-based cluster” throughout the manuscript.

iii. In paragraphs concerning sequence similarity-based clusters, please do not refer to “spatial clusters” of these clusters. It is very confusing to use the term in a non-statistical sense in a manuscript describing two types of statistical clusters. Just indicate that these sequence similarity-based clusters concentrate around particular cities or regions rather than state they form “spatial clusters”. When some members or counties of sequence similarity-based clusters are spatially clustered we changed the expression to “some members within this group appear spatially concentrated or grouped around metropolitan areas …”.

iv. In the discussion, make certain to state clearly the value of extracting the information concerning the sequence similarity-based clusters from within the space-time clusters.? We have added a paragraph to the methods section on comparison to address this issue.

Manuscript text:

i. Line 45: It should read “share a similar”. “share similar” has been replaced with “share a similar”.

ii. Line 109: It should read “understanding disease dynamics”. “understanding of the disease dynamics” has been replaced with “understanding disease dynamics”.

iii. Lines 113 & 386: It should read “complementary” not “complimentary”. “complimentary” has been corrected with “complementary”.

iv. Line 165: It might be better to write “missing” rather than “avoiding”. In this case we did want to avoid very large clusters such as ones covering over a quarter of the country as these are not particularly meaningful as “clusters”

v. Lines 171-173: This statement is not accurate. The age structure of a population will influence disease reporting and the real incidence of disease. The authors should just state they did not have access to age and sex data for cases and were unable to adjust for these variables as they reported in their response letter. This limitation should be addressed in the discussion. We agree the effect of age structure of a population on the actual incidence of disease. We added “we were unable to access age and sex data at this time for cases in this study, thus we did not adjust for age and sex”.

vi. Line 219: Would it be better to state “decreased” rather than “improved”? We made this change

vii. Line 263: It should read, “statistical”. We made this change

viii. Line 333: Please remove the sentence, “In this period, we note little activity on the west coast.” This statement is not correct. The authors did not identify any active space-time clusters during this period, but there was a lot of disease activity on the West Coast. We agree with this and the sentence has been deleted.

ix. Line 347: Replace “covering through” with “ending in”. We replaced “covering through” with “ending in” since followed by May 20, 2020.

x. Line 348: Replace “statistics” with “statistic”. We replaced “statistics” with “statistic”.

xi. Line 362: Replace “determined” with “selected”. “determined” has been replaced with “selected”. (note: “determined” was in Line 372)

xii. Lines 371-372: Make certain to explain what “OC” means in the text. We added a paragraph in the methods section to explain OC and additional added an explanation of “Outsiders” (OC) in the captions of all four related maps.

xiii. Lines 389-390: It might be better to state, “the expected number of cases in space-time based on …...”. The following sentence from 390-391 should be removed since it is redundant with the addition of the above phrase. We added “in space-time” after “the expected number of cases” and deleted the following sentence.

xiv. Line 392: It should read, “at such a location during a period of time.” We replaced “at such location and times” with “at such a location during a period of time”.

xv. Line 393: Replace “temporal dimension” with “temporal pattern” since one could argue the “temporal dimension” is part of the space-time cluster. Good point, we replaced “temporal dimension” with “temporal pattern”.

xvi. Line 400: Rephrase as “we found that in all study periods, similar sequence patterns of COVID-19...”. We agreed that this is a better expression so we replaced “all study periods showed that similar sequences in COVID-19 cases” with “we found that in all study periods, similar sequence patterns of COVID-19 cases”.

xvii. Line 408: Consider including “similar changes in surveillance programs” to the list of reasons explaining these common temporal patterns. We agreed with the reviewer adding this to the list for one of the reasons.

xviii. Lines 413-414: Do the authors mean “counties in New York State”? Please clarify.There is a New York county so we changed the order of the statement to read Bronx, Kings, Queens, New York, and Wassau counties in New York State, for better clarification.

xix. Line 447: It should read, “pathway studies”. We replaced “pathways studies” with “pathway studies”.

Tables and figures:

i. Tables 1-4. In the titles, please replace “SaTScan space-time clusters” with “prospective space-time clusters”. A footnote can be added to the tables stating, “Space-time clusters were identified using the spatial scan statistic with a Poisson model”. We replaced “SaTScan” with “prospective” and placed the “Note: Space-time clusters were identified using the spatial scan statistic with a Poisson model” below the tables as a footnote.

ii. Fig 1. Remove “covering” from the title. We deleted “covering”.

iii. I believe the journal expects supplementary materials to be labeled as “Table S1” and “Fig. S1” rather than “S1 Table” or “S1 Figure”. Please correct accordingly. We double checked the author guidelines which state the following: “You may use almost any description as the item name of your supporting information as long as it contains an "S" and number. For example, “S1 Appendix” and “S2 Appendix,” “S1 Table” and “S2 Table,” and so forth”. So we followed this guideline. However, the way of labeling that the reviewer suggested may also be accepted.

References:

i. Please properly edit the references. Journal titles and manuscript titles are inconsistently formatted. Additional “PubMed” information is sometimes accidentally included at the end of references. If the authors wish to include “doi” information, please include it consistently or not at all. Manually rechecked and properly edited the Endnote references according to PLoSOne author guidelines. Removed “PubMed” and “doi” information from some references and updated and formatted all references consistently.

ii. Reference 36 should be replaced with a more formal reference (e.g., journal article or government report). The previous Ref 36 was removed and replaced with two recent formal references (Ref 36 and 37).

Attachment

Submitted filename: response_Reviewers_v3.docx

Decision Letter 2

Agricola Odoi

18 May 2021

PONE-D-20-26698R2

A comparison of prospective space-time scan statistics and spatiotemporal event sequence similarity-based clustering for COVID 19 surveillance

PLOS ONE

Dear Dr. Beard,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit and will most likely be accepted after the minor revisions suggested by reviewer 2 have been implemented. Therefore, we invite you to submit a revised version of the manuscript after making the recommended revisions.

Please submit your revised manuscript by Jul 02 2021 11:59PM.  When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Agricola Odoi, BVM, MSc, PhD, FAHA, FACE

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Manuscript ID: PONE-D-20-26698R2

Manuscript title: A comparison of prospective space-time scan-statistics and spatiotemporal event sequence-based clustering for COVID 19 surveillance

Corresponding author: Beard-K

General comments:

The authors have made all the requested revisions. Below are some minor edits/suggestions they should consider. There is no need for me to see the manuscript again.

Specific comments:

Line 172: Should it read, “required”?

Line 174-175: It might sound better to write, “....for cases in this study, so we could not adjust for age and sex.”

References:

Fix the formatting of journal titles for references 17, 21, 34 (also the article title).

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 3

Agricola Odoi

27 May 2021

A comparison of prospective space-time scan statistics and spatiotemporal event sequence similarity-based clustering for COVID 19 surveillance

PONE-D-20-26698R3

Dear Dr. Beard,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Agricola Odoi, BVM, MSc, PhD, FAHA, FACE

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Agricola Odoi

2 Jun 2021

PONE-D-20-26698R3

A comparison of Prospective Space-time Scan Statistics and Spatiotemporal Event Sequence Based Clustering for COVID-19 Surveillance

Dear Dr. Beard:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Prof. Agricola Odoi

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-3-13/2020.

    This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

    (XLSX)

    S2 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-3-31/2020.

    This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

    (XLSX)

    S3 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-4-19/2020.

    This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

    (XLSX)

    S4 Table. Comparison of space-time clusters from SaTScan and STES based hierarchical clustering with the dataset from 1/23-5-20/2020.

    This table is merged through FIPS of US counties, and also includes other selected output parameters from SaTScan such as p-values, LOC_RR (location or county relative risk), CLU_RR (cluster relative risk), LOC_LAT (location latitude), LOC_LONG (location longitude).

    (XLSX)

    S5 Table. The minimal data set underlying the results described in this manuscript.

    (CSV)

    Attachment

    Submitted filename: Response to reviewers.docx

    Attachment

    Submitted filename: response_Reviewers_v3.docx

    Attachment

    Submitted filename: response_Reviewers_v4_5-20-2021.docx

    Data Availability Statement

    All data used in the study are available as S1S5 Tables provided with this submission.


    Articles from PLoS ONE are provided here courtesy of PLOS

    RESOURCES