Enhanced cluster detection and noise reduction for geospatial time series data of COVID-19

Sabitri Gaire; Abeer Alsadoon; P W C Prasad; Nada Alsallami; Simi Kamini Bajaj; Ahmed Dawoud; Trung Hung VO

doi:10.1007/s11042-023-15901-0

. 2023 Jun 4:1–32. Online ahead of print. doi: 10.1007/s11042-023-15901-0

Enhanced cluster detection and noise reduction for geospatial time series data of COVID-19

Sabitri Gaire ¹, Abeer Alsadoon ^1,^2,^3,^✉, P W C Prasad ^1,², Nada Alsallami ⁴, Simi Kamini Bajaj ², Ahmed Dawoud ², Trung Hung VO ⁵

PMCID: PMC10239308 PMID: 37362721

Abstract

Spatial-temporal analysis of the COVID-19 cases is critical to find its transmitting behaviour and to detect the possible emerging clusters. Poisson's prospective space-time analysis has been successfully implemented for cluster detection of geospatial time series data. However, its accuracy, number of clusters, and processing time are still a major problem for detecting small-sized clusters. The aim of this research is to improve the accuracy of cluster detection of COVID-19 at the county level in the U.S.A. by detecting small-sized clusters and reducing the noisy data. The proposed system consists of the Poisson prospective space-time analysis along with Enhanced cluster detection and noise reduction algorithm (ECDeNR) to improve the number of clusters and decrease the processing time. The results of accuracy, processing time, number of clusters, and relative risk are obtained by using different COVID-19 datasets in SaTScan. The proposed system increases the average number of clusters by 7 and the average relative risk by 9.19. Also, it provides a cluster detection accuracy of 91.35% against the current accuracy of 83.32%. It also gives a processing time of 5.69 minutes against the current processing time of 7.36 minutes on average. The proposed system focuses on improving the accuracy, number of clusters, and relative risk and reducing the processing time of the cluster detection by using ECDeNR algorithm. This study solves the issues of detecting the small-sized clusters at the early stage and enhances the overall cluster detection accuracy while decreasing the processing time.

Keywords: COVID-19, Geospatial time series data, Space-time analysis, Spatial-temporal analysis, Poisson prospective distribution

Introduction

Coronavirus disease (COVID-19) is an infectious disease that was first identified in Wuhan city, Hubei province, China in December of 2019. COVID-19 is caused by a newly discovered coronavirus (SARS-CoV-2) and has severe acute respiratory syndrome [5, 12]. As of September 12, 2020, there were more than 28 million confirmed cases and above 920 thousand people lost their lives worldwide. In the United States, there were over 6,638,044 confirmed cases and 197,461 deaths [6]. Within a short time, the virus spread all over the world and many countries have implemented social distancing and even lockdown. Around 80% of the confirmed cases were mild and the death rate was around 3.19% [28]. Its general symptoms include fever, shortness of breath, cough whereas severe cases may include multi-organ failure, pneumonia, and death [22].

Due to its high transmission rate and challenges in developing a vaccine, it will likely take more time. So, it is very important to understand and visualize the behaviour of the virus to be safe or to reduce the fatality. A Space-Time statistic is an effective approach for analyzing the disease's behavior over the given time [18, 26]. It helps in studying the number of cases in a particular location in a given time. In the past, Space-time scan statistics have been implemented for analyzing chikungunya and dengue fever in Colombia and Panama [22] pointing to areas with increased criminal activity, detecting hot spots for the West Nile Virus infection in Italy [24]. It is useful for analyzing the recurrence intervals of the data using the software such as SaTScan [17]. Prospective distribution is a widely used approach to analyze the disease cases. This approach treats each case as an individual, following it over a given time, and collecting their data as characteristics. Then, it detects active or emerging clusters of the current day, while disregarding past clusters, which is very helpful for understanding the disease's behaviour. The current solution lacks the including of testing rate while calculating the number of expected cases [11].

The purpose of this paper is to enhance the cluster detection accuracy, processing time, the number of clusters, and the relative risk by using the ECDeNR algorithm This research aims to increase the cluster detection accuracy thus including the areas with a small number of cases having low testing rate [3]. The proposed Modified Likelihood ratio function helps to add new important features to the system to detect clusters. For this, the Relative positive case count is used to calculate the number of expected cases. It helps to detect the small-sized cluster which increases the accuracy of cluster detection. Furthermore, The Modified Relative Risk function is used with the Monte Carlo simulation to detect the secondary cluster. To ensure higher accuracy on the relative risk of each cluster, the proportion of positive test is used, which reduce the noise data of clusters.

The remaining paper is divided into five sections. In section two the literature review is given. Section three will depict all the major component and sub-components of the proposed system along with the related diagrams. Then Section four discuss the results of this study. Finally, we conclude the research in section five. The future works are also highlighted in the last section.

Literature review

The main reason for the literature review is to do a survey of the existing papers with their limitation to improve the current system. Furthermore, this section provides a review of the different papers and related prospective field analysis.

Guliyev [9] examined COVID-19 cases by taking two variables: recovered cases and the death rate, with their spatial spillover effects. To determine the relationship between these variables and their spatial effects, this work used the spatial panel data model. Guliyev [9] provided the most consistent efficient model to capture spatial effects according to LR-test, maximum pseudo-R2, and minimum BIC and AICc values. It identifies the actual impacts and spatial interactions of the factor components on COVID-19. However, this paper can’t model the death rate because of the presence of a high proportion of zeros in the dataset and also considered time is short. Future work will be carried out with a big dataset. Balamchi and Torabi [1] enhanced the accurate detection of repeated events on spatial data. This research offered a spatial model with repeated events known as the spatial compound poisson model to analyze the repeated events as well as the spatial variation of the data to incorporate spatial random effects. The work provided a better knowledge of the spatial trend of disease and its risk factors for future preventions and the performance of the model increases by 4.04%. However, this research was only used sex and year two covariates to account for the exact number of incidences which may reduce the performance of the system. In the future, the model should use more covariates and extend the process for binary data as well. Saeed et al. [29] improved location-wise alcohol-related driving crash rates by analyzing spatial effects and edge effects emerging from the common road which navigate region boundaries. Saeed et al. [29] used the spatial Durbin model to account for spatial dependencies, evaluating alternative spatial weight structures, and considering edge effects. This research provided a framework for macroscopic spatial analysis of different kinds of road crashes at different severity levels to identify the effective safety intervention programs to minimize the accidents. However, the driving crash rate is still not more accurate. This research can be extended to examine road crashes based on neighborhood dynamics to improve crash rate accuracy. Lansiaux [20] investigated that sunlight exposure is remarkably correlated with the mortality rate of COVID-19 by using the Pearson correlation test. Sunlight exposure may have a defensive impact on COVID-19 mortality. These findings help prevent and win the COVID-19 pandemic. This work doesn’t include two important factors measurement: time and direct vitamin D measurements which impacts the accuracy of the results. Hammad [10] investigated patients with acute and chronic conditions are at more risk because of many factors related to COVID-19 by analyzing blood pressure, heart rate, troponin, left ventricle ejection fraction (LVEF), and new Q-wave to assess severity. Hammad et al. [10] provided a more accurate result compared to the previous system by 11%. However, this paper used very small sets of datasets (only 143 patients) and it is not very efficient to generalize the behavior of the Covid-19. Further research can take a large dataset and apply the analyzing methods on that datasets.

Corizzo et al. [4] proposed a new algorithm for detecting the anomalies in the data collected from multiple sensors positioned in different locations. this research allowed up to 13.56% of RMSE reduction, compared to the baseline scenario which increased the accuracy of anomalies detection. This approach can sometimes lead to bad results if the geographic position of the sensors is different and far from each other. To consider the spatial autocorrelation occurrence in the system, Future research can study the adoption of statistical indicators in the learning process. Krivoruchko and Gribov [15] enhanced the accurate finding of the chordal distance between two geolocations. The work compared distance’s accuracy between different models. They provided a new computationally efficient and accurate Kernel evolution algorithm (EBK). It improved the accuracy by twice while calculating the distance between two geolocations. However, this research is entirely depending on the ArcGIS software for the simulation and mapping GIS models and the improper implementation of modeling within the system can cause the invalid result.

Leevy et al. [21] analyzed the effect on an existing predictive model by including a training dataset from several year-groupings. This research provided well-calculated data showing how the distribution of the original dataset changes over time by grouping the data in different years and applying various data processing algorithms. However, this research was only conducted for the dataset collected from 2013 to 2015 and it was limited to the physician who was active through these three years which also excluded some potential datasets. Future work will inspect the effect of using other different learners, metrics, and class ratios implemented in the big data from different domains than healthcare. Lakhani et al. [19] identified priority areas for palliative care in Melbourne city with a remarkably high number of adults with disabilities and blockade to accessing essential health services. It provided a framework to find the preferred region for palliative care services during the COVID19 pandemic. This model supportedsupports the health of the unsafe populations in the preferred region having limited access to health services during a pandemic. However, this paper is only considered a small area (Melbourne). Future research can be done with a larger data set of larger areas which will give a more generalized result. Mollalo et al. [23] investigated the country level variation of the COVID-19 disease cases across the USA by compiling 35 environmental, topographic, demographic, and socioeconomic variables. The work supported the substantial impact of healthcare professionals during the pandemic. However, the dataset used is on the county-level while the calculation is done on the sub-county level which did not produce the accurate results. Future research can be done with sub-county level data and should include other variables to improve the quality of the service and the overall improvement for combating the pandemic.

Cordes and Castro [3] identified the clusters of high positivity rates and low testing rates by using spatial scan statistics. The research provided the list of areas with limited access to testing but having a high case, which is very essential to realize the risk and allocate resources in the COVID-19 pandemic. The fine spatial resolution is the major strength of this research. It gave a better idea of which nearby region has a higher case burden. But it only explained the relationship between COVID-19 testing patterns and their dependent factors. More input parameters must be used and examined with a big dataset. Hohl et al.[11] conducted daily surveillance of COVID-19 to detect and characterize emerging clusters in the USA by applying the prospective space-time scan statistic. This work offered a web application that lets the user track the space-time distribution of significant clusters. It is an improvement on the previous work and enhanced the accuracy of the cluster detection at an increased temporal resolution. However, it generated the cluster in a circular shape which is not a good choice in an area that has significant spatial heterogeneity and it decreased the accuracy of cluster detection in real life. Future research will work for detecting clusters of irregular shape. Rongyao et al. [27] proposed a framework to conduct joint disease diagnosis and conversion time prediction. This work investigated distinguishing severe cases from mild cases and predicting the conversion time that mild cases to move to a severe case. This proposed method is evaluated against six comparison methods, on synthetic multi-modality data sets and a COVID-19 data set, based on binary classification and regression performance. Rongyao et al. [27] research is sensitive to the selection of the tuning parameters used in the objective function, and only focused on binary classification.

State of art

Hohl, et al. [11] proposed a model for COVID-19 clusters detection using Poisson prospective space-time analysis using software named SaTScan by building the cylindrical clusters. This paper calculates the expected cases and elevated risk by considering the total population and active cases. Expected cases were calculated using the population within a cluster(p), the total number of cases in the US(C), and the total population(P). They use 999 Monte Carlo simulations for significance testing, adds the great possibilities of finding secondary clusters and reduces the uncertainty [30]. Studies have found that in the first half of the study period, the number of clusters was in the range of 6–10, but for the second half, it was in the range of 23–24 [11]. This solution has some limitations. It cannot detect the small sized cluster with a low number of tests. Noise reduction of cluster is not implemented. Average accuracy of the cluster detection is a major limitation. The average cluster detection accuracy rate is 83.32% and its processing time is 7.37 minutes. Figure 1 shows the block diagram of the state of art, the blue borders show the good features of this state of art solution, and the red border refers to the limitation of it.

Fig. 1 — The block diagram of state of art system [11]. The blue borders show the good features of this state of art solution, and the red border refers to the limitation of it

It consists of four major stages, namely Data Collection and Preparation, Data modeling, Data Analysis, and Data Visualization. In the following section the stages of the state of art are further explained and also its limitation are given with the limitation justification.

Data preparation and collection

Data is collected from public COVID-19 case data of the USA, provided by Johns Hopkins University for the selected study period (January 22^nd – June 5^th, 2020). To make the integrity of data, cases from international cruise ships were removed [11]. Finally, daily new cases were calculated and grouped on the weekly basis for cluster calculation [11]. Grouping the daily case data on the weekly basis will impact the daily nature of the considered temporal window and calculation may lose useful information such as the daily number of cases, the daily number of deaths, etc. which impact the accuracy of the result.

Data modelling

Hohl et al. [11] used the Poisson prospective space-time analysis which detects the most likely clusters from several cylindrical candidates’ clusters. This research restricted the spatial scanning window and temporal scanning window by 10% and 50% respectively. Expected cases were calculated using the population within a cluster(p), the total number of cases in the US(C), and the total population(P). The use of the Poisson prospective space-time analysis is the main feature of this stage. Poisson prospective space-time analysis helped to analyze the cases by considering the geographical location and their behaviour over the study period. There is no relationship between expected cases and the testing rate as the number of the cases is directly dependent on the number of the test performed.

Data analysis

For the data analysis, this research paper considers a Null Hypothesis H₀ and an Alternate Hypothesis H_A. First, a Likelihood test is performed against the Null hypothesis and find out the most elevated clusters with likelihood ratio > 1. Researchers then run 999 Monte Carlo simulations by randomizing the spatial and temporal window to obtain a likelihood ratio for each run and candidate cluster that forms a distribution under H₀[11]. The 999 Monte Carlo simulation is the main feature of this stage, which is performed to find the emerging clusters. By randomizing the locations and time window, it calculates the likelihood ratio for each run, and candidate clusters are detected with their Relative risk (RR). Also, selecting the clusters having a Likelihood ratio > 1 helps to select only the most elevated clusters, indicated as elevated risk. However, this model has a limitation that it does not consider the Modified Relative Risk (MRR) which includes the proportion of positive test during the calculation.

Data visualization

The calculated relative risks are presented in the tabular format whereas, for the cluster's visualization, researchers have built a web application named Covid19Scan. It consists of a map and a slider divided into weekly steps. It shows that the number of clusters changes from 0 to 23 during the study period [11]. The pseudocode and the flowchart of the state of art algorithm are shown in Table 1 and Fig. 2, respectively.

Table 1.

Poisson prospective space-time algorithm

Open in a new tab

Fig. 2 — The Flowchart of Poisson Prospective space-time algorithm

The state of art model presented cluster detection accuracy of a minimum of 5 cases within a minimum duration of 2 days. The Poisson prospective space-time scan statistic algorithm is implemented in the data modeling phase to determine the number of the expected cases as shown in Eq. 1 and the Likelihood ratio in Eq. 2 [11]. However, still accuracy can be increased by the techniques for cluster detection.

μ = p * \frac{C}{P}

where

p=: the population inside the cylinder;
C=: the total number of cases; and
P=: the total population from the U.S census website.

The number of expected cases ( $μ$ ) is an objective function that is calculated for each cluster in the data modeling phase to perform the likelihood test with the null hypothesis [11]. It reduced the accuracy of the cluster detection and is prone to error.

\frac{L (Z)}{L 0} = \frac{{(\frac{nZ}{μ (Z)})}^{nZ} {(\frac{N - nZ}{N - μ (Z)})}^{N - nZ}}{{(\frac{N}{μ (T)})}^{N}}

where

L(Z)=: Likelihood function L(Z) for candidate cylinder Z;
L₀=: Likelihood function for H_0;
n_z=: The number of cases inside the cylinder;
μ(Z)=: The expected number of cases in-cylinder Z;
μ(T)=: The total number of expected cases in the study area across all periods; and
N=: The number of observed cases for the entire study area during the entire study period.

The relative risk is the risk within a county divided by the risk outside and computed by Hohl et al. [11] for each cluster in a county as in Eq. 3:

R R c t y = \frac{\frac{e}{μ (Z)}}{(E - e) (E - μ (Z))}

where

RRcty=: The relative risk;
e=: The total number of cases for a given county; and
E=: The number of observed cases in U.S.

Without the testing rate, it is difficult to detect the actual number of expected cases and to calculate the relative risk within a cluster as compared to the outside world. Even though the expected number can be calculated without the testing rate, it does not represent the actual number. In addition, without the Modified Relative Risk, the performed Monte Carlo Simulation will miss some potential clusters and there is a high-risk of including the noise data.

Proposed System

After reviewing a range of methods for space-time analysis of the COVID-19 cases, we analyzed the pros and cons of each method. Accuracy, cluster shape, processing time, relative risks, and expected cases were the main issues to be considered. According to this consideration we selected the work of Hohl, et al. [11]; as the basis for our proposed solution. Poisson prospective space-time analysis analyzed the cases based on the spatial location and time window. Also, it produced clusters of cylindrical shapes which can include both space and time [25]. Thus, Poisson prospective space-time analysis helps to analyze the cases by considering the geographical location and their behavior over the study period [2]. A prospective space-time scan statistic is beneficial as it detects active or emerging clusters of the present while neglecting the clusters from the past [7]. Along with this, the proposed solution increases the cluster detection accuracy by including the area with a small number of cases having a low testing rate. The proposed system uses the differential information about the Relative positive test count (RC) and the proportion of positive test cases [3] for finding the cluster. This is a completely new feature adapted from the work of Cordes and Castro [3]. This approach calculates the testing rates(T), 0 < T ≤ 1, and uses it while calculating the number of expected cases to improve its accuracy. As a result, we selected the work of Cordes and Castro [3] as the second-best solution. The block diagram of the proposed system is given in Figure 3 below. The proposed system consists of same four major stages as in state of art system, as shown in Figure 3, called Data collection and preparation, Data modelling, Data Analysis, and Data Visualization. The following paragraphs gives more details of each stage.

Fig. 3 — The Block diagram of the proposed system for daily visualization of COVID-19 cases using Enhanced cluster detection and noise reduction algorithm. [The green borders refer to the new parts in our proposed system].